Problema running the script via Crontab
« on: July 03, 2012, 02:52:03 PM »
Hello Forum, I would like to discuss here a strange situation I am experiencing with XML-Sitemaps Standalone Generator.

Firstly, the script is currently installed and fully working.

SCENARIO ONE
With the script installed, I manually run with a SSH session the following command:
/usr/bin/php /var/www/mysite.com/generator/runcrawl.php
The result is 2560 URL in sitemap.

SCENARIO TWO
The same script installed. I have set in CRONTAB the very same command:
/usr/bin/php /var/www/mysite.com/generator/runcrawl.php (the command runs at 3AM)
The result is 1950 URL in sitemap.

THE QUESTION IS: why running the very same script with a crontab command, it yields a less number of URL in sitemap?
Re: Problema running the script via Crontab
« Reply #1 on: July 04, 2012, 05:16:58 PM »
Hello,

is the result consistent? i.e. if you run it in ssh session again, it will create 2560 URLs sitemap again?
Re: Problema running the script via Crontab
« Reply #2 on: July 05, 2012, 08:11:36 AM »
Yes, everytime I run via SSH I get 2560 URLs.
Re: Problema running the script via Crontab
« Reply #3 on: July 05, 2012, 02:06:07 PM »
Hello,

could you please PM me your generator URL and an example URL that is not included in sitemap and how it can be reached starting from homepage?
Re: Problema running the script via Crontab
« Reply #4 on: July 06, 2012, 09:41:14 AM »
OKAY, start from the homepage:

1) [ External links are visible to forum administrators only ] then click on GIOCARE MODERNO
2) [ External links are visible to forum administrators only ] then click on Squadre FUORI CATALOGO
3) [ External links are visible to forum administrators only ] then click on SUCCESSIVO (the blu arrow for "next" 63 pages)

[ External links are visible to forum administrators only ]
THIS PAGE IS NOT in sitemap.
Re: Problema running the script via Crontab
« Reply #5 on: July 06, 2012, 08:58:54 PM »
If you check the html source of the page, the pagination link looks like:
<a class="pagina_successiva" href="/page/1-20/2-20" title="Successivo">Successivo

i.e. points to domain root, and later it's corrected with javascript, which crawler bots will not see.