run in background confirmation
« on: March 01, 2006, 06:54:15 PM »
we have a large site of some 1,000,000+ pages. What is the best way on a red hat os to ensure the script is running ok after closing the control panel and checking "Do not interrupt the script even after closing the browser window until the crawling is complete" as well as seeing how far its got at any particular point?

It seems when i log back in and if i select crawl and start it again, it starts from the last time i closed the window which makes me think its not running in the background.
Re: run in background confirmation
« Reply #1 on: March 01, 2006, 07:14:30 PM »
Hello,

you can set the "Save crawling state every X seconds" option to save the progress state every 10 minutes or so and then you will see the *.proc file in the data/ folder.
Btw, it is recommended to execute sitemap crawler from the ssh command line for large sites.
Re: run in background confirmation
« Reply #2 on: March 02, 2006, 10:05:25 AM »
how do you run it from the command line? cant see any instructions on how to do that in your docs
Re: run in background confirmation
« Reply #3 on: March 02, 2006, 06:26:30 PM »
Hi,

you will find a command line to use at "Crawling" page.
It is usually:
/usr/bin/php /path/to/generator/runcrawl.php
OR
/usr/local/bin/php /path/to/generator/runcrawl.php
Re: run in background confirmation
« Reply #4 on: March 02, 2006, 08:01:58 PM »
ok thanks for that - couple more quick question on this "mega crawl" ;-):

1./ all the 1,000,000+ pages have a common directory name in the utrl like:
[ External links are visible to forum administrators only ]

if i add "directory" to the "do not parse URL" field, will it skip anything with "directory" in the URL?

2./ I let it crawl 100,000 pages so far and then i decided to pause it and adjust the max time to 10 seconds so it would create a sitemap for what its done so far, so i clicked on crawl and resume session but now all that shows in the crawl page is:

Links depth: -
Current page: -
Pages added to sitemap: -
Pages scanned: - (- Kb)
Pages left: - (+ - queued for the next depth level)
Time passed: -
Time left: -
Memory usage: -

and nothing more and no sitemaps - any ideas?
« Last Edit: March 02, 2006, 10:45:10 PM by sales32 »
Re: run in background confirmation
« Reply #5 on: March 03, 2006, 04:29:24 PM »
Hello,

1. Yes, that is correct
2. If you are using the "maximum execution time" setting, the script will just stop when the time is exceeded at any point it is running at and no sitemap will be created. You should use "maximum URLs" option  for testing purposes instead.