Large Site Crawl Won't Resume
« on: October 22, 2008, 12:49:29 AM »
Hi - I can see some similar scenarios, but not the exact same one. Large site (80,000 locations, excluding images, etc). Have gradually cranked up the memory. Now running it via command line. The server was rebooted, probably early on the second day of crawling and won't resume.

It emits the header, down to the <body> tag and then returns to the command line. To save memory, I have disabled HTML sitemap, ROR, etc (following the install guide recommendations). I have switched to the second memory type - (var export) - *before* I started the run. I have the save script setting at 180 (default).

There is stuff in the data directory, including a non-blank crawl_dump.log, and a zero length interrupt.log, and an non-zero placeholder.txt.

Suggestions for how to make this proceed further would be most welcome. I'd rather avoid another 24 hour delay, if possible :)

Thanks In Advance!

Re: Large Site Crawl Won't Resume
« Reply #1 on: October 22, 2008, 10:43:50 PM »
Hello,

did you try to increase memory_limit setting in PHP configuration on your server?

Please note that in most cases "serialize" option takes less memory than "export_vars", it only makes sense in some server configuration where PHP setup has memory leaks when using "serialize".
Re: Large Site Crawl Won't Resume
« Reply #2 on: October 23, 2008, 07:48:17 AM »
Thanks.

I've been gradually increasing the memory. It is now set at 256M. If use the "php .../runcrawl.php" script, that normally shows when memory is exhausted. It doesn't show that. It gets as far as <body> and stops, silently.

I've reset the memory to serialize. It seemed to run out of memory faster with serialize.

Since the incremental dump seems to depend on the memory type, it has restarted the crawl. I'll know in a day or so whether it gets any further.

Thanks for the help.