Crawl interrupted - out of ideas
« on: September 16, 2010, 05:01:27 PM »
I know that many other have had this problem, and I've tried various resolutions with no help.  It's typically crawling about 5200 pages, about 1000 short of my site's content.   The crawl seems to be interrupted every night after 50-some minutes.  I created a php.ini file on my server and set the max execution time to 7200 seconds, which should give me a couple of hours.

The host said that the script is being interrupted because it's using too much resources - either exceeding the memory limits or cpu capacity.  Their memory limit is 64MB, so I adjusted that in my Generator configuration.  I also have lots of URL's not being parsed, to speed up the process and use less resources. I'm also not generating a changelog to save on resources.

What else can I try?

I've experimented with various delays between requests, but I have no idea what is the best setting here.

Any other suggestions?

thanks,
Mike

Re: Crawl interrupted - out of ideas
« Reply #1 on: September 17, 2010, 08:10:43 AM »
Hello,

please let me know your generator URL/login in private message to check this.
In many cases configuring it using "Exclude urls"/"Do not parse" can siginificatnyl improve crawling speed.
Re: Crawl interrupted - out of ideas
« Reply #2 on: September 18, 2010, 03:53:05 AM »
PM sent.  I use "Do not parse" quite a bit.  Take a look and see if there's something else I should tweak.

thanks,
Mike
Re: Crawl interrupted - out of ideas
« Reply #3 on: September 19, 2010, 04:39:56 PM »
Have you had a chances to check my settings yet?  Two nights ago the crawl completed, but last night it was interrupted again at about 59 minutes.  Under the Crawling tab I can manually check "Continue the interrupted session."  Is there any way to set it to continue an interrupted session automatically?   I see the setting "Save the script state, every X seconds: this option allows to resume crawling operation if it was interrupted", but what do I put there?  Its' currently set to 30 seconds, but obviously that's not working for me.

thanks,
Mike
Re: Crawl interrupted - out of ideas
« Reply #5 on: October 05, 2010, 07:27:28 PM »
Today my host shut my site down due to this XML-Sitemap crawler.  The host sent me an email saying that they had to shut down my site because a script "was causing a high load on the server, and due to it affecting all of the other accounts on the system, I was forced to take immediate action for the health of the server. "  The script they blamed was /generator/runcrawl.php.

They are telling me I have to upgrade to VPS because this script requires too much memory and can't be supported by shared hosting.

Has anyone else experience issues on shared hosting?
Re: Crawl interrupted - out of ideas
« Reply #6 on: October 05, 2010, 08:51:28 PM »
Hello,

hmm.. you can use "Delay for X seconds after each Y requests" to slow down requests that generator crawler makes, and limit the number of URLs added to sitemap in generator configuration to reduce memory usage. Disabling "html/ror" sitemap options would decrease memory usage too.
Re: Crawl interrupted - out of ideas
« Reply #7 on: October 05, 2010, 09:29:36 PM »
I already have "Delay 5 seconds after each 100 requests" set.  Not sure if that's a good setting or not.  I've disabled ror and text sitemap options, but I need the html option.  I'm going to try the "use temporary files to store crawling progress" option tonight, but I've read elsewhere on this forum that's known to cause issues, too.
Re: Crawl interrupted - out of ideas
« Reply #8 on: October 06, 2010, 07:01:45 PM »
Hello,

issues happen in some cases only, so that might work. And I'd recommend to limit maximum number of URLs in sitemap if you have a lot of them.
Re: Crawl interrupted - out of ideas
« Reply #9 on: October 15, 2010, 03:24:11 PM »
OK, I started using the "use temporary files to store crawling progress" option, and now the crawl is completing most of the time.  Other times it's interrupted, and when I login I notice that the crawl itself is pretty much done (0 pages left in cue), and all that remains is to actually build the sitemap. 

I've already selected every option that states "please note that this option requires more resources to complete", except for the HTML sitemap, because I need that.  I have sort order selected as Alphabetical ... does that take more resources?  Would I be better off leaving it unsorted, or does it not really make much difference in memory use?
Re: Crawl interrupted - out of ideas
« Reply #11 on: November 01, 2010, 02:03:19 PM »
OK, the crawl has been working OK most of the time over the last couple of weeks, failing to complete only a couple of times.  But last night it happened again, the host shut my site down because runcrawl.php was using too much resources.   :o

I'm not sure what else to try.  I had delay between requests set at 5 seconds every 100 requests.  I'll try changing that to 10 seconds every 50 requests, unless you have a different suggestion.  How about Maximum Depth Level, I have that set at 150.  Would changing that make any difference?  I thought about changing the storage type to var_export, but I read elsewhere on the forum that serialize actually uses less memory.  How about Save the script state, every X seconds?  I have that set at 600 seconds.  Should that be higher or lower?
Re: Crawl interrupted - out of ideas
« Reply #12 on: November 01, 2010, 11:40:42 PM »
Edit for more info: I just clarified from the host, the problem was CPU usage, not memory.  So what can I do to reduce the usage of my host's CPU resources?
Re: Crawl interrupted - out of ideas
« Reply #13 on: November 02, 2010, 06:22:58 PM »
CPU usage can be reduced by using "Delay after each X request" setting, as described above. You might try longer delay or do it after smaller number of requests (although that will increase sitemap generation time)