keep getting 500 error - huge site
« on: December 12, 2011, 11:05:31 AM »
Hi,

we have a huge site and the script used to run fine after I was able to figure out the right configurations. I had to resume a couple of times but the script was running.

Now I keep getting 500 errors when trying to resume the script. I can't get it to run anymore. Is there any help? The site has about 1 million pages and 200.000 or so were already crawled.

Also a sitemap was not generated yet. Do I have to wait until the whole site is crawled?

The sitemap script is installed at [ External links are visible to forum administrators only ]

Thank you,

Tino
Re: keep getting 500 error - huge site
« Reply #1 on: December 12, 2011, 01:07:22 PM »
Hello,

with website of this size the best option is to create a limited sitemap - with "Maximum depth" or "Maximume URLs" option limited so that it would gather about 200-300,000 URLs, which would be main pages representing "roadmap" sitemap for search engines.

The crawling time itself depends on the website page generation time mainly, since it crawls the site similar to search engine bots.
For instance, if it it takes 1 second to retrieve every page, then 1000 pages will be crawled in about 16 minutes.

Some of the real-world examples of big db-driven websites:
about 35,000 URLs indexed - 1h 40min total generation time
about 200,000 URLs indexed - 38hours total generation time

With "Max urls" options defined it would be much faster than that though.
Re: keep getting 500 error - huge site
« Reply #2 on: December 12, 2011, 01:31:32 PM »
ok thanks for that.

But how does it come that I keep getting the 500 internal server error from time to time?

Also it seems the script restarted, however all the previously crawled pages are lost now :(. We already had 200,000+ crawled.

Does the script only create the sitemap after crawling is done?

Did you look at the site and the generator and may be fixed it?
Re: keep getting 500 error - huge site
« Reply #3 on: December 12, 2011, 01:35:19 PM »
One more question. Are the info regarding crwaled pages always saved in crawl_dump.log?
Should I make a backup of it frequently?
Re: keep getting 500 error - huge site
« Reply #4 on: December 13, 2011, 09:09:58 PM »
> But how does it come that I keep getting the 500 internal server error from time to time?

it looks like your server configuration doesn't allow to run the script long enough to create full sitemap. Please try to increase memory_limit and max_execution_time settings in php configuration at your host (php.ini file) or contact hosting support regarding this.

> One more question. Are the info regarding crwaled pages always saved in crawl_dump.log?
> Should I make a backup of it frequently?

Yes, it's saved in that file and I'd recommend to save copy just in case if it took long time already to crawl the site.