Very large site, taking long to crawl
« on: April 15, 2010, 04:33:01 PM »

I have purchased the standalone crawler and am running it on a very big site (few million pages). It has been going for a day now and is only on 98000 pages. I am concerned this is going to take a few weeks.

It is crawling on a dedicated box (which runs only a few sites, including the one it is crawling), 8 CPUs and 8GB RAM, I have not made it store to disk so it should be running at its full potential.

Any tips on speeding it up?
Once it finishes and I need to update the site map (via cron) - will it have to recrawl and take this long again?
Re: Very large site, taking long to crawl
« Reply #1 on: April 16, 2010, 11:44:23 AM »

with website of this size the best option is to create a limited sitemap - with "Maximum depth" or "Maximume URLs" option limited so that it would gather about 200-300,000 URLs, which would be main pages representing "roadmap" sitemap for search engines.

The crawling time itself depends on the website page generation time mainly, since it crawls the site similar to search engine bots.
For instance, if it it takes 1 second to retrieve every page, then 1000 pages will be crawled in about 16 minutes.

Some of the real-world examples of big db-driven websites:
about 35,000 URLs indexed - 1h 40min total generation time
about 200,000 URLs indexed - 38hours total generation time

With "Max urls" options defined it would be much faster than that.
Re: Very large site, taking long to crawl
« Reply #2 on: May 19, 2017, 02:54:15 AM »
I am getting 130000 pages in 23 minus. Used to having 130000 pages in days.
Hint was in Narrow Indexed Pages Set. Try Exclusion preset. I am now using this software without a hassle.