What is the best approach for a big site?
« on: February 06, 2011, 08:26:15 PM »
I have a very big site (over 3M pages).  It's running on an EC2 instance with 1.7GB of memory.  I know that it will probably take several days to create the site map, so I want to configure the crawl in the best way to make sure that once it starts it completes successfully.

My initial attempts got about 9k pages into the site and then started getting memory errors.  Watching the process, it seems like it never releases memory and the memory just keeps growing until it can't allocate any more.  My guess is that I need to do this incrementally, but when I resume the process it seems to just start getting errors again.

Do you have any suggestions for the best way to approach this?  I'm happy to let it run for a long time at a low priority level, as long as it finally completes.
Re: What is the best approach for a big site?
« Reply #1 on: February 06, 2011, 09:25:52 PM »
Hello,

with website of this size the best option is to create a limited sitemap - with "Maximum depth" or "Maximume URLs" option limited so that it would gather about 50-100,000 URLs, which would be main pages representing "roadmap" sitemap for search engines.

The crawling time itself depends on the website page generation time mainly, since it crawls the site similar to search engine bots.
For instance, if it it takes 1 second to retrieve every page, then 1000 pages will be crawled in about 16 minutes.

Some of the real-world examples of big db-driven websites:
about 35,000 URLs indexed - 1h 40min total generation time
about 200,000 URLs indexed - 38hours total generation time

With "Max urls" options defined it would be much faster than that.
Re: What is the best approach for a big site?
« Reply #2 on: February 07, 2011, 05:45:41 PM »
It doesn't look like I can crawl that many in a session.  I was able to crawl 20k, but when I tried a higher number it died with the "unable to allocate memory" error.  memory_limit is 512M, max_execution_time is 300s.

So here's my plan, please tell me if you think this will work or if there is a better approach.

Luckily, I can identify branches of my site tree, and only include those branches using "Include only" and "Parse only".  My plan is to separately index about 10k at a time, focusing on one branch at a time, and then collect up the separate sitemapX.xml files and use a sitemap index file to group them all together by manually creating the index file.

Does that sound workable?  Is there a better approach?
Re: What is the best approach for a big site?
« Reply #3 on: February 07, 2011, 07:12:36 PM »
Thinking about how to do this, it would be great if I could create a queue of smaller tasks, each with a different configuration, and accumulate the sitemaps that are generated into a single sitemap set.  That would be a great feature for the next version.

It might also be nice if I could configure a start depth and an end depth.  That way I could start a crawl at depth 2 and only go to depth 3, gradually accumulating sitemaps into a single sitemap set.  Since a lot of people seem to be having trouble with larger crawls I'm looking for ways to break up each crawl into bitesize pieces and accumulate the result.
Re: What is the best approach for a big site?
« Reply #4 on: February 07, 2011, 08:03:49 PM »
Also, could you give me an idea of how much setting "Progress state storage type" to var_export is likely to reduce memory requirements?  Will it reduce it by 50%?  5%?
Re: What is the best approach for a big site?
« Reply #5 on: February 07, 2011, 10:24:17 PM »
Hello,

you would need to start crawling from the start (without resumeing) if you reduce the total number of URLs in sitemap.

> Also, could you give me an idea of how much setting "Progress state storage type" to var_export is likely to reduce memory requirements?  Will it reduce it by 50%?  5%?

In some cases it can increase resources usage, I'd recommend to keep it in default state,