After the initial crawl
« on: January 07, 2009, 04:20:42 PM »

I have a question about the Standalone Generator, and it seems I'm unable to start a new topic in that forum.  I purchased and installed the generator a couple weeks ago, and I'll be crawling a very large, resource-intensive site in the next few days. I don't think the server is looking forward to this, and I may come back with a question or two along the way.  For now, I have a question about subsequent crawls after the first one.  Pardon my ignorance if this question is answered elsewhere (I skimmed through documentation and posts, but didn't find what I needed to know).

Once a sitemap has been generated, are future sitemaps reconstructed from scratch every time, or is the system able to seek out only new URLs and not attempt to connect with pages that have already been crawled?  Every page of this site we're getting ready to crawl loads a video, some are hooked up to the Flickr API through a set of custom wrappers (and also display video) and every page requires 5-10 DB queries (at least the DB is on a different server). We're working on setting up more extensive caching, but right now caching is limited to common crawlers (there's a member login section for the site, and you can't very well cache those pages for others to see! We're working on it), and Sitemap Generator will hit more uncached pages than cached, especially since we're dumping cache every 15 minutes.

The initial crawl will probably reach ~70,000 pages. I know this isn't too out of the ordinary, but with the resources we're pushing, either it's going to take a full day to generate a sitemap every time we need to update, or we're going to need to figure out a way to manage the load even better.  Oh yeah, and they want to update daily  :o

And so my question: how are updates handled - whole new site map, or in some way incremental?  Thanks!
Re: After the initial crawl
« Reply #1 on: January 08, 2009, 12:28:29 AM »

every time sitemap is created, the generator script crawls the whole site. Otherwise it won't be able to find new pages (since it's unknown where the new page will appear).
However, with "Do not parse" / "Exclude URLs" options it's possible to avoid crawling of a major part of the site, and still include those URLs in sitemap.
Also, using "Make delay after each X request" option you can decrease the server load added by the crawler.

PS. you should use forums username/password that was sent to you along with sitemap generator download info to be able to post in first 2 subforums.

Re: After the initial crawl
« Reply #2 on: January 08, 2009, 05:39:08 PM »
Thanks Oleg.  It was a while ago, and I forgot there was login information in that email. I'm using it now - sorry for the confusion.

Do you have any strategies regarding number of requests and time between requests that you can share, how much certain settings actually affect crawl times? I know it will vary from site to site and server to server, I just want to get a good sense for how much impact throttling has had for you or various users who have experience with the tool.  Thanks!
Re: After the initial crawl
« Reply #3 on: January 08, 2009, 07:24:29 PM »
Indeed that depends on the site structure and page content type. An example might be 1 seconds delays after each 5 requests.
The most important part though is to properly configure "Do not parse" options to improve crawling rate.