Hello,
I have a question about the Standalone Generator, and it seems I'm unable to start a new topic in that forum. I purchased and installed the generator a couple weeks ago, and I'll be crawling a very large, resource-intensive site in the next few days. I don't think the server is looking forward to this, and I may come back with a question or two along the way. For now, I have a question about subsequent crawls after the first one. Pardon my ignorance if this question is answered elsewhere (I skimmed through documentation and posts, but didn't find what I needed to know).
Once a sitemap has been generated, are future sitemaps reconstructed from scratch every time, or is the system able to seek out only new URLs and not attempt to connect with pages that have already been crawled? Every page of this site we're getting ready to crawl loads a video, some are hooked up to the Flickr API through a set of custom wrappers (and also display video) and every page requires 5-10 DB queries (at least the DB is on a different server). We're working on setting up more extensive caching, but right now caching is limited to common crawlers (there's a member login section for the site, and you can't very well cache those pages for others to see! We're working on it), and Sitemap Generator will hit more uncached pages than cached, especially since we're dumping cache every 15 minutes.
The initial crawl will probably reach ~70,000 pages. I know this isn't too out of the ordinary, but with the resources we're pushing, either it's going to take a full day to generate a sitemap every time we need to update, or we're going to need to figure out a way to manage the load even better. Oh yeah, and they want to update daily
And so my question: how are updates handled - whole new site map, or in some way incremental? Thanks!