How long?
« on: February 26, 2006, 06:43:24 AM »
How long, in general, does it take to crawl sites?  I've been using GSiteCrawler, but I out grew it b/c it couldn't crawl some of my bigger sites.  Unless I set something up wrong, this program seems to be crawling quite slowly.  About how many pages should it be crawling/hour?  Anything I can do to tweak it to be fast?  I've got a 2.2M page site to crawl, and the rate that it's crawling my 200k+ page site, my 2.2M page site will take over a month.
Re: How long?
« Reply #1 on: February 26, 2006, 12:21:39 PM »
Hello,

Sitemap Generator script is crawling your site (fetching the pages with http requests) to make sure that all pages are included in sitemap, so the main factor is how fast the page of your site is loaded (including page generation time). Since the script is running on the same host where your website is running, the network connection is performed significantly faster than desktop application running on your computer.
You will find more details here: https://www.xml-sitemaps.com/forum/index.php/topic,95.msg387.html#msg387

Please note that you can greatly improve generation time using "Do not parse URLs" (pages included into sitemap, but not downloaded by the script) and "Exclude URLs" (do not include into sitemap).
Re: How long?
« Reply #2 on: February 26, 2006, 12:25:40 PM »
Yeah, ok, that makes sense.

1) What's the biggest site the SiteMap Generator has crawled?
2) Are the sitemap files created along the way, or are they all created once all the crawling is done?  For example, my 2.2M page site will need 45ish sitemaps.  As it crawls 50k, then 100k, etc will I see the sitemaps created along the way, or will it wait until the very end before it creates those files?
Re: How long?
« Reply #3 on: March 01, 2006, 05:02:15 PM »
Hello,

1. we're not able  to say exactly how many URLs have included in sitemap our customers, but you can find some examples at our testimonials page :)
2. All data is collected first and the sitemaps are splitted and index file created. This is required to avoid having duplicate URLs added and endless loops in lined pages.