a few questions
« on: March 14, 2008, 08:40:44 AM »
It looks like I have everything working correctly.  I have a few questions though.

1.  My site is very large and there is probably a few million pages.  I only have 1024 MB of total space and the first crawl took up nearly 450MB and it crawled approx 2% of my pages.  The obvious solution is to get a dedicated server and more space, but do you have any suggestions until I go forward with the new server? 

2.  How do I know when Google, Ask, Yahoo etc.. are pinged?  Does it use the ROR, RSS, XML and TXT files when it pings these search engines?

Re: a few questions
« Reply #1 on: March 14, 2008, 08:18:02 PM »

1. do you mean the metric displayed by sitemap generator when crawling? This is not the amount of space used by generator script, but total pages size crawled. You don't need that much space to store it.

2. the script pings search engines automatically when sitemap is created. You can also manually submit xml sitemap to Google webmaster account: http://www.google.com/webmasters/sitemaps/siteoverview
Re: a few questions
« Reply #2 on: March 15, 2008, 07:48:10 AM »
Well, crawl_dump.log file is nearly 215MB.  not including the size of the sitemaps as well.  I used the gzip compression and that does help, but I can't figure out why crawl_dump.log is storing so much data.

Re: a few questions
« Reply #3 on: March 15, 2008, 11:17:44 PM »
You can disable HTML/ROR sitemaps to decrease the dump size (since it currently stores page titles/descriptions).