XML Sitemaps Generator

Author Topic: 500,000 page website - crawler takes too long...  (Read 67987 times)

dak

  • Registered Customer
  • Approved member
  • *
  • Posts: 1
500,000 page website - crawler takes too long...
« on: October 27, 2005, 04:18:44 AM »
Hi... So i've tried every crawler out there and I wanted to give this a shot...

My site has over 500,000 pages.... and when I activated the crawler ... crawler was crawling 500- 1,000 pages an hour and at that rate it would take 1 -2 weeks to complete the crawl...

Anyways I turned on the crawler in background mode and just checked back occasionally throughout the week to see the status... It seemed like crawler was going too slow and hitting logs every 10 seconds instead of every second...  This could be why it took so long...

So the crawl after 7 days got to 200,000 pages... then my webhost decide to restart apache and those 7 days went down the drain....


Do I have the script configured incorrectly? Cause on other sites I can crawl 50,000 pages in like 1 hr where as it takes like a day to do 50,000 pages...  I think this product is cool but I wish it was faster...

Any suggestions on what i should do? Thanks


XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 9133
Re: 500,000 page website - crawler takes too long...
« Reply #1 on: October 27, 2005, 12:43:54 PM »
Hello,

1. The speed of the sitemap generator script depends MAINLY on the speed of the site being crawled. i.e., if your site creates the page in 5 seconds (including network conection times), then generator *cannot* process it earlier than it is received. So, faster sites are crawled faster. Few examples of the sitemap generation time:
Quote
Some of the examples of big db-driven websites:
about 35,000 URLs indexed - 1h 40min total generation time
about 200,000 URLs indexed - 38hours total generation time

2. The 7 days passed since your first crawler execution wouldn't be lost if you specify the "Save the script state, every X seconds:" options of the script (added in the recent version). If you specify this, the script will resume crawling.

3. You can also try to reduce the total number of pages by using "Do not parse URLs" and "Exclude URLs" options (try eliminating less important/duplicate content).
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

abram

  • Registered Customer
  • Approved member
  • *
  • Posts: 4
Re: 500,000 page website - crawler takes too long...
« Reply #2 on: November 18, 2005, 07:57:43 AM »
I did a pretty big site and got this result

Total pages
97,656

Proc.time
35,235.65s = 9.8 hours

It mostly depends on the system doing the crawling and the site you are crawling.

I even set it to pause for 2 seconds after 15 requests so i think it did it in pretty good time. Plus i set the php memory to about 256 megs ram on that machine.

You should check with your host and see whats going on in the backround of that machine, might be doing more then you think and that will increase time for the crawler.

sales611

  • Registered Customer
  • Approved member
  • *
  • Posts: 1
Re: 500,000 page website - crawler takes too long...
« Reply #3 on: July 27, 2009, 02:08:01 AM »
Perhaps for a new version we could have it setup so it checks for changed items?  If links 1-200k haven't changed and say you added 5k new pages it would be good if it could just update that part without having to recrawl the first 200k pages?

Make any sense?

~Andy

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 9133
Re: 500,000 page website - crawler takes too long...
« Reply #4 on: July 27, 2009, 12:34:24 PM »
Hello,

in order to find new pages sitemap generator have to crawl your site and search all pages, since links to new content can be added on ANY page.
In case if you have something like "new products" section on your site, you can crawl it separately, creating a smaller sitemap more often and keep full sitemap unchanged.
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

 

SMF 2.0.4 | SMF © 2011, Simple Machines
XHTML RSS WAP2