dak

*
  • *
  • 1
500,000 page website - crawler takes too long...
« on: October 27, 2005, 05:18:44 AM »
Hi... So i've tried every crawler out there and I wanted to give this a shot...

My site has over 500,000 pages.... and when I activated the crawler ... crawler was crawling 500- 1,000 pages an hour and at that rate it would take 1 -2 weeks to complete the crawl...

Anyways I turned on the crawler in background mode and just checked back occasionally throughout the week to see the status... It seemed like crawler was going too slow and hitting logs every 10 seconds instead of every second...  This could be why it took so long...

So the crawl after 7 days got to 200,000 pages... then my webhost decide to restart apache and those 7 days went down the drain....


Do I have the script configured incorrectly? Cause on other sites I can crawl 50,000 pages in like 1 hr where as it takes like a day to do 50,000 pages...  I think this product is cool but I wish it was faster...

Any suggestions on what i should do? Thanks

Re: 500,000 page website - crawler takes too long...
« Reply #1 on: October 27, 2005, 01:43:54 PM »
Hello,

1. The speed of the sitemap generator script depends MAINLY on the speed of the site being crawled. i.e., if your site creates the page in 5 seconds (including network conection times), then generator *cannot* process it earlier than it is received. So, faster sites are crawled faster. Few examples of the sitemap generation time:
Quote
Some of the examples of big db-driven websites:
about 35,000 URLs indexed - 1h 40min total generation time
about 200,000 URLs indexed - 38hours total generation time

2. The 7 days passed since your first crawler execution wouldn't be lost if you specify the "Save the script state, every X seconds:" options of the script (added in the recent version). If you specify this, the script will resume crawling.

3. You can also try to reduce the total number of pages by using "Do not parse URLs" and "Exclude URLs" options (try eliminating less important/duplicate content).
Re: 500,000 page website - crawler takes too long...
« Reply #2 on: November 18, 2005, 07:57:43 AM »
I did a pretty big site and got this result

Total pages
97,656

Proc.time
35,235.65s = 9.8 hours

It mostly depends on the system doing the crawling and the site you are crawling.

I even set it to pause for 2 seconds after 15 requests so i think it did it in pretty good time. Plus i set the php memory to about 256 megs ram on that machine.

You should check with your host and see whats going on in the backround of that machine, might be doing more then you think and that will increase time for the crawler.
Re: 500,000 page website - crawler takes too long...
« Reply #3 on: July 27, 2009, 03:08:01 AM »
Perhaps for a new version we could have it setup so it checks for changed items?  If links 1-200k haven't changed and say you added 5k new pages it would be good if it could just update that part without having to recrawl the first 200k pages?

Make any sense?

~Andy
Re: 500,000 page website - crawler takes too long...
« Reply #4 on: July 27, 2009, 01:34:24 PM »
Hello,

in order to find new pages sitemap generator have to crawl your site and search all pages, since links to new content can be added on ANY page.
In case if you have something like "new products" section on your site, you can crawl it separately, creating a smaller sitemap more often and keep full sitemap unchanged.
Re: 500,000 page website - crawler takes too long...
« Reply #5 on: July 10, 2013, 03:54:44 PM »
I have followed all the instructions in setting up standalone XML sitemap but after 3 months I still haven't been able to fully crawl my website via 'manual' or 'cronjob'.  I don't think my website is 500K pages big but this is the first post I saw with the reply option.  Please assist.
Re: 500,000 page website - crawler takes too long...
« Reply #6 on: June 15, 2015, 11:38:03 AM »
My website is also 500k large or even more. i wonder how much time it would take to crawl. i just bought this tool. Is it possible to crawl the website on your local machine(i.e localhost). this would reduce the server response time significantly thus resulting in fast crawling.
Re: 500,000 page website - crawler takes too long...
« Reply #7 on: June 16, 2015, 04:16:05 AM »
Hello,

server response time would be faster if you install generator on the same machine where website is located (which is recommended).
Installing on local machine is less optimal.

You can run it locally though if needed, you would need to install web server with PHP support on local machine for that.
Re: 500,000 page website - crawler takes too long...
« Reply #8 on: July 04, 2015, 04:01:37 PM »
For those of you struggling with this, the last post here seems to have saved me.  All the posts I've seen talk about increasing php.ini storage/whatever. That's not easy/impossible on GoDaddy; insufficient at least.

Since I've got a WAMP environment on my pc in my home office I loaded it on to one of my test sites and wammo!  It works!  Mind you its still slow and still working its way through my 6,000 page site but unlike the installation at the site on GoDaddy, this one doesn't stop and need to be restarted. I'm sure if I go into my local php.ini it would run faster, but I'm so desperate for this to complete that I'll wait and tweak it later.

Light at the end of the tunnel.  Wish I had knew earlier that I could load it on another machine...just never thought about that one.