john

*
  • *
  • 16
This is Going to Take Forever
« on: March 04, 2006, 11:30:54 PM »
I started the Sitemap Generatator a few days ago.  It seems to add the first 20,000 pages in the first 6 or 7 hours it was running.   Then it stopped running after about 24 hours, so I restarted it from where it left off.   Now it is scanning very slow - maybe 100 pages an hour and it stops every few hours and I need to restart it.  Also, I notice it is scanning URL strings on my exclude list like: Auction/APViewRatings.asp? and Auction/APAddRating.asp?

I almost think I should just stop the thing and start from scratch because I don't see and end in sight.  If the pages I excluded are really excluded my sitemap will probalby be around 100,000 pages.

Below are snapshots of the progress at four different times today. I have had to do a restart twice today so far.  You can see the thing is running super slow.  My site is on a shared Windows IIS server.    Should I stop and start from scratch? Or let this thing run for a month or so?

11 AM 3/4/06

Already in progress. Current process state is displayed:
Links depth: 5
Current page: Auction/APViewRatings.asp?UserID=303&returnto=APViewItem%2Easp%3Fid%3D41968
Pages added to sitemap: 23167
Pages scanned: 31880 (738,227.6 Kb)
Pages left: 57714 (+ 16918 queued for the next depth level)
Time passed: 2733:07
Time left: 4947:55
Memory usage: -

12:45PM 3/4/06

Links depth: 5
Current page: Auction/APViewRatings.asp?UserID=303&returnto=APViewItem%2Easp%3Fid%3D22030
Pages added to sitemap: 23237
Pages scanned: 32020 (740,132.9 Kb)
Pages left: 57574 (+ 17092 queued for the next depth level)
Time passed: 2826:17
Time left: 5081:50
Memory usage: -


2:45PM 3/4/06

Links depth: 5
Current page: Auction/APAddRating.asp?UID=7490&ItemID=41111
Pages added to sitemap: 23334
Pages scanned: 32200 (743,074.4 Kb)
Pages left: 57394 (+ 17590 queued for the next depth level)
Time passed: 2948:00
Time left: 5254:36
Memory usage: -


6:15PM 3/4/06

Links depth: 5
Current page: Auction/APViewItem.asp?ID=43845
Pages added to sitemap: 23435
Pages scanned: 32360 (746,032.5 Kb)
Pages left: 57234 (+ 17972 queued for the next depth level)
Time passed: 3137:58
Time left: 5550:02
Memory usage: -
Re: This is Going to Take Forever
« Reply #1 on: March 05, 2006, 12:01:33 AM »
Hello John,

the progress indicator screen may display ANY link found on your pages and it doesn't mean it is *fetched* in fact. It means that it is "processed" at this moment (i.e., compared to the exclusion list).

As for the generation performance, it depends on the speed of fetching the pages from your website. You can *greatly* improve generation time using "Do not parse URLs" option.

This topic was discussed several time already:

https://www.xml-sitemaps.com/forum/index.php/topic,177.html
https://www.xml-sitemaps.com/forum/index.php/topic,182.html
https://www.xml-sitemaps.com/forum/index.php/topic,95.msg387.html#msg387

john

*
  • *
  • 16
Re: This is Going to Take Forever
« Reply #2 on: March 05, 2006, 12:10:59 AM »
I already read those posts and I excluded a big portion of the fluff URLs on my site (ie user ratings, search)  and did use the "do not parse URLs" on the 50,000 auction pages (/Auction/APViewItem.asp?ID=) on my site which is probably over 90 % of my content.
« Last Edit: March 05, 2006, 12:36:21 AM by john »

john

*
  • *
  • 16
Re: This is Going to Take Forever
« Reply #3 on: March 05, 2006, 12:28:15 AM »
Well Sitemap Generator took down my entire website.

I checked my site about 10 minutes ago and got a big:

Bad Request (Invalid Hostname)

I contacted my host support and got this nasty message:

The PHP processes spawning from your website were taking up 179MB of RAM per instance, and about 25% of the entire CPU per instance as well.  We have disabled the site for 15 minutes for a temporary server resolution as not to affect others on the server.  We will reenable it shortly, however if the instance continues to occur we will be forced to disabled the instances causing it. 

So it seems my site map generation was slowing to a stand still because it was using up the resourse on the shared server where my site resides.

Two questions

1.) What went wrong? 

2.) How can I get a site map of my site without taking down my server?


Also, when they say 179 MB of RAM per instance, did I have several sitemaps being generated at the same time?  Like one instance for each time I had to restart it?  Is this even possible, if so, why did the crawling report say:

Resume last session
Continue the interrupted session (2006-03-04 16:36:06)
Click button below to start crawl manually:

Re: This is Going to Take Forever
« Reply #4 on: March 05, 2006, 12:42:26 AM »
Hello,
Quote
I already read those posts and I excluded a big portion of the fluff URLs on my site (ie user ratings, search)  and did use the "do not parse URLs" on the 50,000 auction pages (/Auction/APViewItem.asp?ID=) on my site which is probably over 90 % of my content.
Great, also make sure that entries from Exclude URLs: are also added to Do not parse URLs:.
Quote
Well Sitemap Generator took down my entire website.

I checked my site about 10 minutes ago and got a big:
may be you have started a new crawling session before the previous one was stopped and had multiple scripts running as a result.
You can use "Make a delay between requests, X seconds after each N requests:" option to reduce the server load.

john

*
  • *
  • 16
Re: This is Going to Take Forever
« Reply #5 on: March 05, 2006, 01:53:08 AM »
So how do I start a new crawl, clean - from scratch? Delete the files in the data directory?


Also, how do I know when the script has stopped?

When I hit the crawl tab and I get the message:

Resume last session
Continue the interrupted session (2006-03-04 16:36:06)
Click button below to start crawl manually:

Does this mean it stopped?  Because when I got this message I would hit run to continue the interrupted session.  I did this at least 5 times when I thought the script stopped.


I assumed when I got a progess report like:

Links depth: 5
Current page: Auction/APViewRatings.asp?UserID=303&returnto=APViewItem%2Easp%3Fid%3D22030
Pages added to sitemap: 23237
Pages scanned: 32020 (740,132.9 Kb)
Pages left: 57574 (+ 17092 queued for the next depth level)
Time passed: 2826:17
Time left: 5081:50
Memory usage: -

The script was running and when I got a "resume last session" message the script stopped.

I need to make sure this is done right and the script doesn't overload my server again or they may just take my site down for more than 15 minutes.

Also, I had delay 2 second after every 20 requests, is this enough?
« Last Edit: March 05, 2006, 01:55:11 AM by john »
Re: This is Going to Take Forever
« Reply #6 on: March 05, 2006, 10:27:29 AM »
Hello John,

you may want to resume the session only when a significant time has passed since specified session time ("Continue the interrupted session (2006-03-04 16:36:06)") - at least an hour to be sure.
Important: make sure you don't have a small value for the "Save state" option to avoid forcing it to save every few pages (set it to several *minutes* at least).

Also, there is no specific value for the "delay" option that will work in all cases, but 2 seconds after every 20 requests might be enough.

john

*
  • *
  • 16
Re: This is Going to Take Forever
« Reply #7 on: March 05, 2006, 04:55:36 PM »
This morning I just tried to start a sitemap from scratch.  It seemed to be off and running fine - 900 pages crawled in the first 10 minutes.  Then when I checked back later by loading sitemap generator in my browser and selecting the crawl tab it seems the crawl stopped before it even reached my 30 minute save point.    I would guess my host may have put the clampdown on this PHP process after it almost brought down the server yesterday.

Now when I click the Crawl tab I get the page:

Crawling
Run in background
Do not interrupt the script even after closing the browser window until the crawling is complete
Click button below to start crawl manually:
Cron job setup
You can use the following command line to setup the cron job for sitemap generator:
/usr/bin/php d:\i.....

I assume this means that it is no longer crawling?  I don't want to try hitting the crawl button again unless I am sure the crawl has stopped.  If I don't see a crawl status page, does that mean it stopped crawling.

Now what should I try?



john

*
  • *
  • 16
Re: This is Going to Take Forever
« Reply #8 on: March 07, 2006, 02:29:11 AM »
Any suggestions?  Ever since the sitemap generator almost took down the shared server I am on, it will no longer run more then 2 minutes before it stops running.  I can't even get to a save point 5 minutes in. 

I suspect my host has blacklisted the program.  If you think I am doing something wrong please let me know.  I am desperate to get a sitemap done for my site and would much rather have a sitemap than my money back.
Re: This is Going to Take Forever
« Reply #9 on: March 07, 2006, 07:47:11 PM »
Hello,

most likely your host didn't blacklisted the script, but limited the maximum execution time for all scripts. I can suggest you to run sitemap generator from the server console (not from the browser) - in this case the limitations are not usually applied.