Problems with first Crawl
« on: July 28, 2009, 08:58:51 PM »
Hi,

I'm new and just bought the sitemap program.  RIght now I have 16,000 pages and see it will grow about 500 - 1000 per day. Any way, I just the to crawl and it cannot make it through.  It just stops with the following (and it's different every time):

Links depth: 2
Current page: tag/credit/
Pages added to sitemap: 495
Pages scanned: 1620 (60,370.3 KB)
Pages left: 4571 (+ 5817 queued for the next depth level)
Time passed: 1:11:17
Time left: 3:21:08
Memory usage: 4,755.7 Kb

What do I need to do so it can run all the way through?  More memory, more speed, etc?  The problem is that I'm using shared hosting, so I do not have access to any configs.
Re: Problems with first Crawl
« Reply #1 on: July 29, 2009, 04:57:42 AM »
Hello,

I'd recommend to add this in "Do not parse" option first:
Code: [Select]
tag/
feed/
Re: Problems with first Crawl
« Reply #2 on: July 30, 2009, 01:55:56 AM »
That has helped, thanks so much.  I had to run the program 4 times manually, it finally completed.  The problem though is that the crawl didn't obtain all my URLs.  It only indexed 3000 out of 18,000.  What can I do fix this?
Re: Problems with first Crawl
« Reply #3 on: July 30, 2009, 07:46:50 AM »
Hello,

could you please PM me your generator URL and an example URL that is not included in sitemap and how it can be reached from homepage?
Re: Problems with first Crawl
« Reply #4 on: July 31, 2009, 04:37:22 AM »
Sure, I PM'd all the info.  Just curious, is there any way to enable logging to see where the problem lies?  Also, I was curious if your software supports remote crawls.  I was thinking that if I was to create a seperate hosting account just for your software, then crawl my site remotly, then resources (or what ever it is) may be saved for the site itself.  Does your software support this?
Re: Problems with first Crawl
« Reply #5 on: July 31, 2009, 01:23:45 PM »
Hello,

this issue in most cases is resolved by analyzing the site structure to optimize crawler settings using exclude urls and do not parse options.
Yes, it is possible to crawl the site from remote account, but then resulting sitemap files will have to be manually moved to main server where the site is hosted.
Re: Problems with first Crawl
« Reply #6 on: August 02, 2009, 04:49:32 AM »
Would someone be able to recommend a hosting company that doesn't limit php scripts (and also affordable)?  My host say they do not, but I cannot figure out why the script just stops.  I need to be able to build my sitemap.  Does anyone not have an issue?  If so, what host do you use?
Re: Problems with first Crawl
« Reply #7 on: August 28, 2009, 04:04:58 PM »
OK, I moved my site to a full dedicated server (outgrew shared hosting).  Now, the software is still crawling my site, but stops every so often.  Is there anyway we can set the software to continue without human intervention?  For example, after two days the software stopped for some reason.  I had to manually kick it off again.  Is there a way I can set it to continue once it detects the script stopped?  So far my first crawl has been running for 90 + hours (and I had to manually restart it twice).

Links depth: 4
Current page: friendship-month-%e2%80%93-the-joys-of-friendship/
Pages added to sitemap: 20847
Pages scanned: 26060 (963,508.8 KB)
Pages left: 5131 (+ 8691 queued for the next depth level)
Time passed: 90:21:10
Time left: 17:47:22
Memory usage: 24,787.5 Kb
Re: Problems with first Crawl
« Reply #8 on: August 29, 2009, 03:06:53 AM »
Hello,

yes, you can setup a daily scheduled  task (cron job) in hosting control panel for sitemap generator and it will automatically resume generation in case if it has stopped.
Re: Problems with first Crawl
« Reply #9 on: August 31, 2009, 10:00:18 PM »
Great news.  So if the job is running fine and the cron job kicks off, would this cause the job to fail since it's already running?  Just checking because I was thinking about having it run every 6 hours or so.....
Re: Problems with first Crawl
« Reply #10 on: September 01, 2009, 11:54:48 AM »
Yes, it will check if another job is running and will skip session in this case.
Re: Problems with first Crawl
« Reply #11 on: September 02, 2009, 10:11:07 PM »
ok, well it now completes the crawl (very quickly), but only reports 7956 URLs rather than my 55000 plus URL's.  I ran this several times with the same results.  Have you seen this before?  how can I fix this?
Re: Problems with first Crawl
« Reply #13 on: September 03, 2009, 04:06:36 PM »
One of the URL's is

[ External links are visible to forum administrators only ]