Standalone Sitemap Generator Indexing Problem...
« on: April 26, 2008, 04:50:18 AM »
Hi Guys,

I've been using your online Sitemap generator for quite awhile now, but I was a little concerned about the standalone version after reading some of the problems with installs people were having in the forums. I just installed the standalone version of Sitemap Generator and just wanted to let you know the install took all of 5 minutes from download to generating the first sitemap and the directions were perfect.

I do have one issue and I'm sure it's a simple one with the online generator my new site generates more than 500 pages. The Standalone version is only seeing the pages within the root directory even though the index.html has links into the directory containing the majority of our content. The URL is [ External links are visible to forum administrators only ] if you could take a look and see if you can provide some info it would be greatly appreciated.

Thanx,
  Scott
Re: Standalone Sitemap Generator Indexing Problem...
« Reply #1 on: April 26, 2008, 07:24:33 AM »
Hi Guys,

I was able to find the problem and get Sitemap Generator to find the pages not being indexed but now I have another small issue. When I start a crawl the generator will add 1594 pages and stop. I then check the "Run in Background" and "Resume last session" boxes and start the generator again. It adds pages and quits here:

Links depth: 4
Current page: CruiseAds/index.php?method=create_form&list=classifiedsuser&rollid=449&fromlist=advertisement&frommethod=showdetails&fromid=449&fromfromlist=advertisement_active&fromfrommethod=showhtmllist
Pages added to sitemap: 3234
Pages scanned: 3240 (36,909.8 Kb)
Pages left: 7224 (+ 8268 queued for the next depth level)
Time passed: 6:03
Time left: 13:30
Memory usage: 14,833.8 Kb
Resuming the last session (last updated: 2008-04-26 01:58:34)


When I go back to the "Crawl" tab to restart the message under "Resume last session" is not updated to the new page count and info it still reads:

Continue the last session (2008-04-26 1:58:34, URLs added 1594, estimated URLs left: 11741)

I then check the "Run in Background" and "Resume last session" boxes and start the generator again. It adds pages and quits and I have a repeat of the above. It appears to be quitting on the same page each time and not saving the pages added up to that point If you could provide some info it would be greatly appreciated. All setting are at default with no modifications...

Thanx,
  Scott
« Last Edit: April 26, 2008, 07:49:44 AM by scott9 »
Re: Standalone Sitemap Generator Indexing Problem...
« Reply #2 on: April 26, 2008, 12:34:12 PM »
Hello,

it looks like your server configuration doesn't allow to run the script long enough to create full sitemap. Please try to increase memory_limit and max_execution_time settings in php configuration at your host (php.ini file) or contact hosting support regarding this.
Re: Standalone Sitemap Generator Indexing Problem...
« Reply #3 on: April 26, 2008, 05:48:54 PM »
Hi Guys,

I increased the ;Resource Limits; in php.ini each times 4. This allows the generator to run longer each time before stopping and has allowed me to see more page are being added to the map each time the generator stops here's the latest:

Links depth: 5
Current page: CruiseAds/index.php?method=create_form&list=classifiedssearch&fromlist=advertisement&frommethod=showdetails&fromid=523&fromfromlist=advertisement_active&fromfrommethod=showhtmllist&cid=523
Pages added to sitemap: 11573
Pages scanned: 11580 (159,123.8 Kb)
Pages left: 49165 (+ 21428 queued for the next depth level)
Time passed: 27:16
Time left: 115:49
Memory usage: 75,781.7 Kb
Resuming the last session (last updated: 2008-04-26 11:52:24)


I'm still running the generator at it's default config and in looking at the pages that are being generated It looks to me as though I may need to set some config parameters. The generator has 11573 indexed + 49,165 left to scan and if the rate holds for the pages it is adding to the queue it will hold over 100,000 by the time it  get to link level 6. I hate to think where will be if it goes to more link levels. The site I'm running the generator on has about 9 html pages in the root directory with a link to the php based classified section one directory up. The classified section currently contains about 500 ads. I'm trying to get a site map containing the site including the ads. Do these numbers look normal for this type of map or do I have the generator indexing pages that should be excluded. Your help is appreciated.

Thanx,
  Scott
Re: Standalone Sitemap Generator Indexing Problem...
« Reply #4 on: April 27, 2008, 06:15:24 PM »
Hello,

you can use "Do not parse" and "Exclude URLs" options to exclude "noise content" from crawling, like sorting links or "tell a friend" etc. If you need assistance setting this up, please PM me your generator URL/login.
Re: Standalone Sitemap Generator Indexing Problem...
« Reply #5 on: April 28, 2008, 01:26:24 AM »
Oleg,

Thanx for the offer. I may take you up on it in the future but I was able to take a look at the log files and see problem. Just like you said the majority of the links being indexed after the 3rd link level were "noise content" tell a friend, search, email the advertiser, etc. I just pulled the link depth back to the point where the noise content began. I was able to set the ;Resource Limits; in php.ini back to the defaults and index the site in under a minute, capturing all the relevant links. It makes me wonder if a lot of the problems with memory errors and the program stopping I've been reading about in the forums aren't being caused by not limiting the content the program is allowed to scan. Thanx for you help and by the way, nice job with the program.

Thanx,
  Scott