Pages not indexed and not requested from server
« on: August 13, 2008, 02:32:37 PM »
We are finding that the standalone version of the xml site map generator is not indexing a large number of pages on our site.  Here's an example of the output while it is running:

Links depth: 3
Current page: HS1118949/A-2-Bedroom-property-in-France/100/
Pages added to sitemap: 1534
Pages scanned: 10400 (87,875.2 KB)
Pages left: 11309 (+ 0 queued for the next depth level)
Time passed: 40:07
Time left: 43:37
Memory usage: 8,211.9 Kb

Even after finishing scanning over 20000+ pages the index only has the 1534 pages in it.  I also noticed that during the link depth 3 phase of the scanning no requests were being made to the server.  For example the URL given above 'HS1118949/A-2-Bedroom-property-in-France/100/' doesn't appear in the web server logs even though earlier requests made by the site map generator do appear.

I've set the max_execution_time and memory_limit as suggested in other posts, I've also set the xs_crawl_ident which was also suggested but this doesn't fix the problem.

Any idea what might be causing the problem?
Re: Pages not indexed and not requested from server
« Reply #1 on: August 13, 2008, 03:46:12 PM »
Hello,

I'd suggest to limit "maximum URLs" number to say 1000 for testing, create sitemap and check which URLs are actually included in sitemap and which are not.
Re: Pages not indexed and not requested from server
« Reply #2 on: August 14, 2008, 03:58:05 PM »
The first 1000 pages appear to be indexed correctly no pages appear to have been skipped at this point.  The problem appears after around 1500 pages are indexed, for some reason it then stops adding any more after this (even though it appears to be trying to process them, although according to the web server logs no requests are being made).

For instance the following page is correctly indexed:
/search/holidaysales/Properties-in-Spain/psE10AstpE1AcurE100AcntE309ApoaEtrueAsldEfalse/1/

has a link to the page:
/HS1182487/Salia-2/100/

this is correctly added to the index, but another page that it has indexed:
/search/holidaysales/Properties-in-Spain/psE10AstpE1AcurE100AcntE309ApoaEtrueAsldEfalse/2/

has a link to the page:
/HS1132529/La-Ciguena/100/

But for some reason this page never gets added to the index.

It is as if something is causing the xml site map generator to stop making requests for pages, even though it appears to be processing them.