Sitemap generator only crawling 1/4th of pages
« on: August 19, 2010, 07:26:18 AM »
URL: [ External links are visible to forum administrators only ]
I’m not exactly sure where to go with this but I was emailing Philip earlier and couldn’t quite explain my issue in a way that he could understand. What I will do is go into detail to make it very clear.
1.) I have been using xml-sitemap generator for about (2) months without problems.
2.) I have recently added some disallows to robot.txt to shore up some duplicate content issues on Google.
3.) One of the directories I added to my robot.txt file is /jreviews/ which is a Joomla component I use. As /jreviews/ is restricted by robot.txt it comes up with a sitemap crawl error in Google Webmaster because there are url’s with /jreviews/ in them
4.) Again, I have never had a problem with xml-sitemap crawling my site although today Philip said the pages  that are not being crawled need to be linked from somewhere accessible in order to be crawled. They are obviously linked somewhere accessible so this was a little confusing since the generator has worked perfectly in the past with no problems as proof.  The pages I am refering to are all the poker room listings withen the "Poker Room" directory which is accessable via the main menu on the home page.
5.) Today I added “limit:” , “poker_rooms” and “Poker-Room-Directory” to excluded url’s in the xml-sitemap generator. I went from 221 url’s in my sitemap to having around 50. What I was doing was experimenting with how the generator blocked certain strings to figure out what I could add to eliminate /jreviews/ and more so the crawl errors I am getting from Google Webmaster.
5.) I deleted said additions to, excluded url’s and ran the generator again to start with a full sitemap. I again got 50 url’s. While running the generator I noticed it skipped the main menu item, “Poker Rooms” all together indexing all the other pages. Poker Rooms is where all my missing pages are. Again this is where Philip said there needs to be some accessibility from other pages which there clearly is. The only thing I can think of is that I sent him some examples of the pages that weren’t being indexed and I first gave the example of [ External links are visible to forum administrators only ] which isn’t a real url it’s just an example. I can come up with no better reason of why he would think there is no accessibility.
6.) This is that email for clarity.
All of the urls that are [ External links are visible to forum administrators only ] that aren't linked from the home page. Probably about 140ish of them. A few examples:
[ External links are visible to forum administrators only ]
[ External links are visible to forum administrators only ]
[ External links are visible to forum administrators only ]
 
On Wed, Aug 18, 2010 at 4:58 PM, XML-Sitemaps <contact@xml-sitemaps.com> wrote:
Hi,
Can you give me an example of a page not included?
Regards,
Philip
XML-Sitemaps.com
7.) I have a printed copy of the sitemap from 8/4/2010 with the same url structure as there is today on the site. Nothing has changed except xml-sitemap is unable to crawl the site or to be exact the poker room directory.
Re: Sitemap generator only crawling 1/4th of pages
« Reply #1 on: August 19, 2010, 09:23:05 AM »
Hello,

from what I can see, URLs like aliante_casino/ on your site can be reached only through "jreviews/" visited first. If you are blocking "jreviews" then you block all pages that linked from it as well, since generator wont' be able to find them.
Re: Sitemap generator only crawling 1/4th of pages
« Reply #2 on: August 20, 2010, 06:22:42 AM »
So googlebot can crawl them but xml-sitemap can not?  Why is there an "exclude url" field in the generator dashboard if the generator simply follows robot.txt?  I don't get it.....  Is there a way for the generator to ignore robot.txt?  I could then add the terms to, "exclude urls" to come out with the correct sitemap.
Re: Sitemap generator only crawling 1/4th of pages
« Reply #3 on: August 20, 2010, 10:59:06 AM »
The googlebot might see the pages linked from elsewhere, even from external sites. You can use "Exclude URLs" *additionally* to robots.txt, to have only the most important pages in sitemap.
You can disable following robots.txt instructions in generator/data/generator.conf file, set this to "0":
<option name="xs_robotstxt">1</option>