Trouble crawling our site
« on: February 19, 2007, 03:55:55 PM »
Hello-

First off, thanks for making your tool available online, we are looking forward to using it with Google!

I am trying to get it to index our site ([ External links are visible to forum administrators only ]) and for some reason it is stopping after reading the home page.  When I view it in the simulator, it sees the entire page and extracts all the links.  But when I try to generate a sitemap, I see this in the log files:

85.92.68.99 - - [19/Feb/2007:10:56:08 -0500] "GET / HTTP/1.0" 301 245
85.92.68.99 - - [19/Feb/2007:10:56:08 -0500] "GET /home/ HTTP/1.0" 200 20716
85.92.68.99 - - [19/Feb/2007:10:56:08 -0500] "GET /robots.txt HTTP/1.0" 404 6709
85.92.68.99 - - [19/Feb/2007:10:56:09 -0500] "GET /home/robots.txt HTTP/1.0" 200 20716
85.92.68.99 - - [19/Feb/2007:10:56:09 -0500] "GET /home/ HTTP/1.0" 200 20716

And that's it.  The resulting sitemap just contains the homepage URL.

Any help would be greatly appreciated.

Thanks again,

Greg
Re: Trouble crawling our site
« Reply #1 on: February 19, 2007, 09:34:18 PM »
The problem is because your site redirects to [ External links are visible to forum administrators only ] and our generator doesn't follow redirects.

If you put your start page as [ External links are visible to forum administrators only ] you will only get this page as well because there are no pages in this directory to index.

All your links point to pages in other directories that wouldn't be considered to be in the domain root of [ External links are visible to forum administrators only ]


Philip Nicosia
Re: Trouble crawling our site
« Reply #2 on: February 19, 2007, 09:48:55 PM »
Hmm...it seems fairly common to have different paths for different site components - why does it consider "/home" to be part of the DOMAIN when its really part of the PATH?
Re: Trouble crawling our site
« Reply #3 on: February 19, 2007, 11:20:32 PM »
It’s simply because of what Google expects to see and what it will accept in a sitemap.

The script sees home as the root of the site as that is the start page of your site. If your start page was [ External links are visible to forum administrators only ] it wouldn’t be an issue. Google won’t accept any urls that are below or not within the home path so they are not included.

The script has been designed to work exactly this way and adhere strictly to paths so that urls will not be included which would then mean your sitemap would be rejected by Google.

Unfortunately that doesn’t help you in the way your site is setup.
Philip Nicosia
Re: Trouble crawling our site
« Reply #4 on: February 20, 2007, 07:00:30 PM »
Thank you very much for your help.  We are going to make some changes to how our sites URL's work and then try this again.