robot text file not excluding generator from indexing
« on: February 09, 2008, 05:53:34 AM »
I am trying to use the robot.txt file to exclude certain pages from being indexed by the googlebot.

At the moment I am getting a number of links like this:
/option,com_netinvoice/action,orders/task,order/cid,1/Itemid,170.html
and like this:
/index.php?option=com_content&task=view&id=17&Itemid=1

As there are so many of these in google's index of me  :( , I am using the disallow command in this format:
Disallow: /*.html
Disallow: /*.php
Disallow: /*itemid
Disallow: /*Itemid
And then the Allow command to allow the 15 or so links that are important.

It works in that the links I want are in my sitemap  :) , but the ones I don't want are still there  ???. How come my Disallow: /*.html didn't stop this:  /option,com_netinvoice/action,orders/task,order/cid,1/Itemid,170.html

Or Disallow: /*Itemid  and Disallow: /*.php didn't stop /index.php?option=com_content&task=view&id=17&Itemid=1

Even though these links that I don't want are in my sitemap, will they be disallowed by the googlebot? And will this idea of disallowing everything with the Disallow: /*.html command and allowing my links through using the Allow command cause me problems in some way?

Any thoughts would be really great  ;)
Re: robot text file not excluding generator from indexing
« Reply #1 on: February 10, 2008, 01:13:19 AM »
Hello,

not every search engine supports wildcards in robotx.txt, that's why they are still included in sitemap. Google will NOT index those pages even if they are included in sitemap though, since Google supports wildcards and you excluded them in robots.txt (you might want to check it with "Analyze robots.txt" tool in google webmaster account to make sure that it's excluded).
Re: robot text file not excluding generator from indexing
« Reply #2 on: February 10, 2008, 04:18:45 AM »
Thanks for the reply, I checked and yes they are excluded by google, even though they are in my sitemap.  ;)
So to get them out of my sitemap, I would put Itemid in my 'do not parse urls' and 'exclude urls' configuration settings, right?
And to confirm: wildcards in the robot.txt file are not supported by xml-sitemaps, and we need to use the config settings, right?
Re: robot text file not excluding generator from indexing
« Reply #3 on: February 10, 2008, 10:43:33 PM »
Quote
So to get them out of my sitemap, I would put Itemid in my 'do not parse urls' and 'exclude urls' configuration settings, right?
Yes. I t should be added both to "Do not parse" and "Exclude URLs" options though.