Robots.txt
« on: November 22, 2005, 05:46:24 PM »
I was under the impression that the standalone version did honor the robots.txt file; however, it is crawling pages that have been disallowed in the robots.txt file in the root of my localhost server on an iMac G5, 10.3. I know it is finding it as I am not seeing any errors in my Apache error log.

Any ideas?
Re: Robots.txt
« Reply #1 on: November 22, 2005, 05:50:29 PM »
Hello,

please post the contents of your robots.txt and example URL that matches disallow directive but is still included into sitemap.
Re: Robots.txt
« Reply #2 on: November 22, 2005, 06:30:07 PM »
Robot.txt
Code: [Select]
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/large/
Disallow: /index.php?main_page=advanced_search
Disallow: /index.php?main_page=login
Disallow: /index.php?main_page=logoff
Disallow: /index.php?main_page=product_reviews_write
Disallow: /index.php?main_page=redirect
Disallow: /index.php?main_page=shopping_cart
Disallow: /index.php?main_page=tell_a_friend
Disallow: /index.php?main_page=create_account
Disallow: /index.php?main_page=checkout_shipping
Disallow: /index.php?main_page=password_forgotten
Disallow: /index.php?main_page=images
Disallow: /index.php?main_page=ask_a_question
Disallow: /index.php?main_page=product_reviews
Disallow: /index.php?main_page=address_book
Disallow: /index.php?main_page=account_notifications

Sitemap.xml
Code: [Select]
<url>
  <loc>http://bob.local/harborfare/index.php?main_page=login</loc>
  <priority>0.5</priority>
  <lastmod>2005-11-22T02:46:19+00:00</lastmod>
  <changefreq>monthly</changefreq>
</url>
<url>
  <loc>http://bob.local/harborfare/index.php?main_page=logoff</loc>
  <priority>0.5</priority>
  <lastmod>2005-11-22T02:46:19+00:00</lastmod>
  <changefreq>monthly</changefreq>
</url>
<url>
  <loc>http://bob.local/harborfare/index.php?main_page=shopping_cart</loc>
  <priority>0.5</priority>
  <lastmod>2005-11-22T02:46:19+00:00</lastmod>
  <changefreq>monthly</changefreq>
</url>
Re: Robots.txt
« Reply #3 on: November 22, 2005, 10:46:57 PM »
Hi,

robots.txt file resides at your domain root (i.e., [ External links are visible to logged in users only ])
Having your robots.txt file contents, it disallows the urls like [ External links are visible to logged in users only ] and NOT [ External links are visible to logged in users only ]