XML Sitemaps Generator

Author Topic: Robots.txt disallow, nofollow links and "Exclude URLs"  (Read 5610 times)

info2289

  • Registered Customer
  • Approved member
  • *
  • Posts: 2
Robots.txt disallow, nofollow links and "Exclude URLs"
« on: July 20, 2014, 02:14:02 AM »
Does the sitemap generator ignore paths set as "Exclude:" in robots.txt, particularly wildcard directives?

I was hoping that leaving these configurations as default would simulate a crawl as googlebot would, however I see the sitemap generator crawling and parsing url's that are set as Excluded in robots.txt

I'm crawling a Magento site with ~700 products in ~100 categories and without robots.txt exclude, "noindex" meta tags and nofollow links this quickly turns into tens of thousands of URLs, which I neither want indexed (dup. content) nor crawled by the bots (un-necessary server load).

eg.,
robots.txt
Code: [external links are visible to admins only]
Disallow: /products/*?dir*
Disallow: /products/*?dir=asc

but I still entries in debug like:
Code: [external links are visible to admins only]
((include https://equinepodiatry.com.au/products/horseshoes-aluminium.html?dir=asc&order=price))
I have robots.txt turned on. (full config attached)
Code: [external links are visible to admins only]
<option name="xs_robotstxt">1</option>

The sitemap generator actually just chokes on this and needs to be restarted multiple times.
Code: [external links are visible to admins only]
Continue the interrupted session
Updated on 2014-07-20 01:48:42, Time elapsed: 2:02:04,
Pages crawled: 7911 (4481 added in sitemap), Queued: 16, Depth level: 8
Current page: https://domain.com/products/horseshoes-glue-on/sigafoos.html?dir=asc&limit=9&order=name&p=11 (1.1)
« Last Edit: July 20, 2014, 02:25:59 AM by info2289 »

info2289

  • Registered Customer
  • Approved member
  • *
  • Posts: 2
Re: Robots.txt disallow, nofollow links and "Exclude URLs"
« Reply #1 on: July 20, 2014, 04:14:53 AM »

Actually not even sure if it is honouring any Disallow directive in robots.txt
Code: [external links are visible to admins only]
Disallow: /products/customer/
debug.log
Code: [external links are visible to admins only]
((include https://domain.com/products/customer/account/forgotpassword/))

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10621
Re: Robots.txt disallow, nofollow links and "Exclude URLs"
« Reply #2 on: July 21, 2014, 04:36:56 AM »
Hello,

please let me know your generator URL/login in private message to check this.
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

 

SMF 2.0.12 | SMF © 2014, Simple Machines
XHTML RSS WAP2