Robots.txt disallow, nofollow links and "Exclude URLs"

info2289

2

Robots.txt disallow, nofollow links and "Exclude URLs"

« on: July 20, 2014, 03:14:02 AM »

Does the sitemap generator ignore paths set as "Exclude:" in robots.txt, particularly wildcard directives?

I was hoping that leaving these configurations as default would simulate a crawl as googlebot would, however I see the sitemap generator crawling and parsing url's that are set as Excluded in robots.txt

I'm crawling a Magento site with ~700 products in ~100 categories and without robots.txt exclude, "noindex" meta tags and nofollow links this quickly turns into tens of thousands of URLs, which I neither want indexed (dup. content) nor crawled by the bots (un-necessary server load).

eg.,
robots.txt

Code: [Select]

Disallow: /products/*?dir*
Disallow: /products/*?dir=asc

but I still entries in debug like:

Code: [Select]

((include https://equinepodiatry.com.au/products/horseshoes-aluminium.html?dir=asc&order=price))
I have robots.txt turned on. (full config attached)

Code: [Select]

<option name="xs_robotstxt">1</option>

The sitemap generator actually just chokes on this and needs to be restarted multiple times.

Code: [Select]

Continue the interrupted session 
Updated on 2014-07-20 01:48:42, Time elapsed: 2:02:04,
Pages crawled: 7911 (4481 added in sitemap), Queued: 16, Depth level: 8
Current page: https://domain.com/products/horseshoes-glue-on/sigafoos.html?dir=asc&limit=9&order=name&p=11 (1.1)

« Last Edit: July 20, 2014, 03:25:59 AM by info2289 »

Logged