Robots.txt disallow, nofollow links and "Exclude URLs"
« on: July 20, 2014, 03:14:02 AM »
Does the sitemap generator ignore paths set as "Exclude:" in robots.txt, particularly wildcard directives?

I was hoping that leaving these configurations as default would simulate a crawl as googlebot would, however I see the sitemap generator crawling and parsing url's that are set as Excluded in robots.txt

I'm crawling a Magento site with ~700 products in ~100 categories and without robots.txt exclude, "noindex" meta tags and nofollow links this quickly turns into tens of thousands of URLs, which I neither want indexed (dup. content) nor crawled by the bots (un-necessary server load).

Code: [Select]
Disallow: /products/*?dir*
Disallow: /products/*?dir=asc

but I still entries in debug like:
Code: [Select]
I have robots.txt turned on. (full config attached)
Code: [Select]
<option name="xs_robotstxt">1</option>

The sitemap generator actually just chokes on this and needs to be restarted multiple times.
Code: [Select]
Continue the interrupted session
Updated on 2014-07-20 01:48:42, Time elapsed: 2:02:04,
Pages crawled: 7911 (4481 added in sitemap), Queued: 16, Depth level: 8
Current page: (1.1)
« Last Edit: July 20, 2014, 03:25:59 AM by info2289 »
Re: Robots.txt disallow, nofollow links and "Exclude URLs"
« Reply #1 on: July 20, 2014, 05:14:53 AM »

Actually not even sure if it is honouring any Disallow directive in robots.txt
Code: [Select]
Disallow: /products/customer/
Code: [Select]