• Welcome to Sitemap Generator Forum.
 

Robots.txt disallow, nofollow links and "Exclude URLs"

Started by info2289, July 20, 2014, 03:14:02 AM

Previous topic - Next topic

info2289

Does the sitemap generator ignore paths set as "Exclude:" in robots.txt, particularly wildcard directives?

I was hoping that leaving these configurations as default would simulate a crawl as googlebot would, however I see the sitemap generator crawling and parsing url's that are set as Excluded in robots.txt

I'm crawling a Magento site with ~700 products in ~100 categories and without robots.txt exclude, "noindex" meta tags and nofollow links this quickly turns into tens of thousands of URLs, which I neither want indexed (dup. content) nor crawled by the bots (un-necessary server load).

eg.,
robots.txt

Disallow: /products/*?dir*
Disallow: /products/*?dir=asc


but I still entries in debug like:
((include https://equinepodiatry.com.au/products/horseshoes-aluminium.html?dir=asc&order=price))

I have robots.txt turned on. (full config attached)
<option name="xs_robotstxt">1</option>


The sitemap generator actually just chokes on this and needs to be restarted multiple times.
Continue the interrupted session
Updated on 2014-07-20 01:48:42, Time elapsed: 2:02:04,
Pages crawled: 7911 (4481 added in sitemap), Queued: 16, Depth level: 8
Current page: https://domain.com/products/horseshoes-glue-on/sigafoos.html?dir=asc&limit=9&order=name&p=11 (1.1)

info2289


Actually not even sure if it is honouring any Disallow directive in robots.txt
Disallow: /products/customer/

debug.log
((include https://domain.com/products/customer/account/forgotpassword/))