limit URLs crawled by Sitemap Generator
« on: June 19, 2012, 02:52:29 AM »
Hello,
We are using an enormous amount of server resources each time XML Sitemap generates a sitemap for us. I have configured many URL exclusions, but it does not seem to affect the crawling of those pages - only that they are not added to the sitemap.

Is there a way to limit the URLs that are actually crawled when the sitemap is being generated?

One of our main problems is that our sitemaps are generating many reduntant entries for products because of the Joomla/Virtuemart URL syntax. It appears that a product is indexed first as a single product page, but then also as part of a category, and then also as a "manufacturer" search result. The resulting unnecessary URL looks like this:
[ External links are visible to forum administrators only ]

I would like to keep the redundant results from being crawled, but since they are dynamically generated I can't figure out how to limit them using a URL exception.

Any ideas?

I know it seems like two different issues, but if I had an answer to either we may accomplish our objective of decreasing server resource usage.

Thank you for an excellent product!
Re: limit URLs crawled by Sitemap Generator
« Reply #2 on: June 19, 2012, 05:57:02 PM »
Thank you. That does seem obvious...

But this does not limit the pages crawled, does it? Only the pages added to the sitemap? Is there a way to exclude the same set of URLs from being crawled by the Sitemap generator at all?
Re: limit URLs crawled by Sitemap Generator
« Reply #4 on: August 12, 2012, 11:32:21 PM »
Hi,

I think I may have gone overbiard in limiting the crawling and now no pages are indexed when crawled. I apologize for asking something that is probably very simple, but can you tell me what in my robots.txt file might be causing the xml crawl to end so quickly with no pages indexed?

Here are the parameters fro the xml-sitemaps crawler in my robots.txt file:

User-agent: pro1.pro-sitemaps.com
Disallow: /administrator/
Disallow: /cache/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /resized/
Disallow: /xmlrpc/
Disallow: /index2.php
Disallow: /logs/
Disallow: /*?page=shop.registration
Disallow: /*?page=shop.cart
Disallow: /*?page=shop.ask
Disallow: /*?page=shop.manufacturer_page
Disallow: /*?page=account
Disallow: /*category_id=6&
Disallow: /components/mailto
Disallow: /cgi-bin/
Disallow: /com_fireboard/sources
Disallow: /themes

I have also set up many limits in the "Exlude URLs" configuration settings of our PRO Sitemap Service:

feed
func=userlist
catid=11
option=com_user
func=fb_pdf
do=reply
pop=0
flypage=flypage-ask.tpl
print=
do_pdf=
pop=1
task=emailform
task=trackback
task=rss
component/mailto
page=shop.manufacturer_page
page=account
page=shop.cart
page=shop.ask
/com_fireboard/sources
/themes
option=com_redmystic&view=sitemap
/archives/prods.zip
/archives/prods.xml
/sitemap*
/generator
/ror.xml
/administrator
page=account.favorite_products
page=shop.cart
option=com_user
page=checkout.index
pop=
manufacturer_id=

We are using Joomla and Virtuermart, and the purpose of all of this is to elminate duplications in our sitemap caused by extra "paths" to the products. It would be greatly appreciated if you could let me know where the conflict is here so that we get a full crawl and index, but also eliminate as much redundancy and server resource usage as possible.

Thank you again. We REALLY appreciate your product and service!
Re: limit URLs crawled by Sitemap Generator
« Reply #5 on: August 13, 2012, 12:12:54 PM »
Hello,

please check your site with search engine bot simulator tool to see if the pages are blocked: https://www.xml-sitemaps.com/se-bot-simulator.html
Note though that generator uses the entries from "User-agent: *" and "User-agent: googlebot" sections in robots.txt