How to exclude pages from being indexed by the generator
« on: June 16, 2006, 06:32:32 PM »
Hi,

The sitemap generator has been running for more than 1700 minutes and it has scanned 150,000 + pages on my client's site already. I've no clue how long it'll take to finish crawling the site...
I found the need to prevent the generator from indexing some of the dynamic pages. So here comes my questions:

1. How can I use the "Do not parse URLs" configuration setting to exclude dynamic page with certain query string parameters from being scanned?
e.g. I've a dynamic page called forum_post.asp and the URL to this page contains 4 different query string paramenters and content is generated dynamically based on the query string values, ie: TID = thread id, PN = page number, GET = can't recall  what it's for, and TPN = can't recall what it's for (eg: forum_post.asp?TID=1&PN=4&GET=last&TPN=5). How do I tell the generator to ignore the GET and TPN query string to reduce the number of scanned pages? There's not enough information/sample that discuss this "Do not parse URLs" as well as the "Exclude URLs:" feature... It'll be nice if you could provide a few examples.

2. How do I tell the generator to ignore certain query string sitewise? Like I want the generator to ignore the 'SortCol' query string among all the dynamic pages in my website so I don't have to specific the exculsion rule for each page?


The generator is very powerful but it gives me headache as I can't figure out how to ignore URL that do not contain unique links to other pages... thanks for the help!

eddie
Re: How to exclude pages from being indexed by the generator
« Reply #1 on: June 19, 2006, 07:10:30 PM »
Hello guys, can the admin or the developers for this tool resopnse to this post? I really need to know and see more examples on how to exclude query string parameter from being recognized as a new unqiue URL, thanks in advance!
Re: How to exclude pages from being indexed by the generator
« Reply #2 on: June 20, 2006, 04:06:17 PM »
Hello,

sorry fo delay.
1. as described in sitemap generator manual:
Quote
"Do not parse URLs" works ... to increase the speed of sitemap generation. If you are sure that some pages at your site do not contain the unique links to other pages, you can tell generator not to fetch them.
For instance, if your site has "view article" pages with urls like "viewarticle.php?..", you may want to add them here, because most likely all links inside these pages are already listed at "higher level" (like the list of articles) documents as well:
    * viewarticle.php?id=
If you are not sure what to write here, just leave this field empty. Please note that these pages are still included into sitemap.
in your case, you can use the following entries in "Do not parse URLs":
Quote
GET=
TPN=
as result, ALL pages that have "GET=" OR "TPN=" in URL will not be requested from your site.

2. similar to above, just add the following in "Do not parse URLs" (and probably in "Exclude URLs"):
Quote
SortCol=