Generator including excluded querystrings - bug?
« on: March 24, 2010, 03:46:57 AM »
Not sure if this is a bug or a mis-configuration, but I can't figure it out nor find any similar posts in the forums:

Our website uses a bunch of querystrings for passing data between pages. These querystrings don't change the content of the page in any significant way and hence shouldn't be included in our sitemap. These querystrings are not actually Session ID querystring, but you can think of them as being very similar.

To prevent the Generator from including these querystrings in it's crawl / generated sitemap, I've added them to the "Remove session ID from URLs" field. This is generally working as expected, but for some reason there is one specific querystring that is still being included in the crawl / generated sitemap URLs, despite the fact that it is in the "Remove session ID from URLs" field.

Specifically, the querystring is called "q" and always contains a GUID, e.g.:

[ External links are visible to forum administrators only ]

The above URL is being included in the generated sitemap as listed above, whereas I believe the presence of "q" in the "Remove session ID from URLs" field should result in this URL being listed instead:

[ External links are visible to forum administrators only ]


Is that correct? Have I misunderstood or misconfigured something? Or is this a bug?

Thanks.
Re: Generator including excluded querystrings - bug?
« Reply #1 on: March 24, 2010, 05:14:22 PM »
Hello,

since your GUID includes dashes, it's not treated as "session ID", you can manually define this setting in data/generator.conf file:
Code: [Select]
<option name="xs_cleanurls">#\bq=[\-a-z0-9]+#i</option>
Re: Generator including excluded querystrings - bug?
« Reply #2 on: March 25, 2010, 01:53:23 AM »
Thanks for the quick response.

I've just tested your solution and it appears to work fine - we ended up modding the regex to strip all querystrings which is what we were trying to achieve anyway.

One further quick question: Is there a reason why dashes prevent a querystring paramater from being treated as a session ID? Seems like this shouldn't be a barrier.

Cheers.
Re: Generator including excluded querystrings - bug?
« Reply #3 on: March 25, 2010, 09:07:07 PM »
Usually seasion ID is either a number or a hash (md5), both include only alpanumeric character. In case if other characters are presented, paremeter is unaffected to avoid modifying other (valid) parameters with the same name.
Re: Generator including excluded querystrings - bug?
« Reply #4 on: March 26, 2010, 02:19:56 AM »
Can I suggest a feature then: a field which is used to specify a list of querystrings to be excluded completely from crawling / the generated sitemap? That is, the sitemap generator strips those querystrings out of the URL completely being crawling / adding to the sitemap?

As I said in my original post, we have lots of querystring parameters that should be ignored when generating a sitemap (or when being indexed by a search engine - but we use the Canonical Link tag to fix that).

We thought the Session IDs field would perform a similar function to "strip these querystrings", but it sounds like it's not quite designed to do that. We can strip out querystrings using the xs_cleanurls parameter you posted about above, which is easy enough if we want to remove ALL querystrings (\?-.*), but gets very tricky if we only want to strip out some querystrings.

Thanks for your help.
Re: Generator including excluded querystrings - bug?
« Reply #6 on: March 29, 2010, 03:22:20 AM »
Cool, thanks again.