• Welcome to Sitemap Generator Forum.
 

Exclude based on language part of URL

Started by d.d, January 16, 2013, 10:22:41 AM

Previous topic - Next topic

d.d

Apologies if this has been covered before - I've searched without any joy.

I have a site with multiple language parts:
mysite.com/en  for english
mysite.com/fr   for french
mysite.com/de  for german

etc.

I want to setup sitemap generator to scan and include URLs from the English part of the site (mysite.com/en) only.

The english part of the site has links to other language parts of the site, but I don't want to include or follow them.

I do not want to scan other parts of the site for links to english pages.
I do not want any links outside the /en part of the site to appear in the sitemap.

If I put "en" in "Include only URLs" and "Parse only URLs" it still seems to follow some links outside the /en part of the site.

I'm guessing this is because "en" is quite a common pair of letters to find in other parts of a URL.

If I put "en/" or "/en" in "Include only URLs" and "Parse only URLs" it fails to scan fully.

Is there a way I can specify to only parse and include URLs which start with "/en"?
(perhaps put a regular expression in those fields?).

thanks in advance for any help you can offer

XML-Sitemaps Support

Hello,

please try to specify [ External links are visible to logged in users only ] (with ending slash) as Starting URL.

d.d

Thanks very much - that worked just fine.

To be clear:

  • I set the starting URL as [ External links are visible to forum administrators only ]
  • I emptied the "Exclude URLs", "Do not parse URLs", "Include ONLY URLs" and "Parse ONLY URLs" fields

I had to adjust my .htaccess file, since URLs such as /en on my site are virtual - I have a central URL handler, so for SEO purposes I had stripped the final '/'.
I adjusted the .htaccess file to allow for 'mysite.com/en/' (and force the trailing '/'), while removing the trailing '/' from lower URLs (mysite.com/en/about_us).

Extract from .htaccess shown below. YMMV.


# force trailing '/' on bare language specifier URL (eg [ External links are visible to forum administrators only ])
RewriteCond %{REQUEST_METHOD} GET
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^/?([^/]{2})$ $1/ [R=301,L]

# remove trailing slash on lower-level URLs (eg [ External links are visible to forum administrators only ])
RewriteCond %{REQUEST_METHOD} GET
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^/?([^/]{2}/.+)/$ $1 [R=301,L]