d.d

*
  • *
  • 2
Exclude based on language part of URL
« on: January 16, 2013, 10:22:41 AM »
Apologies if this has been covered before - I've searched without any joy.

I have a site with multiple language parts:
mysite.com/en  for english
mysite.com/fr   for french
mysite.com/de  for german

etc.

I want to setup sitemap generator to scan and include URLs from the English part of the site (mysite.com/en) only.

The english part of the site has links to other language parts of the site, but I don't want to include or follow them.

I do not want to scan other parts of the site for links to english pages.
I do not want any links outside the /en part of the site to appear in the sitemap.

If I put "en" in "Include only URLs" and "Parse only URLs" it still seems to follow some links outside the /en part of the site.

I'm guessing this is because "en" is quite a common pair of letters to find in other parts of a URL.

If I put "en/" or "/en" in "Include only URLs" and "Parse only URLs" it fails to scan fully.

Is there a way I can specify to only parse and include URLs which start with "/en"?
(perhaps put a regular expression in those fields?).

thanks in advance for any help you can offer
Re: Exclude based on language part of URL
« Reply #1 on: January 17, 2013, 12:52:29 PM »
Hello,

please try to specify mysite.com/en/ (with ending slash) as Starting URL.

d.d

*
  • *
  • 2
Re: Exclude based on language part of URL
« Reply #2 on: January 23, 2013, 10:03:59 AM »
Thanks very much - that worked just fine.

To be clear:
  • I set the starting URL as [ External links are visible to forum administrators only ]
  • I emptied the "Exclude URLs", "Do not parse URLs", "Include ONLY URLs" and "Parse ONLY URLs" fields

I had to adjust my .htaccess file, since URLs such as /en on my site are virtual - I have a central URL handler, so for SEO purposes I had stripped the final '/'.
I adjusted the .htaccess file to allow for 'mysite.com/en/' (and force the trailing '/'), while removing the trailing '/' from lower URLs (mysite.com/en/about_us).

Extract from .htaccess shown below. YMMV.


# force trailing '/' on bare language specifier URL (eg mysite.com/en/)
RewriteCond %{REQUEST_METHOD} GET
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^/?([^/]{2})$ $1/ [R=301,L]

# remove trailing slash on lower-level URLs (eg mysite.com/en/about_us)
RewriteCond %{REQUEST_METHOD} GET
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^/?([^/]{2}/.+)/$ $1 [R=301,L]