XML Sitemaps Generator

Author Topic: Exclude based on language part of URL  (Read 7006 times)

d.d

  • Registered Customer
  • Approved member
  • *
  • Posts: 2
Exclude based on language part of URL
« on: January 16, 2013, 10:22:41 AM »
Apologies if this has been covered before - I've searched without any joy.

I have a site with multiple language parts:
mysite.com/en  for english
mysite.com/fr   for french
mysite.com/de  for german

etc.

I want to setup sitemap generator to scan and include URLs from the English part of the site (mysite.com/en) only.

The english part of the site has links to other language parts of the site, but I don't want to include or follow them.

I do not want to scan other parts of the site for links to english pages.
I do not want any links outside the /en part of the site to appear in the sitemap.

If I put "en" in "Include only URLs" and "Parse only URLs" it still seems to follow some links outside the /en part of the site.

I'm guessing this is because "en" is quite a common pair of letters to find in other parts of a URL.

If I put "en/" or "/en" in "Include only URLs" and "Parse only URLs" it fails to scan fully.

Is there a way I can specify to only parse and include URLs which start with "/en"?
(perhaps put a regular expression in those fields?).

thanks in advance for any help you can offer

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Exclude based on language part of URL
« Reply #1 on: January 17, 2013, 12:52:29 PM »
Hello,

please try to specify mysite.com/en/ (with ending slash) as Starting URL.
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

d.d

  • Registered Customer
  • Approved member
  • *
  • Posts: 2
Re: Exclude based on language part of URL
« Reply #2 on: January 23, 2013, 10:03:59 AM »
Thanks very much - that worked just fine.

To be clear:
  • I set the starting URL as [external links are visible to admins only]
  • I emptied the "Exclude URLs", "Do not parse URLs", "Include ONLY URLs" and "Parse ONLY URLs" fields

I had to adjust my .htaccess file, since URLs such as /en on my site are virtual - I have a central URL handler, so for SEO purposes I had stripped the final '/'.
I adjusted the .htaccess file to allow for 'mysite.com/en/' (and force the trailing '/'), while removing the trailing '/' from lower URLs (mysite.com/en/about_us).

Extract from .htaccess shown below. YMMV.


# force trailing '/' on bare language specifier URL (eg mysite.com/en/)
RewriteCond %{REQUEST_METHOD} GET
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^/?([^/]{2})$ $1/ [R=301,L]

# remove trailing slash on lower-level URLs (eg mysite.com/en/about_us)
RewriteCond %{REQUEST_METHOD} GET
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^/?([^/]{2}/.+)/$ $1 [R=301,L]

 

SMF 2.0.12 | SMF © 2014, Simple Machines
XHTML RSS WAP2