Problems with SEF links in cyrillic alphabet
« on: September 02, 2012, 06:00:13 PM »
Hello!

I use standalone version of XML Sitemap Generator.

My site (Joomla) has a SEF component rewriting links into cyrillic alphabet.

I run the script on my local computer where I have got an Apache web-server installed. Because real site contains over 20 000 pages. I pointed the script to scan my real site in its configuration section. After that I uploaded the generated sitemap to my real site, after changing site URL in it of course.

Now I am facing two problems.

1) There was 1000 broken links, however they look OK and I can open the pages from the list of broken links just clicking on them. What should I do with them?

2) The site map contains links supposed to be in cyrillic looking as abracadabra, and of course they give 404 error when clicking on them.

I am not a big expert with that. Could you please advise where to look for or relevant reading. I will send you in pm my sitemap address so you can take a look.
Thanks.
Re: Problems with SEF links in cyrillic alphabet
« Reply #1 on: September 02, 2012, 11:28:09 PM »
Hello,

1. what is an example oforken link + referring page to it?

2. I'd need an example URL here as well. You might just try to enable UTF8 support setting in generator configuration though.
Re: Problems with SEF links in cyrillic alphabet
« Reply #2 on: September 03, 2012, 09:15:03 AM »
Hello,

As for 1) you can see them at [ External links are visible to forum administrators only ].

Regarding 2) I will follow your advice and report the results later. So far I can only say that when I run free on-line generator the SEF links in Russian in xml sitemap were OK. I suspect it may have also to do with settings of my local web-server.

BRG
Re: Problems with SEF links in cyrillic alphabet
« Reply #3 on: September 03, 2012, 02:00:15 PM »
I set UTF-8 and it's OK now. Also I set delay for crawling. Looks better - no 404 pages during a short test run.

Thanks for your advice!

I've got only two questions left at the moment.

1) If it still happens that some links will be reported as broken due to slow server response, but they are OK in fact, are they included nevertheless in the sitemap or I need something to do with them to have them included  in the sitemap?

2) Generator crowls "print" and "ask a question" pages in Virtuemart shop, however I've chosen Joomla exclusion preset in config section. These pages looks like:

/index.php?page=shop.ask&flypage=flypage.tpl&product_id=123522&category_id=51634&option=com_virtuemart&Itemid=44

/index2.php?option=com_virtuemart&page=shop.product_details&only_page=1&category_id=51634&product_id=123527&pop=1&tmpl=component&

What string do I need to add to "Exclude URLs" to prevent generator crowling these pages?

Thanks again!
Re: Problems with SEF links in cyrillic alphabet
« Reply #4 on: September 03, 2012, 04:52:26 PM »
1. In case if generator gets "not found" response, it won't include those pages in sitemap.
2. I'd recommend to add them in "Exclude URLs" setting with:
Code: [Select]
shop.ask
pop=1
Re: Problems with SEF links in cyrillic alphabet
« Reply #5 on: September 03, 2012, 05:35:33 PM »
Thanks,

It works. However I had an impression, when I added "shop.feed" to exclusion list of URLs, for some reason it kept including in sitemap links containing "shop.feed". So I turned off these links on site for a scanning period.

Also I set delay 2 sec between 5 requests. I think my hosting has anti-flooding software running. Slowly but surely  :)

BRG.
Re: Problems with SEF links in cyrillic alphabet
« Reply #6 on: September 04, 2012, 05:18:45 PM »
Hello Oleg!

After new run again 1000 links reported as broken. Could you please take a look on them here [ External links are visible to forum administrators only ].

Also around 5000 links are not crowled at all. I can't see why.

Thanks.

PS. As I said in the beginning I run generator from local PC and afterwards upload sitemap and generator itself to the real site. Since memory limit and execution time do not allow to run it on hosting.
Re: Problems with SEF links in cyrillic alphabet
« Reply #7 on: September 06, 2012, 06:09:15 AM »
Hello,

\the problem in this case is not in server where generator running, but server here the site is hosted, since it stops responding after some time when generator crawls the site.
Re: Problems with SEF links in cyrillic alphabet
« Reply #8 on: September 06, 2012, 06:08:30 PM »
I will try to experiment with delay times between requests, as well as with a number of requests between delays.

Could you please tell - are there any limitations for a length of encoded URLs for a generator to crawl them successfully?
Re: Problems with SEF links in cyrillic alphabet
« Reply #10 on: October 01, 2012, 01:57:00 PM »
Hello!

Regardless of crowl delay settings I always get exactly 1000 links reported as broken, even on another domain. Why is it always 1000 and why a vast majority of them is always the same links, actually working links? Could you please take a look at [ External links are visible to forum administrators only ]. Do I need to re-engineer product and category names?

I use Joomla 1.5, Virtuemart 1.1.9 and SH404SEF.

Thanks

« Last Edit: October 01, 2012, 01:59:05 PM by capricorn »