XML Sitemaps Generator

Author Topic: Do not parse URLs are also parsed  (Read 16130 times)

kodokmarton

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 31
Do not parse URLs are also parsed
« on: February 13, 2011, 11:41:24 AM »
I have setup a regular expression like this in the `Do not parse` section

Code: [external links are visible to admins only]
.*\-[0-9]+\-[0-9]+\.html
This would match an url like:
Code: [external links are visible to admins only]
http://www.example.org/licitatie-publica-ro/furnizare-srot-soia--92417-2.html
The matched section would be:
Code: [external links are visible to admins only]
92417-2.html
But I've noticed from the statistics, that the number of parsed URLs is not correct.

For example I have this:
Code: [external links are visible to admins only]
Links depth: 6
Current page: licitatie-publica-ue/bg-plovdiv:-servicii-de-asistenţă-informatică-277076-34.html
Pages added to sitemap: 269177
Pages scanned: 269180 (409,694.0 KB)
Pages left: 36058 (+ 11303 queued for the next depth level)
Time passed: 18:06:14
Time left: 2:25:30
Memory usage: 426,070.2 Kb

As you see in the above section the following errors are present:
  • it should never list Do not parse URLs in the Current page.
    Current page: licitatie-publica-ue/bg-plovdiv:-servicii-de-asistenţă-informatică-277076-34.html
  • Pages scanned: 269180 (409,694.0 KB)
    I should never include the Do not parse URLs link count.
    Links should be added directly to the Pages added to sitemap
  • Do not parse URLs should be never queued for next depth level, they should be automatically added to the sitemap, so it should show up in the `Pages added to sitemap` line
    Pages left: 36058 (+ 11303 queued for the next depth level)

I am doing something wrong, or are these bugs?

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Do not parse URLs are also parsed
« Reply #1 on: February 13, 2011, 09:52:20 PM »
Hello,

the progress indicator lists all URLs that are added in sitemap (including those that are not parsed). If you want to completely exclude them, you should add that expression in "Exclude URLs" setting.
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

kodokmarton

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 31
Re: Do not parse URLs are also parsed
« Reply #2 on: February 13, 2011, 10:00:12 PM »
Hey man, I want to include it, but do not retrieve it.

So I have added the regular expression to the Do not parse configuration box.

And I expected 'do not retrieve pages that contain these substrings in URL, but still INCLUDE them in sitemap'

So I expected these urls to show up in the sitemap, but they should never be retrieved.
The statistics should say:
Pages added to sitemap: 15
Pages scanned: 1

That means 1 page was scanned, and found additional 14 links, it included in the sitemap, and they were never retrieved, so they are not included in the `Pages scanned` section.

This is definetly working wrong!

Did you understood the problem I have?

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Do not parse URLs are also parsed
« Reply #3 on: February 13, 2011, 10:47:11 PM »
That is ok then - "scanned" doesn't mean "parsed" or retrieved.
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

kodokmarton

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 31
Re: Do not parse URLs are also parsed
« Reply #4 on: February 14, 2011, 06:40:08 AM »
It's not ok.

That's what I mean.

Look at the numbers on top:
Code: [external links are visible to admins only]
Pages added to sitemap: 269177
Pages scanned: 269180

They are very close.

It should be
Code: [external links are visible to admins only]
Pages added to sitemap: 269177
Pages scanned: 2345

Can you give me your phone number or Skype to explain more?

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Do not parse URLs are also parsed
« Reply #5 on: February 14, 2011, 10:15:31 AM »
"Pages scanned" is NOT the number of pages retrieved from server.
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

kodokmarton

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 31
Re: Do not parse URLs are also parsed
« Reply #6 on: February 14, 2011, 05:26:23 PM »
if I create a page with 300.000 links on it.
When that is parsed. It doesn't add instantly to the sitemap the found and do not parsed links.


XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Do not parse URLs are also parsed
« Reply #7 on: February 14, 2011, 08:42:41 PM »
Sorry, I'm not sure I understand what you mean, do you have a question related to "Do not parse" setting?
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

kodokmarton

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 31
Re: Do not parse URLs are also parsed
« Reply #8 on: February 14, 2011, 08:44:32 PM »
I expected urls that match the pattern defined in `Do not parse`are automatically added to the sitemap, without going and getting the content of the url.

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Do not parse URLs are also parsed
« Reply #9 on: February 15, 2011, 01:36:26 PM »
Exactly, that's what happens (they are still included in pages number counter though).
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

kodokmarton

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 31
Re: Do not parse URLs are also parsed
« Reply #10 on: February 15, 2011, 07:12:51 PM »
I don't think works well.
Put on a pace a lot of URLs, and setup the Do not parse section to match those urls.

The script adds around 80 items per minute to the sitemap.
A PHP loop it would simply add hundredthousand of items in a minute, not just 80.
I feel here, that the pages are actually checked, downloaded.

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Do not parse URLs are also parsed
« Reply #11 on: February 16, 2011, 11:08:06 AM »
I just tried the settings as provided in post #1 above and example link provided in the same post and it worked correctly - it was not retrived from the server but added to sitemap.
Note that although generator doesn't retrieve the page, it still needs to check every link for duplicates (i.e. scan the array of pages already added if new link already exists), which might be resource consuming on large sites.
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

kodokmarton

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 31
Re: Do not parse URLs are also parsed
« Reply #12 on: February 20, 2011, 08:45:04 AM »
It's very bad the method you use for the verification if an URL exists.

You need to use a more efficient way than scanning the array each time, that is really really bad approach.

You need to change the structure of your collection to store a md5 hash of the url, and all this in the `key` part of the array, and this way you don't need to iterate over the array, you just assume it does not exists yet, and you simply assign it. In case if exists it will be only a simple reassign, and this operation takes very short compared to a array scan.

Here is a sample code to help you.

Code: [external links are visible to admins only]
$collection['links']=array();
foreach ($pageUrls as $currentUrl) {
    $currentKey=md5($currentUrl);
    $collection['links'][$currentKey]=array($currentUrl,1,2,3,$referrer,5, 'or any other variable you want to set');
}
if you do a print_r($collection['links']);

you will see something like

Code: [external links are visible to admins only]
['96E5494F6C488EEC4EDDCA6DF12AF745']=>array(
  'licitatie-publica-ro/furnizare-srot-soia--92417-2.html',
  '1',
  '2',
  '3',
  'licitatie-publica-ro/agricultura-2.html',
  '5',
   'or any other variable you want to set'
);

Benefits:
  • There won't be two keys with the same hash in the collection.
  • You don't need to scan the array each time you want to add a new unique link.
  • You take benefit of the array unique keys functionality doing this.
  • It's the fastest way doing this.
  • Your scripts will use less memory too (do not have to store the iteration copy in memory)

Please tell me if you understood the approach described here, and also please tell how urgent can you optimize the sitemap generator to use the approach described here?

kodokmarton

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 31
Re: Do not parse URLs are also parsed
« Reply #13 on: March 13, 2011, 11:49:26 AM »
How it's going with the above implementation? I haven't heard back from you for weeks.

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Do not parse URLs are also parsed
« Reply #14 on: March 13, 2011, 01:52:58 PM »
We will take your post into consideration, however we don't have immediate plans to implement it. Our investigation has shown that it won't provide benefits for memory/other resources usage.
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

 

SMF 2.0.12 | SMF © 2014, Simple Machines
XHTML RSS WAP2