Do not parse URLs are also parsed
« on: February 13, 2011, 11:41:24 AM »
I have setup a regular expression like this in the `Do not parse` section

Code: [Select]
.*\-[0-9]+\-[0-9]+\.html
This would match an url like:
Code: [Select]
http://www.example.org/licitatie-publica-ro/furnizare-srot-soia--92417-2.html
The matched section would be:
Code: [Select]
92417-2.html
But I've noticed from the statistics, that the number of parsed URLs is not correct.

For example I have this:
Code: [Select]
Links depth: 6
Current page: licitatie-publica-ue/bg-plovdiv:-servicii-de-asistenţă-informatică-277076-34.html
Pages added to sitemap: 269177
Pages scanned: 269180 (409,694.0 KB)
Pages left: 36058 (+ 11303 queued for the next depth level)
Time passed: 18:06:14
Time left: 2:25:30
Memory usage: 426,070.2 Kb

As you see in the above section the following errors are present:
  • it should never list Do not parse URLs in the Current page.
    Current page: licitatie-publica-ue/bg-plovdiv:-servicii-de-asistenţă-informatică-277076-34.html
  • Pages scanned: 269180 (409,694.0 KB)
    I should never include the Do not parse URLs link count.
    Links should be added directly to the Pages added to sitemap
  • Do not parse URLs should be never queued for next depth level, they should be automatically added to the sitemap, so it should show up in the `Pages added to sitemap` line
    Pages left: 36058 (+ 11303 queued for the next depth level)

I am doing something wrong, or are these bugs?
Re: Do not parse URLs are also parsed
« Reply #1 on: February 13, 2011, 09:52:20 PM »
Hello,

the progress indicator lists all URLs that are added in sitemap (including those that are not parsed). If you want to completely exclude them, you should add that expression in "Exclude URLs" setting.
Re: Do not parse URLs are also parsed
« Reply #2 on: February 13, 2011, 10:00:12 PM »
Hey man, I want to include it, but do not retrieve it.

So I have added the regular expression to the Do not parse configuration box.

And I expected 'do not retrieve pages that contain these substrings in URL, but still INCLUDE them in sitemap'

So I expected these urls to show up in the sitemap, but they should never be retrieved.
The statistics should say:
Pages added to sitemap: 15
Pages scanned: 1

That means 1 page was scanned, and found additional 14 links, it included in the sitemap, and they were never retrieved, so they are not included in the `Pages scanned` section.

This is definetly working wrong!

Did you understood the problem I have?
Re: Do not parse URLs are also parsed
« Reply #4 on: February 14, 2011, 06:40:08 AM »
It's not ok.

That's what I mean.

Look at the numbers on top:
Code: [Select]
Pages added to sitemap: 269177
Pages scanned: 269180

They are very close.

It should be
Code: [Select]
Pages added to sitemap: 269177
Pages scanned: 2345

Can you give me your phone number or Skype to explain more?
Re: Do not parse URLs are also parsed
« Reply #6 on: February 14, 2011, 05:26:23 PM »
if I create a page with 300.000 links on it.
When that is parsed. It doesn't add instantly to the sitemap the found and do not parsed links.

Re: Do not parse URLs are also parsed
« Reply #7 on: February 14, 2011, 08:42:41 PM »
Sorry, I'm not sure I understand what you mean, do you have a question related to "Do not parse" setting?
Re: Do not parse URLs are also parsed
« Reply #8 on: February 14, 2011, 08:44:32 PM »
I expected urls that match the pattern defined in `Do not parse`are automatically added to the sitemap, without going and getting the content of the url.
Re: Do not parse URLs are also parsed
« Reply #9 on: February 15, 2011, 01:36:26 PM »
Exactly, that's what happens (they are still included in pages number counter though).
Re: Do not parse URLs are also parsed
« Reply #10 on: February 15, 2011, 07:12:51 PM »
I don't think works well.
Put on a pace a lot of URLs, and setup the Do not parse section to match those urls.

The script adds around 80 items per minute to the sitemap.
A PHP loop it would simply add hundredthousand of items in a minute, not just 80.
I feel here, that the pages are actually checked, downloaded.
Re: Do not parse URLs are also parsed
« Reply #11 on: February 16, 2011, 11:08:06 AM »
I just tried the settings as provided in post #1 above and example link provided in the same post and it worked correctly - it was not retrived from the server but added to sitemap.
Note that although generator doesn't retrieve the page, it still needs to check every link for duplicates (i.e. scan the array of pages already added if new link already exists), which might be resource consuming on large sites.
Re: Do not parse URLs are also parsed
« Reply #12 on: February 20, 2011, 08:45:04 AM »
It's very bad the method you use for the verification if an URL exists.

You need to use a more efficient way than scanning the array each time, that is really really bad approach.

You need to change the structure of your collection to store a md5 hash of the url, and all this in the `key` part of the array, and this way you don't need to iterate over the array, you just assume it does not exists yet, and you simply assign it. In case if exists it will be only a simple reassign, and this operation takes very short compared to a array scan.

Here is a sample code to help you.

Code: [Select]
$collection['links']=array();
foreach ($pageUrls as $currentUrl) {
    $currentKey=md5($currentUrl);
    $collection['links'][$currentKey]=array($currentUrl,1,2,3,$referrer,5, 'or any other variable you want to set');
}
if you do a print_r($collection['links']);

you will see something like

Code: [Select]
['96E5494F6C488EEC4EDDCA6DF12AF745']=>array(
  'licitatie-publica-ro/furnizare-srot-soia--92417-2.html',
  '1',
  '2',
  '3',
  'licitatie-publica-ro/agricultura-2.html',
  '5',
   'or any other variable you want to set'
);

Benefits:
  • There won't be two keys with the same hash in the collection.
  • You don't need to scan the array each time you want to add a new unique link.
  • You take benefit of the array unique keys functionality doing this.
  • It's the fastest way doing this.
  • Your scripts will use less memory too (do not have to store the iteration copy in memory)

Please tell me if you understood the approach described here, and also please tell how urgent can you optimize the sitemap generator to use the approach described here?
Re: Do not parse URLs are also parsed
« Reply #13 on: March 13, 2011, 11:49:26 AM »
How it's going with the above implementation? I haven't heard back from you for weeks.
Re: Do not parse URLs are also parsed
« Reply #14 on: March 13, 2011, 01:52:58 PM »
We will take your post into consideration, however we don't have immediate plans to implement it. Our investigation has shown that it won't provide benefits for memory/other resources usage.