• Welcome to Sitemap Generator Forum.
 

Do not parse URLs are also parsed

Started by kodokmarton, February 13, 2011, 11:41:24 AM

Previous topic - Next topic

kodokmarton

I have setup a regular expression like this in the `Do not parse` section

.*\-[0-9]+\-[0-9]+\.html

This would match an url like:
http://www.example.org/licitatie-publica-ro/furnizare-srot-soia--92417-2.html

The matched section would be:
92417-2.html

But I've noticed from the statistics, that the number of parsed URLs is not correct.

For example I have this:
Links depth: 6
Current page: licitatie-publica-ue/bg-plovdiv:-servicii-de-asistenţă-informatică-277076-34.html
Pages added to sitemap: 269177
Pages scanned: 269180 (409,694.0 KB)
Pages left: 36058 (+ 11303 queued for the next depth level)
Time passed: 18:06:14
Time left: 2:25:30
Memory usage: 426,070.2 Kb


As you see in the above section the following errors are present:

  • it should never list Do not parse URLs in the Current page.
    Current page: licitatie-publica-ue/bg-plovdiv:-servicii-de-asistenţă-informatică-277076-34.html
  • Pages scanned: 269180 (409,694.0 KB)
    I should never include the Do not parse URLs link count.
    Links should be added directly to the Pages added to sitemap
  • Do not parse URLs should be never queued for next depth level, they should be automatically added to the sitemap, so it should show up in the `Pages added to sitemap` line
    Pages left: 36058 (+ 11303 queued for the next depth level)

I am doing something wrong, or are these bugs?

XML-Sitemaps Support

Hello,

the progress indicator lists all URLs that are added in sitemap (including those that are not parsed). If you want to completely exclude them, you should add that expression in "Exclude URLs" setting.

kodokmarton

Hey man, I want to include it, but do not retrieve it.

So I have added the regular expression to the Do not parse configuration box.

And I expected 'do not retrieve pages that contain these substrings in URL, but still INCLUDE them in sitemap'

So I expected these urls to show up in the sitemap, but they should never be retrieved.
The statistics should say:
Pages added to sitemap: 15
Pages scanned: 1

That means 1 page was scanned, and found additional 14 links, it included in the sitemap, and they were never retrieved, so they are not included in the `Pages scanned` section.

This is definetly working wrong!

Did you understood the problem I have?


kodokmarton

It's not ok.

That's what I mean.

Look at the numbers on top:
Pages added to sitemap: 269177
Pages scanned: 269180


They are very close.

It should be
Pages added to sitemap: 269177
Pages scanned: 2345


Can you give me your phone number or Skype to explain more?


kodokmarton

if I create a page with 300.000 links on it.
When that is parsed. It doesn't add instantly to the sitemap the found and do not parsed links.


XML-Sitemaps Support

Sorry, I'm not sure I understand what you mean, do you have a question related to "Do not parse" setting?

kodokmarton

I expected urls that match the pattern defined in `Do not parse`are automatically added to the sitemap, without going and getting the content of the url.


kodokmarton

I don't think works well.
Put on a pace a lot of URLs, and setup the Do not parse section to match those urls.

The script adds around 80 items per minute to the sitemap.
A PHP loop it would simply add hundredthousand of items in a minute, not just 80.
I feel here, that the pages are actually checked, downloaded.

XML-Sitemaps Support

I just tried the settings as provided in post #1 above and example link provided in the same post and it worked correctly - it was not retrived from the server but added to sitemap.
Note that although generator doesn't retrieve the page, it still needs to check every link for duplicates (i.e. scan the array of pages already added if new link already exists), which might be resource consuming on large sites.

kodokmarton

It's very bad the method you use for the verification if an URL exists.

You need to use a more efficient way than scanning the array each time, that is really really bad approach.

You need to change the structure of your collection to store a md5 hash of the url, and all this in the `key` part of the array, and this way you don't need to iterate over the array, you just assume it does not exists yet, and you simply assign it. In case if exists it will be only a simple reassign, and this operation takes very short compared to a array scan.

Here is a sample code to help you.

$collection['links']=array();
foreach ($pageUrls as $currentUrl) {
    $currentKey=md5($currentUrl);
    $collection['links'][$currentKey]=array($currentUrl,1,2,3,$referrer,5, 'or any other variable you want to set');
}

if you do a print_r($collection['links']);

you will see something like

['96E5494F6C488EEC4EDDCA6DF12AF745']=>array(
  'licitatie-publica-ro/furnizare-srot-soia--92417-2.html',
  '1',
  '2',
  '3',
  'licitatie-publica-ro/agricultura-2.html',
  '5',
   'or any other variable you want to set'
);


Benefits:

  • There won't be two keys with the same hash in the collection.
  • You don't need to scan the array each time you want to add a new unique link.
  • You take benefit of the array unique keys functionality doing this.
  • It's the fastest way doing this.
  • Your scripts will use less memory too (do not have to store the iteration copy in memory)

Please tell me if you understood the approach described here, and also please tell how urgent can you optimize the sitemap generator to use the approach described here?

kodokmarton

How it's going with the above implementation? I haven't heard back from you for weeks.

XML-Sitemaps Support

We will take your post into consideration, however we don't have immediate plans to implement it. Our investigation has shown that it won't provide benefits for memory/other resources usage.