long running crawl
« on: October 08, 2008, 01:21:03 PM »
Hello Oleg,
  I have a question. This is the our configuration:

Do not parse URLs: 184/
Progress state storage type: var_export
Maximum depth level: 7

We are crawling our big site, at some moment we see the following statistic:

Time is 9:18

Links depth: 7
Current page: 184/europe/deutschland/mecklenburg-vorpommern/fischland-darss-zingst/dierhagen/atr182756.html
Pages added to sitemap: 220441
Pages scanned: 220460 (1,030,267.2 Kb)

crawl_dump.log  77732K

Next time when the crawl_dump.log was updated is

09:54

Links depth: 7
Current page: 184/europe/schweiz/wallis/crans-montana/edom-7789.html
Pages added to sitemap: 220481
Pages scanned: 220500 (1,030,267.2 Kb)
Pages left: 11051 (+ 0 queued for the next depth level)
Time passed: 7178:16
Time left: 359:45


during this period runcrawl program (version 2.eight) took 90% of CPU and didn't hit the http server. So, did the program parse  crawl_dump.log file? Is this a normal behaviour?

Thank you.

 
« Last Edit: October 08, 2008, 01:27:48 PM by lenk »
Re: long running crawl
« Reply #1 on: October 08, 2008, 11:26:59 PM »
Hello,

since you have "184/" defined in Do not parse option, sitemap generator doesn't request those URLs from your server (which improves performance) and just scans all remaining URLs to include them in sitemap (thus higher CPU usage).
Re: long running crawl
« Reply #2 on: October 09, 2008, 08:38:24 AM »
"Scans" means from the crawl_dump.log file?
Re: long running crawl
« Reply #3 on: October 10, 2008, 07:21:12 AM »
It scans the URLs list in memory, the crawl_dump is only being extracted one time when started.