long running crawl

lenk · October 08, 2008, 01:21:03 PM

Hello Oleg,
I have a question. This is the our configuration:

Do not parse URLs: 184/
Progress state storage type: var_export
Maximum depth level: 7

We are crawling our big site, at some moment we see the following statistic:

Time is 9:18

Links depth: 7
Current page: 184/europe/deutschland/mecklenburg-vorpommern/fischland-darss-zingst/dierhagen/atr182756.html
Pages added to sitemap: 220441
Pages scanned: 220460 (1,030,267.2 Kb)

crawl_dump.log 77732K

Next time when the crawl_dump.log was updated is

09:54

Links depth: 7
Current page: 184/europe/schweiz/wallis/crans-montana/edom-7789.html
Pages added to sitemap: 220481
Pages scanned: 220500 (1,030,267.2 Kb)
Pages left: 11051 (+ 0 queued for the next depth level)
Time passed: 7178:16
Time left: 359:45

during this period runcrawl program (version 2.eight) took 90% of CPU and didn't hit the http server. So, did the program parse crawl_dump.log file? Is this a normal behaviour?

Thank you.

XML-Sitemaps Support · October 08, 2008, 11:26:59 PM

Hello,

since you have "184/" defined in Do not parse option, sitemap generator doesn't request those URLs from your server (which improves performance) and just scans all remaining URLs to include them in sitemap (thus higher CPU usage).

lenk · October 09, 2008, 08:38:24 AM

"Scans" means from the crawl_dump.log file?

XML-Sitemaps Support · October 10, 2008, 07:21:12 AM

It scans the URLs list in memory, the crawl_dump is only being extracted one time when started.

Sitemap Generator Forum

News:

long running crawl

lenk

XML-Sitemaps Support

lenk

XML-Sitemaps Support