long running crawl

lenk

8

« on: October 08, 2008, 01:21:03 PM »

Hello Oleg,
I have a question. This is the our configuration:

Do not parse URLs: 184/
Progress state storage type: var_export
Maximum depth level: 7

We are crawling our big site, at some moment we see the following statistic:

Time is 9:18

Links depth: 7
Current page: 184/europe/deutschland/mecklenburg-vorpommern/fischland-darss-zingst/dierhagen/atr182756.html
Pages added to sitemap: 220441
Pages scanned: 220460 (1,030,267.2 Kb)

crawl_dump.log 77732K

Next time when the crawl_dump.log was updated is

09:54

Links depth: 7
Current page: 184/europe/schweiz/wallis/crans-montana/edom-7789.html
Pages added to sitemap: 220481
Pages scanned: 220500 (1,030,267.2 Kb)
Pages left: 11051 (+ 0 queued for the next depth level)
Time passed: 7178:16
Time left: 359:45

during this period runcrawl program (version 2.eight) took 90% of CPU and didn't hit the http server. So, did the program parse crawl_dump.log file? Is this a normal behaviour?

Thank you.

« Last Edit: October 08, 2008, 01:27:48 PM by lenk »

Logged

XML-Sitemaps Support

11793

Re: long running crawl

« Reply #1 on: October 08, 2008, 11:26:59 PM »

Hello,

since you have "184/" defined in Do not parse option, sitemap generator doesn't request those URLs from your server (which improves performance) and just scans all remaining URLs to include them in sitemap (thus higher CPU usage).

Logged

Oleg Ignatiuk
https://www.xml-sitemaps.com
Send me a Private Message

SEM and SEO Reports, more than 45M domains: The world's leading Competitive Intelligence Tool for digital marketing.

lenk

8

Re: long running crawl

« Reply #2 on: October 09, 2008, 08:38:24 AM »

"Scans" means from the crawl_dump.log file?

Logged

XML-Sitemaps Support

11793

Re: long running crawl

« Reply #3 on: October 10, 2008, 07:21:12 AM »

It scans the URLs list in memory, the crawl_dump is only being extracted one time when started.

Logged

Oleg Ignatiuk
https://www.xml-sitemaps.com
Send me a Private Message

SEM and SEO Reports, more than 45M domains: The world's leading Competitive Intelligence Tool for digital marketing.