large site, generator keeps dying
« on: July 22, 2008, 12:41:05 PM »
I have posted this basic topic before for another large site, and the only way I solved it was to cut out a large portion of the site from the sitemap.  This meant that a lot fewer sites were submitted to google.

Here are the current details:

approximate urls to be included: 3.5 million
current status:  generator keeps dying

ls -al in generator/data:

drwxrwxrwx  2 admin  admin       4096 Jul 22 04:29 .
drwxrwxr-x  5 admin  admin       4096 Jul 15 12:05 ..
-rw-r--r--  1 apache apache 196096423 Jul 22 04:31 crawl_dump.log
-rw-r--r--  1 apache apache       174 Jul 22 04:31 crawl_state.log
-rw-rw-r--  1 admin  admin         11 Jul 15 02:15 placeholder.txt

I have assigned 3Gb of memory, server has total of 6Gb.  The generator/index.php page says it uses about 495Mb.  The crawl page, when it will load, says there about 385k urls indexed with another 895k urls to go. 

It has been running several days and has died at least 20 times.

Is there a way to get this program to index very large sites?  If not, I am considering making a manual sitemap for the majority of the urls, and exclude them from the automated process.

But I would like to try again to get this to work.

Thanks
Mike
Re: large site, generator keeps dying
« Reply #1 on: July 22, 2008, 04:56:46 PM »
Hello,

make sure that you have ROR and HTML sitemap generation disabled as it consumes additional memory to store page titles/descriptions (that would require restart of sitemap generator).
Re: large site, generator keeps dying
« Reply #2 on: July 25, 2008, 11:22:16 AM »
So far, so good.

I have disabled the ROR  The crawl_dump.log file is now at 156Mb.  Here are the stats from the crawl page:

Links depth: 5
Current page: detail.php?in_npi=1467466706&state=CA
Pages added to sitemap: 36852
Pages scanned: 90700 (4,347,643.9 KB)
Pages left: 630350 (+ 46121 queued for the next depth level)
Time passed: 667:48
Time left: 4641:11
Memory usage: 255,449.1 Kb

There are about 3.5 million pages in this site.  I suspect that if this run goes to completion, it will take a lot more than 4 days.

I have set the "Progress state storage type" to var_export.  What is the memory issue between this setting and serialize?  Which uses less memory?  The var_export option is easier to read by humans.

I have set the "Save the script state" option to 3600, or 1 hour.  What is the memory issue associated with this setting? 

I have set the "Maximum memory usage" to 3000Mb.  It is only using 300Mb.  The machine has 6Gb of ram.  Is there a way to give it more memory, or is this an httpd/php limitation controlled elsewhere?

Thanks
Mike
Re: large site, generator keeps dying
« Reply #3 on: July 25, 2008, 04:34:10 PM »
Quote
I have set the "Progress state storage type" to var_export.  What is the memory issue between this setting and serialize?  Which uses less memory?  The var_export option is easier to read by humans.
Usually serialize option consumes less memory, but in some php environments it's has caused memory *leaks*, so we've added "var_export" option as well.
Quote
I have set the "Save the script state" option to 3600, or 1 hour.  What is the memory issue associated with this setting? 
Ideally saving the state should not cause leaks, but as mentioned above sometimes it might happen.

Quote
I have set the "Maximum memory usage" to 3000Mb.  It is only using 300Mb.  The machine has 6Gb of ram.  Is there a way to give it more memory, or is this an httpd/php limitation controlled elsewhere?
Probably it just doesn't need need more memory at this point.


Also, make sure to check your site structure for further optimizations using "Exclude URLs" and "Do not parse" options. The first option excludes specific (non-important) URLs from sitemap, while the second one includes pages in sitemap, but doesn't *crawl* them. I guess that in your case adding "detail.php" in "Do not parse" option might make sense.
Re: large site, generator keeps dying
« Reply #4 on: July 26, 2008, 02:20:48 AM »
Thank you, this seems to be an improvement.  The crawl_dump.log file has reached 164Mb in just a few hours with no hanging.  I will monitor it and see what happens.

Meanwhile the html access page index.php will not load.  I would like to monitor progress.  Can you help me analyze the crawl_state.log file?

The html display from a previous crawl attempt showed

Links depth: 5
Current page: detail.php?in_npi=1467466706&state=CA
Pages added to sitemap: 36852
Pages scanned: 90700 (4,347,643.9 KB)
Pages left: 630350 (+ 46121 queued for the next depth level)
Time passed: 667:48
Time left: 4641:11
Memory usage: 255,449.1 Kb

Current snapshot of crawl_state.log shows

 more crawl_state.log
array (
  0 => 29203.245185852,
  1 => 'detail.php?in_npi=1508846221&state=CA',
  2 => 629941,
  3 => 99620,
  4 => 4571292527,
  5 => 5,
  6 => '269,618.0 Kb',
  7 => 92568,
  8 => 354,

Some of the items are obvious, some are not.  Can you list the elements of this array?

Thanks
Mike
Re: large site, generator keeps dying
« Reply #5 on: July 26, 2008, 12:31:03 PM »
Also I have a question/suggestion for development.

Would it be possible to control the levels step-by-step?

Here is the layout of this particular site.  It is relevant to a lot of large directory sites.  I own and maintain several.

Top-tier pages:  Geographical list (by state).  Lists all entries for that state, 100 per page with "Next" link.  Next levels go as deep as needed, in this case California has approximately 220,000 entries, or 2,200 "Next" links.  Each entry on these pages is an individual page.

Specialty pages:  List by specialty and by state.  Each page calls a certain specialty within a specific state.  They also have "Next" links.  It is important to have these pages indexed.  Each entry on these pages is an individual page.  The entries on these pages are also included in the geographical pages, but ordered differently.

Individual pages:  One entry without any further links.

Here is a possible suggestion:

1. Run a preliminary search, completely excluding the individual pages from being crawled or added to sitemaps, but including all the general and specialty pages;

2. THEN re-set the configuration parameters to READ the previous links spidered but not crawled into a second crawl list, de-duplicated;

3. ADD the individual pages from the list to the sitemap as one crawl.

Personally, I would be willing to pay extra for a customized program. 

Thanks,
Mike
Re: large site, generator keeps dying
« Reply #6 on: July 26, 2008, 02:01:04 PM »
Hello,

you can use "Maximum depth level" option in sitemap generator configuration to limit the "depth of crawling", so that individual pages are added without crawling (also you can define individual page URL in "Do not parse" option for the same purpose).