crawling just stops after several hours
« on: July 20, 2016, 11:14:17 PM »
I have sitemap generating for several other sites on same server, but the newest one does not finish.  All have very similar configuration and are run via cronjob.

New generator (config file attached) keeps stopping after running for awhile.
I got it to generate successfully after setting max pages to 100, then to 1000, but after setting to 10000, it ran for over 6 hours and then stopped and no new sitemap.

This new site is "deeper" and has many more links (over 15K products), but I thought this should work for unlimited links?

The _only_ thing that is changed from the downloaded zip files is the /data/generator.conf file.

I'm running it from command line (in shell script)

#!/bin/bash
cd /var/www/sorrelsky/sitemap-generator
/usr/bin/php /var/www/sorrelsky/sitemap-generator/runcrawl.php

I have the generator files above the site root (same as the others that I've set up).
Re: crawling just stops after several hours
« Reply #2 on: July 21, 2016, 01:50:09 PM »
I ran with nohup, so captured nohup.out

Top of output is:
PHP Stack trace:
PHP   1. {main}() /var/www/sorrelsky/sitemap-generator/runcrawl.php:0
PHP   2. include() /var/www/sorrelsky/sitemap-generator/runcrawl.php:102
PHP   3. include() /var/www/sorrelsky/sitemap-generator/index.php:102
PHP   4. SiteCrawler->pJIy8HIUg() /var/www/sorrelsky/sitemap-generator/pages/page-crawlproc.inc.php:212
PHP   5. preg_replace() /var/www/sorrelsky/sitemap-generator/pages/class.grab.inc.php:1265
0 | 0 | 0.0 | 0:00:05 | 0:00:00 | 0 | 3,020.5 Kb | 0 | 0 | 3020
1 | 79 | 34.4 | 0:00:08 | 0:11:08 | 1 | 3,185.1 Kb | 1 | 79 | 3185
3 | 77 | 98.1 | 0:00:13 | 0:05:49 | 1 | 3,187.2 Kb | 3 | 3 | 3187
6 | 74 | 214.5 | 0:00:18 | 0:03:50 | 1 | 3,276.0 Kb | 6 | 30 | 3276
9 | 71 | 350.3 | 0:00:25 | 0:03:19 | 1 | 3,324.3 Kb | 9 | 69 | 3324
12 | 68 | 485.0 | 0:00:31 | 0:02:56 | 1 | 3,369.4 Kb | 12 | 108 | 3369
16 | 64 | 644.3 | 0:00:36 | 0:02:24 | 1 | 3,421.1 Kb | 16 | 149 | 3421
...

end of file is:
...
13492 | 175 | 734,310.8 | 6:28:50 | 0:05:02 | 6 | 76,352.6 Kb | 4434 | 2015 | 76352
13496 | 171 | 734,492.4 | 6:28:56 | 0:04:55 | 6 | 76,374.7 Kb | 4435 | 2028 | 76374
13499 | 168 | 734,661.1 | 6:29:02 | 0:04:50 | 6 | 76,386.9 Kb | 4435 | 2028 | 76386
13500 | 167 | 734,717.4 | 6:29:02 | 0:04:48 | 6 | 76,389.7 Kb | 4435 | 2028 | 3
13503 | 164 | 734,886.3 | 6:29:08 | 0:04:43 | 6 | 76,404.3 Kb | 4435 | 2028 | 76404
13506 | 161 | 735,055.7 | 6:29:13 | 0:04:38 | 6 | 76,369.9 Kb | 4435 | 2028 | 76369
13509 | 158 | 735,176.3 | 6:29:20 | 0:04:33 | 6 | 76,434.3 Kb | 4436 | 2038 | 76434
13513 | 154 | 735,397.9 | 6:29:26 | 0:04:26 | 6 | 76,452.0 Kb | 4436 | 2038 | 76452
13516 | 151 | 735,564.8 | 6:29:31 | 0:04:21 | 6 | 76,467.7 Kb | 4436 | 2038 | 76467
13520 | 147 | 735,793.0 | 6:29:35 | 0:04:14 | 6 | 76,480.7 Kb | 4436 | 2038 | 76480
13521 | 146 | 735,849.0 | 6:29:41 | 0:04:12 | 6 | 76,487.4 Kb | 4436 | 2038 | 76487
Re: crawling just stops after several hours
« Reply #3 on: July 21, 2016, 11:17:55 PM »
just ran it again for max 5000, with debug on, and it finished

but... I also turned off canonical url (which I really would prefer), so will try again with same settings except I'll turn that back on
Re: crawling just stops after several hours
« Reply #4 on: July 22, 2016, 06:16:57 AM »
Also you might want to increase "Maximum running time" setting in generator configuration since it takes quite long to finish the process in this case.
Re: crawling just stops after several hours
« Reply #5 on: July 22, 2016, 06:11:25 PM »
I tried running again with debug, max 10000 pages, and it just stopped after several hours, no errors in debug.log, just the last page processed.

I have 0 for [unlimited] max execution (and have had that for all my attempts).
Is there some other setting?
Re: crawling just stops after several hours
« Reply #6 on: July 23, 2016, 12:33:42 AM »
it stopped again, many hours later, but I managed to grab stderr output

PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 116857 bytes) in /var/www/sorrelsky/sitemap-generator/pages/class.grab.inc.php on line 1265
PHP Stack trace:
PHP   1. {main}() /var/www/sorrelsky/sitemap-generator/runcrawl.php:0
PHP   2. include() /var/www/sorrelsky/sitemap-generator/runcrawl.php:102
PHP   3. include() /var/www/sorrelsky/sitemap-generator/index.php:102
PHP   4. SiteCrawler->pJIy8HIUg() /var/www/sorrelsky/sitemap-generator/pages/page-crawlproc.inc.php:212
PHP   5. preg_replace() /var/www/sorrelsky/sitemap-generator/pages/class.grab.inc.php:1265

Perhaps there is a memory leak in runcrawl?

memory limit on server is 128M
Re: crawling just stops after several hours
« Reply #7 on: July 23, 2016, 05:54:04 AM »
Hello,

you need to increase memory_limit setting since 128M is not sufficient in this case.
Re: crawling just stops after several hours
« Reply #8 on: July 25, 2016, 02:04:47 PM »
I had to increase memory with ini_set at top of runcrawl and now it finishes.
Still seems like there is a memory leak somewhere.