Memory size exhausted at only 13K URls
« on: April 26, 2009, 12:56:06 AM »
I've kept kicking up my php.ini memory allocation so that now I've got 256MB. Unfortunately at 128MB I got the following error after just 14,900 URLs (at least I think that is the number on the far left that goes by 20.) I'm using the command line interface to save on memory and processing but I still am getting nailed.

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 16372403 bytes) in /home/mysite/public_html/generator/pages/class.utils.inc.php(2) : eval()'d code on line 6

Do I have a large site? Not as large as some. I've got a forum with about 30K threads and 90K posts as well as a few hundred other pages of content. But why is memory allocation doing nothing. Now I got the following error at 256MB so it's clearly not just the php.ini that's causing the problem.


16000 | 54845 | 1,456,153.2 | 33:08 | 113:35 | 4 | 64,080.6 Kb | 14748 | 5711 | -491
16080 | 54765 | 1,464,312.4 | 33:18 | 113:27 | 4 | 65,042.1 Kb | 14811 | 5993 | 962
16100 | 54745 | 1,466,099.6 | 33:21 | 113:26 | 4 | 65,083.2 Kb | 14825 | 6043 | 41

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 16371921 bytes) in /home/mysite/public_html/generator/pages/class.utils.inc.php(2) : eval()'d code on line 6
Re: Memory size exhausted at only 13K URls
« Reply #1 on: April 26, 2009, 05:08:28 AM »
OK - I'm up to 53,000 links but now I'm told that level 2 has 70,000 more links - what is the best way to handle a site map this large?
Re: Memory size exhausted at only 13K URls
« Reply #2 on: April 26, 2009, 06:36:15 AM »
Hmm... my site map generation will take at least 6-8 hours. Apparently I've now got 64,000 pages added to the sitemap with 60,000 pages left and 30,000 more on the next level and counting. I really need a solution about what to do. I think an html site map might be completely worthless although I'm not sure what other search engines use in order to spider information. What do some of you do with this many pages?

Now it's stopped and at this rate, a sitemap will never be completed:

Links depth: 5
Current page: forums/showthread.php?p=57779&mode=threaded
Pages added to sitemap: 65881
Pages scanned: 87120 (7,031,332.3 KB)
Pages left: 57767 (+ 38193 queued for the next depth level)
Time passed: 278:46
Time left: 184:50
Memory usage: 174,120.3 Kb
Resuming the last session (last updated: 2009-04-26 00:58:58)
Fatal error: Allowed memory size of 373293056 bytes exhausted (tried to allocate 44317056 bytes) in /home/mysite/public_html/generator/pages/class.utils.inc.php(2) : eval()'d code on line 6
« Last Edit: April 26, 2009, 07:01:23 AM by paypal346 »
Re: Memory size exhausted at only 13K URls
« Reply #3 on: April 26, 2009, 02:02:59 PM »
Tried to restart the script again without generating HTML maps. This time it's even worse. The script froze in the web browser and using CLI resulted in this error, despite making adjustments for a HUGE memory and time to complete of 256MB placed into the htaccess file as described here:

> php runcrawl.php
<html>
<head>
<title>XML Sitemaps - Generation</title>
<meta http-equiv="Content-type" content="text/html;charset=iso-8859-15" />
<link rel=stylesheet type="text/css" href="pages/style.css">
</head>
<body>
Resuming the last session (last updated: 2009-04-26 06:16:36)
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 262144 bytes) in /home/mysite/public_html/generator/pages/class.grab.inc.php(2) : eval()'d code on line 101

Regardless, I don't see a setting on making multiple sitemaps - I just realized that 50,000 is the limit anyways so I'll need mutliple sitemaps. Does xml-sitemaps do this automatically if you limit the link number?
« Last Edit: April 26, 2009, 02:10:12 PM by paypal346 »
Re: Memory size exhausted at only 13K URls
« Reply #4 on: April 26, 2009, 06:33:10 PM »
Fine - now we've got it working to do the first 30,000 links and it stopped. But I can't find instructions on how to have xml-sitemaps just do multiple sitemaps of 30,000 links. Could you help me out here? I've searched the forums but don't see a simple answer - do I run the crawl again after the first 30,000 links? How do I know that it's being split up into a complementary sitemap and not starting again from the beginning?
Re: Err... 50k pages... Now what?
« Reply #5 on: April 27, 2009, 09:35:11 AM »
Hello,

Google allows up to 50,000 URLs *per sitemap file*.
You can decrease the number though in generator/config.inc.php file:
'xs_sm_size' => '20000',

and regenerate sitemap.
Make sure that you create sitemap1.xml, sitemap2.xml etc files in domain root and set 0666 permissions for them so that sitemap generator can store multifiles sitemap.
OK  - so I assume these are the partial instructions for how to generate a sitemap for sites larger than 50K URLs. I'm just not sure what this meals. Who is supposed to create multiple sitemap.xml files in sequential order? If I regenerate it seems that the original file is recreated from scratch or updated. Unfortunately there is no manual saying what the process is - what is the first file supposed to be called? Is it automatic ordo we rename it? Right now I'm working with 25K URLs at a time since that is a reasonable memory limit and the xml files are between 5-6MB which is manageable size. Could you provide a sticky with the process for creating multiple sitemaps for large sites that can be offered to google?
Re: Memory size exhausted at only 13K URls
« Reply #6 on: April 28, 2009, 05:51:08 PM »
Hello,

eventhough sitemaps are splitten on multiple files, it's still required to have ALL links be found at once, to avoid including duplicate URLs and running in endless loops of links.
If you are running vBulletin-powered website, make sure to use settings described in https://www.xml-sitemaps.com/forum/index.php/topic,241.html


Also, replied here: https://www.xml-sitemaps.com/forum/index.php/topic,2494.msg10556.html#msg10556
Re: Memory size exhausted at only 13K URls
« Reply #7 on: April 28, 2009, 07:11:06 PM »
Thanks for the tips - I'll see what can be done. The challenge is that if you run into sites with many pages, the memory requirements can still go crazy. I was getting out of memory errors at 66K links (with 384MB allocated) and I have over 100,000 links in Google.

So let me understand - for large sites, set memory high (384MB is quite high but will try 512MB next time) and try to run in command line, don't set the limit for number of links, and then later the script will automatically break up the xml files after it's finished running?

(1) Is there a way we can set the multiple sitemaps generator to put out less than 50K links per sitemap, e.g. 40K so that file size of the xml files is manageable?

(2) Is there a suggested memory size per number of links? If I'm hitting memory errors at 66K links, I'm wondering if I'll ever be able to complete the map - but it seems you have done sites much larger.

(3) Are there any other settings/options we need to know about to split sitemaps into several files? How will they be placed?

(4) I saw the vbulletin exclude list - while I understand cutting down time, why would you want to exclude items like profile.php and member.php? These files are individual member profiles with information that do get listed in Google's spider and wouldn't removal of these pages from the sitemap result in having far fewer pages indexed than without a sitemap? Maybe I'm missing something here.

My suggestion is that perhaps you can just include a short bit about sites with over 50K links in the manual, how to set it up and what to expect. This would probably help answering questions like mine! As always, great support here.
« Last Edit: April 28, 2009, 07:15:48 PM by paypal346 »
Re: Memory size exhausted at only 13K URls
« Reply #8 on: April 29, 2009, 08:19:42 PM »
Quote
So let me understand - for large sites, set memory high (384MB is quite high but will try 512MB next time) and try to run in command line, don't set the limit for number of links, and then later the script will automatically break up the xml files after it's finished running?
Yes, that's correct.
Quote
(1) Is there a way we can set the multiple sitemaps generator to put out less than 50K links per sitemap, e.g. 40K so that file size of the xml files is manageable?
You can select the maximum number of URLs per sitemap file manually in generator/config.inc.php file with:
'xs_sm_size' => 40000,
(40000 is default)
Quote
(2) Is there a suggested memory size per number of links? If I'm hitting memory errors at 66K links, I'm wondering if I'll ever be able to complete the map - but it seems you have done sites much larger.
That depends on many points actually, including the URLs structure (longer URLs take more memory since all of them are kept stored), size of page titles (that are exctracted for html and ror sitemaps), the number of links *awaiting in the queue* since they are stored too, even PHP version since they have different memory management schemes. Normally it's trial and error process - you can run it, in case if memory limit is reached you increase it and resume sitemap generation.
Quote
(3) Are there any other settings/options we need to know about to split sitemaps into several files? How will they be placed?
Sitemap generator will create files like:
sitemap.xml (index file that includes the list of all other sitemap files)
sitemap1.xml
sitemap2.xml
etc, depending on the total number of URLs.

Note that since the script doesn't have permissions to create files in domain root, you should manually create all those files before sitemap generator has finished and set 0666 permission to them, so that generator can write to them.

Quote
(4) I saw the vbulletin exclude list - while I understand cutting down time, why would you want to exclude items like profile.php and member.php? These files are individual member profiles with information that do get listed in Google's spider and wouldn't removal of these pages from the sitemap result in having far fewer pages indexed than without a sitemap? Maybe I'm missing something here.
Those are non-content page though, so it's rarely to see visitor coming *from search engine* that was looking for this member and not the *content* on your forum.
At least, you can add them to "Do not parse" option, so they are included in sitemap, but not fecthed from site, which is much faster.