Solutions to Memory Problem
« on: March 17, 2008, 02:11:28 AM »
I am stuck in the same bind that many people trying to use this script are in.  My site is hosted on a shared server at a hosting company.  We are not able to increase the amount of memory available to the process.  Based on the other threads I've seen in these forums, it appears 32k is a standard amount we all get allocated.

I'm looking to do two things: (1) figure out a way I can index my site given the constraints I have and (2) offer suggestions for improving the script so it will work under lower memory constraints.

(1) I bought this script because I want to index more then just 100 pages of my site.  Given the memory constraints, that's about all I get.  If I go another level deeper, I fun out of memory.  If all I get is 100 pages, this script isn't much use to me.  Is it possible to run this script on one of my personal machines that I have control over and can set the memory to whatever I want and have it index a web site that is on a different machine (the site on the hosted server)?  I assume I can.  However, the resulting files will be saved on my local machine, correct?  So, does this mean I will have to write a batch job to call the sitemap script, then when its done ftp the files, and when that's done ping Google?

(2) I haven't looked at the code of the script yet, but I will assume you are not already doing these things.
(2a) Have you considered some sort of compression algorithm to make the URL list take up less space in memory?
(2b) Have you considered adding a config value indicating the max memory the script should use and then if the process requires more then that much memory, it starts reading/writing the urls to disk?  It would take longer, but at least it would work.  Rather then have 1 huge growing file of URLs which would start taking way to long to process to check each url, look at using a bunch of smaller files.  Take the URL and add the ASCII values of each character in the URL, then MOD 1000.  Create/open the file with the name that includes that mathematical result, say tmp743.txt, and read the URLs from the file one at a time to see if any match.  If no match is found, add this URL to the end of the file.  If you combine this with 2a, it will go even faster. :)
Re: Solutions to Memory Problem
« Reply #1 on: March 17, 2008, 10:42:37 PM »
Hello,

if you only get 100 pages crawled and memory limit of 32M is exceeded at that point, possible sitemap generator is trying to download a large file, please PM me your generator URL so that I can check that.

We have plans for further improvements of memory usage for Sitemap Generator targeting low memory server packages.
Re: Solutions to Memory Problem
« Reply #2 on: March 18, 2008, 05:10:44 PM »
With a max depth of 4, the script runs out of memory.  With a max depth of 3, the script completes but only includes approximately 175 pages.
Re: Solutions to Memory Problem
« Reply #4 on: March 19, 2008, 08:02:18 PM »
Well, I don't know exactly what it was failing on, but I played with the settings which included adding some url exclusions and reran it.  I was able to get it to crawl 5000+ pages with a max depth of 8 without erroring out.  Again, I don't know what was wrong and which of my changes fixed it.
Re: Solutions to Memory Problem
« Reply #5 on: April 02, 2008, 01:21:34 AM »
The answer to your problem is simple. Run the script alone on a home/office server. I have been running the script for months on siteground shared hosting with no luck. It was only when I had read a reply from admin in the forum that I realized the script did not have to be on the same server as the website being crawled.

I have used Xampp on a windows machine, I had to change the php.ini file in about 5 locations to increase max _execution_time and memory. I just reset file locations and names to suit my site in the configuration.

Apart from  taking up a massive amount of memory and time, the script runs sweet! I assume its taking some time to crawl due to limitations from my shared hosts allowed amount of http requests!

I hope this has been helpful to anyone using shared hosting!

Dan
Re: Solutions to Memory Problem
« Reply #6 on: April 09, 2008, 01:28:07 AM »
I fixed this by opening .htaccess in my root (public_html) folder
And adding "php_value memory_limit 16M" to the bottom and saved it. (WITHOUT " quotes)

Hope that helps. :)