multiple simultaneous executions
« on: August 02, 2008, 01:25:54 PM »
Is it possible to execute more than one instance of the program at one time?

For example, in a large site I am experimenting with generating a large number of individual sitemaps, each of which includes specific terms in the url.

Is it possible to execute more than one of these runs at the same time?

Re: multiple simultaneous executions
« Reply #1 on: August 02, 2008, 05:01:13 PM »

yes, but you will have to install separate generator instances for that (like , etc)
Re: multiple simultaneous executions
« Reply #2 on: August 03, 2008, 02:14:26 PM »
Thanks, Oleg

This is just a followup to describe how the process is working.  I currently have 9 instances of generator going, using about 96 exclusion terms to prevent both including and parsing, and one required inclusion term.

I was unable to run all simultaneously from the browser interface, however I did the setup with the browser then ran the executions using runcrawl.php from command prompt.  Each page is actually a mysql query, although none of these databases is too big, so it is loading up mysql a bit but still working.  Top shows cpu usage about 12%.

I am creating individual sitemaps, one for each category, and including them in sitemap-index.xml.  I am re-submitting the index to google once per day as new maps are generated.  When it is all completed I will extract urls and manually build a urllist.txt.gz to submit to yahoo.  I do this by grepping "http" from all the files and directing output to a new file, then using perl to remove the <loc>..</loc>.

On a different site I ended up building a sitemap 100% manually, as I was able to list all the page names.  That method won't work for the current site, however.  This had 2.5 million urls, I did it manually in a few hours, while the generator had been running a week and was at about 600,000.

For me, practically speaking, it seems the generator is maxed out at about 200,000 urls.  I have been thinking about it, and it seems there is no easy way to automate the process for very large sites where pages are cross-referenced and cross-linked.  You have to either check the list of spidered/included urls with each grab (which becomes slower and slower as the list grows), or grab them all many times then remove duplicates.  Either way you reach a breaking point.

I have reached the conclusion that building smaller maps is the best solution, when it can be done.

Thanks for your help, your tech support is really very good.