multiple simultaneous executions
« on: August 02, 2008, 01:25:54 PM »
Is it possible to execute more than one instance of the program at one time?

For example, in a large site I am experimenting with generating a large number of individual sitemaps, each of which includes specific terms in the url.

Is it possible to execute more than one of these runs at the same time?

Thanks
Mike
Re: multiple simultaneous executions
« Reply #1 on: August 02, 2008, 05:01:13 PM »
Hello,

yes, but you will have to install separate generator instances for that (like domain.com/generator1/ , domain.com/generator2/ etc)
Re: multiple simultaneous executions
« Reply #2 on: August 03, 2008, 02:14:26 PM »
Thanks, Oleg

This is just a followup to describe how the process is working.  I currently have 9 instances of generator going, using about 96 exclusion terms to prevent both including and parsing, and one required inclusion term.

I was unable to run all simultaneously from the browser interface, however I did the setup with the browser then ran the executions using runcrawl.php from command prompt.  Each page is actually a mysql query, although none of these databases is too big, so it is loading up mysql a bit but still working.  Top shows cpu usage about 12%.

I am creating individual sitemaps, one for each category, and including them in sitemap-index.xml.  I am re-submitting the index to google once per day as new maps are generated.  When it is all completed I will extract urls and manually build a urllist.txt.gz to submit to yahoo.  I do this by grepping "http" from all the files and directing output to a new file, then using perl to remove the <loc>..</loc>.

On a different site I ended up building a sitemap 100% manually, as I was able to list all the page names.  That method won't work for the current site, however.  This had 2.5 million urls, I did it manually in a few hours, while the generator had been running a week and was at about 600,000.

For me, practically speaking, it seems the generator is maxed out at about 200,000 urls.  I have been thinking about it, and it seems there is no easy way to automate the process for very large sites where pages are cross-referenced and cross-linked.  You have to either check the list of spidered/included urls with each grab (which becomes slower and slower as the list grows), or grab them all many times then remove duplicates.  Either way you reach a breaking point.

I have reached the conclusion that building smaller maps is the best solution, when it can be done.

Thanks for your help, your tech support is really very good.

Mike