This is just a followup to describe how the process is working. I currently have 9 instances of generator going, using about 96 exclusion terms to prevent both including and parsing, and one required inclusion term.
I was unable to run all simultaneously from the browser interface, however I did the setup with the browser then ran the executions using runcrawl.php from command prompt. Each page is actually a mysql query, although none of these databases is too big, so it is loading up mysql a bit but still working. Top shows cpu usage about 12%.
I am creating individual sitemaps, one for each category, and including them in sitemap-index.xml. I am re-submitting the index to google once per day as new maps are generated. When it is all completed I will extract urls and manually build a urllist.txt.gz to submit to yahoo. I do this by grepping "http" from all the files and directing output to a new file, then using perl to remove the <loc>..</loc>.
On a different site I ended up building a sitemap 100% manually, as I was able to list all the page names. That method won't work for the current site, however. This had 2.5 million urls, I did it manually in a few hours, while the generator had been running a week and was at about 600,000.
For me, practically speaking, it seems the generator is maxed out at about 200,000 urls. I have been thinking about it, and it seems there is no easy way to automate the process for very large sites where pages are cross-referenced and cross-linked. You have to either check the list of spidered/included urls with each grab (which becomes slower and slower as the list grows), or grab them all many times then remove duplicates. Either way you reach a breaking point.
I have reached the conclusion that building smaller maps is the best solution, when it can be done.
Thanks for your help, your tech support is really very good.