So let me understand - for large sites, set memory high (384MB is quite high but will try 512MB next time) and try to run in command line, don't set the limit for number of links, and then later the script will automatically break up the xml files after it's finished running?
Yes, that's correct.
(1) Is there a way we can set the multiple sitemaps generator to put out less than 50K links per sitemap, e.g. 40K so that file size of the xml files is manageable?
You can select the maximum number of URLs per sitemap file manually in generator/config.inc.php file with:
'xs_sm_size' => 40000,
(40000 is default)
(2) Is there a suggested memory size per number of links? If I'm hitting memory errors at 66K links, I'm wondering if I'll ever be able to complete the map - but it seems you have done sites much larger.
That depends on many points actually, including the URLs structure (longer URLs take more memory since all of them are kept stored), size of page titles (that are exctracted for html and ror sitemaps), the number of links *awaiting in the queue* since they are stored too, even PHP version since they have different memory management schemes. Normally it's trial and error process - you can run it, in case if memory limit is reached you increase it and resume sitemap generation.
(3) Are there any other settings/options we need to know about to split sitemaps into several files? How will they be placed?
Sitemap generator will create files like:
sitemap.xml (index file that includes the list of all other sitemap files)
etc, depending on the total number of URLs.
Note that since the script doesn't have permissions to create files in domain root, you should manually create all those files before sitemap generator has finished and set 0666 permission to them, so that generator can write to them.
(4) I saw the vbulletin exclude list - while I understand cutting down time, why would you want to exclude items like profile.php and member.php? These files are individual member profiles with information that do get listed in Google's spider and wouldn't removal of these pages from the sitemap result in having far fewer pages indexed than without a sitemap? Maybe I'm missing something here.
Those are non-content page though, so it's rarely to see visitor coming *from search engine* that was looking for this member and not the *content* on your forum.
At least, you can add them to "Do not parse" option, so they are included in sitemap, but not fecthed from site, which is much faster.