realy large site
« on: January 10, 2010, 11:51:25 AM »
Hello,
I've got really large site, which consist of over 10 million pages and growing in number every day.
I ran sitemap generator that loads approximately 20 - 30 thousand pages per day. Complete Sitemap will be ready approximately in a year.
I have set the generator to make the xml files with 20 000 pages. (Number of URLs per file in XML sitemap: 20 000)
But crawling is stil running, over 200 000 pages was loaded, but there is no saved xml sitemap file yet and the temp file is getting realy large.

Is it possible to somehow set the generator to keep save xml files continuously with already loaded sites? For example after 20 thousands sites?

Thank you
Re: realy large site
« Reply #1 on: January 10, 2010, 07:46:20 PM »
Hello,

with website of this size the best option is to create a limited sitemap - with "Maximum depth" or "Maximume URLs" option limited so that it would gather about 200-300,000 URLs, which would be main pages representing "roadmap" sitemap for search engines.
Re: realy large site
« Reply #2 on: January 12, 2010, 11:45:51 PM »
i am having the same issue too. i have a site with over 400k pages and is a photo site so i want the photos to be a part of the site map so google will have an easier time indexing them.   I will try what you suggested below but what else can I do to make sure all my pages are added and new pages are added when we release new content to the site.
Re: realy large site
« Reply #3 on: January 12, 2010, 11:49:49 PM »
what settings would i want to use then for my max depth and max urls. how do i know that it will pick up the right pages that will be the roadmap.  what can i do if i want all the pages indexed but it looks like it was going to take over a week to do?
Re: realy large site
« Reply #5 on: January 18, 2010, 08:46:56 AM »
Hello,
for us it's impossible to make sitemap only with 400k links as a roadmap. The website is image bank and each site of over 6 mil. has it's own keywords and SEO parameters. The clients are looking for unique photos with unique KW.

In such a configuration is, sadly, bought sitemap generator for us useless  :(  Do you plan to upgrade SG, which will store sitemap continuously?

Thank You
Re: realy large site
« Reply #6 on: January 18, 2010, 12:02:11 PM »
Hello,

it is required for sitemap generator to keep all crawled links list in memory to avoid including duplicate URLs in sitemap (otherwise sitemap generator would crawl the site indefinitely). Creating a shorter xml sitemap is very useful - search engines will still include other pages onf your site as well, but they will be able to find them faster/easier with "shortlist" sitemap that allows to extract main "routes" to find the rest of your pages.