Will my sitemap ever complete or should I try again?
« on: July 18, 2010, 05:23:25 PM »
I have a very large site dynamically generated from a database. I honestly don't even know how many pages it is, I have never successfully been able to sitemap it.

I have tried a few times with XML Sitemaps but have not been able to complete without fatal server errors.

I just moved the site to our own dedicated virtual server and am attempting to site map it again with XML Sitemaps. I have the max_execution_time set at 30000 and the memory limit at 500M.

When I started the generator last night it was moving along at a good pace. I watched it for 2 hours and then went to bed. Today it is still running, but I'm not sure it will ever complete. Here is a current snapshot of its crawl:

Links depth: 5
Current page: quotes/22189/harry-s-truman/shall-able-remove-suspicion
Pages added to sitemap: 127955
Pages scanned: 127960 (1,332,251.1 KB)
Pages left: 56580 (+ 116185 queued for the next depth level)
Time passed: 12:15:30
Time left: 5:25:12
Memory usage: 211,146.2 Kb

What is happening is that the 'Auto Start Monitoring' will count to 121 seconds and then the stats above will update by about 140 pages. If I have at least 172,765 pages to go (adding pages left and queued) it will take another 1,234 passes of this 121 second countdown. That would be another 149318 seconds or about 41 1/2 hours.

I don't see that happening successfully before hitting my 500M of php.ini max memory setting. I'm not even sure why it has progressed beyond the 30000 second (8.3 hours) mark.

Any thoughts on what I should do?

Maybe it will complete? Maybe I can split the crawls into different url/directory paths?


Re: Will my sitemap ever complete or should I try again?
« Reply #1 on: July 19, 2010, 09:17:35 AM »

with website of this size the best option is to create a limited sitemap - with "Maximum depth" or "Maximume URLs" option limited so that it would gather about 100-150,000 URLs, which would be main pages representing "roadmap" sitemap for search engines.

The crawling time itself depends on the website page generation time mainly, since it crawls the site similar to search engine bots.
For instance, if it it takes 1 second to retrieve every page, then 1000 pages will be crawled in about 16 minutes.

Some of the real-world examples of big db-driven websites:
about 35,000 URLs indexed - 1h 40min total generation time
about 200,000 URLs indexed - 38hours total generation time

With "Max urls" options defined it would be much faster than that.
Re: Will my sitemap ever complete or should I try again?
« Reply #2 on: July 19, 2010, 03:51:50 PM »

Thanks for the reply. I think you are right in that a 'partial site map' is better then 'no site map'.

I talked to the web host that we have our dedicated virtual server with...curious about how high I could set the max memory in php.ini. I think I could crank it to a seemingly ridiculous amount like a gig or more. Then I might not run out of memory in the site mapping. (That is what finally happened in yesterday's attempt after my first post...fatal error...exceeded my 500MB limit I set in the php.ini)

As I watched the site map build I was confused at what it was identifying as 'Links Depth'. It seemed at times that even though it said '5' my link depth may have only been 4?? (eg: domain/quotes/22189/harry-s-truman/shall-able-remove-suspicion) That is a 4 right?

As well, it may have been on a 4 or 5 and showed it was indexing this page: "domain/quotes/keyword/funny" or even back to a link such as: "domain/quotes/author/maurice-chevalier" when it had already worked itself deeper at earlier points.

Does the link depth start with the root? or the first level down from it?

I'm asking that question because optimally I would like to attempt a site map for sections as opposed to a max url count down to a specified depth.

The site is well organized. So, I'd like to index say just:

  • domain/quotes/author/{author-name} (and go no lower)
  • domain/authors/type/{author-type} (and go no lower)
  • domain/quotes/topic/{topic-name} (and go no lower)

How would I achieve that?

Set the "Parse ONLY" URLs: to domain/author/ (like that specifically?? I am terrible with server paths) and then set "Maximum depth level" to what?

I appreciate your help and advise!

Re: Will my sitemap ever complete or should I try again?
« Reply #3 on: July 19, 2010, 09:36:46 PM »

actually the link depth is not a depth of "subfolders" in URL, but *a number of clicks it takes to reach the page, starting from homepage*, i.e. that is how the crawler works: it opens homepage, finds all links on that page, then opens all of them, finds the links on them, etc.

You can add this in "Do not parse" setting:
Code: [Select]
and generator *will* include URLs with those substrings, but will not fetch them from server to find inner links, so it should work *much* faster.
Re: Will my sitemap ever complete or should I try again?
« Reply #4 on: May 19, 2017, 02:54:04 AM »
I am getting 130000 pages in 23 minus. Used to having 130000 pages in days.
Hint was in Narrow Indexed Pages Set. Try Exclusion preset. I am now using this software without a hassle.