very large site
« on: April 20, 2008, 02:08:37 PM »
I just purchased and installed unlimited sitemap generator, and am currently running my first execution.  The previous sitemap which was created by gsitecrawler from a personal computer, took about a week to finish.  That sitemap had over 430,000 entries.

Currently I set the program to continue running, no timeouts or log-off stoppage.  Then I clicked a link on the browser tab, and was unable to load the page again which shows how many pages have been spidered.

I am on a linux server and the top command shows about 1.40 cpu load average, which is OK.  I am using ls -al to check the logs in the data directory and crawl_dump.log is now about 78Mb after about 1/2 hour, and both crawl_dump.log and crawl_state.log are continuing to increment their last modified date.

I have 2 questions:

1.  If there is a problem like cpu usage goes up, how do I stop the process from a linux comand?  I cannot identify the process.  Will an apache restart, which reloads php, stop the program?

2.  Is there any way for me to get an idea how many pages have been indexed, and thereby calculate how much remaining time I have before the sitemap is completed -- without being able to see the php page in a browser?  Do the numbers in crawl_state.log contain this information?

Re: very large site
« Reply #1 on: April 20, 2008, 06:38:39 PM »

1. you can create empty file named "interrupt.log" in data/ folder to stop generator. Apache restart would stop generator script as well if you execute it via web interface (i.e. not in the command line)

2. you should see the progress state in the crawling tab page when generator is running in background. And yes - the progress is stored in crawl_state.log file.
Re: very large site
« Reply #2 on: April 20, 2008, 07:42:19 PM »
Thank you Oleg

What I was getting at was that the tab page would not reload after I left the page one time, so I could not see the progress.

But now I check back and the page loads normally.

Current page: cgi-bin/search.cgi?program_type=hematology&searchtype=city&state=TN&city1=L
Pages added to sitemap: 69996
Pages scanned: 70080 (2,075,245.3 Kb)
Pages left: 174933 (+ 74844 queued for the next depth level)
Time passed: 369:45
Time left: 922:57
Memory usage: 176,415.4 Kb
sitemaps not generated
« Reply #3 on: April 21, 2008, 11:02:30 AM »
This is in followup to previous query titled "very large site".  The crawler stopped running.  In /data directory, crawl_state.log is no longer there.  crawl_dump.log is 129Mb and is dated about 10 hours ago.  The /sitemaps directory is empty.  All subdirectories are chmod drwxrwxrwx.

Web page login to main page says sitemaps not generated, go to crawl page.  Crawling tab shows only the first line of text with checkbox that says "run in background".  Sitemaps tab shows 3 sitemap links, but they are all 404 not present.

The crawling tab takes a long time to load.  The page is cutoff and view/source shows the last line is

<div class="inptitle">Run in background</div>
<input type="checkbox" name="bg" value="1" id="in1"><label for="in1"> Do not interrupt the script even after closing the browser window until the crawling is complete</label>

The server is running normally and no daemons have been stopped.  Executing top command shows cpu usage at 0.15, which is normal minimum usage.

Has the program stopped or hung up due to the large file size of crawl_dump.log?  This server has 6Gb of ram and dual  Xeon(TM) CPU 3.20GHz.

If so, and if I delete the crawl_dump.log, are there recommended settings that will streamline the program for a second try?

Re: sitemaps not generated
« Reply #4 on: April 21, 2008, 02:06:21 PM »
One more thing, I downloaded the crawl_dump.log file and here is the tail of the file:

Re: very large site
« Reply #5 on: April 21, 2008, 04:40:01 PM »
In most cases for the large site it's possible to optimize crawler settings to not parse the majority of pages and dramatically increase performance. Could you please PM me your generator URL so that I can check it?
Re: very large site
« Reply #6 on: April 28, 2008, 10:37:32 AM »
This is a Thank You to Oleg, who followed through with tech support, logging in to my server and modifying the program to handle large numbers of excluded urls in a way that now works great! 

This product solves a problem I have had for years.  The windows-based sitemap generators are ok for smaller sites, but they had to run for a week to index this huge site.  Now it is completed in a matter of hours.

This product is the best buy for the money I have seen for a long time.
Re: very large site
« Reply #7 on: April 28, 2008, 09:13:02 PM »
an someone tell us what settings were changed to make this work?

Re: very large site
« Reply #8 on: April 28, 2008, 10:33:55 PM »

using "Do not parse" and "Exclude URLs" options to avoid crawling "noise content" pages helps a lot in most cases for large sites.