I think I broke it..
« on: September 03, 2010, 11:31:03 PM »
I have a very large site in multiple languages: [ External links are visible to forum administrators only ]  ::)

I did get this to work once, but i think it's choking on all the content. There are thousands of screen captures in browse able folders, and hundreds of pages that have been added since 2004. I also noticed it is not indexing most of the screen captures, and many directories are being duplicated.

I have fiddled with the memory and time delay etc for a few days with no help. It's been crawling now for about 6 hours but has not finished, I believe it has croaked.

Another thing is, on my linux server it will not work when being called by a crontab session. I have tried the basic php command line start as given in the crawl page, and also used a file my webhost suggested to call the crontab job by a perl file, that don't work either. I have verified with my server host that the crontab schedule is being executed but the program is either not running, or croaked during the crawl.

The program works great on another site with less than 100 urls, but the crontab job will not work on that site either. Seems to be a decent program if i can get it to handle the large site and get called up by crontab at night each night.

Doc
Re: I think I broke it..
« Reply #1 on: September 03, 2010, 11:45:37 PM »
Oh another thing.. I would like to show all those screen captures by image name. I have tried enabling add images info and removing the exclusion of .jpg .gif .png and it still will not add them to the sitemap.

The html sitemap says there are a total number of pages: 1965, but only 739 pages are in the sitemap.

It's over here if anyone want's to take a look. ebaymotorssucks.com  sitemap.html
Re: I think I broke it..
« Reply #2 on: September 04, 2010, 08:33:04 AM »
Hello,

1. I've updated your crawler setings, please check now. (with 20 seconds delay after each requests it will run terribly long, it's not needed)

2. please try to use that cron task command line by connecting to your server via ssh to see if there is an error.

3. images info is only included in sitemap.xml, not in html sitemap. Note that including images noticeably increases memory usage of generator script and increases crawling time.
Re: I think I broke it..
« Reply #3 on: September 04, 2010, 12:45:53 PM »
It's messed up bad today.

The html sitemap has weird entries and broken links.

1   6     2010-09-04T10:00:00+00:00     6     
2   b    2010-09-03T22:00:00+00:00    b    b
3   3    2010-09-04T07:00:00+00:00    3    3
4   f    2010-09-03T18:00:00+00:00    f    f
5   1    2010-09-04T05:00:00+00:00    1    1
6   a    2010-09-03T23:00:00+00:00    a    a

I have not tried running it yet - but i need to - Google will barf all over this list and i don't want to loose my ranking.
Re: I think I broke it..
« Reply #4 on: September 04, 2010, 02:11:10 PM »
Seems to have locked up while scanning.

Already in progress. Current process state is displayed:
Links depth: 4
Current page: ?tag=free-vehicle-history-report&gtlang=tr
Pages added to sitemap: 10325
Pages scanned: 21060 (465,633.7 KB)
Pages left: 4754 (+ 114 queued for the next depth level)
Time passed: 0:59:06
Time left: 0:13:20
Memory usage: -

Sat Sep 04 2010 09:08:29 GMT-0400 (Eastern Daylight Time): resuming generator (121 seconds with no response)
Sat Sep 04 2010 09:06:28 GMT-0400 (Eastern Daylight Time): resuming generator (120 seconds with no response)
Sat Sep 04 2010 09:04:28 GMT-0400 (Eastern Daylight Time): resuming generator (121 seconds with no response)
Sat Sep 04 2010 09:02:27 GMT-0400 (Eastern Daylight Time): resuming generator (121 seconds with no response)
Re: I think I broke it..
« Reply #6 on: September 04, 2010, 08:14:09 PM »
It's working, I guess the html sitemap must be created in the Data sub-dir. OK so now it crawls the site in about 10 minutes and creates the sitemaps.

Now.. I created a new directory called screen-captures with 2 sub dirs hanging off of it with .jpg and .png images. Neither the directory or the sub-dirs, or the contained images are being added.

With the other screen capture folders under "ebayhacks" some are not added, and none of the images in that directory tree are being indexed or added to the sitemap.

Also those directorys that are being added are duplicated 8 times, and none of the images in those directorys are added to the sitemap.

Equipment is 1 sub-directory with 7 images in it.

equipment 9 pages
Index of /ebayhacks/equipment
Index of /ebayhacks/equipment
Index of /ebayhacks/equipment
Index of /ebayhacks/equipment
Index of /ebayhacks/equipment
Index of /ebayhacks/equipment
Index of /ebayhacks/equipment
Index of /ebayhacks/equipment
Index of /ebayhacks/equipment

And similar within these other sub-dirs. with no images being added.

Re: I think I broke it..
« Reply #7 on: September 05, 2010, 09:09:54 AM »
Hello,

only images embedded on html pages are included in xml sitemap. Also, there must be a way to reach those pages starting from homepage so that generator bot can find them.
Re: I think I broke it..
« Reply #8 on: September 06, 2010, 01:31:06 AM »
Hi Oleg..  Thanks for the response.

I put an html file in each directory and sub-category dirinfo.htm which allows the bot to index the area. But how do i stop the generator from duplicating the directories? The below is only one sub-directory, but it's being duplicated 9 times for some reason.

ebm-discussions 10 pages
Index of /screen-captures/ebm-discussions
Index of /screen-captures/ebm-discussions
Index of /screen-captures/ebm-discussions
Index of /screen-captures/ebm-discussions
Index of /screen-captures/ebm-discussions
Index of /screen-captures/ebm-discussions
Index of /screen-captures/ebm-discussions
Index of /screen-captures/ebm-discussions
Index of /screen-captures/ebm-discussions
Screen Captures Of eBay and eBay Motors Pages
This is a browse able directory of various eBay and eBay Motors screen captures of possible interest archived for preservation
Re: I think I broke it..
« Reply #9 on: September 06, 2010, 09:40:10 AM »
What are those are default apache index pages with different sorting options. What are the URLs to all those page?