Crawl Issue-Can't get Sitemap Generator to finish
« on: March 22, 2008, 01:14:20 PM »

Hi,
I've been running the stand alone site map for over 6 hours.
This is the current picture:
Links depth: 9
Current page: index.php?main_page=shopping_cart&manufacturers_id=3&sort=20a&products_id=40&action=notify&zenid=a6b7e9f56537ab7ed59f4fa314a8cc69
Pages added to sitemap: 3527
Pages scanned: 16320 (268,577.3 Kb)
Pages left: 6111 (+ 2880 queued for the next depth level)
Time passed: 377:03
Time left: 141:11
Memory usage: 14,574.4 Kb
 
My site is not very large.
I only have 110 items listed in my online store at this time
The website url is: elliesox.com

Am I doing anything wrong?
Config, files/folders it's crawling/location?
Thanks ???
 
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #1 on: March 22, 2008, 04:10:48 PM »
Hello,

looks like your shopping software generated a lot of "noise" content page that should not be normally indexed (like sorting pages etc). Please let me know your generator URL and I will check the crawler exclusion list to resolve the issue.
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #2 on: March 22, 2008, 04:32:18 PM »
Hi,

The URL is: [ External links are visible to forum administrators only ]

But now when I go to the URL I get the site with a 404 error and I'm not getting  the Sitemap menu.
Should I delete the generator file and reupload it and then email you to take a look at the configs?

Thanks
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #3 on: March 23, 2008, 02:16:09 PM »
HI,
I ended up deleting and reuploading the generator folder to my server at the url and now I can access the sitemap generator again.
I have not run it again, I'll wait until you can look at the config.
I did save the error_log and crawl_dump.log files and can email them to you is you'd like to see them.
The crawl dump lof is very large, 9.49MB

Let me know.

Thanks
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #4 on: March 23, 2008, 04:31:29 PM »
I too am having the same problem here with a shopping cart and trying to exclude things like urls ending this way:
/Compressors/?page=1&sort=2a
/Compressors/?page=1&sort=3a
/product_reviews.php/products_id/6437?osCsid=caab1a59ead8d2536ca6c11f0f5d3a41
/shop/product_reviews.php/cPath/295_559_563/products_id/2747
/shop/ACCESSORIES/WARN+PRODUCTS/Warn+Winch+Accessories/?page=1&sort=4d

I have tried the excludes, Does not seem to work.
You will notice it is even adding the session id when you have it as a default option to drop.
My site map this morning was over 7MB and ran all night.....

How can we stop all this???

As I understand the exclusions and extension exclusions all go on one line with a space between them right? ( No carriage returns)
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #5 on: March 23, 2008, 07:58:38 PM »
Recommended settings for X-Cart websites:
Do Not parse URLs option:
Code: [Select]
js=
sort=
action=
write_review
product_reviews
reviews_write
printable=
language=
manufacturers_id=
bestseller=
sort/
action/
js/
printable/
language/
redirect.php
price_match.php

Exclude URLs option:
Code: [Select]
redirect.php
js=
sort=
action=
write_review
reviews_write
printable=
manufacturers_id=
bestseller=
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #6 on: March 23, 2008, 08:38:53 PM »
I run oscommerce and my cart is in the /shop directory, should any fo the above be preceeded with this path?
shop/

?
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #7 on: March 24, 2008, 02:22:05 AM »
Hi,
The new config has corrected the proble.
Now with the completed sitemap generated I have run into another problem.
All 5 referred from are similar.
I have 5 broken links, but cannot find the file they are referred from.
An an example of one referred from is:
index.php?main_page=tell_a_friend&products_id=21&zenid=d6cf8ed5e8c06a327c9c4a5410f5f971

Any suggestions to correct?

Thanks
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #8 on: March 24, 2008, 02:48:48 AM »
I submitted it to Google to see what it would come up with and Google came up with 10 "Paths don't match" errors

Let me know if I should PM you the URL(s)

Thanks,
Robert
 :(
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #9 on: March 24, 2008, 03:00:08 PM »
The X-Cart suggestions you included, along with a couple more have solved ALL my problems with the oscommerce cart. It was a pleasure to see everything ran from cron last night with no errors and NO MORE TRASH output.

Many thanks for your help, May I suggest making your cart suggestions a part of a sticky F.A.Q for "Shopping Cart Operators or include them in your doc files.

Thanks,

TwoEyesOfBlue
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #10 on: March 25, 2008, 01:04:05 AM »
Replied to your PM, Robert.
I submitted it to Google to see what it would come up with and Google came up with 10 "Paths don't match" errors

Let me know if I should PM you the URL(s)

Thanks,
Robert
 :(
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #11 on: March 25, 2008, 01:04:41 AM »
I'm glad that worked for you, thank you for suggestion, that is a good idea.

The X-Cart suggestions you included, along with a couple more have solved ALL my problems with the oscommerce cart. It was a pleasure to see everything ran from cron last night with no errors and NO MORE TRASH output.

Many thanks for your help, May I suggest making your cart suggestions a part of a sticky F.A.Q for "Shopping Cart Operators or include them in your doc files.

Thanks,

TwoEyesOfBlue
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #12 on: March 25, 2008, 03:53:31 PM »
Well I spoke a little too soon I guess as I thought since I received no errors from cron that it ran ok.  Later yesterday I realized I had pulled a good one in setting cron from cpanel but did not activate it (Good reason I got no errors I guess) However all sitemaps and googlebase created this morning just fine and completed with no errors.  The HTML site map did not update and cron sent hundreds of these repeating lines:

Warning: fwrite(): supplied argument is not a valid stream resource in /home/twoeyesofblue/public_html/generator/pages/class.xml-creator.inc.php(2) : eval()'d code on line 169

Warning: fwrite(): supplied argument is not a valid stream resource in /home/twoeyesofblue/public_html/generator/pages/class.xml-creator.inc.php(2) : eval()'d code on line 171

Warning: fwrite(): supplied argument is not a valid stream resource in /home/twoeyesofblue/public_html/generator/pages/class.xml-creator.inc.php(2) : eval()'d code on line 169

Warning: fwrite(): supplied argument is not a valid stream resource in /home/twoeyesofblue/public_html/generator/pages/class.xml-creator.inc.php(2) : eval()'d code on line 171


I have checked all files inside the /generator/pages directory and all have the same 0644 file permissions.
All site map files are 0666 and all updated fine except sitemap.html

Any Ideas?

I ran it manually yesterday and saw no errors when ran manually. I started with an empty sitemap.html and it wrote to it fine. It appears the problem is in the update to it.

TwoEyesOfBlue
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #13 on: March 26, 2008, 02:52:02 AM »
Hello,

if you have any files inside generator/data/ folder, set their permissions to 0666.
Re: Crawl Issue-Can't get Sitemap Generator to finish
« Reply #14 on: March 26, 2008, 12:11:32 PM »
Well I can't change them all. The ones that were generated by a manual run from  a log in are owned by me the user. All those generated by Cron are owned by nobody (apache) and they are at 0644.

Here is something else. I have sitemap.html thru sitemap25.html defined and properly chmodded in the root. However I see sitemap.html thru sitemap26.html in /generator/pages directory. The cron run did not write ANYthing in any of them out in the /webroot directory. I went ahead and made a sitemap26.html for the webroot and chmodded it correctly. I am going to do a manual run and see what happens. I am starting to suspect the run from cron is failing while the other day on the 1st. install that worked was a manual run.  What I see too at looking at those in the /generator/pages directory are full of the trash fromthe shopping cart that we are now blocking from your suggestions. Let you know what happens on a manual run.