Only spiders a small amount of my pages
« on: August 03, 2005, 11:26:10 PM »
Hi,

My site has around 100,000 pages.

The php generator only spiders 7631 pages.

The large majority of those pages on my website are part of my forum (vbulletin).

Is this a known issue ?

I have tried every permutation of setting on the config screen including reuploading all files. I have also cleared my data folder to retry.

I have run this through the web and also through shell (which I would prefer) and both end up with the same figure.

The script isn't timing out because I get the confirmation of apparant success message through shell saying that it's finished.

The limit I have on the generator is 50010. I chose this figure to see if it would span sitemapX.xml.gz files properly.

I am using 1.08 btw

Thanks
« Last Edit: August 03, 2005, 11:37:47 PM by support1 »
Re: Only spiders a small amount of my pages
« Reply #1 on: August 03, 2005, 11:50:34 PM »
Hi,

it is possible that you have have robots.txt file with exclusion list? (generator script supports it in the same way as SE bots)
otherwise, please let me know your generator URL in PM.
Re: Only spiders a small amount of my pages
« Reply #2 on: August 04, 2005, 12:23:56 PM »
Hi,

I don't use a robots.txt instead use apache to block bad spiders, and provide authentication where needed. Furthermore, I don't really block directories, but do block bad spiders.

I do have several 301 redirects though as have updated a lot of urls recently. That said, there is still an easily navigable path for all my webpages to be accessed by human and spider alike ;)

Will PM you in a sec
« Last Edit: August 04, 2005, 12:39:05 PM by support1 »
Re: Only spiders a small amount of my pages
« Reply #3 on: August 04, 2005, 10:59:41 PM »
Also, when I run the php script from a shell I get this in my httpd error_log:

[Thu Aug 04 12:28:44 2005] [error] [client xxx.xxx.xxx.xxx] request failed: error reading the headers, referer: [ External links are visible to forum administrators only ]

I get hundreds of those every time I run the script.
Re: Only spiders a small amount of my pages
« Reply #4 on: August 05, 2005, 07:34:35 PM »
anyone ?
Re: Only spiders a small amount of my pages
« Reply #5 on: August 06, 2005, 01:04:54 AM »
Also, when I run the php script from a shell I get this in my httpd error_log:

[Thu Aug 04 12:28:44 2005] [error] [client xxx.xxx.xxx.xxx] request failed: error reading the headers, referer: http://www.mydomain.com/

I get hundreds of those every time I run the script.
Hi,

hmm.. is it produced by the generator requests? i.e., is the client IP address the same as host IP?
Re: Only spiders a small amount of my pages
« Reply #6 on: August 06, 2005, 01:08:44 AM »
Hi,

I don't use a robots.txt instead use apache to block bad spiders, and provide authentication where needed. Furthermore, I don't really block directories, but do block bad spiders.

I do have several 301 redirects though as have updated a lot of urls recently. That said, there is still an easily navigable path for all my webpages to be accessed by human and spider alike ;)

Will PM you in a sec
Hi,

Sorry for delay! I tried to open the generator instance you PM'ed me, but it says that the target file is not writable. That's why the script cannot save the sitemap.
If that was not an issue before, then please PM me the example URL that is not stored in sitemap and has a link path from the starting url. Thanks!
Re: Only spiders a small amount of my pages
« Reply #7 on: August 06, 2005, 06:08:58 PM »
Also, when I run the php script from a shell I get this in my httpd error_log:

[Thu Aug 04 12:28:44 2005] [error] [client xxx.xxx.xxx.xxx] request failed: error reading the headers, referer: [ External links are visible to forum administrators only ]

I get hundreds of those every time I run the script.
Hi,

hmm.. is it produced by the generator requests? i.e., is the client IP address the same as host IP?

Yes it is. This is because the php script is being called from the same server as the generator is crawling.
« Last Edit: August 06, 2005, 06:11:54 PM by support1 »
Re: Only spiders a small amount of my pages
« Reply #8 on: August 06, 2005, 06:11:14 PM »
Hi,

I don't use a robots.txt instead use apache to block bad spiders, and provide authentication where needed. Furthermore, I don't really block directories, but do block bad spiders.

I do have several 301 redirects though as have updated a lot of urls recently. That said, there is still an easily navigable path for all my webpages to be accessed by human and spider alike ;)

Will PM you in a sec
Hi,

Sorry for delay! I tried to open the generator instance you PM'ed me, but it says that the target file is not writable. That's why the script cannot save the sitemap.
If that was not an issue before, then please PM me the example URL that is not stored in sitemap and has a link path from the starting url. Thanks!

Hi,

This is an unrelated issue that I have now rectified. I made some changes to the server which caused that to happen but this happened after I had filed this bug report.

The problem still exists
Re: Only spiders a small amount of my pages
« Reply #9 on: August 06, 2005, 10:14:54 PM »
If that was not an issue before, then please PM me the example URL that is not stored in sitemap and has a link path from the starting url. Thanks!
Please also include your server details (apache version, php type/version). Thanks.
Re: Only spiders a small amount of my pages
« Reply #10 on: August 06, 2005, 11:03:47 PM »
Apache 2 latest version
PHP 4.4.0
RHEL3
Re: Only spiders a small amount of my pages
« Reply #11 on: August 07, 2005, 09:52:18 PM »
Have you been able to replicate this ?
Re: Only spiders a small amount of my pages
« Reply #12 on: August 08, 2005, 07:58:09 AM »
Hi,

yes, I see the problem when trying to crawl using your generator instance. However, when I execute the crawler for your site from my host, I have it running correctly  :-\
If you could give me ftp access to you generator folder, I can make some debugging remotely and probably find the problem.
Re: Only spiders a small amount of my pages
« Reply #13 on: August 09, 2005, 08:08:12 PM »
Hi,

The new version didn't fix this. It's weird - if I specify to spider just the forums the spider can crawl them fine. If I start from my domain root it can only spider 9500 pages.

Furthermore, from the main page it does go to the forums but then doesn't spider all of them again.

I'm afraid I cannot give you FTP access. However the problem really seems to be caused by the crawler not being able to follow some anchors or mod_rewrite'd urls.

Until this bug is fixed I cannot use the generator.

This is getting frustrating :-/
« Last Edit: August 09, 2005, 09:14:39 PM by support1 »
Re: Only spiders a small amount of my pages
« Reply #14 on: August 09, 2005, 11:58:49 PM »
Hi,

as I mentioned I've tried to crawl your website successfully (I stopped the process at 15,000 urls and it was counting), so this is probably some specific server configuration issue. And hard to determine that exactly without being able to test at your host. The only thing I noticed is that your site is running eAccelerator: if you can try to disabling and crawl the site, that will at least ensure the problem is not at this side.