XML Sitemaps Generator

Author Topic: Only spiders a small amount of my pages  (Read 69753 times)

support1

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 15
Only spiders a small amount of my pages
« on: August 03, 2005, 10:26:10 PM »
Hi,

My site has around 100,000 pages.

The php generator only spiders 7631 pages.

The large majority of those pages on my website are part of my forum (vbulletin).

Is this a known issue ?

I have tried every permutation of setting on the config screen including reuploading all files. I have also cleared my data folder to retry.

I have run this through the web and also through shell (which I would prefer) and both end up with the same figure.

The script isn't timing out because I get the confirmation of apparant success message through shell saying that it's finished.

The limit I have on the generator is 50010. I chose this figure to see if it would span sitemapX.xml.gz files properly.

I am using 1.08 btw

Thanks
« Last Edit: August 03, 2005, 10:37:47 PM by support1 »

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Only spiders a small amount of my pages
« Reply #1 on: August 03, 2005, 10:50:34 PM »
Hi,

it is possible that you have have robots.txt file with exclusion list? (generator script supports it in the same way as SE bots)
otherwise, please let me know your generator URL in PM.
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

support1

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 15
Re: Only spiders a small amount of my pages
« Reply #2 on: August 04, 2005, 11:23:56 AM »
Hi,

I don't use a robots.txt instead use apache to block bad spiders, and provide authentication where needed. Furthermore, I don't really block directories, but do block bad spiders.

I do have several 301 redirects though as have updated a lot of urls recently. That said, there is still an easily navigable path for all my webpages to be accessed by human and spider alike ;)

Will PM you in a sec
« Last Edit: August 04, 2005, 11:39:05 AM by support1 »

support1

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 15
Re: Only spiders a small amount of my pages
« Reply #3 on: August 04, 2005, 09:59:41 PM »
Also, when I run the php script from a shell I get this in my httpd error_log:

[Thu Aug 04 12:28:44 2005] [error] [client xxx.xxx.xxx.xxx] request failed: error reading the headers, referer: [external links are visible to admins only]

I get hundreds of those every time I run the script.

support1

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 15
Re: Only spiders a small amount of my pages
« Reply #4 on: August 05, 2005, 06:34:35 PM »
anyone ?

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Only spiders a small amount of my pages
« Reply #5 on: August 06, 2005, 12:04:54 AM »
Also, when I run the php script from a shell I get this in my httpd error_log:

[Thu Aug 04 12:28:44 2005] [error] [client xxx.xxx.xxx.xxx] request failed: error reading the headers, referer: http://www.mydomain.com/

I get hundreds of those every time I run the script.
Hi,

hmm.. is it produced by the generator requests? i.e., is the client IP address the same as host IP?
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Only spiders a small amount of my pages
« Reply #6 on: August 06, 2005, 12:08:44 AM »
Hi,

I don't use a robots.txt instead use apache to block bad spiders, and provide authentication where needed. Furthermore, I don't really block directories, but do block bad spiders.

I do have several 301 redirects though as have updated a lot of urls recently. That said, there is still an easily navigable path for all my webpages to be accessed by human and spider alike ;)

Will PM you in a sec
Hi,

Sorry for delay! I tried to open the generator instance you PM'ed me, but it says that the target file is not writable. That's why the script cannot save the sitemap.
If that was not an issue before, then please PM me the example URL that is not stored in sitemap and has a link path from the starting url. Thanks!
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

support1

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 15
Re: Only spiders a small amount of my pages
« Reply #7 on: August 06, 2005, 05:08:58 PM »
[external links are visible to admins only]
[external links are visible to admins only]
Also, when I run the php script from a shell I get this in my httpd error_log:

[Thu Aug 04 12:28:44 2005] [error] [client xxx.xxx.xxx.xxx] request failed: error reading the headers, referer: [external links are visible to admins only]

I get hundreds of those every time I run the script.
Hi,

hmm.. is it produced by the generator requests? i.e., is the client IP address the same as host IP?

Yes it is. This is because the php script is being called from the same server as the generator is crawling.
« Last Edit: August 06, 2005, 05:11:54 PM by support1 »

support1

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 15
Re: Only spiders a small amount of my pages
« Reply #8 on: August 06, 2005, 05:11:14 PM »
[external links are visible to admins only]
[external links are visible to admins only]
Hi,

I don't use a robots.txt instead use apache to block bad spiders, and provide authentication where needed. Furthermore, I don't really block directories, but do block bad spiders.

I do have several 301 redirects though as have updated a lot of urls recently. That said, there is still an easily navigable path for all my webpages to be accessed by human and spider alike ;)

Will PM you in a sec
Hi,

Sorry for delay! I tried to open the generator instance you PM'ed me, but it says that the target file is not writable. That's why the script cannot save the sitemap.
If that was not an issue before, then please PM me the example URL that is not stored in sitemap and has a link path from the starting url. Thanks!

Hi,

This is an unrelated issue that I have now rectified. I made some changes to the server which caused that to happen but this happened after I had filed this bug report.

The problem still exists

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Only spiders a small amount of my pages
« Reply #9 on: August 06, 2005, 09:14:54 PM »
If that was not an issue before, then please PM me the example URL that is not stored in sitemap and has a link path from the starting url. Thanks!
Please also include your server details (apache version, php type/version). Thanks.
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

support1

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 15
Re: Only spiders a small amount of my pages
« Reply #10 on: August 06, 2005, 10:03:47 PM »
Apache 2 latest version
PHP 4.4.0
RHEL3

support1

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 15
Re: Only spiders a small amount of my pages
« Reply #11 on: August 07, 2005, 08:52:18 PM »
Have you been able to replicate this ?

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Only spiders a small amount of my pages
« Reply #12 on: August 08, 2005, 06:58:09 AM »
Hi,

yes, I see the problem when trying to crawl using your generator instance. However, when I execute the crawler for your site from my host, I have it running correctly  :-\
If you could give me ftp access to you generator folder, I can make some debugging remotely and probably find the problem.
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

support1

  • Registered Customer
  • Jr. Member
  • *
  • Posts: 15
Re: Only spiders a small amount of my pages
« Reply #13 on: August 09, 2005, 07:08:12 PM »
Hi,

The new version didn't fix this. It's weird - if I specify to spider just the forums the spider can crawl them fine. If I start from my domain root it can only spider 9500 pages.

Furthermore, from the main page it does go to the forums but then doesn't spider all of them again.

I'm afraid I cannot give you FTP access. However the problem really seems to be caused by the crawler not being able to follow some anchors or mod_rewrite'd urls.

Until this bug is fixed I cannot use the generator.

This is getting frustrating :-/
« Last Edit: August 09, 2005, 08:14:39 PM by support1 »

XML-Sitemaps Support

  • Administrator
  • Hero Member
  • *****
  • Posts: 10624
Re: Only spiders a small amount of my pages
« Reply #14 on: August 09, 2005, 10:58:49 PM »
Hi,

as I mentioned I've tried to crawl your website successfully (I stopped the process at 15,000 urls and it was counting), so this is probably some specific server configuration issue. And hard to determine that exactly without being able to test at your host. The only thing I noticed is that your site is running eAccelerator: if you can try to disabling and crawl the site, that will at least ensure the problem is not at this side.
Oleg Ignatiuk
www.xml-sitemaps.com
Send me a Private Message

For maximum exposure and traffic for your web site check out our additional SEO Services.

 

SMF 2.0.12 | SMF © 2014, Simple Machines
XHTML RSS WAP2