Free Online
Sitemap Generator
Unlimited
Sitemap Generator
About Sitemaps
Broken Links
Forum
Testimonials
Tools
Contact
Advanced search
Sitemap Generator Forum
July 19, 2008, 03:47:31 PM
Welcome,
Guest
. Please
login
or
register
.
Did you miss your
activation email?
1 Hour
1 Day
1 Week
1 Month
Forever
Login with username, password and session length
Home
Help
Search
Login
Register
Sitemap software 2.9 released
- Email notifications, html sitemap customizing and more
6807
Posts in
1675
Topics by
Members
Latest Member:
cusinger
Sitemap Generator Forum
>
XML Sitemaps Discussions Category
>
Unlimited PHP Sitemap Generator
>
Only spiders a small amount of my pages
Pages: [
1
]
2
« previous
next »
Print
Author
Topic: Only spiders a small amount of my pages (Read 32854 times)
support1
Registered Customer
Jr. Member
Posts: 15
Only spiders a small amount of my pages
«
on:
August 03, 2005, 11:26:10 PM »
Hi,
My site has around 100,000 pages.
The php generator only spiders 7631 pages.
The large majority of those pages on my website are part of my forum (vbulletin).
Is this a known issue ?
I have tried every permutation of setting on the config screen including reuploading all files. I have also cleared my data folder to retry.
I have run this through the web and also through shell (which I would prefer) and both end up with the same figure.
The script isn't timing out because I get the confirmation of apparant success message through shell saying that it's finished.
The limit I have on the generator is 50010. I chose this figure to see if it would span sitemapX.xml.gz files properly.
I am using 1.08 btw
Thanks
«
Last Edit: August 03, 2005, 11:37:47 PM by support1
»
Logged
admin
Administrator
Hero Member
Posts: 2837
Re: Only spiders a small amount of my pages
«
Reply #1 on:
August 03, 2005, 11:50:34 PM »
Hi,
it is possible that you have have robots.txt file with exclusion list? (generator script supports it in the same way as SE bots)
otherwise, please let me know your generator URL in PM.
Logged
Oleg Ignatiuk
http://www.xml-sitemaps.com
support1
Registered Customer
Jr. Member
Posts: 15
Re: Only spiders a small amount of my pages
«
Reply #2 on:
August 04, 2005, 12:23:56 PM »
Hi,
I don't use a robots.txt instead use apache to block bad spiders, and provide authentication where needed. Furthermore, I don't really block directories, but do block bad spiders.
I do have several 301 redirects though as have updated a lot of urls recently. That said, there is still an easily navigable path for all my webpages to be accessed by human and spider alike
Will PM you in a sec
«
Last Edit: August 04, 2005, 12:39:05 PM by support1
»
Logged
support1
Registered Customer
Jr. Member
Posts: 15
Re: Only spiders a small amount of my pages
«
Reply #3 on:
August 04, 2005, 10:59:41 PM »
Also, when I run the php script from a shell I get this in my httpd error_log:
[Thu Aug 04 12:28:44 2005] [error] [client xxx.xxx.xxx.xxx] request failed: error reading the headers, referer: [external links are visible to admins only]
I get hundreds of those every time I run the script.
Logged
support1
Registered Customer
Jr. Member
Posts: 15
Re: Only spiders a small amount of my pages
«
Reply #4 on:
August 05, 2005, 07:34:35 PM »
anyone ?
Logged
admin
Administrator
Hero Member
Posts: 2837
Re: Only spiders a small amount of my pages
«
Reply #5 on:
August 06, 2005, 01:04:54 AM »
Quote from: support1 on August 04, 2005, 10:59:41 PM
Also, when I run the php script from a shell I get this in my httpd error_log:
[Thu Aug 04 12:28:44 2005] [error] [client xxx.xxx.xxx.xxx] request failed: error reading the headers, referer:
http://www.mydomain.com/
I get hundreds of those every time I run the script.
Hi,
hmm.. is it produced by the generator requests? i.e., is the client IP address the same as host IP?
Logged
Oleg Ignatiuk
http://www.xml-sitemaps.com
admin
Administrator
Hero Member
Posts: 2837
Re: Only spiders a small amount of my pages
«
Reply #6 on:
August 06, 2005, 01:08:44 AM »
Quote from: support1 on August 04, 2005, 12:23:56 PM
Hi,
I don't use a robots.txt instead use apache to block bad spiders, and provide authentication where needed. Furthermore, I don't really block directories, but do block bad spiders.
I do have several 301 redirects though as have updated a lot of urls recently. That said, there is still an easily navigable path for all my webpages to be accessed by human and spider alike
Will PM you in a sec
Hi,
Sorry for delay! I tried to open the generator instance you PM'ed me, but it says that the target file is not writable. That's why the script cannot save the sitemap.
If that was not an issue before, then please PM me the example URL that is not stored in sitemap and has a link path from the starting url. Thanks!
Logged
Oleg Ignatiuk
http://www.xml-sitemaps.com
support1
Registered Customer
Jr. Member
Posts: 15
Re: Only spiders a small amount of my pages
«
Reply #7 on:
August 06, 2005, 06:08:58 PM »
Quote from: admin on August 06, 2005, 01:04:54 AM
Quote from: support1 on August 04, 2005, 10:59:41 PM
Also, when I run the php script from a shell I get this in my httpd error_log:
[Thu Aug 04 12:28:44 2005] [error] [client xxx.xxx.xxx.xxx] request failed: error reading the headers, referer: [external links are visible to admins only]
I get hundreds of those every time I run the script.
Hi,
hmm.. is it produced by the generator requests? i.e., is the client IP address the same as host IP?
Yes it is. This is because the php script is being called from the same server as the generator is crawling.
«
Last Edit: August 06, 2005, 06:11:54 PM by support1
»
Logged
support1
Registered Customer
Jr. Member
Posts: 15
Re: Only spiders a small amount of my pages
«
Reply #8 on:
August 06, 2005, 06:11:14 PM »
Quote from: admin on August 06, 2005, 01:08:44 AM
Quote from: support1 on August 04, 2005, 12:23:56 PM
Hi,
I don't use a robots.txt instead use apache to block bad spiders, and provide authentication where needed. Furthermore, I don't really block directories, but do block bad spiders.
I do have several 301 redirects though as have updated a lot of urls recently. That said, there is still an easily navigable path for all my webpages to be accessed by human and spider alike
Will PM you in a sec
Hi,
Sorry for delay! I tried to open the generator instance you PM'ed me, but it says that the target file is not writable. That's why the script cannot save the sitemap.
If that was not an issue before, then please PM me the example URL that is not stored in sitemap and has a link path from the starting url. Thanks!
Hi,
This is an unrelated issue that I have now rectified. I made some changes to the server which caused that to happen but this happened after I had filed this bug report.
The problem still exists
Logged
admin
Administrator
Hero Member
Posts: 2837
Re: Only spiders a small amount of my pages
«
Reply #9 on:
August 06, 2005, 10:14:54 PM »
Quote from: admin on August 06, 2005, 01:08:44 AM
If that was not an issue before, then please PM me the example URL that is not stored in sitemap and has a link path from the starting url. Thanks!
Please also include your server details (apache version, php type/version). Thanks.
Logged
Oleg Ignatiuk
http://www.xml-sitemaps.com
support1
Registered Customer
Jr. Member
Posts: 15
Re: Only spiders a small amount of my pages
«
Reply #10 on:
August 06, 2005, 11:03:47 PM »
Apache 2 latest version
PHP 4.4.0
RHEL3
Logged
support1
Registered Customer
Jr. Member
Posts: 15
Re: Only spiders a small amount of my pages
«
Reply #11 on:
August 07, 2005, 09:52:18 PM »
Have you been able to replicate this ?
Logged
admin
Administrator
Hero Member
Posts: 2837
Re: Only spiders a small amount of my pages
«
Reply #12 on:
August 08, 2005, 07:58:09 AM »
Hi,
yes, I see the problem when trying to crawl using your generator instance. However, when I execute the crawler for your site from my host, I have it running correctly
If you could give me ftp access to you generator folder, I can make some debugging remotely and probably find the problem.
Logged
Oleg Ignatiuk
http://www.xml-sitemaps.com
support1
Registered Customer
Jr. Member
Posts: 15
Re: Only spiders a small amount of my pages
«
Reply #13 on:
August 09, 2005, 08:08:12 PM »
Hi,
The new version didn't fix this. It's weird - if I specify to spider just the forums the spider can crawl them fine. If I start from my domain root it can only spider 9500 pages.
Furthermore, from the main page it does go to the forums but then doesn't spider all of them again.
I'm afraid I cannot give you FTP access. However the problem really seems to be caused by the crawler not being able to follow some anchors or mod_rewrite'd urls.
Until this bug is fixed I cannot use the generator.
This is getting frustrating :-/
«
Last Edit: August 09, 2005, 09:14:39 PM by support1
»
Logged
admin
Administrator
Hero Member
Posts: 2837
Re: Only spiders a small amount of my pages
«
Reply #14 on:
August 09, 2005, 11:58:49 PM »
Hi,
as I mentioned I've tried to crawl your website successfully (I stopped the process at 15,000 urls and it was counting), so this is probably some specific server configuration issue. And hard to determine that exactly without being able to test at your host. The only thing I noticed is that your site is running eAccelerator: if you can try to disabling and crawl the site, that will at least ensure the problem is not at this side.
Logged
Oleg Ignatiuk
http://www.xml-sitemaps.com
Pages: [
1
]
2
Print
« previous
next »
Jump to:
Please select a destination:
-----------------------------
XML Sitemaps Discussions Category
-----------------------------
=> Unlimited PHP Sitemap Generator
=> Bug reporting for Standalone Sitemap Generator
=> Free Online Sitemaps Generator
=> Site maps - General Discussion
Loading...