Excluding folders and robots.txt
« on: July 09, 2009, 08:01:56 AM »
Very confused... I crawled without having a robots.txt on my host AND without entering anything in the 'Exclude URLs' area - result: 1788 pages indexed.

Then I crawled with robots.txt excluding some folders - result: 1788 pages indexed.

Then I crawled with robots.txt AND folders entered in the 'Exclude URLs' area - result: 1788 pages indexed. Format for entering folders in the 'Exclude' area is:
folder1/
folder2/

How can I make sure my sensitive stuff is excluded (buyer histories, personal buyer data, etc)?
Thx
Re: Excluding folders and robots.txt
« Reply #2 on: July 09, 2009, 03:36:55 PM »
Hello Oleg,
No - they are not in the sitemap, but I'm trying to understand the logic... I crawled BEFORE uploading robots.txt to my server and BEFORE entering anything in the 'Exclude URLs' area of 'Configuration'... and still got the same 1,788 pages crawled. While it's true that the sitemap has been generated the way that I want (with these excluded), my question is: How?? (The only reason I even care is because I want to make sure I understand the logic of the script in order to make sure I am doing everything properly).

Also, I cannot seem to find an explanation of ROR.XML - what is this and how/why/under what circumstances do I use it?
Thanks so much

Re: Excluding folders and robots.txt
« Reply #3 on: July 10, 2009, 07:48:43 PM »
Hello,

so that means that those pages wer not in sitemap before as well. Perhaps there is no way to reach them starting from homepage and sitemap generator crawler cannot find them.

You can find details on ror sitemaps here: http://rorweb.com/
Re: Excluding folders and robots.txt
« Reply #4 on: July 10, 2009, 09:10:12 PM »
I think you may be right... there isn't a way to reach them from the homepage. However, my blog is reachable from home (and every) page, but it was not crawled. If you wouldn't mind, please have a look and let me know: www.h eal thyleg depot.com
and add '/blog' to go straight to the blog. (I out the spaces in the url so the bulletin board system won't mess it up, AND I don't want search engines crawling this discussion and pointing it to my site.
Thx
Re: Excluding folders and robots.txt
« Reply #5 on: July 10, 2009, 11:44:32 PM »
Hello,

your blog is redirected to http://domain.com/blog/ while main site is on http://www.domain.com/ (with www), you should make them both located in the same subdomain type.
Re: Excluding folders and robots.txt
« Reply #6 on: July 11, 2009, 04:25:55 AM »
This topic of whether or not to use the www is so confusing to me. Is this something that I can fix by setting a 301 redirect from [ External links are visible to forum administrators only ] to [ External links are visible to forum administrators only ] (and I guess it should be without using masking)? Or is it something that I need to have my designer change in the CSS?  If I go to my site WITH the www, and I lay my mouse over 'blog', it shows that I will be going to [ External links are visible to forum administrators only ] BUT if I go to my site without the www, that same link shows without the www.
Once I fix this, xml-sitemap will crawl the blog also?
Thanks again Oleg - sorry to take so much of your time, but you've been a fantastic help!
Re: Excluding folders and robots.txt
« Reply #7 on: July 11, 2009, 03:08:45 PM »
Oleg, you were on the money! I went into my wordpress admin panel for blog... changed the blog location to include the www, re-ran the crawl, and I see the blog now included! Nice!