robots.txt processing enabled - error
« on: July 15, 2015, 05:01:06 PM »
with robots.txt processing enabled, the sitemap generator exits immediately (both 6.1 and 7.1)

I use a block of agents (maybe it doesnt like it) i.e.

User-agent: *
Disallow: /

User-agent: googlebot
User-agent: bingbot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: MJ12bot
Disallow:

Here's the report:

============================================================
2015-07-15 09:08:01


(memory up: 1,567.2 Kb)
0 | 0 | 0.0 | 0:00:01 | 0:00:00 | 0 | 1,567.2 Kb | 0 | 0 | 1567

[ 1 - , 1]

NEXT LEVEL:1

({skipped  - })

(memory: 1,509.9 Kb)
(saving dump)


Crawling completed
<h4>Completed</h4>Total pages indexed: 0
<br>Creating sitemaps...
 and calculating changelog...
<div id="percprog"></div>
Creating HTML sitemap...<div id="percprog2"></div>sorting.. |  | 0.0 | 0:00:00 | 0:00:00 |  |  |  |  | 0
 |  | 0.0 | 0:00:00 | 0:00:00 |  |  |  |  | 0

*** *** [ External links are visible to forum administrators only ]

*** time: 10.263481855392 ***
 |  | 0.0 | 0:00:00 | 0:00:00 |  |  |  |  | 0
 |  | 0.0 | 0:00:00 | 0:00:00 |  |  |  |  | 0
<br />Done, redirecting to sitemap view page. <script> top.location = 'index.php?op=view' </script>
Re: robots.txt processing enabled - error
« Reply #1 on: July 16, 2015, 12:13:39 PM »
Hello,

generator follows "User-agent: *" rule.
However, you can disable "Support robots.txt" setting in generator configuration.
Re: robots.txt processing enabled - error
« Reply #2 on: July 16, 2015, 05:50:01 PM »
generator follows "User-agent: *" rule.
However, you can disable "Support robots.txt" setting in generator configuration.

From the doc, generator also follows "User-agent: googlebot" rule which clearly shows a problem with the current robots.txt implementation.

I suggest you recode it using all the well accepted standards.

googlebot, bingbot, ahrefsbot et al also abide by the "User-agent: *" rule and that don't stop them from crawling the site.
Re: robots.txt processing enabled - error
« Reply #3 on: July 17, 2015, 08:00:25 AM »
It follows both "googlebot" and "*" rules, combining them in a restrictive way.
Re: robots.txt processing enabled - error
« Reply #4 on: July 17, 2015, 04:19:03 PM »
Yes I can see that and it should not.  It should process them separately and sequentially whenever both are present.  Not ORING them e.g. if ($xxx == '*' || strstr($xxx, 'google')) {...}.  What has been disallowed in the first block should be reallowed in the 2nd block.

In other words, rules like that should cancel each other:

User-agent: *
Disallow: /

User-agent: googlebot
Disallow:


It should have the same effect as the well known apache directives
order deny,allow
deny from all
allow from google


Re: robots.txt processing enabled - error
« Reply #5 on: July 18, 2015, 08:40:00 AM »
It is designed in this way since generator bot is not actually a "googlebot". Thank you for suggestion though, we will consider changing the approach in future versions.
Re: robots.txt processing enabled - error
« Reply #6 on: July 18, 2015, 04:22:34 PM »
Thank you actually, Oleg.  I appreciate it.