• Welcome to Sitemap Generator Forum.
 

robots.txt processing enabled - error

Started by shiz, July 15, 2015, 05:01:06 PM

Previous topic - Next topic

shiz

with robots.txt processing enabled, the sitemap generator exits immediately (both 6.1 and 7.1)

I use a block of agents (maybe it doesnt like it) i.e.

User-agent: *
Disallow: /

User-agent: googlebot
User-agent: bingbot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: MJ12bot
Disallow:

Here's the report:

============================================================
2015-07-15 09:08:01


(memory up: 1,567.2 Kb)
0 | 0 | 0.0 | 0:00:01 | 0:00:00 | 0 | 1,567.2 Kb | 0 | 0 | 1567

[ 1 - , 1]

NEXT LEVEL:1

({skipped  - })

(memory: 1,509.9 Kb)
(saving dump)


Crawling completed
<h4>Completed</h4>Total pages indexed: 0
<br>Creating sitemaps...
and calculating changelog...
<div id="percprog"></div>
Creating HTML sitemap...<div id="percprog2"></div>sorting.. |  | 0.0 | 0:00:00 | 0:00:00 |  |  |  |  | 0
|  | 0.0 | 0:00:00 | 0:00:00 |  |  |  |  | 0

*** *** [ External links are visible to forum administrators only ]

*** time: 10.263481855392 ***
|  | 0.0 | 0:00:00 | 0:00:00 |  |  |  |  | 0
|  | 0.0 | 0:00:00 | 0:00:00 |  |  |  |  | 0
<br />Done, redirecting to sitemap view page. <script> top.location = 'index.php?op=view' </script>

XML-Sitemaps Support

Hello,

generator follows "User-agent: *" rule.
However, you can disable "Support robots.txt" setting in generator configuration.

shiz

Quote from: XML-Sitemaps Support on July 16, 2015, 12:13:39 PM
generator follows "User-agent: *" rule.
However, you can disable "Support robots.txt" setting in generator configuration.

From the doc, generator also follows "User-agent: googlebot" rule which clearly shows a problem with the current robots.txt implementation.

I suggest you recode it using all the well accepted standards.

googlebot, bingbot, ahrefsbot et al also abide by the "User-agent: *" rule and that don't stop them from crawling the site.


shiz

Yes I can see that and it should not.  It should process them separately and sequentially whenever both are present.  Not ORING them e.g. if ($xxx == '*' || strstr($xxx, 'google')) {...}.  What has been disallowed in the first block should be reallowed in the 2nd block.

In other words, rules like that should cancel each other:

User-agent: *
Disallow: /

User-agent: googlebot
Disallow:


It should have the same effect as the well known apache directives
order deny,allow
deny from all
allow from google



XML-Sitemaps Support

It is designed in this way since generator bot is not actually a "googlebot". Thank you for suggestion though, we will consider changing the approach in future versions.

shiz

Thank you actually, Oleg.  I appreciate it.