BUG REPORT: Sitemap Generator
« on: September 07, 2012, 06:54:13 PM »
I would have posted this under the bug reports, but that forum refuses to allow me to post. In fact, the Unlimited Sitemap Generator forum refuses to allow me to report, and I am referring to the paid version, not the free version.

The Sitemap Generator reads content in <script> tags, and it should not. This is causing all sorts of bogus entries to get read in, and then my "Broken Links" page is loaded with a bajillion errors about pages that should never have been treated as URLs.

For example, if the page, [ External links are visible to forum administrators only ] has:
<script type="text/javascript">
  var somelink = 'a.html';
  function DoSomething() {
    document.getElementById.innerHTML = '<a href="' + somelink + '">text<\/a>';
  }
</script>

Then a ton of links to this appear:
[ External links are visible to forum administrators only ]" + somelink + "

This creates a TON of superfluous extra page hits that never should have happened in the first place.

Secondly, when I view the Broken Links tab, these URLs are not properly scrubbed, so the HTML of that link above comes out to:
<a href="[ External links are visible to forum administrators only ]"%20+%20somelink%20+%20""> ... </a>
Which is broken HTML and doesn't link to the fake link in question

Steps to fix:
1) The Sitemap Generator needs to ignore content inside of <script> tags
2) Scrub your content (with htmlentities($url); if nothing else) before you put it into the <a href="">
Re: BUG REPORT: Sitemap Generator
« Reply #1 on: September 08, 2012, 09:26:07 PM »
Hello,

you would need to use forum login/password that was sent to you after ordering generator license to be allowed to post on those sections.

Make sure that you have javascript entries added as:

<script type="text/javascript">
<!--
  var somelink = 'a.html';
  function DoSomething() {
    document.getElementById.innerHTML = '<a href="' + somelink + '">text<\/a>';
  }
//-->
</script>
Re: BUG REPORT: Sitemap Generator
« Reply #2 on: September 10, 2012, 06:15:00 PM »
I cannot make massive changes throughout 30+MB of website code to satisfy this condition, especially since it would cause problems elsewhere. That style of commenting JavaScript is effectively obsolete, due to XHTML.

We are using the XHTML method // <![CDATA[ .... // ]]> surrounding structure for our JS. Is it possible to make your parser <![CDATA[ aware so that it will not parse content inside of these regions?
Re: BUG REPORT: Sitemap Generator
« Reply #4 on: September 12, 2012, 10:01:32 PM »
[ External links are visible to forum administrators only ]

View source. Line #449 is the "link" that is being picked up improperly.
Re: BUG REPORT: Sitemap Generator
« Reply #5 on: September 14, 2012, 01:05:53 PM »
Hello,

are you using standalone sitemap generator? You can add this in "Exclude URLs" setting:
Code: [Select]
all_link
Re: BUG REPORT: Sitemap Generator
« Reply #6 on: September 14, 2012, 09:48:31 PM »
Yes, I have the standalone sitemap generator. Adding that will fix my problem, but it does not fix the bug in the crawler.

You can easily fix this in the code with something as simple as this before crawling:
$content = preg_replace("<!\\[CDATA\\[.*?\\]\\]>", "", $content);
and then it will no longer be an issue.