Getting image metadata into sitemap_images.xml
« on: January 09, 2017, 08:17:26 PM »
Hello,

from an older post here in the forum I learned that sitemap generator pulls descriptions from the <img>-tags rather than from the exifs of the actual image.

I have the majority of my photos managed by a gallery software, that makes heavy use of css and javascript. It creates a fancy lightbox display and thumbnail preview pages resulting in links that look like this:

<a class="pswp_go"
     href="[ External links are visible to forum administrators only ]
        /RL16-17_20160827_001_S04_-_TSG_Hin_IMGP6966-single.php"
    data-href="[ External links are visible to forum administrators only ]
        _TSG_Hin/photos/RL16-17_20160827_001_S04_-_TSG_Hin_IMGP6966.jpg"
    data-size="1024x323"
    style="background-image: url([ External links are visible to forum administrators only ]
        /Saison_16-17/FCS04_-_TSG_Hin/thumbnails/RL16-17_20160827_001_S04_-
        _TSG_Hin_IMGP6966.jpg); top: 64px; left: 7px; width: 168px; height: 53px;">
    <img
        src="[ External links are visible to forum administrators only ]
            /thumbnails/RL16-17_20160827_001_S04_-_TSG_Hin_IMGP6966.jpg"
        id="photo-RL16-17_20160827_001_S04_-_TSG_Hin_IMGP6966"
        style="height: 53px; width: 168px;"
        alt="RL16-17_20160827_001_S04_-_TSG_Hin_IMGP6966.jpg"
        title="FC Schalke 04 U23 - TSG Sprockhövel 2:0, Regionalliga West 16/17
            aufgenommen am 27.08.2016 - Michael Hilgenstock"
        width="168"
        height="53">
</a>

I can manage what goes into the title="" attribute but not the alt"".
But at the moment neither the title- nor the alt-attribute is collected into the sitemap for the image above.

Is there an option to configure sitmap generator that it will take the title attribute from an <img> as diaplayed above?
The images contained in wordpress post show up fine in the sitemap with descriptions from the alt-attribute.
Can it be the "id=" attribute that breaks it?
Re: Getting image metadata into sitemap_images.xml
« Reply #1 on: January 09, 2017, 10:05:20 PM »
Hello,

please don't bother.

I believe, I was searching in the wrong place. What You see in the first post is the <img> of the thumbnail image. When I follow the link to the lightbox, I find an <img> tag without any alt or title attribute.
I believe I will have to adress the matter there.

Excuse for the inconvenience.
Michael
Re: Getting image metadata into sitemap_images.xml
« Reply #2 on: January 10, 2017, 08:13:01 AM »
So, I temporarily changed the settings of my gallery to display images on a page without the lightbox foo.
Then I looked into the debug.log which image uri is read by the crawler and there I find <img> tags like the following.

<img
    src="[ External links are visible to forum administrators only ]
        rwe_-_tsg_hin/photos/RL16-17_20161111_186_RWE_-_TSG_Hin_IMGP1778.jpg"
    alt="Rot-Weiss Essen - TSG Sprockhövel 3:2, Regionalliga West 16/17"
    title="Rot-Weiss Essen - TSG Sprockhövel 3:2, Regionalliga West 16/17
        aufgenommen am 11.11.2016 - Michael Hilgenstock"
    id="image-RL16-17_20161111_186_RWE_-_TSG_Hin_IMGP1778"
    class="" width="960"
    height="641"
>

which reads fine to me, neither the alt nor the titel content is displayed in the image sitemap.

I reset the gallery back to the lightbox display again for another crawler run to compare and keep You updated.
Re: Getting image metadata into sitemap_images.xml
« Reply #3 on: January 10, 2017, 08:41:15 AM »
Checked the debug.log again.
The crawler always sees the page with the <img> with alt and title. Only user-agents that execute javascript can ever see the lightbox.

What did I miss that the information  on the images is not harvested into the images sitemap?
The images in the wordpress portion of my site are all with description.

Thank You in advance.
Michael
Re: Getting image metadata into sitemap_images.xml
« Reply #5 on: January 10, 2017, 06:29:33 PM »
Thank You for looking.

An example of a thumbnail page would be
Code: [Select]
https://fotos.michilge.de/galleries/TSG-Fussball/Saison_16-17/TSG_-_BvB_Hin/
The crawler (or a browser that has JavaScript disabled) will follow the links of the TNs to pages like

Code: [Select]
https://fotos.michilge.de/galleries/TSG-Fussball/Saison_16-17/TSG_-_BvB_Hin/RL16-17_20160820_120_TSG_-_BVB_Hin_IMGP7761-single.php
While a mouseclick will result in a lightbox display invisble for the crawler.

I searched debug.log for the filename of the image and find exactly one instance

Code: [Select]
<image:image>
         <image:loc>https://fotos.michilge.de/galleries/TSG-Fussball/Saison_16-17/TSG_-
_BvB_Hin/photos/RL16-17_20160820_120_TSG_-_BVB_Hin_IMGP7761.jpg</image:loc>
</image:image>

The image sitemap is at
Code: [Select]
https://fotos.michilge.de/sitemap_images.xml
I temporarily moved the debug.log of the last run to

Code: [Select]
https://fotos.michilge.de/debug.log
In case
In debug.log I searched for the basename of the image and found two URLs where the crawler got into contact:

Code: [Select]
[ 926 - galleries/TSG-Fussball/Saison_16-17/TSG_-_BvB_Hin/RL16-17_20160820_120_TSG_-_BVB_Hin_IMGP7761-single.php, 1]
 { https://fotos.michilge.de/galleries/TSG-Fussball/Saison_16-17/TSG_-_BvB_Hin/RL16-17_20160820_120_TSG_-_BVB_Hin_IMGP7761-single.php }

*** *** https://fotos.michilge.de/galleries/TSG-Fussball/Saison_16-17/TSG_-_BvB_Hin/RL16-17_20160820_120_TSG_-_BVB_Hin_IMGP7761-single.php

*** time: 0.370609998703 ***

[[[ 200 OK ]]] - 0.37s (0.00 + 0.00)
array (
  'date' => 'Tue, 10 Jan 2017 08:22:38 GMT',
  'content-type' => 'text/html',
  'transfer-encoding' => 'chunked',
  'connection' => 'close',
  'server' => 'Apache',
  'p3p' => 'CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"',
  'expires' => 'Thu, 19 Nov 1981 08:52:00 GMT',
  'cache-control' => 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0',
  'pragma' => 'no-cache',
  'x_csize' => 18910,
)
((include https://fotos.michilge.de/galleries/TSG-Fussball/Saison_16-17/TSG_-_BvB_Hin/RL16-17_20160820_120_TSG_-_BVB_Hin_IMGP7761-single.php))

and

Code: [Select]
[ 4980 - galleries/TSG-Fussball/Saison_16-17/TSG_-_BvB_Hin/single.php?id=RL16-17_20160820_120_TSG_-_BVB_Hin_IMGP7761, 1]
 { https://fotos.michilge.de/galleries/TSG-Fussball/Saison_16-17/TSG_-_BvB_Hin/single.php?id=RL16-17_20160820_120_TSG_-_BVB_Hin_IMGP7761 }

*** *** https://fotos.michilge.de/galleries/TSG-Fussball/Saison_16-17/TSG_-_BvB_Hin/single.php?id=RL16-17_20160820_120_TSG_-_BVB_Hin_IMGP7761

*** time: 1.2153618335724 ***

[[[ 200 OK ]]] - 1.22s (0.00 + 0.00)
array (
  'date' => 'Tue, 10 Jan 2017 10:05:50 GMT',
  'content-type' => 'text/html',
  'transfer-encoding' => 'chunked',
  'connection' => 'close',
  'server' => 'Apache',
  'p3p' => 'CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"',
  'expires' => 'Thu, 19 Nov 1981 08:52:00 GMT',
  'cache-control' => 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0',
  'pragma' => 'no-cache',
  'x_csize' => 18928,
)
((include https://fotos.michilge.de/galleries/TSG-Fussball/Saison_16-17/TSG_-_BvB_Hin/single.php?id=RL16-17_20160820_120_TSG_-_BVB_Hin_IMGP7761))
Re: Getting image metadata into sitemap_images.xml
« Reply #6 on: January 10, 2017, 08:30:04 PM »
The first occurrence of the image is detected in <a data-href=""> tag and it doesn't include the alt or title attribute. Then the same image is included in <img> tag and it's ignored because this images was already included in sitemap.
Re: Getting image metadata into sitemap_images.xml
« Reply #7 on: January 10, 2017, 09:17:44 PM »
Do You see a chance to make sitmap generator ignore the data-href-attributes?
Or will I have to seek a way to enhance the "pswp_go" class with an alt an title?
Re: Getting image metadata into sitemap_images.xml
« Reply #8 on: January 11, 2017, 06:12:39 AM »
Also, currently your alt attribute for <img> tag looks like:
 alt="RL16-17_20160820_083_TSG_-_BVB_Hin_IMGP7657.jpg"

since it goes first, it takes priority over "title" attribute.
Re: Getting image metadata into sitemap_images.xml
« Reply #9 on: January 11, 2017, 06:41:35 AM »
Hi Oleg,

I could change the content of the alt attribute since the scripts that generate my albums are open source. It is a question of two minutes. Only a php-varaiable that must be exchanged.  It might require that I fix that over and over each time the gallery scripts get updated.
I started disscussing it with the author. In hopes that he will make it configurable in a permanent change to his product.

But he made his point about data-* attributes clear to me. He is not inclined to add an alt and a title to his data-href"" even if the crawler should be happy with it, because it is meant to appear in an <a> element. That does by standard not have alts and titles.

I read some lines about data-* attributes now and can I understand his point that all data-* should be treated transparent and is not meant to contain human readable content.
[ External links are visible to forum administrators only ]
[ External links are visible to forum administrators only ]
[ External links are visible to forum administrators only ]

He proposes crawling should either ignore data-* attributes completely or fall back to it only if more appropriate sources are not available.

Thank You for responding
Michael
Re: Getting image metadata into sitemap_images.xml
« Reply #11 on: January 11, 2017, 09:25:39 PM »
Hello,

Hello,

I meant changing alt attribute in <img> tag, not in the <a> tag.

Yes, I understood You and I can change that.
In the gallery scripts there are variables with image properties prepared, I can change the alt="" to contain the caption instead of the filename.

I will try that next.

I have waited for the crawler to finish the last run.
The alt and title that I added into the <a> element with the data-href attribute have been recognized and are now contained in the sitmap-images.xml. Although not valid html it has been read

I wanted to see if that would work.
I will roll back that change and then put a caption into the alt of the <img> that follows next after the data-href still within the same <a>.
After that I will start the crawler again.

I haven't yet put up a smaller test site on a local machine. So the crawler will run for some hours. Would have been wise, but also takes time.
When the next crawler run is finshed, I will report here.
Re: Getting image metadata into sitemap_images.xml
« Reply #12 on: January 12, 2017, 12:42:20 AM »
Now the next crawl is through. I have a proper alt and title in the <img>, but everything in the <img>element in the <img> below the data-href link will be ignored anyway because I put the string "thumbnail" in the "Exclude URLs" section. I prefer lager size renditions of the photos in the sitmap over thumbnails. And I also do not want to have the thumbnail renditions redundant to the larger version.
That was the case before I excluded "thumbnail".

If the sitemap generator would either by default or by option ignore data-* attributes (as the google crawler is said to do), it would no longer ignore the well formatted <img> elements on the dedicated single-image pages. These pages are included in the sitmap.xml and they appear in debug.log I can see in the server log that googlebot has visited some of them.

Code: [Select]
<a class="pswp_go"
    href="https://fotos.michilge.de/galleries/TSG-Fussball/Saison_16-17/rwe_-_tsg_hin
        /RL16-17_20161111_051_RWE_-_TSG_Hin_IMGP1125-single.php"
    data-href="https://fotos.michilge.de/galleries/TSG-Fussball/Saison_16-17/
        rwe_-_tsg_hin/photos/RL16-17_20161111_051_RWE_-_TSG_Hin_IMGP1125.jpg"
    data-size="960x641"
    style="background-image: url(https://fotos.michilge.de/galleries/TSG-Fussball
        /Saison_16-17/rwe_-_tsg_hin/thumbnails/RL16-17_20161111_051_RWE_-
        _TSG_Hin_IMGP1125.jpg); top: 35px; left: 7px; width: 168px; height: 112px;"
>
<img
    src="https://fotos.michilge.de/galleries/TSG-Fussball/Saison_16-17/rwe_-_tsg_hin
        /thumbnails/RL16-17_20161111_051_RWE_-_TSG_Hin_IMGP1125.jpg"
    id="photo-RL16-17_20161111_051_RWE_-_TSG_Hin_IMGP1125"
    style="height: 112px; width: 168px;"
    alt="Rot-Weiss Essen - TSG Sprockhövel 3:2, Regionalliga West 16/17"
    title="Rot-Weiss Essen - TSG Sprockhövel 3:2, Regionalliga West 16/17
        aufgenommen am 11.11.2016 - Michael Hilgenstock"
    width="168"
    height="112"
>
</a>
Re: Getting image metadata into sitemap_images.xml
« Reply #14 on: January 13, 2017, 03:26:53 PM »
Yes, indeed. the data-href="" attribute is ignored and the single-image landing pages for each photo is included to my sitemap-images.xml
Very fine!
I have the choice now to include either the thumbnail pages or the single-image pages or both by configuring the "Exclude URL" section of sitmap generator.
I thank You for Your quick response.