no1

*
  • *
  • 3
Some pages are not getting crawled nor indexed
« on: March 19, 2010, 06:05:35 PM »
Hi there Mr Oleg Ignatiuk and everyone else.

This is the first day I'm using the software and it looks very promising. However I still have 2 issues that I'm trying to solve and hoping to get some help here.

1) How do I add a sitemap entry attribute for the URL "[ External links are visible to forum administrators only ]"

I tried "/,2010-03-19,daily,1.0" but then all pages in the whole site will get the same attribute. (index.php has the canonical page / and index.php$ is not used).

2) 96 pages like the following example are not getting crawled nor indexed: [ External links are visible to forum administrators only ]

Here is the navigation path to the example page:

[ External links are visible to forum administrators only ]
[ External links are visible to forum administrators only ]
[ External links are visible to forum administrators only ]

I may have done something wrong entering Excluded URLs but I can't figure out where the problem is.

Sitemap is here: [ External links are visible to forum administrators only ]

Below follows my new config.

Thanks a lot for helping me out!!!!!

Regards
no1

NEW CONFIG:

Code: [Select]
<xmlsitemaps_settings>
    <option name="xs_inc_skip">\.(pdf|doc|txt|rtf)</option>
    <option name="xs_exc_skip">\.(zip|m4a|m4v|rar|tar|bz2|tgz|exe|gif|tif|jpg|png|class|jar|mpeg|mpg|mp3|wav|mp4|avi|wmv|gz|mov|mid|ra|ram)</option>
    <option name="xs_proto_skip">(\#|mms:|mailto:|javascript:|ftp:|news:|aim:)</option>
    <option name="xs_exec_time">9000</option>
    <option name="xs_initurl">http://chaimai.com</option>
    <option name="xs_freq">weekly</option>
    <option name="xs_lastmod">1</option>
    <option name="xs_priority">0.5</option>
    <option name="xs_descpriority">0.8</option>
    <option name="xs_autopriority">1</option>
    <option name="xs_smname">/www/somef*ckingfolder/sitemap.xml</option>
    <option name="xs_smurl">http://chaimai.com/sitemap.xml</option>
    <option name="xs_gping">0</option>
    <option name="xs_makehtml">1</option>
    <option name="xs_maketxt">1</option>
    <option name="xs_makeror">1</option>
    <option name="xs_savestate_time">30</option>
    <option name="xs_sm_size">40000</option>
    <option name="xs_robotstxt">1</option>
    <option name="xs_dumptype">serialize</option>
    <option name="xs_cleanpar">PHPSESSID</option>
    <option name="xs_chlogorder">asc</option>
    <option name="xs_exclude_check">1</option>
    <option name="xs_dateformat">Y, F j</option>
    <option name="xs_allow_httpcode">200</option>
    <option name="xs_weblog_ping">http://rpc.technorati.com/rpc/ping</option>
    <option name="xs_purgelogs">30</option>
    <option name="xs_htmlpart">10000</option>
    <option name="xs_max_depth">0</option>
    <option name="xs_memlimit">256</option>
    <option name="xs_no_cookies">1</option>
    <option name="xs_htmlname">/www/www/somef*ckingfolder/chaimai.com/public_html/sitemap.html</option>
    <option name="xs_notconfigured">0</option>
    <option name="xs_lastmodtime">2010-03-18 13:46:44</option>
    <option name="xs_max_pages">1000</option>
    <option name="xs_delay_req"></option>
    <option name="xs_delay_ms"></option>
    <option name="xs_yping"></option>
    <option name="xs_excl_urls">sort
;wap
;wap2
;imode
topicseen
attachments
avatars
createimage
cur_topic_id
errors
greybox
FCKeditor
Themes
Smileys
Sources
Packages
Themes
tp-downloads
tp-images
tpstart
tpadmin
action=activate
action=admin
action=calendar
action=dlattach
action=findmember
action=help
action=login
action=mlist
action=mgallery
action=post
action=permissions
action=pm
action=printpage
action=profile
action=register
action=reminder
action=search
action=stats
action=tpadmin
action=verificationcode
action=vote
action=who
prev_next
page
swedish
ftlang=svth
index.php?board=1.0
search
unread
new
msg
.*[topic|board]=\d+$
.*sa=details$
.*start=[A-Za-z%]*</option>
    <option name="xs_incl_urls">page=58</option>
    <option name="xs_incl_only"></option>
    <option name="xs_parse_only"></option>
    <option name="xs_ind_attr">/index.php?language=thai-utf8,2010-03-19,daily,1.0
/index.php?action=forum,2010-03-19,monthly,1.0
/index.php?action=forum;language=thai-utf8,2010-03-19,monthly,1.0
/index.php?action=dictionary,2010-03-19,monthly,1.0
/index.php?action=dictionary;language=thai-utf8,2010-03-19,monthly,1.0
/index.php?board=2.0,2010-03-19,daily,0.9
/index.php?board=25.0,2010-03-19,daily,0.9
/index.php?board=14.0,2010-03-19,daily,0.9
/index.php?board=44.0,2010-03-19,daily,0.9
/index.php?board=21.0,2010-03-19,daily,0.9
/index.php?board=49.0,2010-03-19,daily,0.9
/index.php?board=36.0,2010-03-19,daily,0.9
/index.php?board=27.0,2010-03-19,monthly,0.9
/index.php?board=41.0,2010-03-19,daily,0.9
/index.php?board=50.0,2010-03-19,daily,0.9
/index.php?board=51.0,2010-03-19,daily,0.9
/index.php?board=52.0,2010-03-19,daily,0.9
/index.php?action=thaikeyboard,2010-03-19,yearly,1.0
/thailandskt_tangentbord.php,2010-03-19,yearly,1.0
http://chaimai.com/thai_keyboard.php,2010-03-19,yearly,1.0</option>
    <option name="xs_login">ling</option>
    <option name="xs_email">m@4m.se</option>
    <option name="xs_chlog">0</option>
    <option name="xs_extlinks">0</option>
    <option name="xs_makemob">0</option>
    <option name="xs_compress">1</option>
    <option name="xs_usecurl">0</option>
    <option name="xs_memsave">0</option>
    <option name="xs_ipconnection"></option>
    <option name="xs_metadesc">0</option>
</xmlsitemaps_settings>
Re: Some pages are not getting crawled nor indexed
« Reply #1 on: March 20, 2010, 08:10:30 AM »
Hello,

1. please try this:
$,2010-03-19,daily,1.0

2. The URL matches this exclusion that you have defined:
.*[topic|board]=\d+$

you should replace it with:
.*(topic|board)=\d+$

no1

*
  • *
  • 3
Re: Some pages are not getting crawled nor indexed
« Reply #2 on: March 20, 2010, 09:04:44 AM »
Hello,

1. please try this:
$,2010-03-19,daily,1.0

2. The URL matches this exclusion that you have defined:
.*[topic|board]=\d+$

you should replace it with:
.*(topic|board)=\d+$


Thank You very much for the kind help!  ;D

I've done some experimenting this morning and found out the regexp .*start=[A-Za-z%]* made some results that I did not predict. When generating maps, it matches target string index.php?action=dictionary;sa=all;start=50 to be excluded even though it shouldn't according to a Regular Expression Test Page i found at regexplanet.com/simple/index.html

However i changed it to .*start=[A-Za-z%]+ and now everything seems to be running as I want it to do as I now have 7335 Pages indexed.

I will try your other recommendations out.

Thanks again and keep up the good work with this super software!  ;D

no1

*
  • *
  • 3
Re: Some pages are not getting crawled nor indexed
« Reply #3 on: March 20, 2010, 09:23:36 AM »
...
1. please try this:
$,2010-03-19,daily,1.0
...

Great it works!

// EDIT REASON: I was to fast to reply. It doesn't work. All links are now getting the same date, frequency and priority.
$,2010-03-19,daily,1.0 seems to match ALL pages indexed.

By the way, do I always have to manually edit this list to set the current date or can I enter some kind of placeholder e.g. something like %today% to let the generator use the current date.
« Last Edit: March 20, 2010, 09:42:19 AM by no1 »
Re: Some pages are not getting crawled nor indexed
« Reply #4 on: March 22, 2010, 10:35:04 AM »
Please try to change the attribute line to:
/$,2010-03-19,daily,1.0

Quote
By the way, do I always have to manually edit this list to set the current date or can I enter some kind of placeholder e.g. something like %today% to let the generator use the current date.
You can simply omit the date part like:
/$,,daily,1.0

and it will insert the default date according to your configuration (which can be set to "current time")