XML Sitemaps Generator

Standalone Sitemap Generator Documentation

  1. Requirements
  2. Installation
  3. Configuration options
  4. Configuration tips
  5. Usage
  6. About Google Sitemaps

1. Requirements

  • The PHP XML generator will work with PHP 4.3.x or higher in default configuration in Apache web-server environment and on Windows/IIS servers.
  • Sitemap generator connects to your website via http port 80, so your host should allow local network connections for php scripts (this is default configuration)
  • For file permissions requirements please refer to "Installation" section.
  • The memory size requirements (as well as the time required to complete sitemap generation) depends on the number of pages your website contains.

2. Installation

Note: please check our Sitemap Generator Installation Guide for detailed instructions.
  1. Unpack the contents of distribution archive to the target folder on your server.
  2. Make sure to set the following files permissions:
    • data/ folder - 0777 (rwxrwxrwx)
    • /path/to/your/sitemap.xml - 0666 (rw-rw-rw-) see below (3.14)
    • /path/to/your/ror.xml - 0666 (rw-rw-rw-)
    • [optional]/path/to/your/sitemap_images.xml + sitemap_video.xml + sitemap_news.xml - 0666, only if you are creating corresponding sitemap types (Images/Video/News sitemap)
    Note: in case of Windows/IIS servers it is not allowed to set permissions via FTP and you will have to that using Hosting Control Panel: you should allow write access to files and folders mentioned above.
  3. If you want the sitemap to be build periodically (daily, weekly etc) you should setup the cron job to run the script using your hosting Control Panel. The command to use for cron job is shown on the "Crawling" page.

2.1 Upgrade

If you have previous version of Sitemap Generator already installed, the following steps are required:
  1. Unpack the contents of distribution archive and upload the files to the target folder on your server, overwrite existing files.

3. Configuration options

  • Starting URL

    Use the full url of your site for the "Starting URL" option. The crawler will explore only the URLs within the starting directory, i.e. when starting URL is "http://www.example.com/path/index.html", the "http://www.example.com/path/sub/page.html" will be indexed, but "http://www.example.com/other/index.html" will NOT be indexed.
  • Save sitemap to

    that is the filename in the "public_html/" folder of your website. This file should be writable by the script. To make sure it is, create this file and set its permissions to 0666.
  • Your Sitemap URL

    that is the link to your sitemap, which must be filled in according with the "Save sitemap to" setting defined above.
    Example: http://www.yourdomain.com/sitemap.xml

Other Sitemap Types Section

  • Create XML Sitemap (Text/ROR/HTML/Mobile Sitemap etc)

    You can enable or disable each of the available sitemap types depending on your needs.
  • HTML Sitemap filename

    A full path with filename pointing to a desired location of first page of html sitemap (it will be separated on multiple files automatically if needed)
  • Images/Video/News/RSS feed/Mobile Sitemap filename

    Only a filename should be entered in this option (like "sitemap_images.xml") - sitemap will be created in the same folder where main sitemap.xml is located.
  • Images/News/RSS inclusion mask

    Define this to allow only specific images (news/rss feed pages) to be indexed in sitemap (separated with spaces).
    Example: "images/large/ images/medium/"
  • Allow including images from these domains

    In case if images on your website are located on the external domain (not the same as the website domain), you can specify it here.
    Example: http://static.yourdomain.com
  • Publication title and language

    Attributes of news sitemap entries according with definition as described here
  • Feed title

    The title of your RSS feed
  • Include only entries from last X days

    Limit your RSS feed with including only the latest pages that were added on your site recently ("0" means including all pages)

Sitemap Entry Attributes Section

  • Change frequency

    This value indicates how frequently the content at a particular URL is likely to change.
  • Last modification

    The time the URL was last modified. It is recommended to use "Server's response" for "Last modification" field. In this case the entries for static pages will be filled with their real last modification time, while for dynamic pages the current time is used.
  • Priority

    The priority of a particular URL relative to other pages on the same site. The value for this tag is a number between 0.0 and 1.0.
  • Automatic Priority

    Enable this option to automatically reduce priority depending on the page's depth level
  • Individual attributes

    this option allows you to set specific values for last modification time, frequency andpriority per page. To use it, define specific frequency and priority attributes in the following format: "url substring,lastupdate YYYY-mm-dd,frequency,priority".
    Example:
    page.php?product=,2005-11-14,monthly,0.9

Miscellaneous Settings Section

  • Require authorization to access generator interface

    You can define the login name and password that will be required to access sitemap generator interface.
  • Send email notifications

    Enter your email address here to receive a notification report when sitemap is created.
  • Number of URLs per file in XML sitemap and maximum file size

    Set the maximum number of URLs per file and maximum file size - sitemap will be automatically separated on multiple files if the limit is exceeded.
  • Number of links per page and sort order in HTML sitemap

    Define the number of pages listed per html sitemap file and select sorting type of pages (alphabetical/unsorted) and the way URLs are grouped in the list (Tree-like or Folders list).
  • Compress sitemap using GZip

    Enable this option to create compressed sitemaps (sitemap.xml.gz)
  • Inform (ping) Search Engines upon completion

    Send a notification to major search engines when sitemap is updated
  • Send "weblogUpdate" type of Ping Notification to

    Optionally ping other website(s) when sitemap is updated
  • Calculate changelog

    Keep the tracking log of changes between created sitemaps (URLs that were added/removed in sitemap comparing to previous one)
  • Store the external links list

    Track the links pointing to external domains from your website. Optionally, you can define exclusion mask in "Excluding matching URLs" setting.

Narrow Indexed Pages Set Section

  • Exclude from sitemap extensions

    these file types are not crawled and not included in sitemap
  • Exclude URLs

    To disallow the part of your website from inclusion to the sitemap use "Exclude URLs" setting: all URLs that contain the strings specified will be skipped.
    For instance, to exclude all pages within "www.domain.com/folder/" add this line:
    • folder/

    If your site has pages with lists that can be reordered by columns and URLs look like "list.php?sort=column2", add this line to exclude duplicate content:
    • sort=

    Anyway, you may leave this box empty to get ALL pages listed.

    These 2 options are working in the way similar to robots "noindex, nofollow" directive
  • Add directly in sitemap (do not parse) extensions

    These files will be added to the sitemap, but not fetched from your server to save bandwidth and time, because usually they are not html files and have no embedded links. Please make sure these files are indexed by Google since there is no sense in adding them to sitemap otherwise!
  • Add directly in sitemap (do not parse) URLs

    "Do not parse URLs" works together with the option above to increase the speed of sitemap generation. If you are sure that some pages at your site do not contain the unique links to other pages, you can tell generator not to fetch them.
    For instance, if your site has "view article" pages with urls like "viewarticle.php?..", you may want to add them here, because most likely all links inside these pages are already listed at "higher level" (like the list of articles) documents as well:
    • viewarticle.php?id=

    If you are not sure what to define here, just leave this field empty.
    Please note that these pages are still included in sitemap.

    These 2 options are working in the way similar to robots "index, nofollow" directive
  • Crawl, but do not include URLs

    crawl pages that contain these substrings in URL, but do NOT include them in sitemap

    This option is similar to robots "noindex, follow" directive
  • "Include ONLY" URLs

    Fill it if you would like to include into sitemap ONLY those URls that match the specified string, separate multiple matches with space. leave this field empty by default.
  • "Parse ONLY" URLs

    Fill it if you would like to parse (crawl) ONLY those URls that match the specified string, separate multiple matches with space. leave this field empty by default.

Crawler Limitations, Finetune Section

  • Maximum pages

    This will limit the number of pages crawled. You can enter "0" value for unlimited crawling.
  • Maximum depth level

    Limit the number of "clicks" it takes to get to the page starting from homepage here. "0" for unlimited
  • Maximum execution time

    Limit the running time in seconds. This option might be not working depending on the server configuration.
  • Maximum memory usage

    Limit the memory usage in MB. This option might be not working depending on the server configuration.
  • Save the script state, every X seconds

    this option allows to resume crawling operation if it was interrupted. "0" for no saves
  • Make a delay between requests, X seconds after each N requests

    This option allows to slow down crawling and avoid overloading your server. "0" for no delay

Advanced Settings Section

  • Allow subdomains

    allow including pages from any website subdomain (like news.yourdomain.com or mobile.yourdomain.com)
  • Additional "Starting URLs"

    In case if some pages on your website cannot be reached by clicking links, starting from homepage, you can define additional Entry points for sitemap generator bot in this setting.
  • Support cookies

    With this option enabled generator bot will support cookies that your website sets, similar to browsers.
  • Use robots.txt file

    Generator bot will read and follow blocking directives from your website's robots.txt file if this option is enabled (it will check "User-agent: *" and "User-agent: googlebot" sections)
  • Detect canonical URL meta tag

    In case if canonical meta tag in html source of your page points to a different URL, generator bot will follow the link define in the tag. More details
  • Crawl Ajax content

    Sitemap generator bot will crawl AJAX powered resources on your website, as long as it complies with AJAX crawling specification
  • Remove session ID from URLs

    URL parameters defined in this option will be removed from the link. For instance, "page.html?article=new&sid=XXXXXX" will be replaced with "page.html?article=new" in case if "sid" is defined in this option.
  • Include hreflang for language URLs in sitemap

    automatically detect hreflang tags if you have them included on your website. More details here.
  • Custom alternative language pages

    Additionally to the setting above, you can specify alternative language versions for your pages: enter your page URL followed by a list of language identifier with alternative URLs.
    Example:
    http://www.example.com/ de http://www.example.com/de/ es http://www.example.com/es/
  • Custom Accept-Language http header

    You can specify own "accept-language" header for sitemap generator bot. Default is "en-us,en;q=0.5"
  • Use IP address for crawling

    In case if you get a "Connection error" message when trying to run sitemap generator, you can try to define your website's IP address in this field.
  • Use CURL extension for http requests

    Enable CURL library usage for generator bot, in case if this library is enabled in PHP configuration of the server.
  • Enable stylesheet for XML sitemap

    Display xml sitemap formatted for browser output using the stylesheet
  • Remove "Created by.." links from sitemap

    Do not include a line showing that sitemap was created with our sitemap generator
  • Store referring links, Maximum referring pages to store

    Track which pages are linking to other pages on your website (and define the limit for number of referrers per page). The details can be checked on a special "Referrers" page in sitemap generator interface.
  • Site uses UTF-8 charset

    Please make sure that this option is enabled in case if your website uses UTF-8 charset.
  • Enable last-modification time tag for "not parsed" URLs

    Send an additional HTTP HEAD requests for all pages included in sitemap, even they are matching the "Add directly in sitemap" setting.
  • Extract meta description tag

    Sitemap generator will store all description meta tags and will use them in HTML sitemap and RSS feed.
  • Minimize script memory usage

    Sitemap generator will use temporary files to store crawling progress with this option. However, that may significantly increase the crawling time.
  • Monitor crawler window and automatically resume if it stops

    Define the number of seconds (60 is recommended) to be used as the timeout to check if crawling is still in progress, and resume it automatically if it has stopped. This setting allow to run generator one time in browser and keep the window open, allowing generator to continue the process on its own.
  • Show debug output when crawling

    The option to display debugging information during crawling (such as http requests / response headers, page URLs etc)
  • Check for new versions of sitemap generator

    Check if there is a new version available each time sitemap generator is opened in browser
  • Purge log records older than X days

    Remove outdated logs, so that only the latest records are shown on Changelog page
  • Custom groups for "analyze" tab

    You can define an URL substring that will be shown as a separate entry on "Analyze" page. In this case you can find out how many pages are matching specific URL type on the website.

4. Configuration tips

  • You may want to limit the number of pages to index to make sure it will not be endless if your website have an error like unlimited looped links.
  • To limit the maximum running time of the script, define the "Maximum execution time" field (in seconds).
  • To have a possibility to use "Resume session" feature, define the "Save the script state" field. This value means the intervals to save the crawler process state, so in case the script was interrupted, you can continue the process from the last saved point. Set this value to "0" to disable savings.
  • To reduce the load on your server made by the sitemap generator, you can add the "sleep" delay after each N (configured) requests to your site for X seconds (configured). Leave blank ("0") values to crawls the site without delays.
  • Google doesn't support sitemap files with more than 50,000 pages. That's why script supports "Sitemap Index" creation for the big sites. So, it will create one sitemap index file and multiple sitemap files with 50 thousand pages each.
    For instance, your website has about 140,000 pages. The XML sitemap generator will create these files:
    • "sitemap.xml" - sitemap index file that includes links to other files (filename depends on what you entered in the "Save sitemap to" field)
    • "sitemap1.xml" - sitemap file (URLs from 1 to 50,000)
    • "sitemap2.xml" - sitemap file (URLs from 50,001 to 100,000)
    • "sitemap3.xml" - sitemap file (URLs from 100,001 to 140,000)
    Please make sure all of these files are writable if your website is large.
  • Enable "Create HTML Sitemap" option to let generator create a sitemap for your visitors. You should also define the "HTML Sitemap filename" where the sitemap will be stored. It is possible to split html sitemap onto multiple files by defining the "Number of links per page in HTML sitemap" option.
    The filenames are like the following:
    • "sitemap.html" - in case when all links fit in one file
      OR
    • "sitemap1.html" - site map file, page 1
    • "sitemap2.html" - site map file, page 2
    • etc

    Same as point above: please make sure all of these files are writable.

    The site map pages layout can be modified to suit to your website in pages/mods/sitemap_tpl.html file.
    Besides modifying the stylesheet for html sitemap, you can change the way it is formatted. The basic template commands are:
    • <TLOOP XX>...</TLOOP> - defines a repeating sequence of code (like page numbers or sitemap links)
    • <TIF XX>...</TIF> - defines a conditional statement that is inserted only when a specific term is met
    • <TVAR XX> - inserts a value of a specified variable
    Please refer to sitemap_tpl.html file for usage example.
  • Enable GZip compression of sitemap files to save on disk space and bandwidth. In this case ".gz" will be added to sitemap filenames (like "sitemap.xml.gz").
  • "Sitemap URL" is the same file entered in "Save sitemap to" field, but in the URL form. It is required to inform Google about sitemap address.
  • Set "Ping Google" checkbox enabled to let the script inform Google on every sitemap change. In this way you will always let google know about the fresh information on your site.
  • If you want to restrict access to your generator pages, set the login and password here.

5. Usage

  1. The first step is the script "Configuration". The script will show you the alert messages if the problem is found (e.g., config file is not writable).
    Do not forget to save the settings for your website after making the changes.
  2. Try to crawl your site using "Crawling" page. Just press "Run" button and you will see the generation progress information, including:
    • Links depth
    • Current page
    • Pages scanned
    • Pages left
    • Time passed
    • Time left (estimated)
    Please be patient and wait for the crawling completion, for the large sites it may take significant time. Upon the completion the script will automatically redirect you to the "View Sitemap" page.
  3. For the large websites you may want to use "Run in background" option. In this case the crawler will keep working even after you will click on the other page or even closed your browser. Note that this option might not work depending on server configuration.
  4. When your previous session was interrupted by you or the script has been suspended by a system, you can resume the process from the last saved state. The time intervals for state saving is defined on the "Configuration" screen.
  5. Later on you may want to setup a cron job to refresh your sitemap (described above in the "Installation" section).
  6. When the generator script is running (either with cron or using "Run in background" feature), you will see it's progress state on "Crawling" page. There you will also find the link to stop the script, which is very useful for big sites because you don't have to wait until it is finished if you want to modify the configuration and re-run the script.
  7. On the "View Sitemap" page the content of the recently generated sitemap is displayed. For the large sites multiple parts are shown, including sitemap index and every sitemap file separately.
  8. When the sitemap is already generated, "Sitemap details" block appears in the left column of the pages. It contains a link to download xml sitemap and also a sitemap in text format (one URL per line). Some other details are also available:
    • Request date
    • Processing time (sec)
    • Pages indexed
    • Sitemap files
    • Pages size (Mb)
  9. "Analyze" feature allows you to easily investigate the site structure. It represents the tree-like list of directories of your website, indicating the number of pages in every folder. You can expand/collapse the tree parts by clicking the [x] signs.
  10. Sometimes it is very helpful to know the dynamics of the sites contents. The "ChangeLog" page shows the list of all crawling sessions, including:
    • Date/Time
    • Total pages
    • Proc.time, sec
    • Bandwidth, Mb
    • Number of New URLs
    • Number of Removed URLs
    • Number of Broken links
    You can click any of the sessions titles to see detailed page with the full list of "Added URLs" and "Removed URLs". As you may see, on this page you will easily track how website changes in time, which is especially useful for large dynamic, database-driven sites.
  11. One more feature that is naturally supported by website crawler is "Broken Links" list page. You will see all the pages URLs that were failed to load by the script (HTTP code 404 was returned) AND also corresponding list of pages that refer to the broken pages. Having this page on the screen you can easily fix this problem on your website.
  12. Concluding, if you will setup the cron job to run the Google sitemap creator script and enable "Inform Google" feature, everything will work automatically without a user interaction required.
    And you still can refer to interesting details at Analyze, ChangeLog, Broken Links and View Sitemap pages at any time.