I'm using standalone not accessible to the world (so I cannot give you access). Some info
- the sites are HTTPS to the world behind a load balancer that acts as an SSL proxy.
- the generator is on the same server as the website to be crawled.
- for other reasons the there is a hosts entry for the domain on the web server (so other processes avoid the load balancer and hit web server direct)
- I obviously want the sitemap to have HTTPS links in it
- I don't care if I crawl HTTP or HTTPS
So I can get this working OK using HTTP hitting website direct but this generates XML sitemap with HTTP links.
I've tried using the 'IP address for crawling' as the external IP and setting the starting URL as the HTTPS address but get:
There was an error while retrieving the URL specified: [external links are visible to admins only]
I can telnet to the external IP port 443 ok so it can certainly connect.
- Am I doing something wrong or missing a config setting?
- Is there a way to crawl HTTP but use HTTPS links? I'm guessing not.
A side issue - I want to invoke the crawl via a remote scheduler using the web cron URL:
- I want to hit the generator internally using an IP but this breaks the XSL stylesheet links - can I adjust that? I've disabled the XSL stylesheet for now in the config so not a massive issue.
- This web cron endpoint always seems to report a 200 even if there was a failure. I want the scheduler to know there was a failure so it can tell us. Is that possible?
- This web cron endpoint returns lots of HTML - it would be nice if there was a raw text response mode. Is that possible?
Many thanks, Chris