SSL proxy
« on: April 28, 2016, 06:46:26 PM »
Hi,

I'm using standalone not accessible to the world (so I cannot give you access). Some info

- the sites are HTTPS to the world behind a load balancer that acts as an SSL proxy.
- the generator is on the same server as the website to be crawled.
- for other reasons the there is a hosts entry for the domain on the web server (so other processes avoid the load balancer and hit web server direct)
- I obviously want the sitemap to have HTTPS links in it
- I don't care if I crawl HTTP or HTTPS

So I can get this working OK using HTTP hitting website direct but this generates XML sitemap with HTTP links.

I've tried using the 'IP address for crawling' as the external IP and setting the starting URL as the HTTPS address but get:

There was an error while retrieving the URL specified: [ External links are visible to forum administrators only ]

I can telnet to the external IP port 443 ok so it can certainly connect.

So,

- Am I doing something wrong or missing a config setting?
- Is there a way to crawl HTTP but use HTTPS links? I'm guessing not.

A side issue - I want to invoke the crawl via a remote scheduler using the web cron URL:

 - I want to hit the generator internally using an IP but this breaks the XSL stylesheet links - can I adjust that? I've disabled the XSL stylesheet for now in the config so not a massive issue.
 - This web cron endpoint always seems to report a 200 even if there was a failure. I want the scheduler to know there was a failure so it can tell us. Is that possible?
 - This web cron endpoint returns lots of HTML - it would be nice if there was a raw text response mode. Is that possible?

Many thanks, Chris
Re: SSL proxy
« Reply #1 on: April 29, 2016, 04:47:14 PM »
OK. Can I ask then; how should the IP address for crawling work? Does it bypass resolving the domain?
Re: SSL proxy
« Reply #2 on: April 30, 2016, 04:54:57 PM »
Hello,

>> OK. Can I ask then; how should the IP address for crawling work? Does it bypass resolving the domain?

yes it does.

>> - Am I doing something wrong or missing a config setting?
 You can try to use Curl setting in generator config with https link.

>> - Is there a way to crawl HTTP but use HTTPS links? I'm guessing not.

No.

>>  - I want to hit the generator internally using an IP but this breaks the XSL stylesheet links - can I adjust that? I've disabled the XSL stylesheet for now in the config so not a massive issue.

You can modify generator/pages/mods/sitemap_xml_tpl.xml file for that.

>>  - This web cron endpoint always seems to report a 200 even if there was a failure. I want the scheduler to know there was a failure so it can tell us. Is that possible?

>>  - This web cron endpoint returns lots of HTML - it would be nice if there was a raw text response mode. Is that possible?


This is not possible since that is essentially the same as running generator in browser.