I have setup a Squid server to deploy my web crawler.
For HTTP URLs, Squid works very well, for instance:
curl -is -L http://www.baidu.com --proxy http://<squid_address>:3129
return the source of the index page of Baidu.
However, when I try to connect to HTTPS URL, something weird happen. The following curl command returns nothing but some useless HTML code:
curl -is -L https://www.baidu.com --proxy http://<squid_address>:3129
The result (including the response headers):
HTTP/1.1 200 OKServer: bfe/1.0.8.14Date: Fri, 04 Mar 2016 01:24:26 GMTContent-Type: text/htmlContent-Length: 227Connection: keep-aliveLast-Modified: Thu, 09 Oct 2014 10:47:57 GMTSet-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300Set-Cookie: BIDUPSID=4617F4EBA3F7D1A94B06FFE0B72E02B7; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.comSet-Cookie: PSTM=1457054666; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.comSet-Cookie: BDSVRTM=0; path=/P3P: CP=" OTI DSP COR IVA OUR IND COM "X-UA-Compatible: IE=Edge,chrome=1Pragma: no-cacheCache-control: no-cacheBDPAGETYPE: 1BDQID: 0xd7f5c039000f7c84BDUSERID: 0Accept-Ranges: bytesSet-Cookie: __bsi=16735848709246631383_00_175_N_N_1_0303_C02F_N_N_Y_0; expires=Fri, 04-Mar-16 01:24:31 GMT; domain=www.baidu.com; path=/<html><head><script> location.replace(location.href.replace("https://","http://"));</script></head><body><noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript></body></html>
I know the cause might be that the Baidu server detected the incoming connection is not using HTTPS (see the BD_NOT_HTTPS=1
cookie)
How to configure the Squid server so it can support HTTPS URLs to solve this?