When setting any URL to use HTTPS as the scheme (i.e., https://), I get my desired response (i.e., page source), but any http url (i.e., http://) fails or I receive a header and I don't understand why when I expect redirection to the page source. This is important because sometimes the urls I'm processing are http:// or https:// and I need them to redirect appropriately.
Attempt #1 - Link is http:// and using https-based proxy. However, the result is the same for both http & https proxies. The proxies are public.
import requestsfrom bs4 import BeautifulSoupfrom fake_useragent import UserAgentua = UserAgent(browsers=['Edge', 'Chrome', 'Firefox', 'Google'], os='Windows', platforms='desktop')headers = {'Accept': 'application/json','User-Agent': ua.random, # generic user agent'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8','Connection': 'keep-alive', }htmlRequest = requests.get("http://link.springer.com/10.1023/A:1012637309336", # Another example link that presents the same behavior - https://ieeexplore.ieee.org/document/10152818/ headers=headers, verify=False, # verify is necessary for https proxy, or I'll receive an "Cannot set verify_mode to CERT_NONE when check_hostname is enabled" error. Either solution works, but not the focus. #verify="springer-com-chain.pem", # verify is necessary for https proxy, or I'll receive an "Cannot set verify_mode to CERT_NONE when check_hostname is enabled" error. Either solution works, but not the focus. This file is downloaded directly from the link in the get request above. allow_redirects=True, #proxies={"http": "http://3.21.101.158:3128"}, proxies={"http": "https://204.236.176.61:3128"}, timeout=30)print(f"Status Code: {htmlRequest.status_code}")print(f"URL History: {htmlRequest.history}\n")soup = BeautifulSoup(htmlRequest.content, 'html.parser')print(soup.prettify())
Attempt #1 Error
Status Code: 200URL History: []REMOTE_ADDR = 13.56.247.133REMOTE_PORT = 56719REQUEST_METHOD = GETREQUEST_URI = http://link.springer.com/10.1023/A:1012637309336REQUEST_TIME_FLOAT = 1739322382.2113674REQUEST_TIME = 1739322382HTTP_HOST = link.springer.comHTTP_USER-AGENT = Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36HTTP_ACCEPT-ENCODING = gzip, deflateHTTP_ACCEPT = application/jsonHTTP_CONNECTION = keep-aliveHTTP_ACCEPT-LANGUAGE = en-GB,en-US;q=0.9,en;q=0.8
The first line is the status code. We see 200 response, but the next line shows no redirect history. If I go into the browser, it'll automatically redirect to https://. I understand the stacks are different, but what is missing, especially since requests is supposed to handle redirects. What do I do with this header? Why am I receiving it? I could just manually make sure every url in the GET request is https://, but I wouldn't understand, why this is an issue.
Attempt #2 works and returns the page source (change the URL from http to https) and shows the various redirects
...htmlRequest = requests.get("https://link.springer.com/10.1023/A:1012637309336",...
Thank you kindly and hopefully this is helpful for others!