I am trying to setup a selenium webscraper in jupyter notebook on Azure Databricks (on a linux cluster). I am able to set up the service, options, and driver without issue. When I use driver.get(url) I get ERR_CONNECTION_RESET or the connection times out.
Example 1 - dibbs website
- If I try driver.get("webpage") the connection times out.
Example 2 - google2) When I try driver.get("google") I get ERR_CONNECTION_RESET immediately.
My guess is that somehow the firewall is blocking a response, but I'm not sure because I first get a status code of 200, then 500 (in regards to example 1).
Screenshots of Output
- example 1 actual output:what selenium debugging output actually looks like
1.1) example 1 desired outputwhat selenium debugging output should look like
- example 2 output:what selenium debugging output actually looks like
My Code
`#importsfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.edge.service import Service as EdgeServicefrom selenium.webdriver.edge.options import Options as EdgeOptionsfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC``#set logging level to debug for seleniumlogger = logging.getLogger('selenium')logger.setLevel(logging.DEBUG)handler = logging.StreamHandler(sys.stdout)handler.setLevel(logging.DEBUG)formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s \n')handler.setFormatter(formatter)if not logger.hasHandlers(): # prevent duplicate handlers logger.addHandler(handler)``#set driver and executable pathsdriver_executable_path = pathedge_binary_path = path#set up the Edge WebDriverservice = EdgeService(driver_executable_path)options = webdriver.EdgeOptions()options.binary_location = edge_binary_pathoptions.add_argument("--headless")options.add_argument("--no-sandbox")options.add_argument("--disable-dev-shm-usage")options.add_argument("--disable-extensions")options.add_argument("--disable-gpu")options.add_argument("--disable-infobars")driver = webdriver.Edge(service=service, options=options)``#URL of the awards dates pageurl = "dibbs website" #example 1#url = "google" #example 2#Open the URLdriver.get(url)#Wait for the consent page to load and click the "OK" button if presenttry: WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "butAgree")) ).click() print("Clicked OK on consent page.")except Exception as e: print("No consent page or failed to click OK button:", e)`
Requests - if it helpsI tried using the requests module in an effort to diagnose the issue. Here is the code and output.
`import requestsurl = "stack overflow"r = requests.get(url)r.status_code`
Can anyone help me diagnose the root cause issue here?