Skip to content

luminati-io/undetected-chromedriver-web-scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Using Undetected ChromeDriver for Web Scraping

Promo

This guide explains how to use the Undetected ChromeDriver library for Python to bypass anti-bot systems for web scraping.

What Is Undetected ChromeDriver?

Undetected ChromeDriver is a Python library that offers a modified version of Selenium’s ChromeDriver. It minimizes browser "leaks" to reduce detection by anti-bot services like Imperva, DataDome, and Distil Networks, and can also help bypass some Cloudflare protections. This makes it especially useful for web scraping on sites with robust anti-scraping measures.

How It Works

Undetected ChromeDriver minimizes detection by Cloudflare, Imperva, DataDome, and similar solutions through several techniques:

  • Variable Renaming: It renames Selenium variables to mirror those used by genuine browsers.
  • Authentic User-Agent Strings: It employs real-world User-Agent strings to avoid being flagged.
  • Simulated Human Interaction: It allows for natural, human-like interactions.
  • Cookie & Session Management: It properly manages cookies and sessions during browsing.
  • Proxy Support: It enables the use of proxies to bypass IP blocking and rate limiting.

These strategies work together to help the browser controlled by the library effectively bypass anti-scraping defenses.

Using Undetected ChromeDriver for Web Scraping: Step-by-Step Guide

Many websites implement sophisticated anti-bot measures to block automated scripts from accessing their content. As a result, these defenses are highly effective at stopping web scraping bots.

Let's scrape the title and description from the following GoDaddy product page:

The GoDaddy target page

With plain Selenium in Python, your scraping script will look like this:

# pip install selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

# configure a Chrome instance to start in headless mode
options = Options()
options.add_argument("--headless")

# create a Chrome web driver instance
driver = webdriver.Chrome(service=Service(), options=options)

# connect to the target page
driver.get("https://www.godaddy.com/hosting/wordpress-hosting")

# scraping logic...

# close the browser
driver.quit()

Running this script will fail because it will be blocked by an anti-bot solution (Akamai, in this case):

An "Access Denied" page from GoDaddy

To work around that, you need to use the undetected_chromedriver Python library.

Step #1: Prerequisites and Project Setup

Undetected ChromeDriver has the following prerequisites:

  • Latest version of Chrome
  • Python 3.6+: If Python 3.6 or later is not installed on your machine, download it from the official site and follow the installation instructions.

Note:

The library automatically downloads and patches the driver binary for you, so there is no need to manually download ChromeDriver.

Now, use the following command to create a directory for your project:

mkdir undetected-chromedriver-scraper

The undetected-chromedriver-scraper directory will serve as the project folder for your Python scraper.

Navigate into it and initialize a virtual environment:

cd undetected-chromedriver-scraper
python -m venv env

Open the project folder in your preferred Python IDE and create a scraper.py file inside the project folder, following the structure shown below:

scraper.py in the project folder

Activate the virtual environment. On Linux or macOS, use:

./env/bin/activate

For Windows, run:

env/Scripts/activate

Step #2: Install Undetected ChromeDriver

In an activated virtual environment, install Undetected ChromeDriver:

pip install undetected_chromedriver 

Step #3: Initial Setup

Import undetected_chromedriver:

import undetected_chromedriver as uc

Initialize a Chrome WebDriver:

driver = uc.Chrome()

Like Selenium, this tool launches a browser window that you can control using the Selenium API. The driver object supports all standard Selenium methods, plus some extra features.

Important:

The main distinction is that this patched Chrome driver is engineered to bypass certain anti-bot solutions.

Call the quit() method to close the driver:

driver.quit() 

Here is a basic Undetected ChromeDriver setup:

import undetected_chromedriver as uc

# Initialize a Chrome instance
driver = uc.Chrome()

# Scraping logic...

# Close the browser and release its resources
driver.quit()

Step #4: Use It for Web Scraping

Use the get() method to navigate the browser to your target page:

driver.get("https://www.godaddy.com/hosting/wordpress-hosting")

Next, visit the page in incognito mode in your browser and inspect the element you want to scrape:

The DevTools inspection of the HTML elements to scrape data with

Let's extract the product title, tagline, and description. Here is how you can scrape all of these:

headline_element = driver.find_element(By.CSS_SELECTOR, "[data-cy=\"headline\"]")

title_element = headline_element.find_element(By.CSS_SELECTOR, "h1")
title = title_element.text

tagline_element = headline_element.find_element(By.CSS_SELECTOR, "h2")
tagline = tagline_element.text

description_element = headline_element.find_element(By.CSS_SELECTOR, "[data-cy=\"description\"]")
description = description_element.text

Import By from Selenium to make the above code work:

from selenium.webdriver.common.by import By

Store the scraped data in a Python dictionary:

product = {
  "title": title,
  "tagline": tagline,
  "description": description
}

Finally, export the data to a JSON file:

with open("product.json", "w") as json_file:
  json.dump(product, json_file, indent=4)

Import json from the Python standard library:

import json

Step #5: Put It All Together

This is the final scraping script:

import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
import json

# Create a Chrome web driver instance
driver = uc.Chrome()

# Connect to the target page
driver.get("https://www.godaddy.com/hosting/wordpress-hosting")

# Scraping logic
headline_element = driver.find_element(By.CSS_SELECTOR, "[data-cy=\"headline\"]")

title_element = headline_element.find_element(By.CSS_SELECTOR, "h1")
title = title_element.text

tagline_element = headline_element.find_element(By.CSS_SELECTOR, "h2")
tagline = tagline_element.text

description_element = headline_element.find_element(By.CSS_SELECTOR, "[data-cy=\"description\"]")
description = description_element.text

# Populate a dictionary with the scraped data
product = {
  "title": title,
  "tagline": tagline,
  "description": description
}

# Export the scraped data to JSON
with open("product.json", "w") as json_file:
  json.dump(product, json_file, indent=4)

# Close the browser and release its resources
driver.quit() 

Execute it:

python3 scraper.py

Or, on Windows:

python scraper.py

This will open a browser showing the target web page:

a browser showing the target web page

The script will extract data from the page and produce the following product.json file:

{
    "title": "Managed WordPress Hosting",
    "tagline": "Get WordPress hosting — simplified",
    "description": "We make it easier to create, launch, and manage your WordPress site"
}

Advanced Usage of undetected_chromedriver:

Choosing a Specific Chrome Version

You can specify a particular version of Chrome for the library to use by setting the version_main argument:

import undetected_chromedriver as uc

# Specify the target version of Chrome
driver = uc.Chrome(version_main=105)

The library also works with other Chromium-based browsers, but that requires some additional tweaking.

The withSyntax

Use the with syntax to avoid manually calling the quit() method when you no longer need the driver:

import undetected_chromedriver as uc

with uc.Chrome() as driver:
    driver.get("<YOUR_URL>")

When the code inside the with block completes, Python will automatically close the browser for you.

Note:

This syntax is supported starting from version 3.1.0.

Proxy Integration

The syntax for adding a proxy to Undetected ChromeDriver is similar to regular Selenium. Simply pass your proxy URL to the --proxy-server flag as shown below:

import undetected_chromedriver as uc

proxy_url = "<YOUR_PROXY_URL>"

options = uc.ChromeOptions()
options.add_argument(f"--proxy-server={proxy}")

Note:

Chrome does not support authenticated proxies through the --proxy-server flag.

Extended API

The undetected_chromedriver library has some extra methods that extend regular Selenium functionality:

  • WebElement.click_safe(): Use it when clicking a link causes detection.
  • WebElement.children(tag=None, recursive=False): Use it to easily find child elements. For example:
# Get the 6th child (of any tag) within the body, then find all <img> elements recursively
images = body.children()[6].children("img", True)

Limitations of the undetected_chromedriver Library

While undetected_chromedriver is a powerful Python library, it does have some known limitations. Here are the most important ones you should be aware of!

IP Blocks

The GitHub page for the library clearly states: "This package does not hide your IP address". Running your script from a datacenter may still result in detection, and a poorly regarded home IP can also lead to blocks.

IP Blocks Warning on GitHub

To hide your IP, you must integrate the controlled browser with a proxy server, as demonstrated earlier.

No Support for GUI Navigation

Due to how the module works, you need to navigate programmatically using the get() method. Avoid manual navigation through the browser GUI, as using your keyboard or mouse increases the risk of detection.

This rule also applies when managing new tabs. If you require multiple tabs, open a new one with a blank page by using the URL data:, (including the comma), which the driver accepts. Then, continue with your normal automation workflow.

Following these guidelines will help reduce detection and ensure smoother web scraping sessions.

Limited Support for Headless Mode

Since version 3.4.5, The undetected_chromedriver library has an experimental (read: not guaranteed) headless mode. Try this:

driver = uc.Chrome(headless=True)

Stability Issues

As noted on the package’s PyPI page, outcomes can vary due to many factors. While there's no guarantee of success, the developers continually work to understand and counter detection algorithms.

Alert about unpredictable results on PyPI

This means a script that bypasses anti-bot measures like Distil, Cloudflare, Imperva, DataDome, or hCaptcha today might fail if these defenses are updated tomorrow:

CAPTCHA triggered by Undetected ChromeDriver

The image above, taken from the official documentation, shows that even developer-provided scripts can sometimes trigger a CAPTCHA, potentially halting your automation.

Conclusion

While Undetected ChromeDriver provides a patched ChromeDriver for web scraping, advanced anti-bot systems like Cloudflare can still block your scripts. The issue isn’t with Selenium’s API but with the browser’s settings. The true solution is a cloud-based, always-updated, scalable browser with built-in anti-bot capabilities—enter Scraping Browser.

Create a free Bright Data account today to try out our scraping browser or test our proxies.

About

How to use the undetected_chromedriver Python library to bypass anti-bot measures for web scraping, with step-by-step instructions, advanced tips, and Bright Data integration recommendations.

Topics

Resources

Stars

Watchers

Forks