VStock Data
← All posts
EngineeringApril 22, 2026 · 10 min read

Web Scraping with Playwright: A Practical 2026 Tutorial

Playwright is the headless-browser library most teams reach for in 2026 when a target site renders content with JavaScript or fingerprints automation. This guide walks through install, basic extraction, login flows, infinite scroll, proxy rotation, and the anti-bot pitfalls that catch most first-time scraper builds.

When to use Playwright (and when not to)

Reach for Playwright when the page you're scraping renders content via React, Vue, or another client-side framework — i.e. the HTML you get from a plain HTTP request is mostly empty. Also reach for it when the target site fingerprints automation: WAFs like Cloudflare and DataDome check TLS handshake, JavaScript execution, and behavior signals that a plain HTTP client cannot satisfy.

Skip Playwright when the page is already server-rendered. A real browser uses 50–200× more CPU and memory per page than a plain HTTP request — using one when you don't need to is the most common over-engineering mistake in scraping.

Install

Python (preferred for most data pipelines):

pip install playwright
playwright install chromium

Node.js:

npm install playwright
npx playwright install chromium

Basic extraction

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/products")
    page.wait_for_selector(".product-card")

    products = page.eval_on_selector_all(
        ".product-card",
        """nodes => nodes.map(n => ({
            title: n.querySelector('.title')?.textContent?.trim(),
            price: n.querySelector('.price')?.textContent?.trim(),
            url:   n.querySelector('a')?.href,
        }))"""
    )
    print(products)
    browser.close()

Two things to notice. First, wait_for_selector is how you avoid race conditions against the JavaScript render — never trust goto alone. Second, doing the field extraction inside eval_on_selector_all runs the loop in the browser, which is ~10× faster than pulling each element across the protocol bridge.

Infinite scroll

prev_count = 0
while True:
    page.mouse.wheel(0, 4000)
    page.wait_for_timeout(800)
    count = page.eval_on_selector_all(".product-card", "n => n.length")
    if count == prev_count:
        break
    prev_count = count

Use page.mouse.wheel rather than page.evaluate("window.scrollTo(...)"): the wheel event triggers the scroll listeners many anti-bot systems use as a humanity check.

Login flows

For sites where you have legitimate authenticated access, persist the storage state once and reuse it on every job. Don't replay login on each run — it triggers more anti-bot scrutiny than steady-state authenticated traffic.

# One-time login
ctx = browser.new_context()
page = ctx.new_page()
page.goto("https://example.com/login")
page.fill("#email", USER)
page.fill("#password", PASS)
page.click("button[type=submit]")
page.wait_for_url("**/dashboard")
ctx.storage_state(path="auth.json")

# Every subsequent run
ctx = browser.new_context(storage_state="auth.json")

Proxies

browser = p.chromium.launch(
    proxy={
        "server": "http://gateway.example.com:7777",
        "username": "USER",
        "password": "PASS",
    },
    headless=True,
)

For rotating residential or mobile proxies, point at the gateway URL — your provider rotates the exit IP per request. Datacenter proxies are cheap and fast but get blocked by mature anti-bot; use them only for sources you know don't fingerprint.

Anti-bot pitfalls

  • Default Chromium fingerprint. Anti-bot vendors recognise the stock Playwright build instantly. playwright-stealth patches the obvious tells; for serious targets you need a hardened browser image and matching TLS / JA3 fingerprint.
  • Headless mode detection. navigator.webdriver and missing chrome.runtime properties are the classic giveaways. Either run headed inside Xvfb or use a stealth plugin.
  • Predictable timing. Real users don't click 200ms after page load. Add jitter — page.wait_for_timeout(random.randint(800, 2400)) — to interaction sequences.
  • Browser pool reuse. Long-lived browsers accumulate state and get flagged. Recycle every 50–200 pages.

When you outgrow it

Playwright is excellent for tens of thousands of pages a day on stable targets. It starts to strain past that — orchestration, monitoring, anti-bot upgrades, and selector drift become the real cost. At that scale most teams either move to a managed scraping API for the fetch layer or hand the whole pipeline to a managed provider. We've written about that decision in our tools category guide.

Skip the maintenance burden.

If you'd rather receive structured CSV than maintain a Playwright fleet, we deliver scheduled data on a fixed monthly contract.