Web Scraping with Playwright: A Practical 2026 Tutorial
Playwright is the headless-browser library most teams reach for in 2026 when a target site renders content with JavaScript or fingerprints automation. This guide walks through install, basic extraction, login flows, infinite scroll, proxy rotation, and the anti-bot pitfalls that catch most first-time scraper builds.
When to use Playwright (and when not to)
Reach for Playwright when the page you're scraping renders content via React, Vue, or another client-side framework — i.e. the HTML you get from a plain HTTP request is mostly empty. Also reach for it when the target site fingerprints automation: WAFs like Cloudflare and DataDome check TLS handshake, JavaScript execution, and behavior signals that a plain HTTP client cannot satisfy.
Skip Playwright when the page is already server-rendered. A real browser uses 50–200× more CPU and memory per page than a plain HTTP request — using one when you don't need to is the most common over-engineering mistake in scraping.
Install
Python (preferred for most data pipelines):
pip install playwright
playwright install chromiumNode.js:
npm install playwright
npx playwright install chromiumBasic extraction
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/products")
page.wait_for_selector(".product-card")
products = page.eval_on_selector_all(
".product-card",
"""nodes => nodes.map(n => ({
title: n.querySelector('.title')?.textContent?.trim(),
price: n.querySelector('.price')?.textContent?.trim(),
url: n.querySelector('a')?.href,
}))"""
)
print(products)
browser.close()Two things to notice. First, wait_for_selector is how you avoid race conditions against the JavaScript render — never trust goto alone. Second, doing the field extraction inside eval_on_selector_all runs the loop in the browser, which is ~10× faster than pulling each element across the protocol bridge.
Infinite scroll
prev_count = 0
while True:
page.mouse.wheel(0, 4000)
page.wait_for_timeout(800)
count = page.eval_on_selector_all(".product-card", "n => n.length")
if count == prev_count:
break
prev_count = countUse page.mouse.wheel rather than page.evaluate("window.scrollTo(...)"): the wheel event triggers the scroll listeners many anti-bot systems use as a humanity check.
Login flows
For sites where you have legitimate authenticated access, persist the storage state once and reuse it on every job. Don't replay login on each run — it triggers more anti-bot scrutiny than steady-state authenticated traffic.
# One-time login
ctx = browser.new_context()
page = ctx.new_page()
page.goto("https://example.com/login")
page.fill("#email", USER)
page.fill("#password", PASS)
page.click("button[type=submit]")
page.wait_for_url("**/dashboard")
ctx.storage_state(path="auth.json")
# Every subsequent run
ctx = browser.new_context(storage_state="auth.json")Proxies
browser = p.chromium.launch(
proxy={
"server": "http://gateway.example.com:7777",
"username": "USER",
"password": "PASS",
},
headless=True,
)For rotating residential or mobile proxies, point at the gateway URL — your provider rotates the exit IP per request. Datacenter proxies are cheap and fast but get blocked by mature anti-bot; use them only for sources you know don't fingerprint.
Anti-bot pitfalls
- Default Chromium fingerprint. Anti-bot vendors recognise the stock Playwright build instantly.
playwright-stealthpatches the obvious tells; for serious targets you need a hardened browser image and matching TLS / JA3 fingerprint. - Headless mode detection.
navigator.webdriverand missing chrome.runtime properties are the classic giveaways. Either run headed inside Xvfb or use a stealth plugin. - Predictable timing. Real users don't click 200ms after page load. Add jitter —
page.wait_for_timeout(random.randint(800, 2400))— to interaction sequences. - Browser pool reuse. Long-lived browsers accumulate state and get flagged. Recycle every 50–200 pages.
When you outgrow it
Playwright is excellent for tens of thousands of pages a day on stable targets. It starts to strain past that — orchestration, monitoring, anti-bot upgrades, and selector drift become the real cost. At that scale most teams either move to a managed scraping API for the fetch layer or hand the whole pipeline to a managed provider. We've written about that decision in our tools category guide.
Skip the maintenance burden.
If you'd rather receive structured CSV than maintain a Playwright fleet, we deliver scheduled data on a fixed monthly contract.