VStock Data
← All posts
EngineeringApril 30, 2026 · 9 min read

Headers, Pagination, and CAPTCHAs: The Three Walls of Web Scraping

Most scrapers don't fail because the parsing logic is wrong — they fail because of headers, pagination, or CAPTCHAs. This post covers all three at the level of "what's actually blocking you" rather than "what does HTTP say."

Part 1: Web scraping headers

The headers you send say more about whether you'll be blocked than the IP you send from. A real Chrome request carries 12–15 headers in a specific order, with values that match each other (the User-Agent's stated platform must match the sec-ch-ua-platform hint, etc.). A default requests or curl call sends 3, in the wrong order, with mismatched values.

The headers that matter most:

  • User-Agent. Default python-requests/2.x is blocked by most defended sites. Use a current Chrome / Firefox UA — and update it; UAs older than 6 months are themselves a signal.
  • Accept / Accept-Language / Accept-Encoding. Real browsers always send these. Missing values are a tell.
  • Sec-Fetch-* headers. Chrome attaches sec-fetch-site, sec-fetch-mode, sec-fetch-dest to every request. Most non-browser HTTP clients omit them.
  • Sec-Ch-Ua-* client hints. Modern Chrome sends these on top-level requests; their values must be coherent with the User-Agent.
  • Cookie. A request that should be authenticated but arrives without a cookie is obviously a bot.
  • Referer. Real users arrive from somewhere. A direct request to a deep product URL with no Referer is suspicious on most sites.

Beyond headers, anti-bot vendors fingerprint the TLS handshake itself (JA3 / JA4) — the order of cipher suites, extensions, and supported curves. requests and stock curl have signatures vendors recognise on sight. Fixes: curl-impersonate, tls-client in Python, or a real browser via Playwright.

Part 2: Pagination patterns

Pagination is where a scraper that "works on the first page" silently breaks. The four patterns you'll meet:

  • Numbered pages (?page=2). The simplest. Loop until you stop seeing the next-page link or the result list goes empty. Watch for soft 200 responses that hide "no more results" inside the HTML.
  • Cursor / token-based. The response carries a next_cursor field; you pass it on the next request. More resilient against duplicate / missing pages than numbered pagination.
  • Offset / limit (?offset=200&limit=50). Common on REST APIs. Watch for backends that change ordering between requests — you'll get duplicates and gaps. Add a stable sort key.
  • Infinite scroll. No URL changes; the page fires an XHR for each batch. Inspect DevTools → Network for the actual API call and treat it as cursor-based or offset-based depending on what it returns.

Two pagination traps worth naming. Hard caps: many sites silently cap results at 1000 or 10000 records and quietly stop returning new ones. If the count of records you collect is suspiciously round, you're hitting a cap — split the query into smaller filters (by category, by date range, by ZIP) until each shard fits under the cap. Order drift: if the site re-sorts on every request, paginating by page number gives you duplicates and gaps; switch to a deterministic cursor or stable filter.

Part 3: CAPTCHAs

CAPTCHAs are not a single thing — they're a category. The category determines what (if anything) you should do.

  • Cloudflare Turnstile / hCaptcha invisible. Often passes silently when your fingerprint and behavior look human. The fix is upstream — better headers, real browser, behavioral jitter — not "solve the CAPTCHA."
  • reCAPTCHA v2 (image grids). Designed to be hard to automate. Solver services exist and work but are slow, expensive, and may be against the target site's TOS.
  • reCAPTCHA v3 (score-based). Returns a risk score 0–1; the site decides what threshold to act on. Improving fingerprint and behavior usually moves your score enough to clear.
  • Custom anti-bot challenges. Akamai Bot Manager, DataDome, PerimeterX, Kasada — each runs JavaScript challenges in the browser. Solving them programmatically is a moving target; mature anti-bot vendors update challenges every few weeks. The practical answer is a real browser with stealth or a managed scraping API that handles the challenges as a service.

Two principles. First, treat CAPTCHAs as a downstream symptom of bad fingerprint, not a problem to solve. If you fix the fingerprint, most CAPTCHAs stop firing. Second, respect the legal layer. Some jurisdictions interpret bypassing access controls as a CFAA-adjacent issue; CAPTCHAs are ambiguous here, but it's a fact-pattern worth scoping with counsel before automating around at scale. See our legal updates hub for current case law.

The practical hierarchy

  1. Send the right headers in the right order with coherent values.
  2. Match TLS fingerprint to the browser you're impersonating.
  3. Add behavioral jitter — pacing, mouse movement on real-browser scrapes.
  4. Inspect pagination patterns first; never trust ?page=N without sanity checks.
  5. Treat CAPTCHAs as a fingerprint diagnostic, not a wall to brute-force.
  6. When the maintenance cost exceeds the data value, hand the fetch layer to a managed service.

Past the cat-and-mouse stage?

We deliver structured CSV / JSON on a schedule — proxies, headers, anti-bot, and pagination all included.