EngineeringApril 26, 2026 · 7 min read

Web Scraping with curl: When the Right Tool Is Just curl

Most web scraping tutorials reach for Python before the user has even checked whether the page ships its data over a clean JSON API. Often the right answer is a one-line curl command, piped into jq. This post covers the patterns that get you 80% of the way without touching a parsing library.

Inspect first, scrape second

Open the target page in Chrome DevTools, switch to the Network tab, filter by fetch / XHR, and reload. Most modern sites fetch their data as JSON in the background and render it client-side. If you spot a JSON endpoint that returns the data you actually want, you don't need a scraper — you need curl.

The basic GET

curl -sS "https://example.com/api/products?page=1" \
  -H "accept: application/json" \
  -H "user-agent: Mozilla/5.0" \
  | jq '.results[] | {title, price, sku}'

-sS hides the progress bar but still shows errors. The user-agent header is critical — many APIs serve a redirect or 403 to default curl/8.x.

Copy as cURL

DevTools → Network → right-click any request → Copy → Copy as cURL. Paste into your terminal and you have a working request with every header, cookie, and parameter the browser sent. Strip down to the headers that actually matter (often: cookie, user-agent, x-csrf-token) once you've confirmed it works.

Pagination loop

for page in $(seq 1 50); do
  curl -sS "https://example.com/api/products?page=$page" \
    -H "user-agent: Mozilla/5.0" \
    | jq -c '.results[]'
  sleep 1
done > products.jsonl

Output as JSON Lines (-c) makes downstream processing easy — every line is one record. sleep 1 is a polite floor; some APIs need slower, some can take faster. Always check response time and headers like Retry-After before tuning.

Cookies and sessions

# Save cookies on first request
curl -sS "https://example.com/login" \
  -c cookies.txt -b cookies.txt \
  --data "email=USER&password=PASS"

# Reuse on subsequent requests
curl -sS "https://example.com/api/account" -b cookies.txt

Through a proxy

curl -sS "https://target.example/api/items" \
  --proxy "http://USER:[email protected]:7777"

When curl is no longer enough

The page renders content with JavaScript and inspection finds no clean API. Move to Playwright or Selenium.
Anti-bot fingerprints the TLS handshake (Cloudflare, Akamai). curl's TLS signature is recognisable; you'll need curl-impersonate or a real browser.
You need parallelism above ~10 concurrent requests with retry and rate-limit awareness. Move to a real client (Python httpx or Go).
The data needs structured parsing beyond what jq can handle. JSONPath / CSS-selector parsers from a real language make sense.

The case for staying in curl as long as you can

The two best things about a curl-based scraper are that it's trivially debuggable (you can run the exact request from anywhere) and trivially shareable (one line in a runbook). For internal-facing or one-off jobs, those two properties matter more than the things curl can't do.

For production-grade pipelines that need monitoring, retry, and selector-drift recovery, see our tools category guide.

Past the curl one-liner stage?

If you'd rather receive structured CSV than maintain bash scripts, we deliver scheduled data on a fixed monthly contract.