Web Scraping with curl: When the Right Tool Is Just curl
Most web scraping tutorials reach for Python before the user has even checked whether the page ships its data over a clean JSON API. Often the right answer is a one-line curl command, piped into jq. This post covers the patterns that get you 80% of the way without touching a parsing library.
Inspect first, scrape second
Open the target page in Chrome DevTools, switch to the Network tab, filter by fetch / XHR, and reload. Most modern sites fetch their data as JSON in the background and render it client-side. If you spot a JSON endpoint that returns the data you actually want, you don't need a scraper — you need curl.
The basic GET
curl -sS "https://example.com/api/products?page=1" \
-H "accept: application/json" \
-H "user-agent: Mozilla/5.0" \
| jq '.results[] | {title, price, sku}'-sS hides the progress bar but still shows errors. The user-agent header is critical — many APIs serve a redirect or 403 to default curl/8.x.
Copy as cURL
DevTools → Network → right-click any request → Copy → Copy as cURL. Paste into your terminal and you have a working request with every header, cookie, and parameter the browser sent. Strip down to the headers that actually matter (often: cookie, user-agent, x-csrf-token) once you've confirmed it works.
Pagination loop
for page in $(seq 1 50); do
curl -sS "https://example.com/api/products?page=$page" \
-H "user-agent: Mozilla/5.0" \
| jq -c '.results[]'
sleep 1
done > products.jsonlOutput as JSON Lines (-c) makes downstream processing easy — every line is one record. sleep 1 is a polite floor; some APIs need slower, some can take faster. Always check response time and headers like Retry-After before tuning.
Cookies and sessions
# Save cookies on first request
curl -sS "https://example.com/login" \
-c cookies.txt -b cookies.txt \
--data "email=USER&password=PASS"
# Reuse on subsequent requests
curl -sS "https://example.com/api/account" -b cookies.txtThrough a proxy
curl -sS "https://target.example/api/items" \
--proxy "http://USER:[email protected]:7777"When curl is no longer enough
- The page renders content with JavaScript and inspection finds no clean API. Move to Playwright or Selenium.
- Anti-bot fingerprints the TLS handshake (Cloudflare, Akamai). curl's TLS signature is recognisable; you'll need
curl-impersonateor a real browser. - You need parallelism above ~10 concurrent requests with retry and rate-limit awareness. Move to a real client (Python
httpxor Go). - The data needs structured parsing beyond what jq can handle. JSONPath / CSS-selector parsers from a real language make sense.
The case for staying in curl as long as you can
The two best things about a curl-based scraper are that it's trivially debuggable (you can run the exact request from anywhere) and trivially shareable (one line in a runbook). For internal-facing or one-off jobs, those two properties matter more than the things curl can't do.
For production-grade pipelines that need monitoring, retry, and selector-drift recovery, see our tools category guide.
Past the curl one-liner stage?
If you'd rather receive structured CSV than maintain bash scripts, we deliver scheduled data on a fixed monthly contract.