What HTTP status codes mean (briefly)
Every request gets back a status code. 2xx is success, 3xx is redirect, 4xx is “you did something wrong” (or the site says you did), 5xx is “the server did something wrong”. The right retry behaviour for a 429 is wrong for a 403, and a 200 isn’t always actually a success — read the body shape too.
Status code reference
The full list scrapers actually encounter — what each one means in practice, and the first thing to try.
Malformed syntax, invalid headers, improperly encoded parameters. Often a coding bug, not a target-site issue.
Target requires authentication. Common on APIs requiring bearer tokens or pages behind a login wall.
The most common scraping error. Triggered by bot detection, WAFs (Cloudflare, Akamai), IP blacklists, or missing browser-like headers.
stealth=true + premium_proxy=true.URL doesn’t exist. Stale sitemaps, restructured URL patterns, or dynamic IDs that have rotated.
Server didn’t receive a complete request in time. Usually network-side, sometimes a sign of an upstream issue.
Resource permanently removed (vs 404 which is “maybe just missing”). Treat as a hard signal to drop the URL.
Some anti-bot vendors return 418 for known-bot fingerprints — Twitter’s old API famously did this for unauthenticated calls.
stealth=true; the request is being explicitly fingerprint-rejected.Our API uses 422 for refused requests — out-of-scope CAPTCHAs (login, banking), invalid combinations of params, or terms-violating targets.
Target’s rate limit hit. Often per-IP or per-session; sometimes per-account. The polite ones include a Retry-After header.
auto_proxy=true for IP rotation.Geo-blocked content (GDPR opt-outs, regional licensing). Usually fixable with the right proxy_country.
proxy_country to a permitted region.Nginx’s code for “client gave up before we replied”. Usually a timeout on your side.
Genuine server failure on the target. Retryable; not your fault.
Upstream server is misbehaving — common when the target uses a load balancer with a sick backend.
Server overloaded or maintenance. Also used by anti-bot systems (Cloudflare) as JS challenge pages.
js=true to execute challenge pages.Upstream took too long. Server-side timeout (vs 408 = client-side).
wait_for for slow pages.Cloudflare-specific. Origin returned an empty/invalid response. Usually a sick origin behind CF.
Cloudflare can’t reach the origin. Origin really is down.
Cloudflare reached the origin but it didn’t respond in time.
The retry policy table
Copy this into your scraper. Each row is the safe default — short on retries, polite on backoff, no infinite loops. auto_proxy=true implements all of this server-side, but you’ll want the same shape on your own client retries too.
Why these errors actually fire
WAFs, fingerprinting, and behavioural analysis identify automated traffic. Missing headers, no JS, or a bot-class TLS hash trigger 403.
Servers enforce per-IP / per-session quotas. Exceeding returns 429 with a Retry-After. Some sites throttle silently before blocking outright.
Content gated to specific regions returns 403 or 451. Datacenter IPs from unexpected geos are the loudest tell.
Routes that don’t exist as server resources return 404 to plain HTTP — they need JS to render. Enable js=true and check.
5xx codes (especially 502/520–524) often have nothing to do with you. The target’s LB or origin is having a bad day.
On authenticated APIs, 429 sometimes counts per-account, not per-IP. Rotating proxy doesn’t help; slowing down does.
Why your 200 might be a lie
Not every 200 is a real success. Anti-bot vendors regularly serve a 200 with a CAPTCHA HTML body — your scraper sees “OK”, your data pipeline sees garbage. Always validate the body shape, not just the status.
SOFT_BLOCK_MARKERS = (
'cf-browser-verification',
'g-recaptcha',
'datadome', 'dd-blocked',
'pmx-protected',
'just a moment',
)
def is_real_response(html, expected_selector='article'):
if any(m in html.lower() for m in SOFT_BLOCK_MARKERS):
return False
if len(html) < 2000 and 'captcha' in html.lower():
return False
return expected_selector in htmlCommon “200 lying” patterns: Cloudflare interstitials, DataDome blocked landings, login redirects served as 200 instead of 302, and empty SPA shells that need JS to render. We expose X-Body-Class on the response with a hint when we detect one of these.
Hidden retry-after headers (read them)
The polite servers tell you exactly when to come back. A 429 or 503 often includes Retry-After — either seconds (Retry-After: 30) or an HTTP date. Honour it before retrying. Hammering an endpoint without backoff is the fastest way to escalate a soft block into a permanent IP ban.
async function fetchWithRetry(url, attempt = 0) {
const res = await fetch(url);
if ((res.status === 429 || res.status === 503) && attempt < 4) {
const after = res.headers.get('Retry-After');
const wait = after ? Number(after) * 1000 : 2 ** attempt * 1000;
await new Promise(r => setTimeout(r, wait));
return fetchWithRetry(url, attempt + 1);
}
return res;
}The five metrics worth alerting on
Put these on a dashboard, group by host. The one that’s out-of-distribution names the problem.
What counts toward your credits
We only charge for results you can use. The table below is what each status costs in our credit model — confirm with your provider if you’re comparing.
200, real 3xx, and 404 consume credits.Code examples
Auto-recover from 403 / 429 / 503
curl -X GET 'https://api.example.com/scrape' \
-H 'ApiKey: YOUR_API_KEY' \
-G \
--data-urlencode 'url=https://protected-site.com' \
--data-urlencode 'auto_proxy=true' \
--data-urlencode 'js=true' \
--data-urlencode 'stealth=true' \
--data-urlencode 'solve_captcha=true'Inspect upstream status, proxy used, retries
import requests
r = requests.get(
'https://api.example.com/scrape',
params={'url': 'https://example.com', 'auto_proxy': 'true'},
headers={'ApiKey': 'YOUR_API_KEY'},
)
print('upstream status:', r.headers.get('X-Upstream-Status'))
print('proxy used: ', r.headers.get('X-Proxy-Type'))
print('retries: ', r.headers.get('X-Retries'))
print('body class: ', r.headers.get('X-Body-Class'))FAQ
Almost always bot-detection. Add stealth=true and consider premium_proxy=true. Datacenter IPs are pre-flagged on most major sites.
Read Retry-After and wait. If absent, exponential backoff: 1s, 2s, 4s, 8s. After 3–4 retries with the same IP, rotate.
If the body contains cf-browser-verification or __cf_chl_, it’s a JS challenge. Enable js=true.
No. Only 200, real 3xx, and 404 consume credits. Failed requests are zero.
Likely an empty SPA shell that hydrates with JS, or a soft-block 200 with a CAPTCHA body. Enable js=true and validate against your expected selector.
Both are upstream issues. 502 = origin gave a bad response. 504 = origin gave no response in time. Both are retryable; 504 wants a longer backoff.
They’re Cloudflare-specific codes for “the origin behind us is sick.” Treat them as 5xx but back off harder — the origin needs time to recover.