Playground Sign in Start free
Resource · Debugging

Debug HTTP errors. Scrape reliably.

403, 429, 503, and the dozen other codes scrapers actually meet. The right retry behaviour for each, the policy table to copy into your code, and the four metrics to put on a dashboard so you catch problems before your data pipeline does.

Last updated Mar 2026
1

What HTTP status codes mean (briefly)

Every request gets back a status code. 2xx is success, 3xx is redirect, 4xx is “you did something wrong” (or the site says you did), 5xx is “the server did something wrong”. The right retry behaviour for a 429 is wrong for a 403, and a 200 isn’t always actually a success — read the body shape too.

2xx Success — page loaded
3xx Redirect — follow the URL
4xx Client error — usually blocked
5xx Server error — retry later
2

Status code reference

The full list scrapers actually encounter — what each one means in practice, and the first thing to try.

400 Bad Request Client

Malformed syntax, invalid headers, improperly encoded parameters. Often a coding bug, not a target-site issue.

→ Validate URL encoding and parameter format before retrying.
401 Unauthorized Auth

Target requires authentication. Common on APIs requiring bearer tokens or pages behind a login wall.

→ Set cookies/auth headers via the API. Don’t solve auth CAPTCHAs.
403 Forbidden Blocked

The most common scraping error. Triggered by bot detection, WAFs (Cloudflare, Akamai), IP blacklists, or missing browser-like headers.

→ Add stealth=true + premium_proxy=true.
404 Not Found Not Found

URL doesn’t exist. Stale sitemaps, restructured URL patterns, or dynamic IDs that have rotated.

→ Validate URLs before scraping; handle gracefully — not retried.
408 Request Timeout Timeout

Server didn’t receive a complete request in time. Usually network-side, sometimes a sign of an upstream issue.

→ Retry once. If it persists, check origin status.
410 Gone Not Found

Resource permanently removed (vs 404 which is “maybe just missing”). Treat as a hard signal to drop the URL.

→ Mark URL as dead; don’t retry.
418 I'm a teapot Honeypot

Some anti-bot vendors return 418 for known-bot fingerprints — Twitter’s old API famously did this for unauthenticated calls.

→ Switch to stealth=true; the request is being explicitly fingerprint-rejected.
422 Unprocessable Refused

Our API uses 422 for refused requests — out-of-scope CAPTCHAs (login, banking), invalid combinations of params, or terms-violating targets.

→ Read the response body for the specific reason.
429 Too Many Requests Rate Limit

Target’s rate limit hit. Often per-IP or per-session; sometimes per-account. The polite ones include a Retry-After header.

→ Use auto_proxy=true for IP rotation.
451 Unavailable for Legal Geo

Geo-blocked content (GDPR opt-outs, regional licensing). Usually fixable with the right proxy_country.

→ Set proxy_country to a permitted region.
499 Client Closed Request Cancelled

Nginx’s code for “client gave up before we replied”. Usually a timeout on your side.

→ Increase your client timeout above ~30s.
500 Internal Server Error Server

Genuine server failure on the target. Retryable; not your fault.

→ Retry with exponential backoff, max 3 attempts.
502 Bad Gateway Server

Upstream server is misbehaving — common when the target uses a load balancer with a sick backend.

→ Retry; usually transient.
503 Service Unavailable Server

Server overloaded or maintenance. Also used by anti-bot systems (Cloudflare) as JS challenge pages.

→ Enable js=true to execute challenge pages.
504 Gateway Timeout Server

Upstream took too long. Server-side timeout (vs 408 = client-side).

→ Retry; consider adding wait_for for slow pages.
520 Cloudflare: Unknown Origin

Cloudflare-specific. Origin returned an empty/invalid response. Usually a sick origin behind CF.

→ Retry; if persistent, the origin is down.
521 Cloudflare: Origin Down Origin

Cloudflare can’t reach the origin. Origin really is down.

→ Wait; not your problem to fix.
522 Cloudflare: Timeout Origin

Cloudflare reached the origin but it didn’t respond in time.

→ Retry; back off if it persists.
3

The retry policy table

Copy this into your scraper. Each row is the safe default — short on retries, polite on backoff, no infinite loops. auto_proxy=true implements all of this server-side, but you’ll want the same shape on your own client retries too.

Status
Action
Wait
Max
Notes
200
Return
But validate body shape; see "200 lying" below.
301/302
Follow redirect
≤5
Detect cycles, don’t loop forever.
400
Don’t retry
1
Fix the request shape first.
403
Escalate proxy
0
3
Switch tier (DC → res → mobile), add stealth.
404
Don’t retry
1
URL really doesn’t exist. Drop from queue.
408
Retry
2s
2
Transient. Exponential backoff.
429
Wait + retry
Retry-After or 2^n s
4
Honour the header. Rotate IP after attempt 2.
500
Retry
1s, 2s, 4s
3
Server-side; back off politely.
502/504
Retry
2s
3
Origin or LB issue, usually transient.
503
js=true + retry
2s
3
Often a CF JS challenge — JS rendering resolves it.
520–524
Retry slowly
5s, 30s, 60s
3
Origin is sick. Long backoff.
4

Why these errors actually fire

Bot detection

WAFs, fingerprinting, and behavioural analysis identify automated traffic. Missing headers, no JS, or a bot-class TLS hash trigger 403.

Rate limiting

Servers enforce per-IP / per-session quotas. Exceeding returns 429 with a Retry-After. Some sites throttle silently before blocking outright.

Geo-restrictions

Content gated to specific regions returns 403 or 451. Datacenter IPs from unexpected geos are the loudest tell.

Dynamic SPAs

Routes that don’t exist as server resources return 404 to plain HTTP — they need JS to render. Enable js=true and check.

Origin sickness

5xx codes (especially 502/520–524) often have nothing to do with you. The target’s LB or origin is having a bad day.

Account-bound limits

On authenticated APIs, 429 sometimes counts per-account, not per-IP. Rotating proxy doesn’t help; slowing down does.

5

Why your 200 might be a lie

Not every 200 is a real success. Anti-bot vendors regularly serve a 200 with a CAPTCHA HTML body — your scraper sees “OK”, your data pipeline sees garbage. Always validate the body shape, not just the status.

Python · validate response shape
SOFT_BLOCK_MARKERS = (
    'cf-browser-verification',
    'g-recaptcha',
    'datadome', 'dd-blocked',
    'pmx-protected',
    'just a moment',
)

def is_real_response(html, expected_selector='article'):
    if any(m in html.lower() for m in SOFT_BLOCK_MARKERS):
        return False
    if len(html) < 2000 and 'captcha' in html.lower():
        return False
    return expected_selector in html

Common “200 lying” patterns: Cloudflare interstitials, DataDome blocked landings, login redirects served as 200 instead of 302, and empty SPA shells that need JS to render. We expose X-Body-Class on the response with a hint when we detect one of these.

6

Hidden retry-after headers (read them)

The polite servers tell you exactly when to come back. A 429 or 503 often includes Retry-After — either seconds (Retry-After: 30) or an HTTP date. Honour it before retrying. Hammering an endpoint without backoff is the fastest way to escalate a soft block into a permanent IP ban.

Node · honour Retry-After
async function fetchWithRetry(url, attempt = 0) {
  const res = await fetch(url);
  if ((res.status === 429 || res.status === 503) && attempt < 4) {
    const after = res.headers.get('Retry-After');
    const wait = after ? Number(after) * 1000 : 2 ** attempt * 1000;
    await new Promise(r => setTimeout(r, wait));
    return fetchWithRetry(url, attempt + 1);
  }
  return res;
}
7

The five metrics worth alerting on

Put these on a dashboard, group by host. The one that’s out-of-distribution names the problem.

Metric
What it tells you
If it’s off
success_rate
Percentage of requests that returned a usable 2xx body. < 95% means something’s tilted.
Group by host; the bad host is the suspect.
block_rate
Ratio of requests returning 403 / 451 / 422-blocked. Healthy is < 2% with auto_proxy.
Add stealth=true; verify proxy_country matches.
rate_limit_rate
Ratio of 429s. Healthy is < 1% with sticky-session strategy on session-tracking sites.
Slow concurrency; rotate sticky session keys.
origin_error_rate
5xx rate excluding 503. Above ~1% sustained means the target is sick — not you.
Pause; alert. Wait for origin to recover.
avg_latency_p95
95th-percentile end-to-end latency per host. Spikes correlate with newly-deployed challenges.
Compare host vs baseline; check vendor-update log.
8

What counts toward your credits

We only charge for results you can use. The table below is what each status costs in our credit model — confirm with your provider if you’re comparing.

Status
Meaning
Charged?
200
Successful — page fetched, body returned
Yes
301/302
Redirect — followed to final URL automatically
Yes
404
Page not found — URL doesn’t exist (real answer)
Yes
400
Bad request — malformed parameters
No
401
Authentication required by the target
No
403
Blocked by bot detection or WAF
No
408/499
Timeout (either side)
No
422
Refused — out of scope or invalid combination
No
429
Rate limited by the target
No
500–504
Server-side errors on the target
No
520–524
Cloudflare origin errors — origin sick
No
Pay only for results
If we couldn’t get the data, you don’t pay. Only 200, real 3xx, and 404 consume credits.
9

Code examples

Auto-recover from 403 / 429 / 503

cURL · auto-recovery stack
curl -X GET 'https://api.example.com/scrape' \
  -H 'ApiKey: YOUR_API_KEY' \
  -G \
  --data-urlencode 'url=https://protected-site.com' \
  --data-urlencode 'auto_proxy=true' \
  --data-urlencode 'js=true' \
  --data-urlencode 'stealth=true' \
  --data-urlencode 'solve_captcha=true'

Inspect upstream status, proxy used, retries

Python · response inspection
import requests

r = requests.get(
    'https://api.example.com/scrape',
    params={'url': 'https://example.com', 'auto_proxy': 'true'},
    headers={'ApiKey': 'YOUR_API_KEY'},
)
print('upstream status:', r.headers.get('X-Upstream-Status'))
print('proxy used:    ', r.headers.get('X-Proxy-Type'))
print('retries:       ', r.headers.get('X-Retries'))
print('body class:    ', r.headers.get('X-Body-Class'))
10

FAQ

Why am I getting 403 on a public page?

Almost always bot-detection. Add stealth=true and consider premium_proxy=true. Datacenter IPs are pre-flagged on most major sites.

How should I handle 429 in my own retry logic?

Read Retry-After and wait. If absent, exponential backoff: 1s, 2s, 4s, 8s. After 3–4 retries with the same IP, rotate.

When is 503 actually a Cloudflare challenge?

If the body contains cf-browser-verification or __cf_chl_, it’s a JS challenge. Enable js=true.

Do you charge for 403 / 429 / 5xx responses?

No. Only 200, real 3xx, and 404 consume credits. Failed requests are zero.

My 200 has no data — why?

Likely an empty SPA shell that hydrates with JS, or a soft-block 200 with a CAPTCHA body. Enable js=true and validate against your expected selector.

What’s the difference between 502 and 504?

Both are upstream issues. 502 = origin gave a bad response. 504 = origin gave no response in time. Both are retryable; 504 wants a longer backoff.

Are 520–524 different from regular 5xx?

They’re Cloudflare-specific codes for “the origin behind us is sick.” Treat them as 5xx but back off harder — the origin needs time to recover.

Ship Ujeebu tonight.

5,000 credits free. No card. Real residential proxies on the free tier.