What HTTP status codes mean (briefly)

Every request gets back a status code. 2xx is success, 3xx is redirect, 4xx is “you did something wrong” (or the site says you did), 5xx is “the server did something wrong”. The right retry behaviour for a 429 is wrong for a 403, and a 200 isn’t always actually a success - read the body shape too.

2xx Success - page loaded

3xx Redirect - follow the URL

4xx Client error - usually blocked

5xx Server error - retry later

2

Status code reference

The full list scrapers actually encounter - what each one means in practice, and the first thing to try.

400 Bad Request Client

Malformed syntax, invalid headers, improperly encoded parameters. Often a coding bug, not a target-site issue.

→ Validate URL encoding and parameter format before retrying.

401 Unauthorized Auth

Target requires authentication. Common on APIs requiring bearer tokens or pages behind a login wall.

→ Set cookies/auth headers via the API. Don’t solve auth CAPTCHAs.

403 Forbidden Blocked

The most common scraping error. Triggered by bot detection, WAFs (Cloudflare, Akamai), IP blacklists, or missing browser-like headers.

→ Add stealth=true + premium_proxy=true.

404 Not Found Not Found

URL doesn’t exist. Stale sitemaps, restructured URL patterns, or dynamic IDs that have rotated.

→ Validate URLs before scraping; handle gracefully - not retried.

408 Request Timeout Timeout

Server didn’t receive a complete request in time. Usually network-side, sometimes a sign of an upstream issue.

→ Retry once. If it persists, check origin status.

410 Gone Not Found

Resource permanently removed (vs 404 which is “maybe just missing”). Treat as a hard signal to drop the URL.

→ Mark URL as dead; don’t retry.

418 I'm a teapot Honeypot

Some anti-bot vendors return 418 for known-bot fingerprints - Twitter’s old API famously did this for unauthenticated calls.

→ Switch to stealth=true; the request is being explicitly fingerprint-rejected.

422 Unprocessable Refused

Our API uses 422 for refused requests - out-of-scope CAPTCHAs (login, banking), invalid combinations of params, or terms-violating targets.

→ Read the response body for the specific reason.

429 Too Many Requests Rate Limit

Target’s rate limit hit. Often per-IP or per-session; sometimes per-account. The polite ones include a Retry-After header.

→ Use auto_proxy=true for IP rotation.

451 Unavailable for Legal Geo

Geo-blocked content (GDPR opt-outs, regional licensing). Usually fixable with the right proxy_country.

→ Set proxy_country to a permitted region.

499 Client Closed Request Cancelled

Nginx’s code for “client gave up before we replied”. Usually a timeout on your side.

→ Increase your client timeout above ~30s.

500 Internal Server Error Server

Genuine server failure on the target. Retryable; not your fault.

→ Retry with exponential backoff, max 3 attempts.

502 Bad Gateway Server

Upstream server is misbehaving - common when the target uses a load balancer with a sick backend.

→ Retry; usually transient.

503 Service Unavailable Server

Server overloaded or maintenance. Also used by anti-bot systems (Cloudflare) as JS challenge pages.

→ Enable js=true to execute challenge pages.

504 Gateway Timeout Server

Upstream took too long. Server-side timeout (vs 408 = client-side).

→ Retry; consider adding wait_for for slow pages.

520 Cloudflare: Unknown Origin

Cloudflare-specific. Origin returned an empty/invalid response. Usually a sick origin behind CF.

→ Retry; if persistent, the origin is down.

521 Cloudflare: Origin Down Origin

Cloudflare can’t reach the origin. Origin really is down.

→ Wait; not your problem to fix.

522 Cloudflare: Timeout Origin

Cloudflare reached the origin but it didn’t respond in time.

→ Retry; back off if it persists.

3

The retry policy table

Copy this into your scraper. Each row is the safe default - short on retries, polite on backoff, no infinite loops. auto_proxy=true implements all of this server-side, but you’ll want the same shape on your own client retries too.

Status

Action

Wait

Max

Notes

200

Return

-

But validate body shape; see "200 lying" below.

301/302

Follow redirect

-

≤5

Detect cycles, don’t loop forever.

400

Don’t retry

-

1

Fix the request shape first.

403

Escalate proxy

0

3

Switch tier (DC → res → mobile), add stealth.

404

Don’t retry

-

1

URL really doesn’t exist. Drop from queue.

408

Retry

2s

2

Transient. Exponential backoff.

429

Wait + retry

Retry-After or 2^n s

4

Honour the header. Rotate IP after attempt 2.

500

Retry

1s, 2s, 4s

3

Server-side; back off politely.

502/504

Retry

2s

3

Origin or LB issue, usually transient.

503

js=true + retry

2s

3

Often a CF JS challenge - JS rendering resolves it.

520–524

Retry slowly

5s, 30s, 60s

3

Origin is sick. Long backoff.

4

Why these errors actually fire

Bot detection

WAFs, fingerprinting, and behavioural analysis identify automated traffic. Missing headers, no JS, or a bot-class TLS hash trigger 403.

Rate limiting

Servers enforce per-IP / per-session quotas. Exceeding returns 429 with a Retry-After. Some sites throttle silently before blocking outright.

Geo-restrictions

Content gated to specific regions returns 403 or 451. Datacenter IPs from unexpected geos are the loudest tell.

Dynamic SPAs

Routes that don’t exist as server resources return 404 to plain HTTP - they need JS to render. Enable js=true and check.

Origin sickness

5xx codes (especially 502/520–524) often have nothing to do with you. The target’s LB or origin is having a bad day.

Account-bound limits

On authenticated APIs, 429 sometimes counts per-account, not per-IP. Rotating proxy doesn’t help; slowing down does.

5

Why your 200 might be a lie

Not every 200 is a real success. Anti-bot vendors regularly serve a 200 with a CAPTCHA HTML body - your scraper sees “OK”, your data pipeline sees garbage. Always validate the body shape, not just the status.

Python · validate response shape

SOFT_BLOCK_MARKERS = (
    'cf-browser-verification',
    'g-recaptcha',
    'datadome', 'dd-blocked',
    'pmx-protected',
    'just a moment',
)

def is_real_response(html, expected_selector='article'):
    if any(m in html.lower() for m in SOFT_BLOCK_MARKERS):
        return False
    if len(html) < 2000 and 'captcha' in html.lower():
        return False
    return expected_selector in html

Common “200 lying” patterns: Cloudflare interstitials, DataDome blocked landings, login redirects served as 200 instead of 302, and empty SPA shells that need JS to render. We expose X-Body-Class on the response with a hint when we detect one of these.

6

Hidden retry-after headers (read them)

The polite servers tell you exactly when to come back. A 429 or 503 often includes Retry-After - either seconds (Retry-After: 30) or an HTTP date. Honour it before retrying. Hammering an endpoint without backoff is the fastest way to escalate a soft block into a permanent IP ban.

Node · honour Retry-After

async function fetchWithRetry(url, attempt = 0) {
  const res = await fetch(url);
  if ((res.status === 429 || res.status === 503) && attempt < 4) {
    const after = res.headers.get('Retry-After');
    const wait = after ? Number(after) * 1000 : 2 ** attempt * 1000;
    await new Promise(r => setTimeout(r, wait));
    return fetchWithRetry(url, attempt + 1);
  }
  return res;
}

7

The five metrics worth alerting on

Put these on a dashboard, group by host. The one that’s out-of-distribution names the problem.

Metric

What it tells you

If it’s off

success_rate

Percentage of requests that returned a usable 2xx body. < 95% means something’s tilted.

Group by host; the bad host is the suspect.

block_rate

Ratio of requests returning 403 / 451 / 422-blocked. Healthy is < 2% with auto_proxy.

Add stealth=true; verify proxy_country matches.

rate_limit_rate

Ratio of 429s. Healthy is < 1% with sticky-session strategy on session-tracking sites.

Slow concurrency; rotate sticky session keys.

origin_error_rate

5xx rate excluding 503. Above ~1% sustained means the target is sick - not you.

Pause; alert. Wait for origin to recover.

avg_latency_p95

95th-percentile end-to-end latency per host. Spikes correlate with newly-deployed challenges.

Compare host vs baseline; check vendor-update log.

8

What counts toward your credits

We only charge for results you can use. The table below is what each status costs in our credit model - confirm with your provider if you’re comparing.

Status

Meaning

Charged?

200

Successful - page fetched, body returned

Yes

301/302

Redirect - followed to final URL automatically

Yes

404

Page not found - URL doesn’t exist (real answer)

Yes

400

Bad request - malformed parameters

No

401

Authentication required by the target

No

403

Blocked by bot detection or WAF

No

408/499

Timeout (either side)

No

422

Refused - out of scope or invalid combination

No

429

Rate limited by the target

No

500–504

Server-side errors on the target

No

520–524

Cloudflare origin errors - origin sick

No

Pay only for results

If we couldn’t get the data, you don’t pay. Only 200, real 3xx, and 404 consume credits.

9

Code examples

Auto-recover from 403 / 429 / 503

cURL · auto-recovery stack

curl -X GET 'https://api.example.com/scrape' \
  -H 'ApiKey: YOUR_API_KEY' \
  -G \
  --data-urlencode 'url=https://protected-site.com' \
  --data-urlencode 'auto_proxy=true' \
  --data-urlencode 'js=true' \
  --data-urlencode 'stealth=true' \
  --data-urlencode 'solve_captcha=true'

Inspect upstream status, proxy used, retries

Python · response inspection

import requests

r = requests.get(
    'https://api.example.com/scrape',
    params={'url': 'https://example.com', 'auto_proxy': 'true'},
    headers={'ApiKey': 'YOUR_API_KEY'},
)
print('upstream status:', r.headers.get('X-Upstream-Status'))
print('proxy used:    ', r.headers.get('X-Proxy-Type'))
print('retries:       ', r.headers.get('X-Retries'))
print('body class:    ', r.headers.get('X-Body-Class'))

10

FAQ

Why am I getting 403 on a public page?

Almost always bot-detection. Add stealth=true and consider premium_proxy=true. Datacenter IPs are pre-flagged on most major sites.

How should I handle 429 in my own retry logic?

Read Retry-After and wait. If absent, exponential backoff: 1s, 2s, 4s, 8s. After 3–4 retries with the same IP, rotate.

When is 503 actually a Cloudflare challenge?

If the body contains cf-browser-verification or __cf_chl_, it’s a JS challenge. Enable js=true.

Do you charge for 403 / 429 / 5xx responses?

No. Only 200, real 3xx, and 404 consume credits. Failed requests are zero.

My 200 has no data - why?

Likely an empty SPA shell that hydrates with JS, or a soft-block 200 with a CAPTCHA body. Enable js=true and validate against your expected selector.

What’s the difference between 502 and 504?

Both are upstream issues. 502 = origin gave a bad response. 504 = origin gave no response in time. Both are retryable; 504 wants a longer backoff.

Are 520–524 different from regular 5xx?

They’re Cloudflare-specific codes for “the origin behind us is sick.” Treat them as 5xx but back off harder - the origin needs time to recover.

Debug HTTP errors. Scrape reliably.

What HTTP status codes mean (briefly)

Status code reference

The retry policy table

Why these errors actually fire

Why your 200 might be a lie

Hidden retry-after headers (read them)

The five metrics worth alerting on

What counts toward your credits

Code examples

Auto-recover from 403 / 429 / 503

Inspect upstream status, proxy used, retries

FAQ

Ship Ujeebu tonight.