Extract API
The /extract endpoint is the unified entry point for content extraction.
Internally it routes to one of several specialized handlers based on the
type query/body parameter.
Without a type parameter, requests fall through to our article extractor —
the same engine you may have been calling at /auto-extract before — and
support the full set of article-mode parameters documented below. Pass
type=auto for AI-driven extraction of arbitrary pages, or
type=<sitename> for a domain-specific preset.
GET https://api.ujeebu.com/extract
POST https://api.ujeebu.com/extract
NOTE — Backwards compatibility
/auto-extractis still served as an alias and is equivalent to calling/extract?type=auto. New integrations should call/extractdirectly.
Dispatch logic
type value |
Handler |
|---|---|
(omitted) / article |
Article extractor — cleaned text + metadata from blog posts, news, long-form pages |
auto |
AI-powered auto-extract — schema-less structured extraction of any page |
lead, yelp, crunchbase, … |
Domain-specific preset templates |
Choosing a mode
| Goal | Call |
|---|---|
| Article / blog / news clean-text extraction | /extract?url=… |
| AI-driven structured extraction of arbitrary pages | /extract?url=…&type=auto |
| Domain-specific preset (e.g. Yelp business page) | /extract?url=…&type=yelp |
| Native Go extractor (bypass remote API) | /extract?url=…&mode=super |
| Backwards-compat auto-extract | /auto-extract?url=… (equivalent to type=auto) |
Article mode parameters (default)
When called without a type parameter (or with type=article), the request
uses the article extractor.
Required (one of)
Exactly one of url or raw_html must be provided.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string |
Yes (one of) | - |
URL to extract from |
raw_html |
string |
Yes (one of) | - |
Pre-fetched HTML (skip fetching) |
Extraction flags
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
text |
bool |
No | true |
Return cleaned text |
html |
bool |
No | true |
Return cleaned HTML |
media |
bool |
No | false |
Extract videos / embeds |
feeds |
bool |
No | false |
Extract RSS feeds |
images |
bool |
No | true |
Extract image URLs |
author |
bool |
No | true |
Extract author |
pub_date |
bool |
No | true |
Extract publication date |
publisher_country |
bool |
No | false |
Detect publisher country |
publisher_tz |
bool |
No | false |
Detect publisher timezone |
is_article |
bool |
No | true |
Return article-probability score |
partial |
int |
No | 0 |
Partial extraction mode |
quick_mode |
bool |
No | false |
Fast, less thorough |
heavy_mode |
bool |
No | false |
Slower, more thorough |
strip_tags |
string|[]string |
No | "form" |
Tags to strip from HTML |
Image processing
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
image_analysis |
bool |
No | true |
Analyze images (dims, relevance) |
min_image_width |
int |
No | 200 |
Min image width (px) |
min_image_height |
int |
No | 100 |
Min image height (px) |
image_timeout |
int |
No | 2 |
Per-image timeout (s) |
return_only_enclosed_text_images |
bool |
No | true |
Only images surrounded by text |
main_image_in_html |
bool |
No | false |
Embed main image in returned HTML |
Browser / JS
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
js |
bool |
No | false |
Enable JS rendering |
js_timeout |
int |
No | 30 |
JS timeout (s) |
wait_until |
enum |
No | load |
One of load, domcontentloaded, networkidle, commit |
scroll_down |
bool |
No | false |
Scroll page after load |
scroll_wait |
int |
No | 100 |
Wait between scrolls (ms) |
scroll_percent |
int |
No | 0 |
Scroll fraction |
progressive_scroll |
bool |
No | false |
Incremental scroll |
scroll_callback |
string |
No | null |
JS to run during scroll |
scroll_to_selector |
string |
No | null |
CSS selector to scroll to |
Proxy
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
proxy_type |
string |
No | null |
datacenter, residential, premium, custom, … |
proxy_country |
string |
No | null |
Two-letter country code |
session_id |
string |
No | null |
Sticky session id |
auto_proxy |
bool |
No | false |
Smart proxy selection |
auto_premium_proxy |
bool |
No | false |
Auto-upgrade to premium on failure |
custom_proxy |
string |
Conditional | null |
Required if proxy_type=custom |
custom_proxy_username |
string |
No | null |
Custom proxy auth |
custom_proxy_password |
string |
No | null |
Custom proxy auth |
Pagination
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
pagination |
bool |
No | false |
Follow pagination links |
pagination_max_pages |
int |
No | 0 |
Max pages to follow (0 = unlimited) |
CAPTCHA
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
auto_captcha_solve |
bool |
No | false |
Detect & solve CAPTCHAs (forces super-mode) |
auto_captcha_solve_timeout |
int |
No | 0 |
CAPTCHA solve timeout (ms, 0 = use server default) |
Other
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
timeout |
int |
No | 60 |
Overall timeout (s) |
cookies |
string | map |
No | null |
Cookies to send |
extra_headers |
object |
No | null |
Forwarded headers (also UJB-* request headers) |
mode |
string |
No | null |
super / d1* to force native QuickExtract |
type |
string |
No | "article" |
Extractor selector — see dispatch table above |
Auto mode parameters (type=auto)
When called with type=auto, the request is routed to our AI-powered auto-extract
handler. It uses AI to generate extraction rules dynamically and has its own slim
parameter set with browser-friendly defaults (js, auto_proxy, and
auto_captcha_solve all default to true because auto-extract typically targets
dynamic, AI-relevant pages).
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string |
Yes (one of) | - |
URL to extract from |
html |
string |
Yes (one of) | - |
Pre-fetched HTML (skip browser fetch) |
force_refresh |
bool |
No | false |
Regenerate AI rules instead of using cache |
provider |
string |
No | null |
google, openai, anthropic |
model |
string |
No | null |
AI model name |
wait_for |
string | array |
No | null |
CSS selector(s) to wait for |
wait_for_timeout |
int (ms) |
No | 30000 |
Wait-for timeout |
timeout |
int (ms) |
No | 120000 |
Overall timeout |
proxy_type |
string |
No | null |
Proxy type |
auto_proxy |
bool |
No | true |
Smart proxy selection |
js |
bool |
No | true |
JS rendering |
scroll_down |
bool |
No | false |
Scroll before extraction |
auto_captcha_solve |
bool |
No | true |
Auto CAPTCHA solving |
auto_captcha_solve_timeout |
int (ms) |
No | 0 |
CAPTCHA solve timeout (0 = use server default) |
Typed mode (type=lead|yelp|crunchbase|…)
When type is set to any value other than "", article, or auto, the
request is routed to a preset AI-extraction template for specific
domains / page types. Typed extractors accept the same auto-extract base
parameters plus type-specific schema overrides.
Related
- AI Scraper for fully custom schemas
- Scrape API for raw HTML / screenshot / PDF
- Markdown API for clean Markdown output for RAG
Spin up an API key in 60 seconds
Free tier: 5,000 credits, no card, full access to every endpoint on this page.