Playground Sign in Start free
APIs

Extract API

Deep dives

The /extract endpoint is the unified entry point for content extraction. Internally it routes to one of several specialized handlers based on the type query/body parameter.

Without a type parameter, requests fall through to our article extractor — the same engine you may have been calling at /auto-extract before — and support the full set of article-mode parameters documented below. Pass type=auto for AI-driven extraction of arbitrary pages, or type=<sitename> for a domain-specific preset.

GET https://api.ujeebu.com/extract

POST https://api.ujeebu.com/extract

NOTE — Backwards compatibility

/auto-extract is still served as an alias and is equivalent to calling /extract?type=auto. New integrations should call /extract directly.

Dispatch logic

type value Handler
(omitted) / article Article extractor — cleaned text + metadata from blog posts, news, long-form pages
auto AI-powered auto-extract — schema-less structured extraction of any page
lead, yelp, crunchbase, … Domain-specific preset templates

Choosing a mode

Goal Call
Article / blog / news clean-text extraction /extract?url=…
AI-driven structured extraction of arbitrary pages /extract?url=…&type=auto
Domain-specific preset (e.g. Yelp business page) /extract?url=…&type=yelp
Native Go extractor (bypass remote API) /extract?url=…&mode=super
Backwards-compat auto-extract /auto-extract?url=… (equivalent to type=auto)

Article mode parameters (default)

When called without a type parameter (or with type=article), the request uses the article extractor.

Required (one of)

Exactly one of url or raw_html must be provided.

Parameter Type Required Default Description
url string Yes (one of) - URL to extract from
raw_html string Yes (one of) - Pre-fetched HTML (skip fetching)

Extraction flags

Parameter Type Required Default Description
text bool No true Return cleaned text
html bool No true Return cleaned HTML
media bool No false Extract videos / embeds
feeds bool No false Extract RSS feeds
images bool No true Extract image URLs
author bool No true Extract author
pub_date bool No true Extract publication date
publisher_country bool No false Detect publisher country
publisher_tz bool No false Detect publisher timezone
is_article bool No true Return article-probability score
partial int No 0 Partial extraction mode
quick_mode bool No false Fast, less thorough
heavy_mode bool No false Slower, more thorough
strip_tags string|[]string No "form" Tags to strip from HTML

Image processing

Parameter Type Required Default Description
image_analysis bool No true Analyze images (dims, relevance)
min_image_width int No 200 Min image width (px)
min_image_height int No 100 Min image height (px)
image_timeout int No 2 Per-image timeout (s)
return_only_enclosed_text_images bool No true Only images surrounded by text
main_image_in_html bool No false Embed main image in returned HTML

Browser / JS

Parameter Type Required Default Description
js bool No false Enable JS rendering
js_timeout int No 30 JS timeout (s)
wait_until enum No load One of load, domcontentloaded, networkidle, commit
scroll_down bool No false Scroll page after load
scroll_wait int No 100 Wait between scrolls (ms)
scroll_percent int No 0 Scroll fraction
progressive_scroll bool No false Incremental scroll
scroll_callback string No null JS to run during scroll
scroll_to_selector string No null CSS selector to scroll to

Proxy

Parameter Type Required Default Description
proxy_type string No null datacenter, residential, premium, custom, …
proxy_country string No null Two-letter country code
session_id string No null Sticky session id
auto_proxy bool No false Smart proxy selection
auto_premium_proxy bool No false Auto-upgrade to premium on failure
custom_proxy string Conditional null Required if proxy_type=custom
custom_proxy_username string No null Custom proxy auth
custom_proxy_password string No null Custom proxy auth

Pagination

Parameter Type Required Default Description
pagination bool No false Follow pagination links
pagination_max_pages int No 0 Max pages to follow (0 = unlimited)

CAPTCHA

Parameter Type Required Default Description
auto_captcha_solve bool No false Detect & solve CAPTCHAs (forces super-mode)
auto_captcha_solve_timeout int No 0 CAPTCHA solve timeout (ms, 0 = use server default)

Other

Parameter Type Required Default Description
timeout int No 60 Overall timeout (s)
cookies string | map No null Cookies to send
extra_headers object No null Forwarded headers (also UJB-* request headers)
mode string No null super / d1* to force native QuickExtract
type string No "article" Extractor selector — see dispatch table above

Auto mode parameters (type=auto)

When called with type=auto, the request is routed to our AI-powered auto-extract handler. It uses AI to generate extraction rules dynamically and has its own slim parameter set with browser-friendly defaults (js, auto_proxy, and auto_captcha_solve all default to true because auto-extract typically targets dynamic, AI-relevant pages).

Parameter Type Required Default Description
url string Yes (one of) - URL to extract from
html string Yes (one of) - Pre-fetched HTML (skip browser fetch)
force_refresh bool No false Regenerate AI rules instead of using cache
provider string No null google, openai, anthropic
model string No null AI model name
wait_for string | array No null CSS selector(s) to wait for
wait_for_timeout int (ms) No 30000 Wait-for timeout
timeout int (ms) No 120000 Overall timeout
proxy_type string No null Proxy type
auto_proxy bool No true Smart proxy selection
js bool No true JS rendering
scroll_down bool No false Scroll before extraction
auto_captcha_solve bool No true Auto CAPTCHA solving
auto_captcha_solve_timeout int (ms) No 0 CAPTCHA solve timeout (0 = use server default)

Typed mode (type=lead|yelp|crunchbase|…)

When type is set to any value other than "", article, or auto, the request is routed to a preset AI-extraction template for specific domains / page types. Typed extractors accept the same auto-extract base parameters plus type-specific schema overrides.

Ready to build?

Spin up an API key in 60 seconds

Free tier: 5,000 credits, no card, full access to every endpoint on this page.