Use case · Article extraction for AI

The best article extractor around. Hands down.

Clean title, author, date, and body from any publisher on earth. Six years of tuning, JavaScript rendering and anti-bot built in, and output that is genuinely model-ready - all in one API call.

Start free Try in playground

5,000 free credits · no card · failed requests not billed

The challenge

The article extraction challenge.

AI applications need clean, structured text. Raw web pages are 90% noise - and noise degrades model performance and pollutes training data.

Noisy HTML

Ads, navs, sidebars, cookie banners. Drag them into embeddings and your retrieval quality collapses.

Inconsistent structure

Every publisher formats articles differently. Reliable extraction across thousands of sources is the hard part nobody tells you about.

Boilerplate everywhere

Menus, footers, "related articles", share widgets. All polluting the body text you actually want.

Fragile scrapers

Custom extraction scripts break the day a site redesigns. The maintenance burden eats your roadmap.

Use cases

What AI teams build with it.

RAG pipelines

Feed your retrieval. Skip the cleanup.

Pipe clean, structured article content straight into your vector store. Embeddings capture meaning, not boilerplate. Metadata (date, author, topic) enables filtering and relevance ranking.

Business outcomes

RAG answer accuracy lifts when embeddings stop encoding nav bars
10,000+ articles/day ingested automatically
~90% reduction in content preprocessing time
Metadata-based filtering for more relevant retrieval

Fine-tuning

Build training corpora from the open web.

Extract clean text with consistent formatting across hundreds of sources. Build domain-specific corpora for fine-tuning. Markdown mode preserves headings, lists, and emphasis so models learn formatting too.

Business outcomes

Domain-specific training corpora from thousands of sources
Better model performance via clean, well-structured data
Dataset prep cut from weeks to hours
Consistent text quality across diverse source formats

AI news aggregation

Power topic-aware news feeds.

Collect, categorise, and summarise articles from hundreds of sources. AI-powered topic classification, sentiment analysis, and trend detection on consistently formatted data. Power automated briefings, alerts, and dashboards.

Business outcomes

10,000+ articles aggregated and categorised daily
Breaking trends detected ~80% faster
Auto-generated summaries and briefings
Personalised feeds via topic + sentiment analysis

Chatbot KB

Keep the chatbot KB current, automatically.

Ingest help-center articles, docs, blog posts, FAQs to feed conversational AI. Clean extraction means the chatbot retrieves relevant answers without HTML artifacts. Auto-tagging routes questions to the right content.

Business outcomes

Higher chatbot answer accuracy
KB stays current via automated ingestion
~85% reduction in manual KB maintenance
Source attribution via metadata

Sources

Extract from every publisher format.

One API handles every layout - no custom parsers, no per-site maintenance.

News sites

Reuters, AP, Bloomberg, and thousands more. Clean text, headlines, authors, dates across any layout.

Tech blogs

Medium, Dev.to, Hashnode, and tens of thousands of tech blogs. Diverse layouts + JS-rendered content handled.

Research publications

Articles + papers from research repositories. Structured content with abstracts, citations, author info for academic AI.

Corporate blogs

Company blogs, press rooms, product announcements. Thought leadership and industry analysis for competitive intel.

Government sites

Policy docs, press releases, regulatory updates. Build compliance + policy-analysis AI on official sources.

Substack newsletters

Newsletter content from Substack and similar platforms. Author insights, analysis, commentary for content curation.

RSS / Atom feeds

Enhance feed items with full-text extraction. Beyond truncated summaries - complete text + metadata.

Academic journals

Research papers, abstracts, supplementary materials. Build AI training datasets from peer-reviewed content.

Start extracting No credit card required.

How it works

From URL to AI-ready content in 3 steps.

1

Send article URLs

Pass any URL to the Extract API or Markdown API. JS rendering and anti-bot are automatic. For site-wide ingestion, feed us your URL list from a sitemap and submit as an async batch. Article boundaries are detected regardless of publisher CMS.

2

Automatic content isolation

Ads, navs, sidebars, footers, cookie banners, related-articles widgets - all stripped. Multi-page articles auto-merged. Output is clean, readable content ready for AI.

3

Receive structured output

Title, author, date, language, images, article-confidence score. Choose plain text for embeddings, markdown for LLM prompts, or cleaned HTML for display.

Try it

Drop an article URL in.

See clean extraction in real time - title, author, date, body, metadata.

url Try in playground

curl 'https://api.ujeebu.com/article' \
  -H 'ApiKey: YOUR_API_KEY' \
  -G \
  --data-urlencode 'url=https://www.theverge.com/tech/935898/asus-rog-zephyrus-g14-2026-intel-nvidia-review' \
  --data-urlencode 'summary=true' \
  --data-urlencode 'html=false'

No API key required for testing in the playground. Powered by /article

Features

Built for production AI pipelines.

Clean text extraction

Pure article text with ads, navs, sidebars, boilerplate removed. Output ready for embeddings, summarisation, or LLM prompts without preprocessing.

Automatic metadata

Title, author, publication date, language, site name, hero image - parsed from Open Graph, JSON-LD, HTML meta. Use for filtering, dedup, and organisation.

Markdown for LLMs

LLM-optimised markdown via the Markdown API. Preserves headings, lists, emphasis, and links - formats models understand natively.

100+ languages

Multi-byte characters, RTL text, complex scripts handled. Language detection tags each article for multilingual training and filtering.

Pagination handling

Multi-page articles auto-detected and merged. Up to 30 pages combined seamlessly. No more truncated articles in your dataset.

Batch processing

Feed sitemap-derived URL lists into async batch jobs. Thousands of articles extracted in parallel with rate limiting and retry baked in.

Powered by

Article Extractor Markdown API Try in playground

FAQ

Frequently asked.

How clean is the extracted text vs raw HTML scraping?

Significantly cleaner. The Extract API removes ads, navs, sidebars, footers, cookie banners, share buttons, related-articles widgets - output contains only the actual article: headline + body + inline images. Article-confidence score lets you programmatically filter out non-articles. For aggressive content isolation, the Markdown API with the `fit` filter goes even further.

Can I extract from paywalled or login-protected sites?

Yes - forward your own credentials with UJB-prefixed headers (UJB-Authorization, UJB-Cookie) so the request authenticates as you. Useful for premium publications and subscriber-only blogs where you have legitimate access. We forward your existing credentials; we don’t bypass paywalls.

How does it handle multi-page articles?

Automatically. Pagination links are detected, followed in order (up to 30 pages by default), and content is merged into a single unified text. One URL in, complete article out. Configure max pages or disable pagination via `pagination_max_pages`.

How does this compare to newspaper3k / readability?

JS-rendered content (most modern sites need it) - we handle, they don’t. Proxy rotation + anti-bot bypass - we include, you build separately. Accuracy across thousands of layouts - we tune over years, open-source breaks on non-standard HTML. Production reliability - automatic retries, CAPTCHA solving, rate-limit handling included. Great prototyping tools become production liabilities at scale.

Can I get markdown output for LLM prompts?

Yes - the Markdown API is purpose-built. `fit` filter for main content. `bm25` filter for query-relevant slices (useful for focused context windows). Citation references included for source attribution. Document structure preserved so models understand hierarchy.

Rate limits and capacity?

Plan-dependent monthly pool. Each Extract call is 5 credits (or 10 with JS rendering, 30 with premium proxies). Markdown wraps /extract and bills at the same rate. Enterprise plans for high-volume extraction across thousands of sources have dedicated infrastructure and custom limits.

5,000 free credits to start.

No credit card. Failed requests cost zero.

Start free

Explore other use cases

View all →

Markdown for AI → Content aggregation → Market research → Structured data for LLMs → Screenshot API → PDF Generation →

Start extracting articles for AI today.

Join AI teams building RAG pipelines, training datasets, and content-intelligence systems on clean article extraction.

Start using Start free trial Talk to an extraction expert

No credit card required.