Content intelligence

Every article, every blog, every release, structured.

Newsrooms, market intel, brand monitoring, content aggregation. Clean text, real publish dates, deduped across syndication. Boilerplate-free, language-tagged, ready for the next step in your pipeline.

Start free Try the API

40+

languages, RTL included

6 yrs

tuning the extractor

1M+

sites tested

< 1s

p50 extraction

What you actually need

The hard parts, already solved.

Body, not boilerplate

Six years of tuning. Nav, ads, "you might also like", cookie banners: all stripped. Only the article remains, in clean HTML or plain text.

Language-tagged

Every article tagged with detected language + confidence. RTL languages (Arabic, Hebrew) handled correctly, text direction preserved in HTML output.

Real publish dates

Parsed from JSON-LD → OpenGraph → DOM heuristics. We tell you when it's an estimate and when it's exact. Critical for time-series content analysis.

Syndication dedup

Same wire story appears on 200 outlets. Content-hash collapses them into one record with a list of where it ran. Your sentiment metrics stop double-counting.

Paywall-aware

Soft paywalls bypassed automatically. Hard paywalls return a clear paywalled:true flag, with no half-extracted nonsense polluting your dataset.

RSS, sitemap, or list

Feed us RSS feeds, sitemap URLs, or just a list of article URLs. We discover, extract, dedupe, and stream results via webhook.

The pipeline

4 steps. One pipeline.

Discover

RSS, sitemap, SERP query, or seed list. We expand to article URLs and queue them for extraction.

Extract

Article Extractor returns title, author, date, body HTML, lead image, language, word count, content hash.

Dedupe

Content-hash matches collapse syndicated reposts. Canonical URL wins; rest become metadata.

Enrich

Pipe to your NLP step: entity extraction, sentiment, topic classification. Or hand off to Markdown for RAG ingest.

full pipeline · 30 lines

// content-firehose.ts
import { Ujeebu } from "ujeebu";

const uj = new Ujeebu(process.env.UJEEBU_KEY);

const FEEDS = [
  "https://www.economist.com/rss",
  "https://www.wired.com/rss",
  "https://www.lemonde.fr/rss",
];

for (const feed of FEEDS) {
  const items = await uj.rss(feed);

  const articles = await Promise.all(
    items.map(({ link }) => uj.article({ url: link }))
  );

  // Dedup syndicated copies across feeds
  const fresh = uniqBy(articles, a => a.content_hash)
    .filter(a => !await seen(a.content_hash));

  await pipeline.publish(fresh);
}

Endpoints in play

Use what fits. Skip what doesn’t.

Article Extractor

The core. Boilerplate-free article in 600ms. 2 credits.

Markdown

When you need LLM-ready chunks instead of raw article HTML.

SERP: topic discovery

Find articles on a topic across the web; pipe URLs into Article Extractor.

Generic Scraper

For paywalled or geo-restricted content where you need session control.

Built for content intelligence. Stress-tested by 3,000+ teams.

Start free