Playground Sign in Start free
Content intelligence

Every article, every blog, every release, structured.

Newsrooms, market intel, brand monitoring, content aggregation. Clean text, real publish dates, deduped across syndication. Boilerplate-free, language-tagged, ready for the next step in your pipeline.

40+
languages, RTL included
6 yrs
tuning the extractor
1M+
sites tested
< 1s
p50 extraction
What you actually need

The hard parts, already solved.

Body, not boilerplate

Six years of tuning. Nav, ads, "you might also like", cookie banners: all stripped. Only the article remains, in clean HTML or plain text.

Language-tagged

Every article tagged with detected language + confidence. RTL languages (Arabic, Hebrew) handled correctly, text direction preserved in HTML output.

Real publish dates

Parsed from JSON-LD → OpenGraph → DOM heuristics. We tell you when it's an estimate and when it's exact. Critical for time-series content analysis.

Syndication dedup

Same wire story appears on 200 outlets. Content-hash collapses them into one record with a list of where it ran. Your sentiment metrics stop double-counting.

Paywall-aware

Soft paywalls bypassed automatically. Hard paywalls return a clear paywalled:true flag, with no half-extracted nonsense polluting your dataset.

RSS, sitemap, or list

Feed us RSS feeds, sitemap URLs, or just a list of article URLs. We discover, extract, dedupe, and stream results via webhook.

The pipeline

4 steps. One pipeline.

01
Discover
RSS, sitemap, SERP query, or seed list. We expand to article URLs and queue them for extraction.
02
Extract
Article Extractor returns title, author, date, body HTML, lead image, language, word count, content hash.
03
Dedupe
Content-hash matches collapse syndicated reposts. Canonical URL wins; rest become metadata.
04
Enrich
Pipe to your NLP step: entity extraction, sentiment, topic classification. Or hand off to Markdown for RAG ingest.
full pipeline · 30 lines
// content-firehose.ts
import { Ujeebu } from "ujeebu";

const uj = new Ujeebu(process.env.UJEEBU_KEY);

const FEEDS = [
  "https://www.economist.com/rss",
  "https://www.wired.com/rss",
  "https://www.lemonde.fr/rss",
];

for (const feed of FEEDS) {
  const items = await uj.rss(feed);

  const articles = await Promise.all(
    items.map(({ link }) => uj.article({ url: link }))
  );

  // Dedup syndicated copies across feeds
  const fresh = uniqBy(articles, a => a.content_hash)
    .filter(a => !await seen(a.content_hash));

  await pipeline.publish(fresh);
}

Built for content intelligence. Stress-tested by 3,000+ teams.

Start free