Every article, every blog, every release, structured.
Newsrooms, market intel, brand monitoring, content aggregation. Clean text, real publish dates, deduped across syndication. Boilerplate-free, language-tagged, ready for the next step in your pipeline.
The hard parts, already solved.
Six years of tuning. Nav, ads, "you might also like", cookie banners: all stripped. Only the article remains, in clean HTML or plain text.
Every article tagged with detected language + confidence. RTL languages (Arabic, Hebrew) handled correctly, text direction preserved in HTML output.
Parsed from JSON-LD → OpenGraph → DOM heuristics. We tell you when it's an estimate and when it's exact. Critical for time-series content analysis.
Same wire story appears on 200 outlets. Content-hash collapses them into one record with a list of where it ran. Your sentiment metrics stop double-counting.
Soft paywalls bypassed automatically. Hard paywalls return a clear paywalled:true flag, with no half-extracted nonsense polluting your dataset.
Feed us RSS feeds, sitemap URLs, or just a list of article URLs. We discover, extract, dedupe, and stream results via webhook.
4 steps. One pipeline.
// content-firehose.ts
import { Ujeebu } from "ujeebu";
const uj = new Ujeebu(process.env.UJEEBU_KEY);
const FEEDS = [
"https://www.economist.com/rss",
"https://www.wired.com/rss",
"https://www.lemonde.fr/rss",
];
for (const feed of FEEDS) {
const items = await uj.rss(feed);
const articles = await Promise.all(
items.map(({ link }) => uj.article({ url: link }))
);
// Dedup syndicated copies across feeds
const fresh = uniqBy(articles, a => a.content_hash)
.filter(a => !await seen(a.content_hash));
await pipeline.publish(fresh);
}