The web → training-quality Markdown.
Feed any URL or sitemap; get back chunked, deduped, embed-ready Markdown. Built for the people building the next models, and the people building the next agents on top of them.
The hard parts, already solved.
Headings stay headings. Tables stay tables. Code blocks keep language hints. Links keep anchors. Not the GPT-flattened mess most "to-markdown" tools produce.
Semantic (header-aware), fixed-token, sentence, or none. Tokens pre-counted for cl100k / o200k / claude. Drop directly into your vector store.
Pass embed:"openai" or embed:"voyage" and we attach vectors. One API call instead of fetch → strip → chunk → embed pipelines.
Every chunk ships with source URL, fetch date, content hash. Critical for filtering training data and showing citations in RAG outputs.
Feed us a sitemap or URL list; we extract from each, dedup, and stream chunks via webhook. No fetcher code, no retry logic for you to write.
We never train on your data. BYO LLM keys for AI Scraper if you want to keep extraction in your own provider account. SOC 2 in progress, GDPR DPA available.
4 steps. One pipeline.
embed to attach vectors. Or skip and embed downstream; chunks are stable across runs by id.// rag-ingest.ts
import { Ujeebu } from "ujeebu";
import { Pinecone } from "@pinecone-database/pinecone";
const uj = new Ujeebu(process.env.UJEEBU_KEY);
const pc = new Pinecone();
const index = pc.index("docs");
// Feed the docs sitemap as an async batch, embed, upsert.
const job = await uj.markdown({
sitemap: "https://docs.stripe.com/sitemap.xml",
chunk: "semantic",
max_tokens: 800,
embed: "openai",
webhook: "https://your-app.com/ingest",
});
// Webhook handler:
export async function ingest({ chunks }) {
await index.upsert(chunks.map(c => ({
id: c.id, values: c.embedding,
metadata: { url: c.source_url, text: c.text }
})));
}