Playground Sign in Start free
LLM training & RAG ingest

The web → training-quality Markdown.

Feed any URL or sitemap; get back chunked, deduped, embed-ready Markdown. Built for the people building the next models, and the people building the next agents on top of them.

10B+
tokens delivered
4
chunk strategies
3
embedding providers
18k
docs sites, fresh nightly
What you actually need

The hard parts, already solved.

Real Markdown

Headings stay headings. Tables stay tables. Code blocks keep language hints. Links keep anchors. Not the GPT-flattened mess most "to-markdown" tools produce.

Smart chunking

Semantic (header-aware), fixed-token, sentence, or none. Tokens pre-counted for cl100k / o200k / claude. Drop directly into your vector store.

Embed in one round-trip

Pass embed:"openai" or embed:"voyage" and we attach vectors. One API call instead of fetch → strip → chunk → embed pipelines.

Provenance + license

Every chunk ships with source URL, fetch date, content hash. Critical for filtering training data and showing citations in RAG outputs.

Sitemap ingestion

Feed us a sitemap or URL list; we extract from each, dedup, and stream chunks via webhook. No fetcher code, no retry logic for you to write.

Your data, your rules

We never train on your data. BYO LLM keys for AI Scraper if you want to keep extraction in your own provider account. SOC 2 in progress, GDPR DPA available.

The pipeline

4 steps. One pipeline.

01
Source
Sitemap, RSS, list of URLs, or SERP queries. Feed us the URLs and we extract from each.
02
Extract + chunk
Markdown endpoint strips boilerplate, converts to clean MD, and chunks by your strategy. Tokens pre-counted.
03
Embed
Optional: pass embed to attach vectors. Or skip and embed downstream; chunks are stable across runs by id.
04
Store
Stream to Pinecone, Weaviate, pgvector, Chroma, or your warehouse. Webhooks deliver as chunks complete; resumable on failure.
full pipeline · 30 lines
// rag-ingest.ts
import { Ujeebu } from "ujeebu";
import { Pinecone } from "@pinecone-database/pinecone";

const uj = new Ujeebu(process.env.UJEEBU_KEY);
const pc = new Pinecone();
const index = pc.index("docs");

// Feed the docs sitemap as an async batch, embed, upsert.
const job = await uj.markdown({
  sitemap:    "https://docs.stripe.com/sitemap.xml",
  chunk:      "semantic",
  max_tokens: 800,
  embed:      "openai",
  webhook:    "https://your-app.com/ingest",
});

// Webhook handler:
export async function ingest({ chunks }) {
  await index.upsert(chunks.map(c => ({
    id: c.id, values: c.embedding,
    metadata: { url: c.source_url, text: c.text }
  })));
}

Built for LLM data. Stress-tested by 3,000+ teams.

Start free