LLM training & RAG ingest

The web → training-quality Markdown.

Feed any URL or sitemap; get back chunked, deduped, embed-ready Markdown. Built for the people building the next models, and the people building the next agents on top of them.

Start free Try the API

10B+

tokens delivered

chunk strategies

embedding providers

18k

docs sites, fresh nightly

What you actually need

The hard parts, already solved.

Real Markdown

Headings stay headings. Tables stay tables. Code blocks keep language hints. Links keep anchors. Not the GPT-flattened mess most "to-markdown" tools produce.

Smart chunking

Semantic (header-aware), fixed-token, sentence, or none. Tokens pre-counted for cl100k / o200k / claude. Drop directly into your vector store.

Embed in one round-trip

Pass embed:"openai" or embed:"voyage" and we attach vectors. One API call instead of fetch → strip → chunk → embed pipelines.

Provenance + license

Every chunk ships with source URL, fetch date, content hash. Critical for filtering training data and showing citations in RAG outputs.

Sitemap ingestion

Feed us a sitemap or URL list; we extract from each, dedup, and stream chunks via webhook. No fetcher code, no retry logic for you to write.

Your data, your rules

We never train on your data. SOC 2 in progress, GDPR DPA available.

The pipeline

4 steps. One pipeline.

Source

Sitemap, RSS, list of URLs, or SERP queries. Feed us the URLs and we extract from each.

Extract + chunk

Markdown endpoint strips boilerplate, converts to clean MD, and chunks by your strategy. Tokens pre-counted.

Embed

Optional: pass embed to attach vectors. Or skip and embed downstream; chunks are stable across runs by id.

Store

Stream to Pinecone, Weaviate, pgvector, Chroma, or your warehouse. Webhooks deliver as chunks complete; resumable on failure.

full pipeline · 30 lines

// rag-ingest.ts
import { Ujeebu } from "ujeebu";
import { Pinecone } from "@pinecone-database/pinecone";

const uj = new Ujeebu(process.env.UJEEBU_KEY);
const pc = new Pinecone();
const index = pc.index("docs");

// Feed the docs sitemap as an async batch, embed, upsert.
const job = await uj.markdown({
  sitemap:    "https://docs.stripe.com/sitemap.xml",
  chunk:      "semantic",
  max_tokens: 800,
  embed:      "openai",
  webhook:    "https://your-app.com/ingest",
});

// Webhook handler:
export async function ingest({ chunks }) {
  await index.upsert(chunks.map(c => ({
    id: c.id, values: c.embedding,
    metadata: { url: c.source_url, text: c.text }
  })));
}

Endpoints in play

Use what fits. Skip what doesn’t.

Markdown

The core. Clean MD, smart chunks, optional embeddings. 5 credits (wraps /extract), ~1.1s.

Article Extractor

Faster, cheaper alternative for news/blog content where chunking isn't needed.

MCP server

Give your agent native web access at inference time, not just at training time.

Built for LLM data. Stress-tested by 3,000+ teams.

Start free