Use case · Structured data for LLMs

Web pages, JSON-shaped. For training, RAG, and agents.

LLMs need clean, structured input. Raw HTML wastes tokens and pollutes embeddings. Convert any URL into structured JSON or LLM-optimised Markdown - built for fine-tuning, RAG, evaluation, and agentic browsing.

Start free Try in playground

5,000 free credits · no card · failed requests not billed

The challenge

The LLM data pipeline challenge.

You can’t feed raw web pages to a model and expect great answers. Token waste + schema chaos = bad RAG, expensive inference, fragile fine-tunes.

Unstructured web data

Raw HTML is incompatible with LLM pipelines. Extensive preprocessing before any model can use it.

Token waste

Nav menus, ads, boilerplate HTML burn precious context-window tokens with zero informational value.

Schema inconsistency

Data varies wildly across sources. Impossible to build reliable structured datasets without constant reconciliation.

Scale challenges

Millions of pages for training data = robust extraction that handles anti-bot and rate limits without breaking.

Use cases

How AI teams use it.

RAG

Build knowledge bases that actually ground answers.

Extract clean, structured content from docs sites, knowledge bases, authoritative sources. Convert to token-efficient Markdown that preserves heading hierarchy and semantic structure. BM25 filtering extracts only domain-relevant passages - lower storage cost, higher retrieval precision. Feed Pinecone, Weaviate, Chroma.

Business outcomes

~60% hallucination reduction with grounded knowledge bases
~70% token cost reduction vs raw HTML
100,000+ pages indexed with consistent structure
Automated re-extraction keeps KBs current

Fine-tuning

Generate training datasets at scale.

Extract structured fields with extract_rules mapped to your training format - Q&A pairs, instruction-response examples, or domain entities. Validate the returned JSON against your own Pydantic or JSON Schema. Scrape API + Markdown API for diverse, well-formatted corpora.

Business outcomes

50,000+ validated training examples from web sources
Schema consistency via automatic JSON validation
Dataset prep cut from weeks to hours
Higher fine-tuned model accuracy from clean data

Evaluation

Build evaluation benchmarks from real sources.

Extract ground-truth data from authoritative web. Government databases, academic repos, reference sites with structured schemas capturing entities, relationships, attributes. Schema validation ensures benchmark quality. Auto-refresh keeps evaluations against current information.

Business outcomes

Domain-specific evaluation suites with verified ground truth
Automated benchmark refresh for continuous evaluation
Compare model performance across extraction complexity
Detect model regressions early via structured tests

Agentic browsing

Give agents a usable data layer.

When autonomous agents browse the web, raw HTML is unusable. Provide clean Markdown for reading comprehension and structured JSON for data extraction. extract_rules pull exactly the fields the agent needs. Consistent output regardless of source layout - fits in context windows.

Business outcomes

Agents process web content ~10x faster than raw HTML
~75% reduction in agent token consumption
JS-heavy sites + anti-bot handled transparently
Consistent data structure regardless of source

Sources

Convert any web content into LLM-ready data.

Each source type maps to a recommended extraction strategy.

Documentation sites

API references, tutorials, guides. Structured Markdown preserving code blocks, parameter tables, heading hierarchy for RAG.

Knowledge bases / wikis

Enterprise wikis, Confluence, KMS. Articles with metadata, categories, cross-references for LLM training.

Academic papers

Abstracts, methodologies, findings, citations from research repositories and preprint servers. Domain-specific scientific KBs.

Product catalogs

Names, descriptions, specifications, pricing, reviews with validated JSON schemas. Product intelligence for e-commerce AI.

News archives

Articles with structured headlines, authors, dates, body text. Temporal knowledge bases for current events.

Government portals

Regulations, public records, policy documents. Complex legal + regulatory content for compliance and legal AI.

Forum discussions

Q&A from Stack Overflow, Reddit, community forums. Structured Q&A pairs for fine-tuning conversational + technical models.

Technical blogs

Articles, code tutorials, engineering posts with preserved formatting. Clean Markdown with intact code snippets.

Start building pipelines No credit card required.

How it works

Three steps from web page to LLM-ready data.

1

Define your fields

Decide the fields you need. For RAG: Markdown API with BM25 filtering. For structured JSON: Scrape API with extract_rules - CSS selectors mapped to field names, with nested objects and arrays. Validate the response against your own JSON Schema or Pydantic model.

2

Extract and validate

Submit URLs. JS rendering, anti-bot, rate limits handled automatically. Scrape API returns JSON matching your extract_rules. Markdown API strips boilerplate; BM25 filtering scores passages against your relevance queries.

3

Feed your pipeline

JSON or Markdown via API response or webhook. Push to vector DB for RAG, format as JSONL for fine-tuning, pipe Markdown into context windows. Token counts, timing, validation warnings in metadata.

Try it

Drop a URL in.

See schema-validated structured data come back - ready for vector store, fine-tuning, or context window.

url Try in playground

curl 'https://api.ujeebu.com/scrape' \
  -H 'ApiKey: YOUR_API_KEY' \
  -G \
  --data-urlencode 'url=https://docs.python.org/3/tutorial/datastructures.html' \
  --data-urlencode 'extract_rules={"title":"h1","sections":"h2","code_examples":"pre"}'

No API key required for testing in the playground. Powered by /scrape

Features

Built for production LLM data pipelines.

Structured JSON output

extract_rules map CSS selectors to field names, returning predictable JSON with nested objects and arrays. Deterministic and repeatable - validate the response against your own JSON Schema or Pydantic model.

Token-optimised Markdown

Clean Markdown preserving semantic structure. Navs/ads/footers/boilerplate stripped. Heading hierarchy, code blocks, tables, lists preserved. ~70% token reduction vs raw HTML.

BM25 content filtering

Provide a query, receive only the passages most relevant. Dramatically reduces noise for RAG KBs. Pre-filter content before it hits your vector DB.

Batch processing

Thousands of URLs in parallel with consistent schema enforcement per page. Build large-scale datasets without infra complexity. Proxy rotation + CAPTCHA solving + JS rendering included.

Deterministic extraction

extract_rules return the same JSON every time, with no LLM variability. Define rules once, reuse across matching pages. Parse the response straight into your Pydantic models or dataclasses for local validation.

Multi-format output

Choose: schema-validated JSON for entity extraction, clean Markdown for text comprehension, Markdown with citations for research. Each format tuned for specific LLM use cases.

Powered by

Scrape API Markdown API Article Extractor Try in playground

FAQ

Frequently asked.

What output formats are available?

Multiple formats for different LLM workflows. Scrape API with extract_rules returns structured JSON - ideal for fine-tuning datasets and entity extraction. Markdown API returns three flavours: raw (full page), fit (main content only), bm25 (passages ranked by relevance to your query). Both endpoints strip ads/navs/boilerplate. Markdown-with-citations format includes numbered references for source attribution in LLM responses.

How do I control the output structure?

With the Scrape API, extract_rules map CSS selectors to field names. The returned JSON mirrors that structure exactly - nested objects, arrays, and per-field output types (text, HTML, attributes). Because it is selector-based, results are deterministic and repeatable, so you can validate them against your own JSON Schema or Pydantic model on your side without any LLM variability.

Can I match the output to my Pydantic model?

Yes. extract_rules keys become JSON keys, so shape the rules to match your model fields. Parse the response straight into a Pydantic model or dataclass and validate locally - the same model you use elsewhere in your pipeline. Because extraction is deterministic, that validation stays stable across runs.

What’s BM25 content filtering and when to use it?

BM25 scores text passages by query relevance. Markdown API with filter=bm25 + a query parameter extracts the full page content, then returns only the passages most relevant to your query. Particularly valuable for RAG pipelines that need specific info from long pages without ingesting irrelevant content. E.g. extract only installation instructions from lengthy docs, or pull pricing details from a marketing page. Reduces token consumption + improves retrieval precision.

How many pages for training datasets?

Built for high volume. Thousands of pages/hour depending on plan, all plans support parallel requests. Each request handles JS rendering, proxy rotation, anti-bot, CAPTCHA solving. For large-scale generation: feed a sitemap or SERP-discovered URL list into an async batch job and the Scrape API extracts each in parallel. Horizontal scaling; enterprise plans support million-page datasets with custom rate limits.

How does this compare to traditional web scraping for LLMs?

The heavy lifting - JS rendering, proxy rotation, anti-bot, CAPTCHA solving - is handled for you, so hand-rolled scrapers do not break the moment a site adds protection. Define extract_rules once with CSS selectors and reuse them across similar pages. For LLM workflows specifically: the Markdown API gives token-efficient output preserving semantic structure, while the Scrape API gives structured JSON matching your target fields. Combined, you skip most of the preprocessing pipeline traditional scraping requires.

5,000 free credits to start.

No credit card. Failed requests cost zero.

Start free

Explore other use cases

View all →

Social monitoring → SEO & SERP tracking → Market research → Markdown for AI → Extract articles for AI → Extract company info →

Start building LLM data pipelines today.

Join AI teams using structured web extraction for RAG, fine-tuning, and agentic workflows.

Start using Start free trial Talk to a data-pipeline expert

No credit card required.