Playground Sign in Start free
Use case · Structured data for LLMs

Web pages, JSON-shaped. For training, RAG, and agents.

LLMs need clean, structured input. Raw HTML wastes tokens and pollutes embeddings. Convert any URL into schema-validated JSON or LLM-optimised Markdown — built for fine-tuning, RAG, evaluation, and agentic browsing.

5,000 free credits · no card · failed requests not billed
The challenge

The LLM data pipeline challenge.

You can’t feed raw web pages to a model and expect great answers. Token waste + schema chaos = bad RAG, expensive inference, fragile fine-tunes.

Unstructured web data

Raw HTML is incompatible with LLM pipelines. Extensive preprocessing before any model can use it.

Token waste

Nav menus, ads, boilerplate HTML burn precious context-window tokens with zero informational value.

Schema inconsistency

Data varies wildly across sources. Impossible to build reliable structured datasets without constant reconciliation.

Scale challenges

Millions of pages for training data = robust extraction that handles anti-bot and rate limits without breaking.

Use cases

How AI teams use it.

RAG

Build knowledge bases that actually ground answers.

Extract clean, structured content from docs sites, knowledge bases, authoritative sources. Convert to token-efficient Markdown that preserves heading hierarchy and semantic structure. BM25 filtering extracts only domain-relevant passages — lower storage cost, higher retrieval precision. Feed Pinecone, Weaviate, Chroma.

Business outcomes
  • ~60% hallucination reduction with grounded knowledge bases
  • ~70% token cost reduction vs raw HTML
  • 100,000+ pages indexed with consistent structure
  • Automated re-extraction keeps KBs current
Fine-tuning

Generate training datasets at scale.

Extract structured data using JSON schemas matching your training format. Define Pydantic-compatible schemas for Q&A pairs, instruction-response examples, or domain entities. Validate against schema automatically. AI Scraper + Markdown API for diverse, well-formatted corpora.

Business outcomes
  • 50,000+ validated training examples from web sources
  • Schema consistency via automatic JSON validation
  • Dataset prep cut from weeks to hours
  • Higher fine-tuned model accuracy from clean data
Evaluation

Build evaluation benchmarks from real sources.

Extract ground-truth data from authoritative web. Government databases, academic repos, reference sites with structured schemas capturing entities, relationships, attributes. Schema validation ensures benchmark quality. Auto-refresh keeps evaluations against current information.

Business outcomes
  • Domain-specific evaluation suites with verified ground truth
  • Automated benchmark refresh for continuous evaluation
  • Compare model performance across extraction complexity
  • Detect model regressions early via structured tests
Agentic browsing

Give agents a usable data layer.

When autonomous agents browse the web, raw HTML is unusable. Provide clean Markdown for reading comprehension and structured JSON for data extraction. Natural-language prompts extract exactly what the agent needs. Consistent output regardless of source layout — fits in context windows.

Business outcomes
  • Agents process web content ~10x faster than raw HTML
  • ~75% reduction in agent token consumption
  • JS-heavy sites + anti-bot handled transparently
  • Consistent data structure regardless of source
Sources

Convert any web content into LLM-ready data.

Each source type maps to a recommended extraction strategy.

Documentation sites

API references, tutorials, guides. Structured Markdown preserving code blocks, parameter tables, heading hierarchy for RAG.

Knowledge bases / wikis

Enterprise wikis, Confluence, KMS. Articles with metadata, categories, cross-references for LLM training.

Academic papers

Abstracts, methodologies, findings, citations from research repositories and preprint servers. Domain-specific scientific KBs.

Product catalogs

Names, descriptions, specifications, pricing, reviews with validated JSON schemas. Product intelligence for e-commerce AI.

News archives

Articles with structured headlines, authors, dates, body text. Temporal knowledge bases for current events.

Government portals

Regulations, public records, policy documents. Complex legal + regulatory content for compliance and legal AI.

Forum discussions

Q&A from Stack Overflow, Reddit, community forums. Structured Q&A pairs for fine-tuning conversational + technical models.

Technical blogs

Articles, code tutorials, engineering posts with preserved formatting. Clean Markdown with intact code snippets.

Start building pipelines No credit card required.
How it works

Three steps from web page to LLM-ready data.

1

Define your schema

JSON Schema or Pydantic-compatible definitions. Fields, types, validation rules. For RAG: Markdown API with BM25 filtering. For structured: AI Scraper with natural-language prompts. Nested objects, arrays, enums, required fields.

2

Extract and validate

Submit URLs. JS rendering, anti-bot, rate limits handled automatically. AI Scraper extracts data matching your schema with automatic validation and retry. Markdown API strips boilerplate; BM25 filtering scores passages against your relevance queries.

3

Feed your pipeline

JSON or Markdown via API response or webhook. Push to vector DB for RAG, format as JSONL for fine-tuning, pipe Markdown into context windows. Token counts, timing, validation warnings in metadata.

Try it

Drop a URL in.

See schema-validated structured data come back — ready for vector store, fine-tuning, or context window.

curl 'https://api.ujeebu.com/ai-scraper' \
  -H 'ApiKey: YOUR_API_KEY' \
  -d '{
    "url": "https://docs.python.org/3/tutorial/datastructures.html",
    "prompt": "extract page title, main topic, all code examples with their explanations, all the key concepts mentioned",
    "schema": {
      "type": "object",
      "properties": {
        "title": { "type": "string" },
        "concepts": { "type": "array", "items": { "type": "string" } },
        "code_examples": { "type": "array" }
      },
      "required": ["title", "concepts"]
    }
  }'
No API key required for testing in the playground. Powered by /ai-scraper
Features

Built for production LLM data pipelines.

JSON Schema validated

Standard JSON Schema. Auto-validated extracted data, retried once on validation failure. Type-safe output with required-field enforcement, enum validation, nested objects.

Token-optimised Markdown

Clean Markdown preserving semantic structure. Navs/ads/footers/boilerplate stripped. Heading hierarchy, code blocks, tables, lists preserved. ~70% token reduction vs raw HTML.

BM25 content filtering

Provide a query, receive only the passages most relevant. Dramatically reduces noise for RAG KBs. Pre-filter content before it hits your vector DB.

Batch processing

Thousands of URLs in parallel with consistent schema enforcement per page. Build large-scale datasets without infra complexity. Proxy rotation + CAPTCHA solving + JS rendering included.

Pydantic-compatible

Schemas from Python Pydantic models with full $ref, $defs, anyOf, oneOf, allOf support. Define data models in Python, export schema, pass to API. References resolved automatically.

Multi-format output

Choose: schema-validated JSON for entity extraction, clean Markdown for text comprehension, Markdown with citations for research. Each format tuned for specific LLM use cases.

FAQ

Frequently asked.

What output formats are available?
Multiple formats for different LLM workflows. AI Scraper returns schema-validated JSON — ideal for fine-tuning datasets and entity extraction. Markdown API returns three flavours: raw (full page), fit (main content only), bm25 (passages ranked by relevance to your query). Both endpoints strip ads/navs/boilerplate. Markdown-with-citations format includes numbered references for source attribution in LLM responses.
How does JSON Schema validation work?
Define expected output structure in standard JSON Schema — field names, types, required fields, nested objects, arrays, enums. AI Scraper passes the schema to the LLM with the page content and prompt. After extraction, response is validated against schema. On failure due to missing fields or type mismatches, the system auto-retries with corrective instructions. If retry fails, you get the best-effort response plus detailed validation warnings showing which fields failed and why.
Can I use Pydantic models?
Yes — fully supported. Pydantic v1 (definitions format) and Pydantic v2 (defs format) both work. The API auto-resolves references, handles anyOf/oneOf for optional fields, merges allOf entries. Export your model schema via model_json_schema() and pass it as the schema parameter. Define models in Python, validate locally, use same schema for web extraction.
What’s BM25 content filtering and when to use it?
BM25 scores text passages by query relevance. Markdown API with filter=bm25 + a query parameter extracts the full page content, then returns only the passages most relevant to your query. Particularly valuable for RAG pipelines that need specific info from long pages without ingesting irrelevant content. E.g. extract only installation instructions from lengthy docs, or pull pricing details from a marketing page. Reduces token consumption + improves retrieval precision.
How many pages for training datasets?
Built for high volume. Thousands of pages/hour depending on plan, all plans support parallel requests. Each request handles JS rendering, proxy rotation, anti-bot, CAPTCHA solving. For large-scale generation: feed a sitemap or SERP-discovered URL list into an async batch job and AI Scraper extracts each in parallel. Horizontal scaling; enterprise plans support million-page datasets with custom rate limits.
How does this compare to traditional web scraping for LLMs?
Traditional scraping needs hand-written CSS/XPath that break on every redesign. AI Scraper uses LLMs to understand page content semantically — extracts based on prompts and JSON schemas, not selectors. One config works across similar content types on different sites. For LLM specifically: Markdown API gives token-efficient output preserving semantic structure; AI Scraper gives schema-validated JSON matching training data format. Combined, you skip the preprocessing pipeline traditional scraping requires.
5,000 free credits to start.
No credit card. Failed requests cost zero.
Start free

Start building LLM data pipelines today.

Join AI teams using structured web extraction for RAG, fine-tuning, and agentic workflows.

No credit card required.