Playground Sign in Start free
APIs

AI Scraper

Extract structured data from websites using natural language prompts powered by Large Language Models

Overview

The AI Scraper API uses Large Language Models to extract structured data from websites using simple natural language instructions. Instead of writing complex CSS selectors or parsing HTML, you describe what you want to extract in plain English, and the AI does the heavy lifting.

Perfect for:

  • Extracting data from complex or dynamically structured pages
  • Websites where CSS selectors change frequently
  • Unstructured content like articles, reviews, or social media
  • Quick prototyping without reverse-engineering page structure
  • When you need flexibility and don't want to maintain selectors

Quick Start

Basic Extraction

The simplest way to use AI Scraper is with just a URL and a prompt:

curl -X POST https://api.ujeebu.com/ai-scraper \
  -H "Content-Type: application/json" \
  -H "ApiKey: YOUR_API_KEY" \
  -d '{
    "url": "https://example.com/product",
    "prompt": "Extract the product name, price, and rating"
  }'
import { UjeebuClient } from '@ujeebu-org/ujeebu-sdk';

const client = new UjeebuClient('YOUR_API_KEY');

const response = await client.aiScrape(
  'https://example.com/product',
  'Extract the product name, price, and rating'
);

console.log(response.data.data);
from ujeebu_python import UjeebuClient

client = UjeebuClient('YOUR_API_KEY')

response = client.ai_scrape(
    'https://example.com/product',
    'Extract the product name, price, and rating'
)

print(response.json()['data'])
package main

import (
    "fmt"
    "log"
    ujeebu "github.com/ujeebu/ujeebu-go"
)

func main() {
    client, err := ujeebu.NewClient("YOUR_API_KEY")
    if err != nil {
        log.Fatal(err)
    }

    result, _, err := client.AIScrape(ujeebu.AIScrapeParams{
        URL:    "https://example.com/product",
        Prompt: "Extract the product name, price, and rating",
    })

    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(result.Data)
}

Response

{
  "success": true,
  "data": {
    "product_name": "Wireless Headphones",
    "price": "$79.99",
    "rating": 4.5
  },
  "metadata": {
    "html_length": 15420,
    "chunks_processed": 2,
    "extraction_time_ms": 2300,
    "input_tokens": 3200,
    "output_tokens": 150
  }
}

Request Parameters

Required Parameters

Parameter Type Required Default Description
url string Yes - URL of the webpage to scrape. Supports all standard scraping features (JavaScript rendering, proxies, etc.)
prompt string No Natural language instruction describing what data to extract. Be specific about field names and data types. Required unless a schema is provided — when a schema is given, the prompt is optional.

AI Configuration

Parameter Type Required Default Description
temperature number No 0.0 LLM temperature (0.0–1.0). Lower values = more deterministic, higher = more creative.
schema object No JSON schema defining the expected structure of extracted data. Ensures consistent, type-safe output.

Standard Scraping Parameters

All standard scraping parameters are supported:

Parameter Type Required Default Description
js boolean No false Enable JavaScript rendering in the browser.
proxy_type string No Proxy type: 'rotating', 'advanced', 'premium', 'residential', 'residential_us', 'residential_geo'. Auto proxy is enabled by default if not specified.
proxy_country string No ISO country code for proxy location (e.g., 'US', 'GB', 'FR')
timeout number No 120 Request timeout in seconds.
wait_for `string number` No
auto_captcha_solve boolean No true Enable automatic CAPTCHA solving. Note: unlike other endpoints this defaults to true here.
auto_captcha_solve_timeout number No Timeout in milliseconds for CAPTCHA solving

Structured Extraction with Schemas

Schemas ensure consistent, type-safe output by defining the exact structure you expect.

Basic Schema Example

{
  "url": "https://example.com/product",
  "prompt": "Extract product with variants and reviews",
  "schema": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "variants": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "color": { "type": "string" },
            "size": { "type": "string" },
            "price": { "type": "number" },
            "available": { "type": "boolean" }
          }
        }
      },
      "reviews": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "author": { "type": "string" },
            "rating": { "type": "integer" },
            "comment": { "type": "string" },
            "date": { "type": "string" }
          }
        }
      }
    }
  }
}

Schema Response

{
  "success": true,
  "data": {
    "name": "Premium T-Shirt",
    "variants": [
      {
        "color": "Blue",
        "size": "M",
        "price": 29.99,
        "available": true
      },
      {
        "color": "Red",
        "size": "L",
        "price": 29.99,
        "available": false
      }
    ],
    "reviews": [
      {
        "author": "John D.",
        "rating": 5,
        "comment": "Great quality!",
        "date": "2025-01-05"
      }
    ]
  },
  "metadata": {
    "html_length": 24530,
    "chunks_processed": 3,
    "extraction_time_ms": 3100,
    "input_tokens": 5400,
    "output_tokens": 280
  }
}

Common Examples

Example 1: E-commerce Product

Extract product details with variants and pricing:

curl -X POST https://api.ujeebu.com/ai-scraper \
  -H "Content-Type: application/json" \
  -H "ApiKey: YOUR_API_KEY" \
  -d '{
    "url": "https://shop.example.com/product",
    "prompt": "Extract product name, price, currency, rating, and available colors",
    "schema": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "price": {"type": "number"},
        "currency": {"type": "string"},
        "rating": {"type": "number"},
        "colors": {
          "type": "array",
          "items": {"type": "string"}
        }
      },
      "required": ["name", "price", "currency"]
    }
  }'
import { UjeebuClient } from '@ujeebu-org/ujeebu-sdk';

const client = new UjeebuClient('YOUR_API_KEY');

const response = await client.aiScrape(
  'https://shop.example.com/product',
  'Extract product name, price, currency, rating, and available colors',
  {
    schema: {
      type: 'object',
      properties: {
        name: { type: 'string' },
        price: { type: 'number' },
        currency: { type: 'string' },
        rating: { type: 'number' },
        colors: {
          type: 'array',
          items: { type: 'string' }
        }
      },
      required: ['name', 'price', 'currency']
    }
  }
);

console.log(response.data.data);
from ujeebu_python import UjeebuClient

client = UjeebuClient('YOUR_API_KEY')

response = client.ai_scrape(
    'https://shop.example.com/product',
    'Extract product name, price, currency, rating, and available colors',
    params={
        'schema': {
            'type': 'object',
            'properties': {
                'name': {'type': 'string'},
                'price': {'type': 'number'},
                'currency': {'type': 'string'},
                'rating': {'type': 'number'},
                'colors': {
                    'type': 'array',
                    'items': {'type': 'string'}
                }
            },
            'required': ['name', 'price', 'currency']
        }
    }
)

data = response.json()
print(data['data'])
package main

import (
    "fmt"
    "log"
    ujeebu "github.com/ujeebu/ujeebu-go"
)

func main() {
    client, err := ujeebu.NewClient("YOUR_API_KEY")
    if err != nil {
        log.Fatal(err)
    }

    result, _, err := client.AIScrape(ujeebu.AIScrapeParams{
        URL:    "https://shop.example.com/product",
        Prompt: "Extract product name, price, currency, rating, and available colors",
        Schema: map[string]any{
            "type": "object",
            "properties": map[string]any{
                "name":     map[string]any{"type": "string"},
                "price":    map[string]any{"type": "number"},
                "currency": map[string]any{"type": "string"},
                "rating":   map[string]any{"type": "number"},
                "colors": map[string]any{
                    "type":  "array",
                    "items": map[string]any{"type": "string"},
                },
            },
            "required": []string{"name", "price", "currency"},
        },
    })

    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(result.Data)
}

Example 2: News Article Extraction

Extract article content with metadata:

{
  "url": "https://news.example.com/article",
  "prompt": "Extract the article headline, author, publish date, full content, and tags",
  "schema": {
    "type": "object",
    "properties": {
      "headline": { "type": "string" },
      "subheadline": { "type": "string" },
      "author": { "type": "string" },
      "publish_date": { "type": "string" },
      "content": { "type": "string" },
      "tags": {
        "type": "array",
        "items": { "type": "string" }
      },
      "category": { "type": "string" }
    }
  }
}

Example 3: Restaurant Reviews

Extract multiple reviews from a restaurant page:

{
  "url": "https://restaurant-reviews.example.com/place/123",
  "prompt": "Extract all customer reviews including ratings and dates",
  "schema": {
    "type": "object",
    "properties": {
      "restaurant_name": { "type": "string" },
      "overall_rating": { "type": "number" },
      "reviews": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "author": { "type": "string" },
            "rating": { "type": "integer" },
            "title": { "type": "string" },
            "comment": { "type": "string" },
            "date": { "type": "string" },
            "helpful_count": { "type": "integer" }
          }
        }
      }
    }
  }
}

Example 4: Contact Information

Extract contact details from a business website:

{
  "url": "https://business.example.com/contact",
  "prompt": "Extract all contact information including phone, email, address, and social media",
  "schema": {
    "type": "object",
    "properties": {
      "phone": { "type": "string" },
      "email": { "type": "string" },
      "address": {
        "type": "object",
        "properties": {
          "street": { "type": "string" },
          "city": { "type": "string" },
          "state": { "type": "string" },
          "zip": { "type": "string" },
          "country": { "type": "string" }
        }
      },
      "social_media": {
        "type": "object",
        "properties": {
          "facebook": { "type": "string" },
          "twitter": { "type": "string" },
          "instagram": { "type": "string" },
          "linkedin": { "type": "string" }
        }
      }
    }
  }
}

Best Practices

Writing Effective Prompts

Be Specific:

  • ❌ "Get product info"
  • ✅ "Extract the product name, price in USD, availability status, and customer rating"

Mention Field Names:

  • ❌ "Extract price and stock"
  • ✅ "Extract 'price' as a number and 'in_stock' as a boolean"

Specify Data Types:

  • ❌ "Extract the rating"
  • ✅ "Extract the rating as a decimal number between 0 and 5"

Handle Missing Data:

  • ✅ "If rating is not available, return null"
  • ✅ "If price includes currency symbol, remove it and return only the number"

Schema Design Tips

Use Required Fields: Mark essential fields as required to ensure they're always present:

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "price": { "type": "number" }
  },
  "required": ["name", "price"]
}

Define Defaults: Provide default values for optional fields:

{
  "properties": {
    "rating": { "type": "number", "default": 0 },
    "in_stock": { "type": "boolean", "default": false }
  }
}

Use Enums for Fixed Values: Constrain values to specific options:

{
  "properties": {
    "size": {
      "type": "string",
      "enum": ["S", "M", "L", "XL"]
    }
  }
}

Performance Optimization

Enable JavaScript Only When Needed:

{
  "url": "https://static-site.example.com",
  "prompt": "Extract content",
  "js": false  // Faster for static pages
}

Chunk Large Pages: The AI automatically chunks content, but for very large pages, consider extracting specific sections:

{
  "prompt": "Extract only the main article content, ignore navigation and footer"
}

Error Handling

Common Errors

Invalid Schema:

{
  "success": false,
  "error": "Invalid schema: property 'price' must be of type number"
}

LLM Timeout:

{
  "success": false,
  "error": "LLM request timeout after 60s. Try reducing content size or increasing timeout."
}

Extraction Failed:

{
  "success": true,
  "data": null,
  "error": "Could not extract requested data from page content"
}

Error Recovery

Fallback to Extract Rules: For structured pages, consider using traditional extract rules as a fallback.

Response Format

Success Response

{
  "success": true,
  "data": {
    // Your extracted data based on prompt/schema
  },
  "metadata": {
    "html_length": 15420,
    "chunks_processed": 2,
    "extraction_time_ms": 2300,
    "input_tokens": 3200,
    "output_tokens": 150
  }
}

INFO — Credits Header

Credits consumed are returned in the Ujb-credits response header, not in the response body.

Field Descriptions

Field Type Description
success boolean Whether extraction was successful
data object Extracted structured data matching your prompt/schema
metadata.html_length number Size of HTML content fetched in bytes
metadata.chunks_processed number Number of content chunks processed (for large pages)
metadata.extraction_time_ms number AI extraction time in milliseconds
metadata.input_tokens number Number of input tokens sent to the LLM
metadata.output_tokens number Number of output tokens received from the LLM
metadata.validation_warnings array Warnings if extracted data doesn't fully match the schema (only present when there are warnings)

Comparison: AI Scraper vs Extract Rules

Choose the right tool for your use case:

Feature AI Scraper Extract Rules
Setup Time Instant (just write prompt) Requires CSS selector analysis
Maintenance Low (AI adapts to changes) Medium (update selectors when page changes)
Cost 15-40+ credits per page 1-2 credits per page
Speed 2-10 seconds 1-3 seconds
Accuracy High for unstructured content Very high for structured content
Best For Dynamic layouts, unstructured data Static selectors, high volume
Flexibility Very high Medium

When to Use AI Scraper

  • ✅ Content structure varies between pages
  • ✅ Need to extract meaning, not just text
  • ✅ Prototyping and quick extraction
  • ✅ Low-volume, high-value data
  • ✅ Unstructured content (articles, reviews)

When to Use Extract Rules

  • ✅ High-volume scraping (cost-effective)
  • ✅ Consistent page structure
  • ✅ Speed is critical
  • ✅ Need precise control over extraction
  • ✅ Static websites

Hybrid Approach

Combine both for optimal results:

// Try AI scraper for flexibility
const result = await client.aiScraper({
  url,
  prompt: 'Extract product details'
});

// Fallback to extract rules if AI fails
if (!result.data) {
  const fallback = await client.scrape({
    url,
    extract_rules: {
      name: { selector: '.product-title', type: 'text' },
      price: { selector: '.price', type: 'text' }
    }
  });
}

Credits & Billing

Credit Cost by Proxy Type

AI Scraper always uses browser rendering. Pricing reflects the proxy cost plus an AI processing premium.

Proxy Type Credits per Request
rotating (default) 15
advanced 20
premium 25
residential 40
residential_us 40
residential_geo 25 + 10/MB over 1MB

INFO — Residential Proxy Pricing

US residential proxies (residential or residential_us) cost a flat 40 credits. Non-US residential proxies (residential_geo) have a base cost of 25 credits, plus 10 credits per MB of document size over 1MB.

INFO — Auto Proxy

If no proxy_type is specified, auto proxy is enabled by default. The system automatically tries different proxies until one succeeds. Credits are charged based on the proxy that was actually used for the successful request.

Cost Example

// Extract 50 product pages with default proxy (rotating)
for (const url of productUrls) {
  await client.aiScrape(url, 'Extract product details');
}
// Total: 50 × 15 = 750 credits

// Extract 50 pages with residential proxy (US)
for (const url of productUrls) {
  await client.aiScrape(url, 'Extract product details', {
    proxy_type: 'residential'
  });
}
// Total: 50 × 40 = 2000 credits

Billing Notes

  • Credits are charged only on successful extraction
  • Failed requests (4xx, 5xx errors) are not charged
  • With auto proxy, you are charged for the proxy that succeeded (not failed attempts)
  • When a CAPTCHA is detected and solved, an additional +5 credits surcharge is applied on top of the base request cost (auto_captcha_solve is enabled by default)

Next Steps

  • Learn about Extract Rules for structured extraction
  • Check out Templates for pre-built configurations
  • Read the Node.js SDK documentation
Ready to build?

Spin up an API key in 60 seconds

Free tier: 5,000 credits, no card, full access to every endpoint on this page.