AI Scraper
Extract structured data from websites using natural language prompts powered by Large Language Models
Overview
The AI Scraper API uses Large Language Models to extract structured data from websites using simple natural language instructions. Instead of writing complex CSS selectors or parsing HTML, you describe what you want to extract in plain English, and the AI does the heavy lifting.
Perfect for:
- Extracting data from complex or dynamically structured pages
- Websites where CSS selectors change frequently
- Unstructured content like articles, reviews, or social media
- Quick prototyping without reverse-engineering page structure
- When you need flexibility and don't want to maintain selectors
Quick Start
Basic Extraction
The simplest way to use AI Scraper is with just a URL and a prompt:
curl -X POST https://api.ujeebu.com/ai-scraper \
-H "Content-Type: application/json" \
-H "ApiKey: YOUR_API_KEY" \
-d '{
"url": "https://example.com/product",
"prompt": "Extract the product name, price, and rating"
}'import { UjeebuClient } from '@ujeebu-org/ujeebu-sdk';
const client = new UjeebuClient('YOUR_API_KEY');
const response = await client.aiScrape(
'https://example.com/product',
'Extract the product name, price, and rating'
);
console.log(response.data.data);from ujeebu_python import UjeebuClient
client = UjeebuClient('YOUR_API_KEY')
response = client.ai_scrape(
'https://example.com/product',
'Extract the product name, price, and rating'
)
print(response.json()['data'])package main
import (
"fmt"
"log"
ujeebu "github.com/ujeebu/ujeebu-go"
)
func main() {
client, err := ujeebu.NewClient("YOUR_API_KEY")
if err != nil {
log.Fatal(err)
}
result, _, err := client.AIScrape(ujeebu.AIScrapeParams{
URL: "https://example.com/product",
Prompt: "Extract the product name, price, and rating",
})
if err != nil {
log.Fatal(err)
}
fmt.Println(result.Data)
}Response
{
"success": true,
"data": {
"product_name": "Wireless Headphones",
"price": "$79.99",
"rating": 4.5
},
"metadata": {
"html_length": 15420,
"chunks_processed": 2,
"extraction_time_ms": 2300,
"input_tokens": 3200,
"output_tokens": 150
}
}
Request Parameters
Required Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string |
Yes | - |
URL of the webpage to scrape. Supports all standard scraping features (JavaScript rendering, proxies, etc.) |
prompt |
string |
No | Natural language instruction describing what data to extract. Be specific about field names and data types. Required unless a schema is provided — when a schema is given, the prompt is optional. |
AI Configuration
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
temperature |
number |
No | 0.0 |
LLM temperature (0.0–1.0). Lower values = more deterministic, higher = more creative. |
schema |
object |
No | JSON schema defining the expected structure of extracted data. Ensures consistent, type-safe output. |
Standard Scraping Parameters
All standard scraping parameters are supported:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
js |
boolean |
No | false |
Enable JavaScript rendering in the browser. |
proxy_type |
string |
No | Proxy type: 'rotating', 'advanced', 'premium', 'residential', 'residential_us', 'residential_geo'. Auto proxy is enabled by default if not specified. | |
proxy_country |
string |
No | ISO country code for proxy location (e.g., 'US', 'GB', 'FR') | |
timeout |
number |
No | 120 |
Request timeout in seconds. |
wait_for |
`string | number` | No | |
auto_captcha_solve |
boolean |
No | true |
Enable automatic CAPTCHA solving. Note: unlike other endpoints this defaults to true here. |
auto_captcha_solve_timeout |
number |
No | Timeout in milliseconds for CAPTCHA solving |
Structured Extraction with Schemas
Schemas ensure consistent, type-safe output by defining the exact structure you expect.
Basic Schema Example
{
"url": "https://example.com/product",
"prompt": "Extract product with variants and reviews",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"variants": {
"type": "array",
"items": {
"type": "object",
"properties": {
"color": { "type": "string" },
"size": { "type": "string" },
"price": { "type": "number" },
"available": { "type": "boolean" }
}
}
},
"reviews": {
"type": "array",
"items": {
"type": "object",
"properties": {
"author": { "type": "string" },
"rating": { "type": "integer" },
"comment": { "type": "string" },
"date": { "type": "string" }
}
}
}
}
}
}
Schema Response
{
"success": true,
"data": {
"name": "Premium T-Shirt",
"variants": [
{
"color": "Blue",
"size": "M",
"price": 29.99,
"available": true
},
{
"color": "Red",
"size": "L",
"price": 29.99,
"available": false
}
],
"reviews": [
{
"author": "John D.",
"rating": 5,
"comment": "Great quality!",
"date": "2025-01-05"
}
]
},
"metadata": {
"html_length": 24530,
"chunks_processed": 3,
"extraction_time_ms": 3100,
"input_tokens": 5400,
"output_tokens": 280
}
}
Common Examples
Example 1: E-commerce Product
Extract product details with variants and pricing:
curl -X POST https://api.ujeebu.com/ai-scraper \
-H "Content-Type: application/json" \
-H "ApiKey: YOUR_API_KEY" \
-d '{
"url": "https://shop.example.com/product",
"prompt": "Extract product name, price, currency, rating, and available colors",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"rating": {"type": "number"},
"colors": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["name", "price", "currency"]
}
}'import { UjeebuClient } from '@ujeebu-org/ujeebu-sdk';
const client = new UjeebuClient('YOUR_API_KEY');
const response = await client.aiScrape(
'https://shop.example.com/product',
'Extract product name, price, currency, rating, and available colors',
{
schema: {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
currency: { type: 'string' },
rating: { type: 'number' },
colors: {
type: 'array',
items: { type: 'string' }
}
},
required: ['name', 'price', 'currency']
}
}
);
console.log(response.data.data);from ujeebu_python import UjeebuClient
client = UjeebuClient('YOUR_API_KEY')
response = client.ai_scrape(
'https://shop.example.com/product',
'Extract product name, price, currency, rating, and available colors',
params={
'schema': {
'type': 'object',
'properties': {
'name': {'type': 'string'},
'price': {'type': 'number'},
'currency': {'type': 'string'},
'rating': {'type': 'number'},
'colors': {
'type': 'array',
'items': {'type': 'string'}
}
},
'required': ['name', 'price', 'currency']
}
}
)
data = response.json()
print(data['data'])package main
import (
"fmt"
"log"
ujeebu "github.com/ujeebu/ujeebu-go"
)
func main() {
client, err := ujeebu.NewClient("YOUR_API_KEY")
if err != nil {
log.Fatal(err)
}
result, _, err := client.AIScrape(ujeebu.AIScrapeParams{
URL: "https://shop.example.com/product",
Prompt: "Extract product name, price, currency, rating, and available colors",
Schema: map[string]any{
"type": "object",
"properties": map[string]any{
"name": map[string]any{"type": "string"},
"price": map[string]any{"type": "number"},
"currency": map[string]any{"type": "string"},
"rating": map[string]any{"type": "number"},
"colors": map[string]any{
"type": "array",
"items": map[string]any{"type": "string"},
},
},
"required": []string{"name", "price", "currency"},
},
})
if err != nil {
log.Fatal(err)
}
fmt.Println(result.Data)
}Example 2: News Article Extraction
Extract article content with metadata:
{
"url": "https://news.example.com/article",
"prompt": "Extract the article headline, author, publish date, full content, and tags",
"schema": {
"type": "object",
"properties": {
"headline": { "type": "string" },
"subheadline": { "type": "string" },
"author": { "type": "string" },
"publish_date": { "type": "string" },
"content": { "type": "string" },
"tags": {
"type": "array",
"items": { "type": "string" }
},
"category": { "type": "string" }
}
}
}
Example 3: Restaurant Reviews
Extract multiple reviews from a restaurant page:
{
"url": "https://restaurant-reviews.example.com/place/123",
"prompt": "Extract all customer reviews including ratings and dates",
"schema": {
"type": "object",
"properties": {
"restaurant_name": { "type": "string" },
"overall_rating": { "type": "number" },
"reviews": {
"type": "array",
"items": {
"type": "object",
"properties": {
"author": { "type": "string" },
"rating": { "type": "integer" },
"title": { "type": "string" },
"comment": { "type": "string" },
"date": { "type": "string" },
"helpful_count": { "type": "integer" }
}
}
}
}
}
}
Example 4: Contact Information
Extract contact details from a business website:
{
"url": "https://business.example.com/contact",
"prompt": "Extract all contact information including phone, email, address, and social media",
"schema": {
"type": "object",
"properties": {
"phone": { "type": "string" },
"email": { "type": "string" },
"address": {
"type": "object",
"properties": {
"street": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" },
"zip": { "type": "string" },
"country": { "type": "string" }
}
},
"social_media": {
"type": "object",
"properties": {
"facebook": { "type": "string" },
"twitter": { "type": "string" },
"instagram": { "type": "string" },
"linkedin": { "type": "string" }
}
}
}
}
}
Best Practices
Writing Effective Prompts
Be Specific:
- ❌ "Get product info"
- ✅ "Extract the product name, price in USD, availability status, and customer rating"
Mention Field Names:
- ❌ "Extract price and stock"
- ✅ "Extract 'price' as a number and 'in_stock' as a boolean"
Specify Data Types:
- ❌ "Extract the rating"
- ✅ "Extract the rating as a decimal number between 0 and 5"
Handle Missing Data:
- ✅ "If rating is not available, return null"
- ✅ "If price includes currency symbol, remove it and return only the number"
Schema Design Tips
Use Required Fields: Mark essential fields as required to ensure they're always present:
{
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" }
},
"required": ["name", "price"]
}
Define Defaults: Provide default values for optional fields:
{
"properties": {
"rating": { "type": "number", "default": 0 },
"in_stock": { "type": "boolean", "default": false }
}
}
Use Enums for Fixed Values: Constrain values to specific options:
{
"properties": {
"size": {
"type": "string",
"enum": ["S", "M", "L", "XL"]
}
}
}
Performance Optimization
Enable JavaScript Only When Needed:
{
"url": "https://static-site.example.com",
"prompt": "Extract content",
"js": false // Faster for static pages
}
Chunk Large Pages: The AI automatically chunks content, but for very large pages, consider extracting specific sections:
{
"prompt": "Extract only the main article content, ignore navigation and footer"
}
Error Handling
Common Errors
Invalid Schema:
{
"success": false,
"error": "Invalid schema: property 'price' must be of type number"
}
LLM Timeout:
{
"success": false,
"error": "LLM request timeout after 60s. Try reducing content size or increasing timeout."
}
Extraction Failed:
{
"success": true,
"data": null,
"error": "Could not extract requested data from page content"
}
Error Recovery
Fallback to Extract Rules: For structured pages, consider using traditional extract rules as a fallback.
Response Format
Success Response
{
"success": true,
"data": {
// Your extracted data based on prompt/schema
},
"metadata": {
"html_length": 15420,
"chunks_processed": 2,
"extraction_time_ms": 2300,
"input_tokens": 3200,
"output_tokens": 150
}
}
INFO — Credits Header
Credits consumed are returned in the
Ujb-creditsresponse header, not in the response body.
Field Descriptions
| Field | Type | Description |
|---|---|---|
success |
boolean | Whether extraction was successful |
data |
object | Extracted structured data matching your prompt/schema |
metadata.html_length |
number | Size of HTML content fetched in bytes |
metadata.chunks_processed |
number | Number of content chunks processed (for large pages) |
metadata.extraction_time_ms |
number | AI extraction time in milliseconds |
metadata.input_tokens |
number | Number of input tokens sent to the LLM |
metadata.output_tokens |
number | Number of output tokens received from the LLM |
metadata.validation_warnings |
array | Warnings if extracted data doesn't fully match the schema (only present when there are warnings) |
Comparison: AI Scraper vs Extract Rules
Choose the right tool for your use case:
| Feature | AI Scraper | Extract Rules |
|---|---|---|
| Setup Time | Instant (just write prompt) | Requires CSS selector analysis |
| Maintenance | Low (AI adapts to changes) | Medium (update selectors when page changes) |
| Cost | 15-40+ credits per page | 1-2 credits per page |
| Speed | 2-10 seconds | 1-3 seconds |
| Accuracy | High for unstructured content | Very high for structured content |
| Best For | Dynamic layouts, unstructured data | Static selectors, high volume |
| Flexibility | Very high | Medium |
When to Use AI Scraper
- ✅ Content structure varies between pages
- ✅ Need to extract meaning, not just text
- ✅ Prototyping and quick extraction
- ✅ Low-volume, high-value data
- ✅ Unstructured content (articles, reviews)
When to Use Extract Rules
- ✅ High-volume scraping (cost-effective)
- ✅ Consistent page structure
- ✅ Speed is critical
- ✅ Need precise control over extraction
- ✅ Static websites
Hybrid Approach
Combine both for optimal results:
// Try AI scraper for flexibility
const result = await client.aiScraper({
url,
prompt: 'Extract product details'
});
// Fallback to extract rules if AI fails
if (!result.data) {
const fallback = await client.scrape({
url,
extract_rules: {
name: { selector: '.product-title', type: 'text' },
price: { selector: '.price', type: 'text' }
}
});
}
Credits & Billing
Credit Cost by Proxy Type
AI Scraper always uses browser rendering. Pricing reflects the proxy cost plus an AI processing premium.
| Proxy Type | Credits per Request |
|---|---|
rotating (default) |
15 |
advanced |
20 |
premium |
25 |
residential |
40 |
residential_us |
40 |
residential_geo |
25 + 10/MB over 1MB |
INFO — Residential Proxy Pricing
US residential proxies (
residentialorresidential_us) cost a flat 40 credits. Non-US residential proxies (residential_geo) have a base cost of 25 credits, plus 10 credits per MB of document size over 1MB.
INFO — Auto Proxy
If no
proxy_typeis specified, auto proxy is enabled by default. The system automatically tries different proxies until one succeeds. Credits are charged based on the proxy that was actually used for the successful request.
Cost Example
// Extract 50 product pages with default proxy (rotating)
for (const url of productUrls) {
await client.aiScrape(url, 'Extract product details');
}
// Total: 50 × 15 = 750 credits
// Extract 50 pages with residential proxy (US)
for (const url of productUrls) {
await client.aiScrape(url, 'Extract product details', {
proxy_type: 'residential'
});
}
// Total: 50 × 40 = 2000 credits
Billing Notes
- Credits are charged only on successful extraction
- Failed requests (4xx, 5xx errors) are not charged
- With auto proxy, you are charged for the proxy that succeeded (not failed attempts)
- When a CAPTCHA is detected and solved, an additional +5 credits surcharge is applied on top of the base request cost (
auto_captcha_solveis enabled by default)
Next Steps
- Learn about Extract Rules for structured extraction
- Check out Templates for pre-built configurations
- Read the Node.js SDK documentation
Spin up an API key in 60 seconds
Free tier: 5,000 credits, no card, full access to every endpoint on this page.