Playground Sign in Start free

AI-Powered Web Scraping with PHP

1

What is AI-Powered Web Scraping?

AI-powered web scraping leverages Large Language Models (LLMs) to extract structured data from unstructured HTML content using natural language instructions instead of CSS selectors.

Traditional vs AI Scraping

AspectTraditionalAI-Powered
SetupWrite CSS selectorsWrite natural language prompt
FlexibilityBreaks when HTML changesAdapts to structure changes
Learning CurveRequires HTML/CSS knowledgePlain English instructions
CostFree (after setup)Costs per request (LLM usage)
SpeedVery fastSlower (LLM processing)
Why Use AI Scraping?

No selector maintenance, natural language instructions, handles complex nested data, and adapts to changing page structures automatically.

2

Basic AI Scraping

Let's start with simple AI-powered data extraction using natural language prompts.

Example 1: Extract Product Information

<?php
function aiScrapeProduct($url) {
    $apiKey = getenv('UJEEBU_API_KEY') ?: 'YOUR_API_KEY';
    
    $prompt = "Extract the following from this product page:
        - Product name
        - Current price (with currency)
        - Star rating (as a number)
        - Number of customer reviews
        - Availability (In Stock, Out of Stock, etc.)
        - Main product image URL
        - Product description (brief summary)
    ";
    
    $ch = curl_init();
    curl_setopt_array($ch, [
        CURLOPT_URL => 'https://api.ujeebu.com/ai-scraper',
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_POST => true,
        CURLOPT_HTTPHEADER => [
            'ApiKey: ' . $apiKey,
            'Content-Type: application/json'
        ],
        CURLOPT_POSTFIELDS => json_encode([
            'url' => $url,
            'prompt' => $prompt,
            'mm_model' => 'gpt-4o-mini',  // Model to use
            'temperature' => 0.0  // Deterministic output
        ])
    ]);
    
    $response = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    
    if ($httpCode !== 200) {
        throw new Exception("AI scraping failed: HTTP $httpCode");
    }
    
    $data = json_decode($response, true);
    return $data['data'] ?? null;
}

// Test
$url = "https://www.amazon.com/dp/B08N5WRWNW";
$product = aiScrapeProduct($url);

echo json_encode($product, JSON_PRETTY_PRINT) . "\n";
?>

Example 2: Extract Multiple Items

PHP
<?php
$apiKey = getenv('UJEEBU_API_KEY') ?: 'YOUR_API_KEY';

$prompt = "Find all products on this page and extract:
    - Product name
    - Price
    - Rating (as a number)
    - Image URL

    Return as a list of products.
";

$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL => 'https://api.ujeebu.com/ai-scraper',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_HTTPHEADER => [
        'ApiKey: ' . $apiKey,
        'Content-Type: application/json'
    ],
    CURLOPT_POSTFIELDS => json_encode([
        'url' => 'https://store.example.com/products',
        'prompt' => $prompt,
        'mm_model' => 'gpt-4o-mini'
    ])
]);

$response = curl_exec($ch);
$data = json_decode($response, true);
curl_close($ch);

$products = $data['data']['products'] ?? [];

echo "Found " . count($products) . " products:\n";
foreach ($products as $product) {
    echo "- " . ($product['name'] ?? 'N/A') . ": " . ($product['price'] ?? 'N/A') . "\n";
}
?>
3

Prompt Engineering Best Practices

Writing effective prompts is key to getting accurate results from AI scraping.

1. Be Specific and Clear

"Get product info"

2. Specify Data Types

PHP
<?php
$prompt = "Extract product information:
- Title: The main product name (string)
- Price: Current price as a string in format '\$XX.XX' including the dollar sign
- Rating: Average star rating as a number 0-5 (float)
- Reviews: Total review count as integer (if no reviews, return 0)
- Stock: 'in_stock', 'out_of_stock', 'limited', or 'pre_order' (string)
- Features: List of key product features as array (max 5, if none available return empty array)
";
?>
4

Schema Validation

Define expected output schema to ensure consistent, validated results.

Using Schema for Validation

<?php
$apiKey = getenv('UJEEBU_API_KEY') ?: 'YOUR_API_KEY';

// Define expected output schema (standard JSON Schema)
$schema = [
    'type' => 'object',
    'properties' => [
        'name' => ['type' => 'string', 'description' => 'Product name'],
        'price' => ['type' => 'string', 'description' => 'Price with currency symbol'],
        'currency' => ['type' => 'string', 'description' => 'Currency code'],
        'rating' => ['type' => 'number', 'description' => 'Star rating 0-5'],
        'total_reviews' => ['type' => 'number', 'description' => 'Number of reviews'],
        'in_stock' => ['type' => 'boolean', 'description' => 'Whether product is in stock'],
        'features' => [
            'type' => 'array',
            'items' => ['type' => 'string'],
            'description' => 'Key product features'
        ]
    ],
    'required' => ['name', 'price']
];

$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL => 'https://api.ujeebu.com/ai-scraper',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_HTTPHEADER => [
        'ApiKey: ' . $apiKey,
        'Content-Type: application/json'
    ],
    CURLOPT_POSTFIELDS => json_encode([
        'url' => 'https://example.com/product',
        'prompt' => 'Extract all product information including name, price, rating, reviews, stock status, and key features',
        'schema' => $schema,
        'mm_model' => 'gpt-4o-mini'
    ])
]);

$response = curl_exec($ch);
$data = json_decode($response, true);
curl_close($ch);

// Schema ensures consistent structure
$product = $data['data'] ?? null;
?>
5

Cost Optimization Strategies

AI scraping can be expensive at scale. Here are strategies to reduce costs:

1. Use Cheaper Models

These are approximate LLM provider costs per request, separate from Ujeebu API credits. You can select a model using the mm_model parameter.

ModelCostQualitySpeed
gpt-4o-miniLowGoodFast
claude-3-haikuLowGoodFast
gemini-1.5-flashLowGoodVery Fast
llama3.2 (Ollama)FreeGoodMedium
gpt-4oHighExcellentMedium

2. Generate Rules Once, Use Many Times

PHP
<?php
function getOrCreateRules($url, $prompt, $rulesFile = 'rules.json') {
    $apiKey = getenv('UJEEBU_API_KEY') ?: 'YOUR_API_KEY';
    
    // Check if rules already exist
    if (file_exists($rulesFile)) {
        echo "Loading existing rules from $rulesFile\n";
        return json_decode(file_get_contents($rulesFile), true);
    }
    
    // Generate new rules with AI (one-time cost)
    echo "Generating new rules for $url\n";
    
    $ch = curl_init();
    curl_setopt_array($ch, [
        CURLOPT_URL => 'https://api.ujeebu.com/guess-er',
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_POST => true,
        CURLOPT_HTTPHEADER => [
            'ApiKey: ' . $apiKey,
            'Content-Type: application/json'
        ],
        CURLOPT_POSTFIELDS => json_encode([
            'url' => $url,
            'prompt' => $prompt,
            'provider' => 'openai'
        ])
    ]);
    
    $response = curl_exec($ch);
    $data = json_decode($response, true);
    curl_close($ch);
    
    $rules = $data['extract_rules'] ?? null;
    
    // Save rules for future use
    file_put_contents($rulesFile, json_encode($rules, JSON_PRETTY_PRINT));
    echo "Rules saved to $rulesFile\n";
    
    return $rules;
}

// Usage: Generate once, use many times
$rules = getOrCreateRules(
    'https://store.example.com/product',
    'Extract product details',
    'product_rules.json'
);

// Now scrape many pages with the same rules (FREE)
$productUrls = [
    'https://store.example.com/product/1',
    'https://store.example.com/product/2',
    'https://store.example.com/product/3',
];

foreach ($productUrls as $url) {
    // Use /scrape with extract_rules (no AI cost!)
    $ch = curl_init();
    curl_setopt_array($ch, [
        CURLOPT_URL => 'https://api.ujeebu.com/scrape',
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_POST => true,
        CURLOPT_HTTPHEADER => [
            'ApiKey: ' . getenv('UJEEBU_API_KEY'),
            'Content-Type: application/json'
        ],
        CURLOPT_POSTFIELDS => json_encode([
            'url' => $url,
            'extract_rules' => $rules,
            'js' => true,
            'response_type' => 'json'
        ])
    ]);
    
    $response = curl_exec($ch);
    $data = json_decode($response, true);
    curl_close($ch);
    
    // Process product data (free extraction!)
}
?>
Cost Savings Tip

Generate extract rules once with AI (one-time cost), then reuse them for all future scraping (free). This can save 99% on AI costs for repeated scraping tasks.

6

Best Practices

01

Use Temperature 0.0

Set temperature to 0.0 for deterministic, consistent results. Higher temperatures add randomness which is usually unwanted for data extraction.

Essential
02

Always Use Schema

Define output schema to ensure consistent structure and validate results. This prevents errors and makes data processing predictable.

Recommended
03

Generate Rules First

For repeated scraping, use /guess-er to generate extract rules once, then use /scrape with those rules for free extraction.

Cost Saving
04

Handle Errors Gracefully

Always check HTTP status codes and handle API errors. Implement retry logic for transient failures and validate extracted data.

Production

Ready to Start AI Scraping?

Try the AI Scraper API in our interactive playground or explore the documentation.