Overview

PHP is an excellent choice for web scraping, especially when building web applications or APIs. With PHP's built-in cURL support and the Ujeebu API, you can easily extract data from any website.

Why Use PHP for Web Scraping?

Built-in HTTP support

Easy JSON handling

Server-side integration

Wide hosting support

2

Prerequisites

Before starting, you'll need:

PHP 7.4+ with cURL extension enabled
Ujeebu API Key (sign up at ujeebu.com)
Basic PHP knowledge (variables, functions, arrays)

Check PHP cURL Support

PHP

<?php
// Check if cURL is available
if (function_exists('curl_init')) {
    echo "✅ cURL is enabled\n";
} else {
    echo "❌ cURL is not enabled. Please enable the cURL extension.\n";
}

// Check PHP version
echo "PHP Version: " . PHP_VERSION . "\n";
?>

3

Basic Web Scraping

Let's start with the simplest form of web scraping - fetching HTML content from a static webpage.

Basic Scraping Example

<?php
$apiKey = getenv('UJEEBU_API_KEY') ?: 'YOUR_API_KEY';
$url = 'https://example.com';

// Initialize cURL
$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL => 'https://api.ujeebu.com/scrape?' . http_build_query([
        'url' => $url
    ]),
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_HTTPHEADER => [
        'ApiKey: ' . $apiKey
    ]
]);

// Execute request
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);

// Check if successful
if ($httpCode === 200) {
    $data = json_decode($response, true);
    $html = $data['html'] ?? $data['text'] ?? '';
    echo "✅ Successfully scraped " . strlen($html) . " characters\n";
    echo substr($html, 0, 500) . "...\n";
} else {
    echo "❌ Error: HTTP $httpCode\n";
    echo $response . "\n";
}
?>

curl -X GET "https://api.ujeebu.com/scrape" \
  -H "ApiKey: YOUR_API_KEY" \
  -G \
  --data-urlencode "url=https://example.com"

Using DOMDocument for HTML Parsing

PHP

<?php
function scrapePage($url) {
    $apiKey = getenv('UJEEBU_API_KEY') ?: 'YOUR_API_KEY';
    
    $ch = curl_init();
    curl_setopt_array($ch, [
        CURLOPT_URL => 'https://api.ujeebu.com/scrape?' . http_build_query(['url' => $url]),
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_HTTPHEADER => ['ApiKey: ' . $apiKey]
    ]);
    
    $response = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    
    if ($httpCode !== 200) {
        return null;
    }
    
    $data = json_decode($response, true);
    return $data['html'] ?? $data['text'] ?? '';
}

// Scrape and parse HTML
$html = scrapePage('https://www.wired.com');
if ($html) {
    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $xpath = new DOMXPath($dom);
    
    // Extract article titles
    $articles = [];
    $nodes = $xpath->query("//tr[@class='athing']");
    
    foreach ($nodes as $node) {
        $titleNode = $xpath->query(".//span[@class='titleline']/a", $node)->item(0);
        if ($titleNode) {
            $articles[] = [
                'title' => $titleNode->textContent,
                'url' => $titleNode->getAttribute('href'),
                'id' => $node->getAttribute('id')
            ];
        }
    }
    
    // Display results
    foreach (array_slice($articles, 0, 10) as $idx => $article) {
        echo ($idx + 1) . ". " . $article['title'] . "\n";
        echo "   " . $article['url'] . "\n\n";
    }
}
?>

4

JavaScript-Rendered Pages

Modern websites often use JavaScript frameworks (React, Vue, Angular) to render content dynamically. For these sites, you need JavaScript rendering.

Basic JavaScript Rendering

<?php
$apiKey = getenv('UJEEBU_API_KEY') ?: 'YOUR_API_KEY';
$url = 'https://example-spa.com';

$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL => 'https://api.ujeebu.com/scrape',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_HTTPHEADER => [
        'ApiKey: ' . $apiKey,
        'Content-Type: application/json'
    ],
    CURLOPT_POSTFIELDS => json_encode([
        'url' => $url,
        'js' => true,  // Enable JavaScript execution
        'wait_for' => 3000  // Wait 3 seconds for content to load
    ])
]);

$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);

if ($httpCode === 200) {
    $data = json_decode($response, true);
    $html = $data['html'] ?? '';
    echo "Rendered HTML: " . strlen($html) . " characters\n";
}
?>

curl -X POST "https://api.ujeebu.com/scrape" \
  -H "ApiKey: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-spa.com",
    "js": true,
    "wait_for": 3000
  }'

Waiting for Specific Elements

PHP

<?php
function scrapeWithJS($url, $waitForSelector = null) {
    $apiKey = getenv('UJEEBU_API_KEY') ?: 'YOUR_API_KEY';
    
    $params = [
        'url' => $url,
        'js' => true,
        'timeout' => 60
    ];
    
    if ($waitForSelector) {
        $params['wait_for'] = $waitForSelector;
    }
    
    $ch = curl_init();
    curl_setopt_array($ch, [
        CURLOPT_URL => 'https://api.ujeebu.com/scrape',
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_POST => true,
        CURLOPT_HTTPHEADER => [
            'ApiKey: ' . $apiKey,
            'Content-Type: application/json'
        ],
        CURLOPT_POSTFIELDS => json_encode($params)
    ]);
    
    $response = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    
    if ($httpCode === 200) {
        return json_decode($response, true);
    }
    
    return null;
}

// Wait for product cards to load
$result = scrapeWithJS('https://example.com/products', '.product-card');
?>

5

Structured Data Extraction with Extract Rules

Instead of parsing HTML manually, use Extract Rules - a powerful feature that lets you define extraction patterns using CSS selectors.

Basic Extract Rules Example

<?php
$apiKey = getenv('UJEEBU_API_KEY') ?: 'YOUR_API_KEY';

// Define extraction rules
$extractRules = [
    'products' => [
        'selector' => '.product-card',
        'type' => 'obj',
        'multiple' => true,
        'children' => [
            'name' => [
                'selector' => '.product-title',
                'type' => 'text'
            ],
            'price' => [
                'selector' => '.product-price',
                'type' => 'text'
            ],
            'image' => [
                'selector' => 'img.product-image',
                'type' => 'image'
            ],
            'url' => [
                'selector' => 'a.product-link',
                'type' => 'link'
            ]
        ]
    ]
];

// Scrape with rules
$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL => 'https://api.ujeebu.com/scrape',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_HTTPHEADER => [
        'ApiKey: ' . $apiKey,
        'Content-Type: application/json'
    ],
    CURLOPT_POSTFIELDS => json_encode([
        'url' => 'https://store.example.com/products',
        'js' => true,
        'wait_for' => '.product-card',
        'extract_rules' => $extractRules,
        'response_type' => 'json'
    ])
]);

$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);

// Process results
if ($httpCode === 200) {
    $data = json_decode($response, true);
    $products = $data['products'] ?? [];
    
    echo "Extracted " . count($products) . " products:\n";
    foreach ($products as $product) {
        echo "- " . ($product['name'] ?? 'N/A') . ": " . ($product['price'] ?? 'N/A') . "\n";
    }
}
?>

{
  "products": {
    "selector": ".product-card",
    "type": "obj",
    "multiple": true,
    "children": {
      "name": {
        "selector": ".product-title",
        "type": "text"
      },
      "price": {
        "selector": ".product-price",
        "type": "text"
      },
      "image": {
        "selector": "img.product-image",
        "type": "image"
      },
      "url": {
        "selector": "a.product-link",
        "type": "link"
      }
    }
  }
}

6

AI-Powered Scraping

The most advanced method: use AI to extract data using natural language prompts instead of CSS selectors.

Basic AI Scraping

<?php
$apiKey = getenv('UJEEBU_API_KEY') ?: 'YOUR_API_KEY';

// AI-powered extraction with natural language
$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL => 'https://api.ujeebu.com/ai-scraper',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_HTTPHEADER => [
        'ApiKey: ' . $apiKey,
        'Content-Type: application/json'
    ],
    CURLOPT_POSTFIELDS => json_encode([
        'url' => 'https://store.example.com/product/12345',
        'prompt' => 'Extract the product name, price, rating, and availability status',
        'mm_model' => 'gpt-4o-mini',
        'temperature' => 0.0
    ])
]);

$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);

if ($httpCode === 200) {
    $data = json_decode($response, true);
    echo json_encode($data['data'] ?? [], JSON_PRETTY_PRINT) . "\n";
}
?>

curl -X POST "https://api.ujeebu.com/ai-scraper" \
  -H "ApiKey: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://store.example.com/product/12345",
    "prompt": "Extract the product name, price, rating, and availability status",
    "mm_model": "gpt-4o-mini",
    "temperature": 0.0
  }'

Auto-Generate Extract Rules with AI

PHP

<?php
function generateExtractRules($url, $prompt) {
    $apiKey = getenv('UJEEBU_API_KEY') ?: 'YOUR_API_KEY';
    
    // Step 1: Generate extract rules using AI
    $ch = curl_init();
    curl_setopt_array($ch, [
        CURLOPT_URL => 'https://api.ujeebu.com/guess-er',
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_POST => true,
        CURLOPT_HTTPHEADER => [
            'ApiKey: ' . $apiKey,
            'Content-Type: application/json'
        ],
        CURLOPT_POSTFIELDS => json_encode([
            'url' => $url,
            'prompt' => $prompt,
            'provider' => 'openai',
            'model' => 'gpt-4o-mini'
        ])
    ]);
    
    $response = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    
    if ($httpCode === 200) {
        $data = json_decode($response, true);
        return $data['extract_rules'] ?? null;
    }
    
    return null;
}

// Generate rules
$rules = generateExtractRules(
    'https://store.example.com/products',
    'Extract product name, price, image, and rating'
);

if ($rules) {
    echo "Generated rules:\n";
    echo json_encode($rules, JSON_PRETTY_PRINT) . "\n";
    
    // Step 2: Use rules for fast, free scraping
    $ch = curl_init();
    curl_setopt_array($ch, [
        CURLOPT_URL => 'https://api.ujeebu.com/scrape',
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_POST => true,
        CURLOPT_HTTPHEADER => [
            'ApiKey: ' . getenv('UJEEBU_API_KEY'),
            'Content-Type: application/json'
        ],
        CURLOPT_POSTFIELDS => json_encode([
            'url' => 'https://store.example.com/products',
            'extract_rules' => $rules,
            'js' => true,
            'response_type' => 'json'
        ])
    ]);
    
    $response = curl_exec($ch);
    $data = json_decode($response, true);
    curl_close($ch);
    
    $products = $data['products'] ?? [];
    echo "\nExtracted " . count($products) . " products without AI cost!\n";
}
?>

7

Best Practices

01

Rate Limiting

Add delays between requests to avoid overwhelming servers and getting rate-limited. Use 1-2 second delays for production scraping.

Essential

02

Error Handling

Always check HTTP status codes and handle errors gracefully. Implement retry logic for transient failures.

Recommended

03

Use Extract Rules

Generate rules once with AI, then reuse them for free. Much more cost-effective than AI scraping for repeated tasks.

Cost Saving

04

Cache Results

Cache scraped data to avoid redundant requests. Use file-based or database caching for production applications.

Performance

Web Scraping with PHP - Complete Guide