Playground Sign in Start free

Clean Article Extraction

1

Overview

The Extract API is specifically designed to extract the main content from article pages, blog posts, and news stories. Unlike the Scrape API which requires you to specify selectors, the Extract API uses advanced algorithms to automatically identify and extract relevant content.

What You'll Get

Clean article title
Main article text
Featured image
Author name
Publication date
Site name & language
Why Use Extract API?

No need to write CSS selectors or understand page structure. The Extract API works across different websites with different layouts automatically.

2

Extract API vs Scrape API

Understanding when to use each API is important:

FeatureExtract APIScrape API
Best ForArticles, blogs, newsAny structured data
Setup RequiredNone - automaticCSS selectors needed
AccuracyHigh for articlesDepends on selectors
FlexibilityLimited to content extractionFully customizable
SpeedFast (optimized)Depends on complexity
MaintenanceNoneUpdate selectors as sites change
3

Make the API Request

curl -X GET "https://api.ujeebu.com/extract" \
  -H "ApiKey: YOUR_API_KEY" \
  -G \
  --data-urlencode "url=https://example.com/blog/article-title"
4

Response Format

JSON - API Response
{
  "article": {
    "url": "https://example.com/blog/article-title",
    "canonical_url": "https://example.com/blog/article-title",
    "title": "How to Build a Successful Startup",
    "text": "Starting a business is challenging but rewarding...",
    "html": "<article>Starting a business is challenging...</article>",
    "author": "John Smith",
    "pub_date": "2024-01-15 12:00:00",
    "modified_date": "2024-01-16 10:30:00",
    "image": "https://example.com/images/startup.jpg",
    "images": [
      "https://example.com/images/startup.jpg",
      "https://example.com/images/team.jpg"
    ],
    "summary": "A comprehensive guide to building your first startup...",
    "site_name": "Example Blog",
    "language": "en",
    "is_article": 0.95,
    "favicon": "https://example.com/favicon.ico",
    "encoding": "utf-8"
  },
  "time": 1.25
}
Note

The text field contains plain text, while html contains the formatted HTML with proper headings, links, and paragraphs preserved.

5

API Options

ParameterDescriptionDefault
urlThe article URL to extractRequired
jsEnable JavaScript renderingfalse
htmlReturn cleaned HTML contenttrue
textExtract clean text contenttrue
imagesExtract image URLs from the articletrue
authorExtract author nametrue
pub_dateExtract publication datetrue
quick_modeFaster extraction (30-60% faster, less thorough)false

Bulk Extraction Example

Python
import requests
import time

def extract_articles(urls):
    """Extract content from multiple articles."""
    articles = []

    for url in urls:
        response = requests.get(
            "https://api.ujeebu.com/extract",
            headers={"ApiKey": "YOUR_API_KEY"},
            params={"url": url})

        if response.status_code == 200:
            data = response.json()
            article = data["article"]
            articles.append({
                'url': url,
                'title': article.get('title'),
                'author': article.get('author'),
                'pub_date': article.get('pub_date'),
                'text': article.get('text'),
                'language': article.get('language')
            })

        time.sleep(1)  # Rate limiting

    return articles

# Extract from multiple URLs
urls = [
    "https://blog.example.com/article-1",
    "https://blog.example.com/article-2",
    "https://news.example.com/story"
]

articles = extract_articles(urls)
print(f"Extracted {len(articles)} articles")
6

Best Practices

01

Use for Articles Only

The Extract API is optimized for articles. For product pages or other structured data, use the Scrape API instead.

Important
02

Enable JS When Needed

Some modern sites require JavaScript. Use js=true if content isn't extracted properly.

Recommended
03

Cache Results

Article content rarely changes. Cache extracted content to reduce API calls and improve performance.

Performance
04

Validate Output

Check that title and text are extracted. Some pages may block extraction or have unusual structures.

Best Practice

Ready to Start Extracting?

Try the API in our interactive playground or explore the documentation.