Playground Sign in Start free

Scraping Hacker News Feed

1

Overview

Hacker News (news.ycombinator.com) is a popular social news website focusing on computer science and entrepreneurship. This tutorial demonstrates how to scrape the front page stories to build news aggregators, monitor trending topics, or analyze discussion patterns.

What You'll Extract

Story title & URL
Points (upvotes)
Author username
Comment count
Time posted
Domain/source
Why Hacker News?

HN has a clean, minimal HTML structure with consistent class names, making it an excellent learning example for web scraping. Plus, no JavaScript rendering is required!

2

Analyze the HTML Structure

Hacker News uses a table-based layout. Each story consists of two table rows: one for the story link and another for the metadata (points, author, time, comments).

Data PointSelector / MethodType
Story Containertr.athingobj
Rank.ranktext
Title.titleline > atext
URL.titleline > alink
Domain.sitestrtext
Points$parent.nextElementSibling.scorefn
Author$parent.nextElementSibling.hnuserfn
Time$parent.nextElementSibling.agefn
Comments$parent.nextElementSibling → last .subtext afn
Pro Tip

HN stores story metadata (points, author, time) in a sibling <tr> row. Since CSS child selectors can't reach siblings, we use fn type rules with $parent.nextElementSibling to access the adjacent row. This requires js: true.

3

Build Extract Rules

We'll use an object-type extract rule to capture each story with all its nested data points:

JSON - Extract Rules
{
  "stories": {
    "selector": "tr.athing",
    "type": "obj",
    "multiple": true,
    "children": {
      "rank": { "selector": ".rank", "type": "text" },
      "title": { "selector": ".titleline > a", "type": "text" },
      "url": { "selector": ".titleline > a", "type": "link" },
      "domain": { "selector": ".sitestr", "type": "text" },
      "story_id": { "selector": "", "type": "attr", "attribute": "id" },
      "points": {
        "type": "fn",
        "fn": "return $parent.nextElementSibling?.querySelector('.score')?.textContent || null;"
      },
      "author": {
        "type": "fn",
        "fn": "return $parent.nextElementSibling?.querySelector('.hnuser')?.textContent || null;"
      },
      "time": {
        "type": "fn",
        "fn": "return $parent.nextElementSibling?.querySelector('.age')?.textContent || null;"
      },
      "comments_text": {
        "type": "fn",
        "fn": "const links = Array.from($parent.nextElementSibling?.querySelectorAll('.subtext a') || []); return links.length ? links[links.length-1].textContent : null;"
      }
    }
  }
}
4

Make the API Request

Use the Scrape API with your extract rules. Since we need fn type rules to access the sibling metadata row, we enable js: true and use a POST request:

curl -X POST "https://api.ujeebu.com/scrape" \
  -H "ApiKey: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com/",
    "js": true,
    "extract_rules": {
      "stories": {
        "selector": "tr.athing",
        "type": "obj",
        "multiple": true,
        "children": {
          "title": {"selector": ".titleline > a", "type": "text"},
          "url": {"selector": ".titleline > a", "type": "link"},
          "points": {"type": "fn", "fn": "return $parent.nextElementSibling?.querySelector('"'"'.score'"'"')?.textContent || null;"}
        }
      }
    }
  }'
5

Handle the Response

The API returns extracted data in JSON format under the result key:

JSON - API Response
{
  "success": true,
  "result": {
    "stories": [
      {
        "rank": "1.",
        "title": "Show HN: I built an open-source alternative to Notion",
        "url": "https://example.com/article",
        "domain": "example.com",
        "story_id": "38774523",
        "points": "342 points",
        "author": "username",
        "time": "3 hours ago",
        "comments_text": "156 comments"
      }
    ]
  }
}
6

Best Practices

01

Use fn for Siblings

HN metadata is in a sibling row. Use fn type with $parent.nextElementSibling and js: true to access it.

Essential
02

Respect Rate Limits

Add delays between requests when scraping multiple pages. HN is a community resource.

Essential
03

Parse Data

Extract numeric values from strings like "342 points" to make data analysis easier.

Recommended
04

Scrape Multiple Pages

Use ?p=2, ?p=3 etc. to scrape older stories from pagination.

Advanced

Ready to Start Scraping?

Try the API in our interactive playground or explore the documentation.

?

Frequently Asked Questions

How often can I scrape Hacker News?

We recommend adding 2-3 second delays between requests. For frequent monitoring, consider scraping every 5-15 minutes to balance freshness with server load.

Does Hacker News require JavaScript rendering?

HN itself is server-rendered. However, because story metadata (points, author, comments) lives in a sibling row, we use fn type extract rules with $parent.nextElementSibling, which requires js: true. If you only need titles and URLs, you can skip JS for faster results.

Is there an official Hacker News API?

Yes, Hacker News has an official Firebase API at github.com/HackerNews/API. This tutorial demonstrates web scraping techniques, but for production use, consider the official API.