Hacker News Scraper - Extract Tech News Stories

1

Overview

Hacker News (news.ycombinator.com) is a popular social news website focusing on computer science and entrepreneurship. This tutorial demonstrates how to scrape the front page stories to build news aggregators, monitor trending topics, or analyze discussion patterns.

What You'll Extract

Story title & URL

Points (upvotes)

Author username

Comment count

Time posted

Domain/source

Why Hacker News?

HN has a clean, minimal HTML structure with consistent class names, making it an excellent learning example for web scraping. Plus, no JavaScript rendering is required!

2

Analyze the HTML Structure

Hacker News uses a table-based layout. Each story consists of two table rows: one for the story link and another for the metadata (points, author, time, comments).

Data Point	Selector / Method	Type
Story Container	`tr.athing`	obj
Rank	`.rank`	text
Title	`.titleline > a`	text
URL	`.titleline > a`	link
Domain	`.sitestr`	text
Points	`$parent.nextElementSibling` → `.score`	fn
Author	`$parent.nextElementSibling` → `.hnuser`	fn
Time	`$parent.nextElementSibling` → `.age`	fn
Comments	`$parent.nextElementSibling` → last `.subtext a`	fn

Pro Tip

HN stores story metadata (points, author, time) in a sibling <tr> row. Since CSS child selectors can't reach siblings, we use fn type rules with $parent.nextElementSibling to access the adjacent row. This requires js: true.

3

Build Extract Rules

We'll use an object-type extract rule to capture each story with all its nested data points:

JSON - Extract Rules

{
  "stories": {
    "selector": "tr.athing",
    "type": "obj",
    "multiple": true,
    "children": {
      "rank": { "selector": ".rank", "type": "text" },
      "title": { "selector": ".titleline > a", "type": "text" },
      "url": { "selector": ".titleline > a", "type": "link" },
      "domain": { "selector": ".sitestr", "type": "text" },
      "story_id": { "selector": "", "type": "attr", "attribute": "id" },
      "points": {
        "type": "fn",
        "fn": "return $parent.nextElementSibling?.querySelector('.score')?.textContent || null;"
      },
      "author": {
        "type": "fn",
        "fn": "return $parent.nextElementSibling?.querySelector('.hnuser')?.textContent || null;"
      },
      "time": {
        "type": "fn",
        "fn": "return $parent.nextElementSibling?.querySelector('.age')?.textContent || null;"
      },
      "comments_text": {
        "type": "fn",
        "fn": "const links = Array.from($parent.nextElementSibling?.querySelectorAll('.subtext a') || []); return links.length ? links[links.length-1].textContent : null;"
      }
    }
  }
}

4

Make the API Request

Use the Scrape API with your extract rules. Since we need fn type rules to access the sibling metadata row, we enable js: true and use a POST request:

curl -X POST "https://api.ujeebu.com/scrape" \
  -H "ApiKey: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com/",
    "js": true,
    "extract_rules": {
      "stories": {
        "selector": "tr.athing",
        "type": "obj",
        "multiple": true,
        "children": {
          "title": {"selector": ".titleline > a", "type": "text"},
          "url": {"selector": ".titleline > a", "type": "link"},
          "points": {"type": "fn", "fn": "return $parent.nextElementSibling?.querySelector('"'"'.score'"'"')?.textContent || null;"}
        }
      }
    }
  }'

import requests

extract_rules = {
    "stories": {
        "selector": "tr.athing",
        "type": "obj",
        "multiple": True,
        "children": {
            "rank": {"selector": ".rank", "type": "text"},
            "title": {"selector": ".titleline > a", "type": "text"},
            "url": {"selector": ".titleline > a", "type": "link"},
            "domain": {"selector": ".sitestr", "type": "text"},
            "points": {
                "type": "fn",
                "fn": "return $parent.nextElementSibling?.querySelector('.score')?.textContent || null;"
            },
            "author": {
                "type": "fn",
                "fn": "return $parent.nextElementSibling?.querySelector('.hnuser')?.textContent || null;"
            },
            "time": {
                "type": "fn",
                "fn": "return $parent.nextElementSibling?.querySelector('.age')?.textContent || null;"
            },
            "comments_text": {
                "type": "fn",
                "fn": "const links = Array.from($parent.nextElementSibling?.querySelectorAll('.subtext a') || []); return links.length ? links[links.length-1].textContent : null;"
            }
        }
    }
}

response = requests.post("https://api.ujeebu.com/scrape",
    headers={
        "ApiKey": "YOUR_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "url": "https://news.ycombinator.com/",
        "js": True,
        "extract_rules": extract_rules
    })

stories = response.json()["result"]["stories"]
for story in stories[:5]:
    print(f"{story['rank']} {story['title']}")
    print(f"   {story.get('points', 'N/A')} by {story.get('author', 'N/A')}")

const axios = require('axios');

const extractRules = {
  stories: {
    selector: 'tr.athing',
    type: 'obj',
    multiple: true,
    children: {
      rank: { selector: '.rank', type: 'text' },
      title: { selector: '.titleline > a', type: 'text' },
      url: { selector: '.titleline > a', type: 'link' },
      domain: { selector: '.sitestr', type: 'text' },
      points: {
        type: 'fn',
        fn: "return $parent.nextElementSibling?.querySelector('.score')?.textContent || null;"
      },
      author: {
        type: 'fn',
        fn: "return $parent.nextElementSibling?.querySelector('.hnuser')?.textContent || null;"
      }
    }
  }
};

const response = await axios.post('https://api.ujeebu.com/scrape', {
  url: 'https://news.ycombinator.com/',
  js: true,
  extract_rules: extractRules
}, {
  headers: { 'ApiKey': 'YOUR_API_KEY' }
});

const stories = response.data.result.stories;
stories.slice(0, 5).forEach(story => {
  console.log(`${story.rank} ${story.title}`);
  console.log(`   ${story.points || 'N/A'} by ${story.author || 'N/A'}`);
});

from ujeebu_python import UjeebuClient

uj = UjeebuClient(api_key="YOUR_API_KEY")

res = uj.scrape_with_rules(
    url="https://news.ycombinator.com/",
    extract_rules={
        "stories": {
            "selector": "tr.athing",
            "type": "obj",
            "multiple": True,
            "children": {
                "title": {
                    "selector": ".titleline > a",
                    "type": "text"
                },
                "url": {
                    "selector": ".titleline > a",
                    "type": "link"
                },
                "points": {
                    "type": "fn",
                    "fn": "return $parent.nextElementSibling?.querySelector('.score')?.textContent || null;"
                }
            }
        }
    },
    params={
        "js": True
    },
)
print(res.json())

const { UjeebuClient } = require('@ujeebu-org/ujeebu-sdk');

const client = new UjeebuClient("YOUR_API_KEY");

(async () => {
  const res = await client.scrape("https://news.ycombinator.com/", {
  "js": true,
  "extract_rules": {"stories":{"selector":"tr.athing","type":"obj","multiple":true,"children":{"title":{"selector":".titleline > a","type":"text"},"url":{"selector":".titleline > a","type":"link"},"points":{"type":"fn","fn":"return $parent.nextElementSibling?.querySelector('.score')?.textContent || null;"}}}},
});
  console.log(res.data);
})();

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://api.ujeebu.com/scrape');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'POST');
curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'ApiKey: ' . 'YOUR_API_KEY',
    'Content-Type: application/json',
]);
$payload = [
    'url' => 'https://news.ycombinator.com/',
    'js' => true,
    'extract_rules' => [
        'stories' => [
            'selector' => 'tr.athing',
            'type' => 'obj',
            'multiple' => true,
            'children' => [
                'title' => [
                    'selector' => '.titleline > a',
                    'type' => 'text'
                ],
                'url' => [
                    'selector' => '.titleline > a',
                    'type' => 'link'
                ],
                'points' => [
                    'type' => 'fn',
                    'fn' => 'return $parent.nextElementSibling?.querySelector(\'.score\')?.textContent || null;'
                ]
            ]
        ]
    ]
];
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($payload));
$response = curl_exec($ch);
curl_close($ch);
echo $response;

package main

import (
    "fmt"
    "io"
    "strings"
    "net/http"
)

func main() {
    url := "https://api.ujeebu.com/scrape"
    payload := strings.NewReader(`{"url":"https://news.ycombinator.com/","js":true,"extract_rules":{"stories":{"selector":"tr.athing","type":"obj","multiple":true,"children":{"title":{"selector":".titleline > a","type":"text"},"url":{"selector":".titleline > a","type":"link"},"points":{"type":"fn","fn":"return $parent.nextElementSibling?.querySelector('.score')?.textContent || null;"}}}}}`)
    req, _ := http.NewRequest("POST", url, payload)
    req.Header.Set("ApiKey", "YOUR_API_KEY")
    req.Header.Set("Content-Type", "application/json")
    res, err := http.DefaultClient.Do(req)
    if err != nil { panic(err) }
    defer res.Body.Close()
    body, _ := io.ReadAll(res.Body)
    fmt.Println(string(body))
}

package main

import (
    "fmt"
    "log"

    "github.com/ujeebu/ujeebu-go"
)

func main() {
    client, err := ujeebu.NewClient("YOUR_API_KEY")
    if err != nil {
        log.Fatal(err)
    }

    res, _, err := client.Scrape(ujeebu.ScrapeParams{
        URL: "https://news.ycombinator.com/",
        ExtractRules: map[string]any{
            "stories": map[string]any{
                "selector": "tr.athing",
                "type": "obj",
                "multiple": true,
                "children": map[string]any{
                    "title": map[string]any{
                        "selector": ".titleline > a",
                        "type": "text",
                    },
                    "url": map[string]any{
                        "selector": ".titleline > a",
                        "type": "link",
                    },
                    "points": map[string]any{
                        "type": "fn",
                        "fn": "return $parent.nextElementSibling?.querySelector('.score')?.textContent || null;",
                    },
                },
            },
        },
        JS: true,
    })
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(res)
}

5

Handle the Response

The API returns extracted data in JSON format under the result key:

JSON - API Response

{
  "success": true,
  "result": {
    "stories": [
      {
        "rank": "1.",
        "title": "Show HN: I built an open-source alternative to Notion",
        "url": "https://example.com/article",
        "domain": "example.com",
        "story_id": "38774523",
        "points": "342 points",
        "author": "username",
        "time": "3 hours ago",
        "comments_text": "156 comments"
      }
    ]
  }
}

6

Best Practices

01

Use fn for Siblings

HN metadata is in a sibling row. Use fn type with $parent.nextElementSibling and js: true to access it.

Essential

02

Respect Rate Limits

Add delays between requests when scraping multiple pages. HN is a community resource.

Essential

03

Parse Data

Extract numeric values from strings like "342 points" to make data analysis easier.

Recommended

04

Scrape Multiple Pages

Use ?p=2, ?p=3 etc. to scrape older stories from pagination.

Advanced

Ethical Scraping

Hacker News has an official API at https://github.com/HackerNews/API which is recommended for production use. This tutorial is for educational purposes to demonstrate web scraping techniques.

?

Frequently Asked Questions

How often can I scrape Hacker News?

We recommend adding 2-3 second delays between requests. For frequent monitoring, consider scraping every 5-15 minutes to balance freshness with server load.

Does Hacker News require JavaScript rendering?

HN itself is server-rendered. However, because story metadata (points, author, comments) lives in a sibling row, we use fn type extract rules with $parent.nextElementSibling, which requires js: true. If you only need titles and URLs, you can skip JS for faster results.

Is there an official Hacker News API?

Yes, Hacker News has an official Firebase API at github.com/HackerNews/API. This tutorial demonstrates web scraping techniques, but for production use, consider the official API.

Scraping Hacker News Feed

Overview

What You'll Extract

Analyze the HTML Structure

Build Extract Rules

Make the API Request

Handle the Response

Best Practices

Use fn for Siblings

Respect Rate Limits

Parse Data

Scrape Multiple Pages

Ethical Scraping

Ready to Start Scraping?

Frequently Asked Questions

How often can I scrape Hacker News?

Does Hacker News require JavaScript rendering?

Is there an official Hacker News API?

Scraping Hacker News Feed

Overview

What You'll Extract

Analyze the HTML Structure

Build Extract Rules

Make the API Request

Handle the Response

Best Practices

Use fn for Siblings

Respect Rate Limits

Parse Data

Scrape Multiple Pages

Ethical Scraping

Ready to Start Scraping?

Related Tutorials

Google News SERP Tutorial

Article Content Extraction Tutorial

Frequently Asked Questions

How often can I scrape Hacker News?

Does Hacker News require JavaScript rendering?

Is there an official Hacker News API?