Playground Sign in Start free
Case Study · Academic research

How Cline at the University of Illinois got research-grade article extraction across hundreds of millions of pages, in days, not quarters.

Cline is a research group at UIUC. They had a years-deep archive of HTML already pulled from the Internet Archive. What they didn't have was a way to turn it into clean, structured article content at the quality and scale their research demanded. They'd already burned weeks on open-source extractors that plateaued below their accuracy bar. Auto Extract slotted in as a hosted API call, cleared the bar on first pass, and replaced an internal parsing pipeline they didn't want to build or maintain.

Last updated May 2026
1

Context

The data was already on disk. The bottleneck was upstream of fetching: turning raw, heterogeneous archived HTML into clean, structured articles, at the scale of hundreds of millions of pages, without losing weeks tuning per-site rules.

Client
Cline · University of Illinois
Industry
Academic research
Workload
Hundreds of millions of archived pages
Approach
Auto Extract (HTML in → JSON out)
2

The challenge

Cline had the raw material already. The constraint was turning it into research-grade structured data. They'd done the legwork on the open-source side and hit a wall:

Open-source plateau
Readability, trafilatura and boilerpy3 each got them part of the way, but none cleared the accuracy floor on a heterogeneous, decades-spanning archive. Weeks of tuning, same wall.
Per-site rules don't scale
Pre-CMS HTML, table-layout sites, modern SPA dumps, all in the same dataset. Selector-based extractors broke as fast as they were written.
Build vs. buy
Owning a parsing pipeline meant standing up infra, tuning models, and babysitting failures. None of that was the research.
No fetch step
The HTML was already in storage. They needed an extractor, not a scraper, and not a vendor that bundled the two.
3

The solution

Auto Extract slotted in as a hosted API call. Cline posted their stored HTML, we returned structured JSON. No fetcher, no proxies, no browser, and no internal parsing pipeline to build, ship, or maintain. First integration was working end-to-end in days; quality cleared their bar on the initial sample without per-site tuning.

  1. Days, not quarters. Hosted endpoint, drop-in client, production-ready as soon as the integration tests passed. No ML team, no infra, no proxy fleet.
  2. Cleared the accuracy floor on first pass. Auto Extract's learned heuristics + structural analysis handled what the open-source extractors couldn't, without per-site tuning.
  3. Schema-stable across eras. Same response shape for a 2003 newspaper article and a 2024 SPA-rendered story. Downstream code stayed simple.
  4. Predictable cost, flat throughput. Deterministic extractor: no per-page LLM bill, no surprise spend as the workload grew. Same cost at page one and page one-hundred-million.
4

The results

In steady state:

Days
To production
100s of M
Pages structured
Weeks
Of internal dev work avoided
1
Schema across the archive

Cline got the clean dataset they needed without rebuilding a parsing pipeline they didn't want to own. The research team focused on the research; we kept the extractor honest. No proxy maintenance, no per-site selectors, no LLM bill, just an endpoint that returned clean JSON.

The trade they actually made
Build a parsing stack that clears the accuracy bar on a decades-spanning HTML archive, or pay per call and ship the research instead. They chose the second. Weeks of engineering went into the research, not into babysitting an extractor.
5

Got HTML you can't clean?

If you've got a backlog of HTML (or PDF, or messy markup) and you've already exhausted the open-source extractors, talk to us. We'll point Auto Extract at a representative sample, send back the JSON, and you can decide whether the quality clears your bar before committing to anything.