How Cline at the University of Illinois got research-grade article extraction across hundreds of millions of pages, in days, not quarters.

Cline is a research group at UIUC. They had a years-deep archive of HTML already pulled from the Internet Archive. What they didn't have was a way to turn it into clean, structured article content at the quality and scale their research demanded. They'd already burned weeks on open-source extractors that plateaued below their accuracy bar. Auto Extract slotted in as a hosted API call, cleared the bar on first pass, and replaced an internal parsing pipeline they didn't want to build or maintain.

Last updated May 2026

Context

The data was already on disk. The bottleneck was upstream of fetching: turning raw, heterogeneous archived HTML into clean, structured articles, at the scale of hundreds of millions of pages, without losing weeks tuning per-site rules.

Client

Cline · University of Illinois

Industry

Academic research

Workload

Hundreds of millions of archived pages

Approach

Auto Extract (HTML in → JSON out)

The challenge

Cline had the raw material already. The constraint was turning it into research-grade structured data. They'd done the legwork on the open-source side and hit a wall:

Open-source plateau

Readability, trafilatura and boilerpy3 each got them part of the way, but none cleared the accuracy floor on a heterogeneous, decades-spanning archive. Weeks of tuning, same wall.

Per-site rules don't scale

Pre-CMS HTML, table-layout sites, modern SPA dumps, all in the same dataset. Selector-based extractors broke as fast as they were written.

Build vs. buy

Owning a parsing pipeline meant standing up infra, tuning models, and babysitting failures. None of that was the research.

No fetch step

The HTML was already in storage. They needed an extractor, not a scraper, and not a vendor that bundled the two.

The solution

Auto Extract slotted in as a hosted API call. Cline posted their stored HTML, we returned structured JSON. No fetcher, no proxies, no browser, and no internal parsing pipeline to build, ship, or maintain. First integration was working end-to-end in days; quality cleared their bar on the initial sample without per-site tuning.

Days, not quarters. Hosted endpoint, drop-in client, production-ready as soon as the integration tests passed. No ML team, no infra, no proxy fleet.
Cleared the accuracy floor on first pass. Auto Extract's learned heuristics + structural analysis handled what the open-source extractors couldn't, without per-site tuning.
Schema-stable across eras. Same response shape for a 2003 newspaper article and a 2024 SPA-rendered story. Downstream code stayed simple.
Predictable cost, flat throughput. Deterministic extractor: no per-page LLM bill, no surprise spend as the workload grew. Same cost at page one and page one-hundred-million.

The results

In steady state:

Days

To production

100s of M

Pages structured

Weeks

Of internal dev work avoided

Schema across the archive

Cline got the clean dataset they needed without rebuilding a parsing pipeline they didn't want to own. The research team focused on the research; we kept the extractor honest. No proxy maintenance, no per-site selectors, no LLM bill, just an endpoint that returned clean JSON.

The trade they actually made

Build a parsing stack that clears the accuracy bar on a decades-spanning HTML archive, or pay per call and ship the research instead. They chose the second. Weeks of engineering went into the research, not into babysitting an extractor.

Got HTML you can't clean?

If you've got a backlog of HTML (or PDF, or messy markup) and you've already exhausted the open-source extractors, talk to us. We'll point Auto Extract at a representative sample, send back the JSON, and you can decide whether the quality clears your bar before committing to anything.

Talk to us Try the playground Auto Extract docs