Press Archiver
For digitisation professionals

Turn historical
newspapers into
structured data.

AI-powered layout analysis, OCR, and article detection — built for archive scale. From raw scan to validated METS/ALTO export in minutes.

A historical newspaper page with AI-detected layout regions overlaid in color
§ The pipeline

From a folder of scans
to a folder of structured XML.

01

Layout

Segmentation model identifies and classifies every region: headlines, paragraphs, illustrations, tables.

02

OCR

Per-region OCR with your choice of engine. Cloud Vision for accuracy, local models for sovereignty.

03

Articles

AI groups regions into logical articles — across columns, around advertisements, in the order a reader would follow.

04

Export

Schema-validated PAGE XML or METS/ALTO. Reproducible, round-trippable, ready for downstream pipelines.

§ Capabilities

Six things it does,
each one carefully.

§ I Layout analysis

Intelligent region detection,
trained on the press.

Automatically detect and classify regions typical to newspaper layouts — headlines, body paragraphs, captions, advertisements, tables, and more. Powered by segmentation models trained on historical press material, not generic document AI.

H
B
H
I
B
AD
C
B
T
§ II OCR engines

Cloud or local —
you decide, per project.

Run OCR with Google Cloud Vision for maximum accuracy, or keep everything on-premise with local models. Switch engines per project without changing your workflow.

Cloud
  • ·Best for difficult scripts and degraded scans
  • ·200+ languages supported by Google
  • ·Region-pinned data residency (EU / US / APAC)
  • ·Per-page billing, passed through at cost
On-premise
  • ·Air-gapped — zero data leaves your infrastructure
  • ·Bundled Tesseract 5 + Kraken / Calamari fine-tunes
  • ·Optional CUDA acceleration, ships as a Docker image
  • ·Per-project model selection from operator UI
§ III Manual correction

When automation isn't enough,
fix it in two keystrokes.

Correct region types and reading order at speed. The interface is keyboard-first, designed for operators who annotate hundreds of pages per day.

OPERATOR — keyboard layer
H / B / C / A Set region type ↑ ↓ Move reading-order pointer ⌘ G Group into article ⌥ drag Merge into neighbour Space Approve & advance
↳ DESIGNED FOR HIGH-VOLUME OPERATORS
§ IV Article detection

Group text into articles —
even across columns.

Automatically group text regions into logical articles, even when they span multiple columns or wrap around advertisements. The AI understands newspaper structure, not just bounding boxes.

REGION GRAPH → ARTICLE GROUPING
H B B H B C H B
ART. A ART. B ART. C
§ V Region vocabulary

A full taxonomy,
aligned to archival practice.

Label every element precisely: paragraphs, headlines, subheadings, ads, illustrations, tables, marginalia. The region taxonomy follows real archival workflows.

Headline Subheading Body Caption Advertisement Illustration Table Marginalia Page number Masthead Stock listing Footer rule
§ VI Standards-compliant export

PAGE XML or METS/ALTO,
out of the box.

Schema-validated exports drop straight into your ingest pipeline. Reproducible, round-trippable, and aligned to the standards your institution already uses.

<alto xmlns="http://www.loc.gov/standards/alto/ns-v4#">
  <Layout>
    <Page WIDTH="6400" HEIGHT="8200">
      <PrintSpace>
        <ComposedBlock ID="art_A" TYPE="Article">
          <TextBlock TYPE="heading">
            <TextLine>
              <String CONTENT="The"  WC="0.99"/>
              <String CONTENT="Great" WC="0.98"/>
            </TextLine>
          </TextBlock>
          <TextBlock TYPE="paragraph"/>
        </ComposedBlock>
      </PrintSpace>
    </Page>
  </Layout>
</alto>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/...">
  <Page imageFilename="page.jp2">
    <TextRegion id="r1" type="heading">
      <Coords points="320,420 1840,420 1840,640 320,640"/>
      <TextLine>
        <TextEquiv conf="0.987">
          <Unicode>The Great Northern Line</Unicode>
        </TextEquiv>
      </TextLine>
    </TextRegion>
    <TextRegion id="r2" type="paragraph" article="A">
      <Coords points="..."/>
    </TextRegion>
  </Page>
</PcGts>
§ See it in action

Watch Press Archiver process
a real newspaper scan,
end to end.

From layout detection to structured export — in under five minutes, on real archival material.