AI-powered layout analysis, OCR, and article detection — built for archive scale. From raw scan to validated METS/ALTO export in minutes.
Segmentation model identifies and classifies every region: headlines, paragraphs, illustrations, tables.
Per-region OCR with your choice of engine. Cloud Vision for accuracy, local models for sovereignty.
AI groups regions into logical articles — across columns, around advertisements, in the order a reader would follow.
Schema-validated PAGE XML or METS/ALTO. Reproducible, round-trippable, ready for downstream pipelines.
Automatically detect and classify regions typical to newspaper layouts — headlines, body paragraphs, captions, advertisements, tables, and more. Powered by segmentation models trained on historical press material, not generic document AI.
Run OCR with Google Cloud Vision for maximum accuracy, or keep everything on-premise with local models. Switch engines per project without changing your workflow.
Correct region types and reading order at speed. The interface is keyboard-first, designed for operators who annotate hundreds of pages per day.
Automatically group text regions into logical articles, even when they span multiple columns or wrap around advertisements. The AI understands newspaper structure, not just bounding boxes.
Label every element precisely: paragraphs, headlines, subheadings, ads, illustrations, tables, marginalia. The region taxonomy follows real archival workflows.
Schema-validated exports drop straight into your ingest pipeline. Reproducible, round-trippable, and aligned to the standards your institution already uses.
<alto xmlns="http://www.loc.gov/standards/alto/ns-v4#">
<Layout>
<Page WIDTH="6400" HEIGHT="8200">
<PrintSpace>
<ComposedBlock ID="art_A" TYPE="Article">
<TextBlock TYPE="heading">
<TextLine>
<String CONTENT="The" WC="0.99"/>
<String CONTENT="Great" WC="0.98"/>
</TextLine>
</TextBlock>
<TextBlock TYPE="paragraph"/>
</ComposedBlock>
</PrintSpace>
</Page>
</Layout>
</alto>
From layout detection to structured export — in under five minutes, on real archival material.
Tell us a bit about your archive — we'll get back within one business day.
We'll be in touch within one business day. Feel free to close this window.