<>

Requirements on handling METS/PAGE

OCR-D has decided to base its data exchange format on top of METS.

For layout and text recognition results, the primary exchange format is PAGE

This document defines a set of conventions and mechanism for using METS.

Conventions for PAGE are outlined in a separate document

Pixel density of images must be explicit and high enough

The pixel density is the ratio of the number of pixels that represent a a unit of measure of the scanned object. It is typically measured in pixels per inch (PPI, a.k.a. DPI).

The original input images MUST have >= 150 ppi.

Every processing step that generates new images and changes their dimensions MUST make sure to adapt the density explicitly when serialising the image.

$> exiftool input.tif |grep 'X Resolution'
"300"

# WRONG (ppi unchanged)
$> convert input.tif -resize 50% output.tif

# RIGHT:
$> convert input.tif -resize 50% -density 150 -unit inches output.tif

$> exiftool output.tif |grep 'X Resolution'
"150"

However, since technical metadata about pixel density is so often lost in conversion or inaccurate, processors should assume 300 ppi for images with missing or suspiciously low pixel density metadata.

No multi-page images

Image formats like TIFF support encoding multiple images in a single file.

Data providers MUST provide single-image TIFF files.

OCR-D processors MUST raise an exception if they encounter multi-image TIFF files.

Unique ID for the document processed

METS provided to the MP must be uniquely addressable within the global library community.

For this purpose, the METS file MUST contain a mods:identifier that must contain a globally unique identifier for the document and have a type attribute with a value of, in order of preference:

File Group USE syntax

All mets:fileGrp MUST have a unique USE attribute that hints at the provenance of the files and must be a valid xsd:ID.

It SHOULD have the structure

ID := "OCR-D-" + PREFIX? + WORKFLOW_STEP + ("-" + PROCESSOR)?
PREFIX := ("" | "GT-")
WORKFLOW_STEP := ("IMG" | "SEG" | "OCR" | "COR")
PROCESSOR := [A-Z0-9\-]{3,}

PREFIX can be GT- to indicate that these files are ground truth.

WORKFLOW_STEP can be one of:

PROCESSOR should be a mnemonic of the processor or result type in a terse, all-caps form, such as the name of the tool (KRAKEN) or the organisation CIS or the type of manipulation (CROP) or a combination of both starting with the type of manipulation (BIN-KRAKEN).

Examples

<mets:fileGrp USE> Type of use for OCR-D
<mets:fileGrp USE="OCR-D-IMG"> The unmanipulated source images
<mets:fileGrp USE="OCR-D-PRE-BIN"> Binarization preprocessing
<mets:fileGrp USE="OCR-D-PRE-CROP"> Cropping preprocessing
<mets:fileGrp USE="OCR-D-PRE-DESKEW"> Deskewing preprocessing
<mets:fileGrp USE="OCR-D-PRE-DESPECK"> Despeckling preprocessing
<mets:fileGrp USE="OCR-D-PRE-DEWARP"> Dewarping preprocessing
<mets:fileGrp USE="OCR-D-SEG-REGION"> Region segmentation
<mets:fileGrp USE="OCR-D-SEG-LINE"> Line segmentation
<mets:fileGrp USE="OCR-D-SEG-WORD"> Word segmentation
<mets:fileGrp USE="OCR-D-SEG-GLYPH"> Glyph segmentation
<mets:fileGrp USE="OCR-D-OCR-TESS"> Tesseract OCR
<mets:fileGrp USE="OCR-D-OCR-OCRO"> Ocropus OCR
<mets:fileGrp USE="OCR-D-COR-CIS"> CIS post-correction
<mets:fileGrp USE="OCR-D-COR-ASV"> ASV post-correction
<mets:fileGrp USE="OCR-D-GT-SEG-REGION"> Region segmentation ground truth
<mets:fileGrp USE="OCR-D-GT-SEG-LINE"> Line segmentation ground truth
<mets:fileGrp USE="OCR-D-GT-SEG-WORD"> Word segmentation ground truth
<mets:fileGrp USE="OCR-D-GT-SEG-GLYPH"> Glyph segmentation ground truth

File ID syntax

Each mets:file must have an ID attribute. The ID attribute of a mets:file SHOULD be the USE of the containing mets:fileGrp combined with a 4-zero-padded number. The ID MUST be unique inside the METS file.

FILEID := ID + "_" + [0-9]{4}
ID := FILEGRP + (".IMG")?

Examples

<mets:file ID> ID of the file for OCR-D
<mets:file ID="OCR-D-IMG_0001"> The unmanipulated source image
<mets:file ID="OCR-D-PRE-BIN_0001"> PAGE encapsulating the result from binarization
<mets:file ID="OCR-D-PRE-BIN.IMG_0001"> Black-and-white image
<mets:file ID="OCR-D-PRE-CROP_0001"> PAGE encapsulating the result from (binarization and) cropping
<mets:file ID="OCR-D-PRE-CROP.IMG_0001"> Cropped black-and-white image

Grouping files by page

Every METS file MUST have exactly one physical map that contains a single mets:div[@TYPE="physSequence"] which in turn must contain a mets:div[@TYPE="page"] for every page in the work.

These mets:div[@TYPE="page"] can contain an arbitrary number of mets:fptr pointers to mets:file elements to signify that all the files within a div are encodings of the same page.

Example

<mets:fileGrp USE="OCR-D-IMG">
    <mets:file ID="OCR-D-IMG_0001">...</mets:file>
</mets:fileGrp>
<mets:fileGrp USE="OCR-D-OCR">
    <mets:file ID="OCR-D-OCR_0001">...</mets:file>
</mets:fileGrp>
<mets:structMap TYPE="PHYSICAL">
  <mets:div ID="PHYS_0000" TYPE="physSequence">
    <mets:div ID="PHYS_0001" TYPE="page">
      <mets:fptr FILEID="OCR-D-IMG_0001"/>
      <mets:fptr FILEID="OCR-D-OCR_0001"/>
    </mets:div>
  </mets:div>
</mets:structMap>

Images and coordinates

Coordinates are always absolute, i.e. relative to extent defined in the imageWidth/imageHeight attribute of the nearest <pc:Page>.

When a processor wants to access the image of a layout element like a TextRegion or TextLine, the algorithm should be:

Media Type for PAGE XML

Every <mets:file> representing a PAGE document MUST have its MIMETYPE attribute set to application/vnd.prima.page+xml.

Always use URL or relative filenames

Always use URL, except for files located in the directory or any subdirectories of the METS file.

Example

/tmp/foo/ws1
├── mets.xml
├── foo.tif
└── foo.xml

Valid mets:FLocat/@xlink:href in /tmp/foo/ws1/mets.xml:

Invalid mets:FLocat/@xlink:href in /tmp/foo/ws1/mets.xml:

If in PAGE then in METS

All URL used in imageFilename and filename attributes of <pc:Page>/<pc:AlternativeImage> MUST be referenced in a mets:fileGrp as the @xlink:href attribute of a mets:file. This MUST be the same file group as the PAGE-XML that was the result of the processing step that produced the <pg:AlternativeImage>. In other words: <pg:AlternativeImage> should be written to the same mets:fileGrp as its source PAGE-XML, which in most implementations will mean the same folder.

Recording processing information in METS

Processors should add information to the METS metadata header to indicate that they changed the METS. This information is mainly for human consumption to get an overview of the software agents involved in the METS file’s creation. More detailed or machine-actionable provenance information is outside the scope of the processor.

To add agent information, a processor must:

1) locate the first mets:metsHdr M. 2) Add to M a new mets:agent A with these attributes

Example:

<mets:agent TYPE="OTHER" OTHERTYPE="SOFTWARE" ROLE="OTHER" OTHERROLE="preprocessing/optimization/binarization">
  <mets:name>ocrd_tesserocr v0.1.2</mets:name>
</mets:agent>