hOCR - OCR Workflow and Output embedded in HTML

ocr_page

Typesetting Elements

Required:: bbox
Recommended:: image, imagemd5, ppageno, lpageno
Allowed:: x_source, x_scanner, scan_res

The ocr_page element must be present in all hOCR documents.

3.1.2. `ocr_column`

Name: ~~ocr_column~~ (Deprecated)
Categories: Typesetting Elements

OBSOLETE

Please use ocr_carea instead

3.1.3. `ocr_carea`

ocr_carea

Typesetting Elements

Required:: bbox

"ocr content area" or "body area"

Used to be called ~~ocr_column~~

The ocr_carea elements should appear in reading order unless this is impossible because of some other structuring requirement. If the document contains multiple ocr_linear streams, then each ocr_carea must indicate which stream it belongs to.

Note that for many documents, the actual ground truth careas are well-defined by the document style of the original document before printing and scanning. From a single page, the careas of the original document style cannot be recovered exactly. However, the partition of a document by ocr_carea for an individual page shall be considered correct relative to ground truth if

all the text contained in a ground truth carea is fully contained within a single ocr_carea,
no text outside a ground truth carea is contained within an ocr_carea, and
the ocr_carea appear in the same order as the text flow relationships between the ground truth careas.

3.1.4. `ocr_line`

ocr_line

Typesetting Elements

Required:: bbox
Allowed:: baseline, hardbreak, x_font, x_fsize, x_bboxes

In typesetting systems, content areas are filled with “blocks”, but most of those blocks are not recoverable or semantically meaningful. However, one type of block is visible and very important for OCR engines: the line. Lines are typesetting blocks that only contain glyphs (“inlines” in XSL terminology). They are represented by the ocr_line area.

ocr_line should be in a span

3.1.5. `ocr_separator`

ocr_separator

Typesetting Elements , Float Elements

Required:: bbox

Any separator or similar element

Any noise element that isn’t part of typesetting

3.2. Float elements

Overlaid onto the page is a set of floating elements; floating elements exist outside the normal reading order. Floating elements may be introduced by the textual content, or they may be related to the page itself (anchoring is a logical property). In typesetting systems, floating elements may be anchored to the page, to paragraphs, or to the content stream. Floating elements can overlap content areas and render on top of or under content, or they can force content to flow around them. The default for floating elements in this spec is that their anchor is undefined (it is a logical property, not a typesetting property), and that text flows around them. Note that with rectangular content areas and rectangular floats, already a wide variety of non-rectangular text shapes can be realized.

There is currently no way of indicating anchoring or flow-around properties for floating elements; properties need to be defined for this.

Floats should not be nested. The following floats are defined:

Something that could be represented well and naturally in a vector graphics format like SVG (even if it is actually represented as PNG)

ocr_photo

Required:: bbox

Something that requires JPEG or PNG to be represented well

3.2.4. `ocr_header` and `ocr_footer`

ocr_header

Required:: bbox

ocr_footer

Required:: bbox

3.2.5. `ocr_pageno`

ocr_pageno

Required:: bbox

3.2.6. `ocr_table`

ocr_table

Required:: bbox

3.3. Logical Elements

Logical Tags/classes

The classes defined in this section for logically structuring a hOCR document have their standard meaning as used in the publishing industry and tools like LaTeX, MS Word, and others.

Tags must be nested as indicated by the following list, but not all tags within the hierarchy need to be present.

ocr_document
- ocr_linear
  - ocr_title
  - ocr_author
  - ocr_abstract
  - ocr_part
    - ocr_chapter
      - ocr_section ▻ ocr_subsection ▻ ocr_subsubsection
        
        ocr_display
        
        ocr_blockquote
        
        ocr_par

For all of these elements except ocr_linear, there exists a natural linear ordering defined by reading order (ocr_linear indicates that the elements contained in it have a linear ordering). At the level of ocr_linear, there may not be a single distinguished order. A common example of ocr_linear is a newspaper, in which a single newspaper may contain many linear, but there is no unique reading order for the different linear. OCR evaluation tools should therefore be sensitive to the order of all elements other than ocr_linear.

Textual information like section numbers and bullets must be represented as text inside the containing element.

Documents whose logical structure does not map naturally onto these logical structuring elements must not use them for other purposes.

3.3.1. `ocr_document`

Name: ocr_document
Recommended HTML Tags: div
Categories: Logical Elements

3.3.2. `ocr_title`, `ocr_author` and `ocr_abstract`

Name: ocr_title
Recommended HTML Tags: h1
Categories: Logical Elements

Name: ocr_author
Categories: Logical Elements

Name: ocr_abstract
Categories: Logical Elements

3.3.3. `ocr_part` and `ocr_chapter`

Name: ocr_part
Recommended HTML Tags: h1
Categories: Logical Elements

Name: ocr_chapter
Recommended HTML Tags: h1
Categories: Logical Elements

3.3.4. `ocr_section`, `ocr_subsection` and `ocr_subsubsection`

Name: ocr_section
Recommended HTML Tags: h2
Categories: Logical Elements

Name: ocr_subsection
Recommended HTML Tags: h3
Categories: Logical Elements

Name: ocr_subsubsection
Recommended HTML Tags: h4
Categories: Logical Elements

3.3.5. `ocr_display`, `ocr_blockquote` and `ocr_par`

ocr_display

<https://github.com/kba/hocr-spec/issues/51>

Required:: bbox

Name: ocr_blockquote
Recommended HTML Tags: blockquote
Categories: Logical Elements

Name: ocr_par
Recommended HTML Tags: p
Categories: Logical Elements

3.3.6. `ocr_linear`

Name: ocr_linear
Categories: Typesetting Elements

3.3.7. `ocr_caption`

Name: ocr_caption
Categories: Logical Elements

Image captions may be indicated using the ocr_caption element; such an element refers to the image(s) contained within the same float, or the immediately adjacent image if both the image and the ocr_caption element are in running text.

3.4. Inline Elements

There is some content that should behave and flow like text

3.4.1. Unrecognized characters and words: `ocr_glyph` and `ocr_glyphs`

Name: ocr_glyph
Categories: Inline Elements

An individual glyph represented as an image (e.g., an unrecognized character)
Must contain a single img tag, or be present on one

Name: ocr_glyphs
Categories: Inline Elements

Multiple glyphs represented as an image (e.g., an unrecognized word)
Must contain a single img tag, or be present on one

3.4.2. `ocr_dropcap`

Name: ocr_dropcap
Categories: Inline Elements

An individual glyph representing a dropcap
May contain text or an img tag; the alt of the image tag should contain the corresponding text

3.4.3. Mathematical and chemical formulas: `ocr_math` and `ocr_chem`

ocr_math

Required:: bbox

ocr_chem

Required:: bbox

Mathematical and chemical formulas that float must be put into an ocr_float section. Formulas that are “display” mode should be put into an ocr_display section. ocr_math and ocr_chem

ocr_math must either be or contain either a single img tag or [MathML] markup

ocr_chem must either be or contain either a single img tag or [CML] markup

3.4.4. Unspecified inline content: `ocr_cinfo`

Define ocrx_cinfo

If no other layout element applies, the ocr_cinfo element may be used.
ocrx_cinfo should nest inside ocrx_line
ocrx_cinfo should contain only x_confs, x_bboxes, and cuts attributes

3.5. OCR Engine-Specific elements

A few abstractions are used as intermediate abstractions in OCR engines, although they do not have a meaning that can be defined either in terms of typesetting or logical function. Representing them may be useful to represent existing OCR output, say for workflow abstractions.

Common suggested engine-specific markup are:

3.5.1. `ocrx_block`

Name: ocrx_block
Categories: Inline Elements , Engine-Specific Elements

ocr_carea vs ocrx_block

any kind of "block" returned by an OCR system
engine-specific because the definition of a "block" depends on the engine

Generators should attempt to ensure the following properties:

An ocrx_block should not contain content from multiple ocr_carea.
The union of all ocrx_blocks should approximately cover all ocr_carea.
an ocrx_block should contain either a float or body text, but not both
an ocrx_block should contain either an image or text, but not both

3.5.2. `ocrx_line`

Name: ocrx_line
Categories: Inline Elements , Engine-Specific Elements

ocr_line vs ocrx_line

any kind of "line" returned by an OCR system that differs from the standard ocr_line above
might be some kind of "logical" line
an ocrx_line should correspond as closely as possible to an ocr_line

3.5.3. `ocrx_word`

Name: ocrx_word
Categories: Inline Elements , Engine-Specific Elements

any kind of "word" returned by an OCR system
engine specific because the definition of a "word" depends on the engine

4. The properties of hOCR

The properties in hOCR can be broadly categorized as follows:

General Properties: These properties can apply to most elements
Non-Recommended Properties: These properties can apply to most elements but should not be used unless there is no alternative:
Inline Properties: These properties apply to content on or below the level of ocr_line / ocrx_line
Layout Properties: These properties relate to placement of elements on the page
Font Properties: These properties convey font information
Character Properties: These properties convey character level information
Page Properties: These properties convey information on the whole page
Content Flow Properties: These properties are related to the reading order and flow of content on the page
Confidence Properties: These properties are related to the confidence of the hOCR producer that the text in the element has been correctly recognized

4.1. The baseline property

baseline

property-name = "baseline"
property-value = float int

Example

baseline 0.015 -18

This property applies primarily to textlines.

The baseline is described by a polynomial of order n with the coefficients pn ... p0 with n = 1 for a linear (i.e. straight) line.

The polynomial is in the coordinate system of the line, with the bottom left of the bounding box as the origin.

The hOCR output for the first line of eurotext.tif contains the following information:

<span class='ocr_line' id='line_1_1'
    title="bbox 105 66 823 113; baseline 0.015 -18">...</span>

bbox is the bounding box of the line in image coordinates (blue). The two numbers for the baseline are the slope (1st number) and constant term (2nd number) of a linear equation describing the baseline relative to the bottom left corner of the bounding box (red). The baseline crosses the y-axis at -18 and its slope angle is arctan(0.015) = 0.86°.

4.2. The bbox property

bbox

property-name = "bbox"
property-value = uint uint uint uint

Example

bbox 0 0 100 200

The bbox - short for "bounding box" - of an element is a rectangular box around this element, which is defined by the upper-left corner (x0, y0) and the lower-right corner (x1, y1).

the values are with reference to the top-left corner of the document image and measured in pixels
the order of the values are x0 y0 x1 y1 = "left top right bottom"
use x_bboxes below for character bounding boxes
do not use bbox unless the bounding box of the layout component is, in fact, rectangular
some non-rectangular layout components may have rectangular bounding boxes if the non-rectangularity is caused by floating elements around which text flows

<span class='ocr_line' id='line_1'
    title="bbox 10 20 160 30">...</span>

The bounding box bbox of this line is shown in blue and it is span by the upper-left corner (10, 20) and the lower-right corner (160, 30). All coordinates are measured with reference to the top-left corner of the document image which border is drawn in black.

4.3. The cflow property

cflow

property-name = "cflow"
property-value = delimited-string

Example

cflow "article1"

This property relates the flow between multiple ocr_carea elements, and between ocr_carea and ocr_linear elements.

The content flow on the page that this element is a part of

s must be a unique string for each content flow
must be present on ocr_carea and ocrx_block tags when reading order is attempted and multiple content flows are present
presence must be declared in the document meta data

4.4. The cuts property

cuts

property-name = "cuts"
property-value = +(uint *1(comma uint *1(comma nint)))

Example

cuts 9 11 7,8,-2 15 3

character segmentation cuts (see below)
there must be a bbox property relative to which the cuts can be interpreted

For left-to-write writing directions, cuts are sequences of deltas in the x and y direction; the first delta in each path is an offset in the x direction relative to the last x position of the previous path. The subsequent deltas alternate between up and right moves.

Assume a bounding box of (0,0,300,100); then

cuts("10 11 7 19") =
    [ [(10,0),(10,100)], [(21,0),(21,100)], [(28,0),(28,100)], [(47,0),(47,100)] ]
cuts("10,50,3 11,30,-3") =
    [ [(10,0),(10,50),(13,50),(13,100)], [(21,0),(21,30),(18,30),(18,100)] ]

<span class="ocr_cinfo" title="bbox 0 0 300 100; nlp 1.7 2.3 3.9 2.7; cuts 9 11 7,8,-2 15 3">hello</span>

Cuts are between all codepoints contained within the element, including any whitespace and control characters. Simply use a delta of 0 (zero) for invisible codepoints.

Writing directions other than left-to-right specify cuts as if the bounding box for the element had been rotated by a multiple of 90 degrees such that the writing direction is left to right, then rotated back.

It is undefined what happens when cut paths intersect, with the exception that a delta of 0 always corresponds to an invisible codepoint.

4.5. The hardbreak property

hardbreak

property-name = "hardbreak"
property-value = "0" / "1"

Default Value

hardbreak 0

a zero (default) indicates that the end of the line is not a hard (explicit) line break, but a break due to text flow
a one indicates that the line is a hard (explicit) line break

Any special characters representing the desired end-of-line processing must be present inside the ocr_line element. Examples of such special characters are a soft hyphen ("", U+00AD), a hard line break (<br>), or whitespace () for soft line breaks.

4.6. The image property

image

property-name = "image"
property-value = delimited-string

Example

image "/foo/bar.png"

image file name used as input
syntactically, must be a UNIX-like pathname or http URL (no Windows pathnames)
may be relative
cannot be resolved to the actual file in general (e.g., if the hOCR file becomes separated from the image file)
if the hOCR file is present in a directory hierarchy or file archive, should resolve to the corresponding image file

4.7. The imagemd5 property

imagemd5

property-name = "imagemd5"
property-value = doublequote 32(%x41-46 / digit) doublequote

MD5 fingerprint of the image file that this page was derived from
allows re-associating pages with source images

4.8. The lpageno property

lpageno

property-name = "lpageno"
property-value = delimited-string / uint

Example

lpageno "IV."

the logical page number expressed on the page
may not be numerical (e.g., Roman numerals)
usually is unique
must not be present unless it has been recognized from the page and is unambiguous

4.9. The ppageno property

ppageno

property-name = "ppageno"
property-value = uint

Example

lpageno 7

the physical page number
the front cover is page number 0
should be unique
must not be present unless the pages in the document have a physical ordering
must not be present unless it is well defined and unique

4.10. The nlp property

nlp

Confidence, Character

property-name = "nlp"
property-value = +float

estimate of the negative log probabilities of each character by the recognizer

4.11. The order property

order

property-name = "order"
property-value = +uint

Example

order 8

The reading order of the element (an integer)

this property must not be used unless there is no other way of representing the reading order of the page by element ordering within the page, since many tools will not be able to deal with content that is not in reading order
presence must be declared in the document meta data

4.12. The poly property

poly

Layout, Non-recommended

Grammar

property-name = "poly"
property-value = 2uint 2int *(2int)

Example

poly 0 0 0 10 10 10 10 20 0 20

A closed polygon for elements with non-rectangular bounds

this property must not be used unless there is no other way of representing the layout of the page using rectangular bounding boxes, since most tools will simply not have the capability of dealing with non-rectangular layouts
note that the natural and correct representation of many non-rectangular layouts is in terms of rectangular content areas and rectangular floats
documents using polygonal borders anywhere must indicate this by adding ocrp_poly to the list of ocr-capabilities (see § 6.2 Capabilities)
documents should attempt to provide a reasonable bbox equivalent as well

4.13. The scan_res property

scan_res

property-name = "scan_res"
property-value = 2(uint)

Example

scan_res 300 300

The scanning resolution in DPI

4.14. The textangle property

textangle

property-name = "textangle"
property-value = float

Example

textangle 7.32

The angle in degrees by which textual content has been rotate relative to the rest of the page (if not present, the angle is assumed to be zero); rotations are counter-clockwise, so an angle of 90 degrees is vertical text running from bottom to top in Latin script; note that this is different from reading order, which should be indicated using standard HTML properties

4.15. The x_bboxes property

x_bboxes

property-name = "x_bboxes"
property-value = 1*(4uint)

Example

x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 b2y0 b2x1 b2y1 ...
x_bboxes 0 0 10 10 0 10 20 20

OCR-engine specific boxes associated with each codepoint contained in the element
note that the bbox property is a property for the bounding box of a layout element, not of individual characters
in particular, use <span class="ocr_cinfo" title="x_bboxes ....">, not <span class="ocr_cinfo" title="bbox ...">

4.16. The x_font property

x_font

property-name = "x_font"
property-value = delimited-string

Example

x_font "Comic Sans MS"

x_font is an OCR-engine specific font name (a string).

4.17. The x_fsize property

x_fsize

property-name = "x_fsize"
property-value = uint

Example

x_fsize 12

x_fsize is the OCR-engine specific font size (an unsigned integer).

4.18. The x_confs property

x_confs