1. Introduction
The purpose of this document is to define an open standard for representing document layout analysis and OCR results as a subset of HTML. The goal is to reuse as much existing technology as possible, and to arrive at a representation that makes it easy to store, share, process and display OCR results.
This specification defines many features that can represent a variety of OCR-related information. However, being built on top of HTML, hOCR is designed to make it easy to start simple and gradually use more complex constructs when necessary.
Consider you have an HTML document that encodes a book: Wrapping page elements
in <div class="ocr_page">
tags will convey the page boundaries to hOCR-capable agents and turn the HTML
document into an hOCR document.
2. Terminology and Representation
2.1. Reusing HTML
Reusing HTML: Some text is missing in the first paragraph <https://github.com/kba/hocr-spec/issues/96>
This document describes a representation of various aspects of OCR output in an XML-like format. That is, we define a set of tags containing text and other tags, together with attributes of those tags. However, since the content we are representing is formatted text,
However, we are not actually using a new XML for the representation; instead we embed the representation in XHTML (or HTML) because [XHTML1] and XHTML processing already define many aspects of OCR output representation that would otherwise need additional, separate and ad-hoc definitions. These aspects include:
-
standard representations for common logical structuring elements, including section headings, citations, tables, emphasis, line breaks, quotations, citations, and preformatted text
-
standard representations for fonts, embedded images, embedded vector graphics, tables, languages, writing direction, colors
-
standard representations for geometric layout and positioning
-
output files that are understood without any further modification by widely used viewers (browsers), editors, conversion tools, and indexing tools
-
libraries for parsing and generating the content
-
support for document metadata
We are embedding this information inside HTML by encoding it within valid tags and attributes inside HTML. We are going to use the terms elements and properties for referring to embedded markup.
2.2. Definitions
2.2.1. "element"
An hOCR element (in this spec simply referred to as an element) is any HTML tag with a class
attribute that contains exactly one class
name that starts with ocr_
or ocrx_
. Non-OCR related HTML content must
not use class names that begin with ocr_
or ocrx_
.
Note: When referring to an HTML tag with class ocr_page
, this spec uses the
notation ocr_page
If an HTML tag is an hOCR element, then its title
attribute must not be used for any other purpose than to define hOCR properties and adhere to the properties format.
For some elements, the specs recommends using specific HTML tags. This is entirely optional, it may not be possible or desirable to actually choose those tags (e.g., when adding hOCR information to an existing HTML output routine).
2.2.2. "property"
hOCR Properties are a set of key-value pairs that convey OCR-specific
information related to specific elements. They are serialized using a specific format in the title
attribute of
the element they refer to.
Note: When referring to a property bbox
, this spec uses the notation bbox.
The name of a property must only consist of
lowercase letters and numbers. Property names must be either from those defined
in § 4 The properties of hOCR or begin with x_
to denote implementation-specific
extensions.
Properties may define a default value. For those elements for which the property is not disallowed but not explicitly specified, the property is assigned to the element with the default value.
2.2.3. "capability"
The presence of elements and properties must be explicitly stated as a capability. The rationale is that if a hOCR producer is capable of producing certain elements and properties, it should inform hOCR consumers that they may encounter those elements/properties. If a producer is not capable of producing certain elements/properties, consumers need not look for them.
Note: When referring to a capability ocrp_poly
, this spec uses the notation ocrp_poly.
The mechanism for declaring capabilities are described in § 6.2 Capabilities
2.3. Relationship between elements, properties
2.3.1. element - property
There are four levels of association between any element to any property:
- Disallowed Property
-
The element MUST NOT contain the property
Unless defined otherwise, all properties are disallowed for any element.
- Required Property
-
The element MUST contain the property
- Recommended Property
-
The element SHOULD contain the property
- Allowed Property
-
The element MAY contain the property
2.3.2. property - property
A property present on an element can have on of the following relations to any other property:
- Independent Property
-
The presence of property A has no influence on the presence of property B
Unless otherwise defiined, properties are always independent
- Implied Property
-
If property A is present, property B must also be present
- Conflicting Property
-
If property A is present, property B must not be present
- Related Property
-
Property B is related to property A
2.4. Properties Grammar
The properties format for the properties is as follows, expressed in ABNF notation of [RFC5234]:
digit = %x30-39 uint = +digit int = *1"-" uint nint = "-" uint fraction = "." uint float = *uint fraction whitespace = +%20 ; one or more spaces ' ' comma = %2C ; comma ',' semicolon = %3B ; semicolon ';' doublequote = %22 ; double quote '"' lowercase-letter = %x41-5A alnum-word = +(lowercase-letter / digit) ascii-word = +(%x21-7E - semicolon) ; printable w/o space/semicolon ascii-string = +(%x20-7E - doublequote) ; printable ascii without doublequote delimited-string = doublequote ascii-string doublequote properties-format = key-value-pair *(*whitespace semicolon *whitespace key-value-pair) spec-property-name = ("bbox" / "baseline" / "cflow" / "cuts" / "hardbreak" / "image" / "imagemd5" / "lpageno" / "nlp" / "order" / "poly" / "ppageno" / "scan_res" / "textangle" / "x_bboxes" / "x_confs" / "x_font" / "x_fsize" / "x_scanner" / "x_source" / "x_wconf" ) engine-property-name = "x_" alnum-word key-value-pair = property-name whitespace property-value property-name = spec-property-name / engine-property-name property-value = (ascii-word / delimited-string) *(whitespace (ascii-word / delimited-string) )
This is just the general grammar, the individual properties will define the exact property grammar that overrides property-name and property-value.
< div class = "ocr_page" id = "page_1" > < div class = "ocr_carea" id = "column_2" title = "bbox 313 324 733 1922" > < div class = "ocr_par" id = "par_7" > ...</ div > < div class = "ocr_par" id = "par_19" > ...</ div > </ div > </ div >
3. The elements of hOCR
The elements in hOCR can be broadly categorized as follows:
- Typesetting Elements
-
Elements that describe those areas of a page that nest but don’t generally overlap
- Float Elements
-
Elements that describe those areas of a page that are not part of the flow but are positioned
- Logical Elements
-
These elements describe a page and its components in traditional typesetting.
- Inline elements
-
These elements describe content beyond the level of text lines
- Engine-Specific elements
-
Elements whose semantics are engine-specific
3.1. Typesetting Elements
The following typesetting related elements are based on a typesetting model as found in most typesetting systems, including XSL:FO, (La)TeX, LibreOffice, and Microsoft Word.
In those systems, each page is divided into a number of areas. Each area can either be a part of the body text (or multiple body texts, in the case of newspaper layouts). The content of the areas derives from a linear stream of textual content, which flows into the areas, filling them linewise in their preferred directions.
3.1.1. ocr_page
- Name
- ocr_page
- Categories
- Typesetting Elements
- Properties
-
- Required:
- bbox
- Recommended:
- image, imagemd5, ppageno, lpageno
- Allowed:
- x_source, x_scanner, scan_res
The ocr_page
element must be present in all hOCR documents.
3.1.2. ocr_column
- Name
-
ocr_column(Deprecated) - Categories
- Typesetting Elements
3.1.3. ocr_carea
- Name
- ocr_carea
- Categories
- Typesetting Elements
- Properties
-
- Required:
- bbox
"ocr content area" or "body area"
Used to be called
ocr_column
The ocr_carea
elements should appear in reading order unless this is impossible
because of some other structuring requirement. If the document contains multiple ocr_linear
streams, then each ocr_carea
must indicate which stream it belongs
to.
Note that for many documents, the actual ground truth careas are well-defined
by the document style of the original document before printing and scanning.
From a single page, the careas
of the original document style cannot be
recovered exactly. However, the partition of a document by ocr_carea
for an
individual page shall be considered correct relative to ground truth if
-
all the text contained in a ground truth carea is fully contained within a single
ocr_carea
, -
no text outside a ground truth
carea
is contained within anocr_carea
, and -
the
ocr_carea
appear in the same order as the text flow relationships between the ground truth careas.
3.1.4. ocr_line
- Name
- ocr_line
- Categories
- Typesetting Elements
- Properties
-
- Required:
- bbox
- Allowed:
- baseline, hardbreak, x_font, x_fsize, x_bboxes
In typesetting systems, content areas are filled with “blocks”, but most of
those blocks are not recoverable or semantically meaningful. However, one type
of block is visible and very important for OCR engines: the line. Lines are
typesetting blocks that only contain glyphs (“inlines” in XSL terminology).
They are represented by the ocr_line
area.
3.1.5. ocr_separator
- Name
- ocr_separator
- Categories
- Typesetting Elements , Float Elements
- Properties
-
- Required:
- bbox
Any separator or similar element
3.1.6. ocr_noise
- Name
- ocr_noise
- Categories
- Inline Elements
Any noise element that isn’t part of typesetting
3.2. Float elements
Overlaid onto the page is a set of floating elements; floating elements exist outside the normal reading order. Floating elements may be introduced by the textual content, or they may be related to the page itself (anchoring is a logical property). In typesetting systems, floating elements may be anchored to the page, to paragraphs, or to the content stream. Floating elements can overlap content areas and render on top of or under content, or they can force content to flow around them. The default for floating elements in this spec is that their anchor is undefined (it is a logical property, not a typesetting property), and that text flows around them. Note that with rectangular content areas and rectangular floats, already a wide variety of non-rectangular text shapes can be realized.
There is currently no way of indicating anchoring or flow-around properties for floating elements; properties need to be defined for this.
Floats should not be nested. The following floats are defined:
3.2.1. ocr_float
- Name
- ocr_float
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
3.2.2. ocr_textfloat
and ocr_textimage
- Name
- ocr_textfloat
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
- Name
- ocr_textimage
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
3.2.3. ocr_image
, ocr_linedrawing
and ocr_photo
- Name
- ocr_image
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
- Name
- ocr_linedrawing
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
Something that could be represented well and naturally in a vector graphics format like SVG (even if it is actually represented as PNG)
- Name
- ocr_photo
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
Something that requires JPEG or PNG to be represented well
3.2.4. ocr_header
and ocr_footer
- Name
- ocr_header
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
- Name
- ocr_footer
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
3.2.5. ocr_pageno
- Name
- ocr_pageno
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
3.2.6. ocr_table
- Name
- ocr_table
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
3.3. Logical Elements
The classes defined in this section for logically structuring a hOCR document have their standard meaning as used in the publishing industry and tools like LaTeX, MS Word, and others.
Tags must be nested as indicated by the following list, but not all tags within the hierarchy need to be present.
For all of these elements except ocr_linear
, there exists a natural linear
ordering defined by reading order (ocr_linear
indicates that the elements
contained in it have a linear ordering). At the level of ocr_linear
, there
may not be a single distinguished order. A common example of ocr_linear
is a
newspaper, in which a single newspaper may contain many linear, but there is no
unique reading order for the different linear. OCR evaluation tools should
therefore be sensitive to the order of all elements other than ocr_linear
.
Textual information like section numbers and bullets must be represented as text inside the containing element.
Documents whose logical structure does not map naturally onto these logical structuring elements must not use them for other purposes.
3.3.1. ocr_document
- Name
- ocr_document
- Recommended HTML Tags
- div
- Categories
- Logical Elements
3.3.2. ocr_title
, ocr_author
and ocr_abstract
- Name
- ocr_title
- Recommended HTML Tags
- h1
- Categories
- Logical Elements
- Name
- ocr_author
- Categories
- Logical Elements
- Name
- ocr_abstract
- Categories
- Logical Elements
3.3.3. ocr_part
and ocr_chapter
- Name
- ocr_part
- Recommended HTML Tags
- h1
- Categories
- Logical Elements
- Name
- ocr_chapter
- Recommended HTML Tags
- h1
- Categories
- Logical Elements
3.3.4. ocr_section
, ocr_subsection
and ocr_subsubsection
- Name
- ocr_section
- Recommended HTML Tags
- h2
- Categories
- Logical Elements
- Name
- ocr_subsection
- Recommended HTML Tags
- h3
- Categories
- Logical Elements
- Name
- ocr_subsubsection
- Recommended HTML Tags
- h4
- Categories
- Logical Elements
3.3.5. ocr_display
, ocr_blockquote
and ocr_par
- Name
- ocr_display
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
- Name
- ocr_blockquote
- Recommended HTML Tags
- blockquote
- Categories
- Logical Elements
- Name
- ocr_par
- Recommended HTML Tags
- p
- Categories
- Logical Elements
3.3.6. ocr_linear
- Name
- ocr_linear
- Categories
- Typesetting Elements
3.3.7. ocr_caption
- Name
- ocr_caption
- Categories
- Logical Elements
Image captions may be indicated using the ocr_caption
element; such an
element refers to the image(s) contained within the same float, or the
immediately adjacent image if both the image and the ocr_caption
element are
in running text.
3.4. Inline Elements
<https://github.com/kba/hocr-spec/issues/51>
There is some content that should behave and flow like text
3.4.1. Unrecognized characters and words: ocr_glyph
and ocr_glyphs
- Name
- ocr_glyph
- Categories
- Inline Elements
-
An individual glyph represented as an image (e.g., an unrecognized character)
-
Must contain a single
img
tag, or be present on one
- Name
- ocr_glyphs
- Categories
- Inline Elements
-
Multiple glyphs represented as an image (e.g., an unrecognized word)
-
Must contain a single
img
tag, or be present on one
3.4.2. ocr_dropcap
- Name
- ocr_dropcap
- Categories
- Inline Elements
-
An individual glyph representing a dropcap
-
May contain text or an
img
tag; thealt
of the image tag should contain the corresponding text
3.4.3. Mathematical and chemical formulas: ocr_math
and ocr_chem
- Name
- ocr_math
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
- Name
- ocr_chem
- Categories
- Float Elements
- Properties
-
- Required:
- bbox
Mathematical and chemical formulas that float must be put into an ocr_float
section. Formulas that are “display” mode should be put into
an ocr_display
section. ocr_math
and ocr_chem
ocr_math
must either be or contain either a single img
tag or [MathML] markup
ocr_chem
must either be or contain either a single img
tag or [CML] markup
3.4.4. Unspecified inline content: ocr_cinfo
-
If no other layout element applies, the
ocr_cinfo
element may be used. -
ocrx_cinfo
should nest insideocrx_line
-
ocrx_cinfo
should contain only x_confs, x_bboxes, and cuts attributes
3.5. OCR Engine-Specific elements
A few abstractions are used as intermediate abstractions in OCR engines, although they do not have a meaning that can be defined either in terms of typesetting or logical function. Representing them may be useful to represent existing OCR output, say for workflow abstractions.
Common suggested engine-specific markup are:
3.5.1. ocrx_block
- Name
- ocrx_block
- Categories
- Inline Elements , Engine-Specific Elements
-
any kind of "block" returned by an OCR system
-
engine-specific because the definition of a "block" depends on the engine
Generators should attempt to ensure the following properties:
-
An
ocrx_block
should not contain content from multipleocr_carea
. -
The union of all
ocrx_blocks
should approximately cover allocr_carea
. -
an
ocrx_block
should contain either a float or body text, but not both -
an
ocrx_block
should contain either an image or text, but not both
3.5.2. ocrx_line
- Name
- ocrx_line
- Categories
- Inline Elements , Engine-Specific Elements
-
any kind of "line" returned by an OCR system that differs from the standard
ocr_line
above -
might be some kind of "logical" line
-
an
ocrx_line
should correspond as closely as possible to anocr_line
3.5.3. ocrx_word
- Name
- ocrx_word
- Categories
- Inline Elements , Engine-Specific Elements
-
any kind of "word" returned by an OCR system
-
engine specific because the definition of a "word" depends on the engine
4. The properties of hOCR
The properties in hOCR can be broadly categorized as follows:
- General Properties
-
These properties can apply to most elements
- Non-Recommended Properties
-
These properties can apply to most elements but should not be used unless there is no alternative:
- Inline Properties
-
These properties apply to content on or below the level of
ocr_line
/ocrx_line
- Layout Properties
-
These properties relate to placement of elements on the page
- Font Properties
-
These properties convey font information
- Character Properties
-
These properties convey character level information
- Page Properties
-
These properties convey information on the whole page
- Content Flow Properties
-
These properties are related to the reading order and flow of content on the page
- Confidence Properties
-
These properties are related to the confidence of the hOCR producer that the text in the element has been correctly recognized
4.1. The baseline property
- Name
-
baseline
- Categories
- Grammar
-
property-name = "baseline" property-value = float int
- Example
-
baseline 0.015 - 18
This property applies primarily to textlines.
The baseline is described by a polynomial of order n
with the coefficients pn ... p0
with n = 1
for a linear (i.e. straight) line.
The polynomial is in the coordinate system of the line, with the bottom left of the bounding box as the origin.
The hOCR output for the first line of eurotext.tif contains the following information:
< span class = 'ocr_line' id = 'line_1_1' title = "bbox 105 66 823 113; baseline 0.015 -18" > ...</ span >
bbox is the bounding box of the line in image coordinates (blue). The two
numbers for the baseline are the slope (1st number) and constant term (2nd
number) of a linear equation describing the baseline relative to the bottom
left corner of the bounding box (red). The baseline crosses the y-axis at -18
and its slope angle is arctan(0.015) = 0.86°
.
4.2. The bbox property
- Name
-
bbox
- Categories
- Grammar
-
property-name = "bbox" property-value = uint uint uint uint
- Example
-
bbox 0 0 100 200
The bbox - short for "bounding box" - of an element is a rectangular box around this element, which is defined by the upper-left corner (x0, y0) and the lower-right corner (x1, y1).
-
the values are with reference to the top-left corner of the document image and measured in pixels
-
the order of the values are
x0 y0 x1 y1
= "left top right bottom" -
use x_bboxes below for character bounding boxes
-
do not use bbox unless the bounding box of the layout component is, in fact, rectangular
-
some non-rectangular layout components may have rectangular bounding boxes if the non-rectangularity is caused by floating elements around which text flows
< span class = 'ocr_line' id = 'line_1' title = "bbox 10 20 160 30" > ...</ span >
The bounding box bbox of this line is shown in blue and it is span by the upper-left corner (10, 20) and the lower-right corner (160, 30). All coordinates are measured with reference to the top-left corner of the document image which border is drawn in black.
4.3. The cflow property
- Name
-
cflow
- Categories
- Grammar
-
property-name = "cflow" property-value = delimited-string
- Example
-
cflow "article1"
This property relates the flow between multiple ocr_carea
elements,
and between ocr_carea
and ocr_linear
elements.
The content flow on the page that this element is a part of
-
s must be a unique string for each content flow
-
must be present on
ocr_carea
andocrx_block
tags when reading order is attempted and multiple content flows are present -
presence must be declared in the document meta data
4.4. The cuts property
- Name
-
cuts
- Categories
- Related
- Implied
- Grammar
-
property-name = "cuts" property-value = +(uint *1(comma uint *1(comma nint)))
- Example
-
cuts 9 11 7 , 8 , - 2 15 3
-
character segmentation cuts (see below)
-
there must be a bbox property relative to which the cuts can be interpreted
For left-to-write writing directions, cuts are sequences of deltas in the x and y direction; the first delta in each path is an offset in the x direction relative to the last x position of the previous path. The subsequent deltas alternate between up and right moves.
Assume a bounding box of (0,0,300,100)
; then
cuts ( "10 11 7 19" ) = [ [( 10 , 0 ),( 10 , 100 )], [( 21 , 0 ),( 21 , 100 )], [( 28 , 0 ),( 28 , 100 )], [( 47 , 0 ),( 47 , 100 )] ] cuts ( "10,50,3 11,30,-3" ) = [ [( 10 , 0 ),( 10 , 50 ),( 13 , 50 ),( 13 , 100 )], [( 21 , 0 ),( 21 , 30 ),( 18 , 30 ),( 18 , 100 )] ]
< span class = "ocr_cinfo" title = "bbox 0 0 300 100; nlp 1.7 2.3 3.9 2.7; cuts 9 11 7,8,-2 15 3" > hello</ span >
Cuts are between all codepoints contained within the element, including any whitespace and control characters. Simply use a delta of 0 (zero) for invisible codepoints.
Writing directions other than left-to-right specify cuts as if the bounding box for the element had been rotated by a multiple of 90 degrees such that the writing direction is left to right, then rotated back.
It is undefined what happens when cut paths intersect, with the exception that a delta of 0 always corresponds to an invisible codepoint.
4.5. The hardbreak property
- Name
-
hardbreak
- Categories
- Grammar
-
property-name = "hardbreak" property-value = "0" / "1"
- Default Value
-
hardbreak 0
-
a zero (default) indicates that the end of the line is not a hard (explicit) line break, but a break due to text flow
-
a one indicates that the line is a hard (explicit) line break
Any special characters representing the desired end-of-line processing must be
present inside the ocr_line
element. Examples of such special characters are a
soft hyphen ("", U+00AD
), a hard line break (<br>
), or whitespace () for soft
line breaks.
4.6. The image property
- Name
-
image
- Categories
- Related
- Grammar
-
property-name = "image" property-value = delimited-string
- Example
-
image "/foo/bar.png"
-
image file name used as input
-
syntactically, must be a UNIX-like pathname or http URL (no Windows pathnames)
-
may be relative
-
cannot be resolved to the actual file in general (e.g., if the hOCR file becomes separated from the image file)
-
if the hOCR file is present in a directory hierarchy or file archive, should resolve to the corresponding image file
4.7. The imagemd5 property
- Name
-
imagemd5
- Categories
- Implied
- Grammar
-
property-name = "imagemd5" property-value = doublequote 32(%x41-46 / digit) doublequote
-
MD5 fingerprint of the image file that this page was derived from
-
allows re-associating pages with source images
4.8. The lpageno property
- Name
-
lpageno
- Categories
- Related
- Grammar
-
property-name = "lpageno" property-value = delimited-string / uint
- Example
-
lpageno "IV."
-
the logical page number expressed on the page
-
may not be numerical (e.g., Roman numerals)
-
usually is unique
-
must not be present unless it has been recognized from the page and is unambiguous
4.9. The ppageno property
- Name
-
ppageno
- Categories
- Related
- Grammar
-
property-name = "ppageno" property-value = uint
- Example
-
lpageno 7
-
the physical page number
-
the front cover is page number 0
-
should be unique
-
must not be present unless the pages in the document have a physical ordering
-
must not be present unless it is well defined and unique
4.10. The nlp property
- Name
-
nlp
- Categories
- Related
- Implied
- Grammar
-
property-name = "nlp" property-value = +float
-
estimate of the negative log probabilities of each character by the recognizer
4.11. The order property
- Name
-
order
- Categories
- Grammar
-
property-name = "order" property-value = +uint
- Example
-
order 8
The reading order of the element (an integer)
-
this property must not be used unless there is no other way of representing the reading order of the page by element ordering within the page, since many tools will not be able to deal with content that is not in reading order
-
presence must be declared in the document meta data
4.12. The poly property
- Name
-
poly
- Categories
- Grammar
-
property-name = "poly" property-value = 2uint 2int *(2int)
- Example
-
poly 0 0 0 10 10 10 10 20 0 20
A closed polygon for elements with non-rectangular bounds
-
this property must not be used unless there is no other way of representing the layout of the page using rectangular bounding boxes, since most tools will simply not have the capability of dealing with non-rectangular layouts
-
note that the natural and correct representation of many non-rectangular layouts is in terms of rectangular content areas and rectangular floats
-
documents using polygonal borders anywhere must indicate this by adding ocrp_poly to the list of ocr-capabilities (see § 6.2 Capabilities)
-
documents should attempt to provide a reasonable bbox equivalent as well
4.13. The scan_res property
- Name
-
scan_res
- Categories
- Related
- Grammar
-
property-name = "scan_res" property-value = 2(uint)
- Example
-
scan_res 300 300
The scanning resolution in DPI
4.14. The textangle property
- Name
-
textangle
- Categories
- Grammar
-
property-name = "textangle" property-value = float
- Example
-
textangle 7.32
The angle in degrees by which textual content has been rotate relative to the rest of the page (if not present, the angle is assumed to be zero); rotations are counter-clockwise, so an angle of 90 degrees is vertical text running from bottom to top in Latin script; note that this is different from reading order, which should be indicated using standard HTML properties
4.15. The x_bboxes property
- Name
-
x_bboxes
- Categories
- Related
- Grammar
-
property-name = "x_bboxes" property-value = 1*(4uint)
- Example
-
x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 b2y0 b2x1 b2y1 ... x_bboxes 0 0 10 10 0 10 20 20
-
OCR-engine specific boxes associated with each codepoint contained in the element
-
note that the bbox property is a property for the bounding box of a layout element, not of individual characters
-
in particular, use
<span class="ocr_cinfo" title="x_bboxes ....">
, not<span class="ocr_cinfo" title="bbox ...">
4.16. The x_font property
- Name
-
x_font
- Categories
- Related
- Grammar
-
property-name = "x_font" property-value = delimited-string
- Example
-
x_font "Comic Sans MS"
x_font is an OCR-engine specific font name (a string).
4.17. The x_fsize property
- Name
-
x_fsize
- Categories
- Related
- Grammar
-
property-name = "x_fsize" property-value = uint
- Example
-
x_fsize 12
x_fsize is the OCR-engine specific font size (an unsigned integer).
4.18. The x_confs property
- Name
-
x_confs
- Categories
- Grammar
-
property-name = "x_confs" property-value = +float
- Example
-
x_confs 37.3 51.23 1 100
-
OCR-engine specific character confidences
-
values must be numbers
-
higher values should express higher confidences
-
if possible, convert character confidences to values between 0 and 100 and have them approximate posterior probabilities (expressed in %)
4.19. The x_scanner property
- Name
-
x_scanner
- Categories
- Related
- Grammar
-
property-name = "x_scanner" property-value = delimited-string
- Example
-
scanner "Canon Lide 220"
A representation of the scanner
4.20. The x_source property
- Name
-
x_source
- Categories
- Related
- Grammar
-
property-name = "x_source" property-value = 1*delimited-string
- Example
-
x_source "/gfs/cc/clean/012345678911" "17" x_source "http://pageserver/012345678911&page=17"
-
an implementation-dependent representation of the document source
-
could be a URL or a /gfs/ path
-
offsets within a multipage format (e.g., TIFF) may be represented using additional strings or using URL parameters or fragments
4.21. The x_wconf property
- Name
-
x_wconf
- Categories
- Grammar
-
property-name = "x_wconf" property-value = float
- Example
-
x_wconf 97.23
-
OCR-engine specific confidence for the entire contained substring
-
value must be a number
-
higher values should express higher confidences
-
if possible, convert word confidences to values between 0 and 100 and have them approximate posterior probabilities (expressed in %)
5. Encoding Guidelines
5.1. Recommendations for Mappings
When possible, any mapping of logical structure onto HTML should try to follow the following rules:
-
the mapping should be "natural" -- similar to what an author of the document might have entered into a WYSIWYG content creation tool
-
text should be in reading order
-
all tags should be used for the intended purpose (and only for the intended purpose) as defined in the [HTML401] spec.
-
floats are contained in
div
elements with astyle
that includes a float attribute -
repeating floating page elements (header/footer) should be repeated and occur in their natural location in reading order (e.g., between pages)
-
embedded images and SVG should be contained in files in the same directory (no
/
in the URL) and embedded withimg
andembed
tags, respectively
Specifically
-
em
andstrong
should represent emphasis, and are preferred tob
,i
, andu
-
b
,i
, andu
should represent a change in the corresponding attribute for the current font (but an OCR font specification must still be given) -
p
should represent paragraph breaks -
br
should represent explicit linebreaks (not linebreak that happen because of text flow) -
h1
, ...,h6
should represent the logical nesting structure (if any) of the document -
a
should represent hyperlinks and references within the document -
blockquote
should represent indented quotations, but not other uses of indented text. -
table
should represent tables, including correct use of theth
tag
If necessary, the markup may use the following non-standard tags:
-
nobr
to indicate that line breaking is not permitted for the enclosed content -
wbr
to indicate that line breaking is permitted at that location
5.2. Styling hOCR with CSS
OCR information and presentation information can be separated by putting the
CSS info related to the CSS in an outer element with an ocr_
or ocrx_
class,
and then overriding it for the presentation by nesting another span
with the
actual presentation information inside that:
< span class = "ocr_cinfo" style = "ocr style" >< span style = "presentation style" > ...</ span ></ span >
5.3. Language, Writing Direction
OCR-generated font and text color information is encoded using standard HTML
and CSS attributes on elements with a class of ocr_...
or ocrx_...
.
Language and writing direction should be indicated using the HTML standard
attributes lang
and dir
.
Furigana and similar constructs must be represented using their correct Unicode encoding.
The HTML ‎
and ‏
entities (indicating writing direction) must not be used; all
writing direction changes must be indicated with new tags with an appropriate dir
attribute.
The CSS3 text layout attributes can be used when necessary. For example, CSS supports writing-mode, direction, glyph-orientation [ISO15924]-based script (list of codes), text-indent, etc.
5.4. Superscript and Subscript
Superscripts and subscripts, when not in ocr_math
or ocr_chem
formulas,
must be represented using the HTML sup
and sub
tags, even if special
Unicode characters are available.
5.5. Whitespace
Non-breaking spaces must be represented using the HTML
entity.
Different space widths should be indicated using HTML and  
,  
,  
, ‌
, ‍
.
5.6. Hyphenation
How to handle hyphens? <https://github.com/kba/hocr-spec/issues/7>
Non Linear Hyphens <https://github.com/altoxml/schema/issues/41>
Soft hyphens must be represented using the HTML ­
entity.
5.7. Alternative Segmentations / Readings
Alternative segmentations and readings are indicated by a span
with class="alternatives"
. It must contains ins
and del
elements. The first
contained element should be ins
and represent the most probable interpretation,
the subsequent ones del
. Each ins
and del
element should have class="alt"
and a
property of either nlp or x_cost. These span
, ins
, and del
tags can nest
arbitrarily.
< span class = "alternatives" > < ins class = "alt" title = "nlp 0.3" > hello</ ins > < del class = "alt" title = "nlp 1.1" > hallo</ del > </ span >
Whitespace within the span
but outside the contained ins
/del
elements is ignored and should be inserted to improve readability of the HTML
when viewed in a browser.
5.8. Grouped Elements and Multiple Hierarchies
The different levels of layout information (logical, physical, engine-specific)
each form hierarchies, but those hierarchies may not be mutually compatible;
for example, a single ocr_page
may contain information from multiple sections
or chapters. To represent both hierarchies within a single document, elements
may be grouped together. That is, two elements with the same class may be
treated as one element by adding a "groupid identifier" property to them and
using the same identifier.
Grouped elements should be logically consistent with the markup they represent; for example, it is probably not sensible to use grouped elements to interleave parts of two different chapters. Therefore, grouped elements should usually be adjacent in the markup.
Applications using hOCR may choose to manipulate grouped elements directly, but
the simplest way of dealing with them is to transform a document with grouped
elements into one without grouped elements prior to further processing by first
removing tags that are not of interest for the subsequent processing step, and
then collapsing grouped elements into single elements. For example, output
that contains both logical and physical layout information, where the logical
layout information uses grouped elements, can be transformed by removing all
the physical layout information, and then collapsing all split ocr_chapter
elements into single ocr_chapter
elements based on the groupid. The result is
a simple DOM tree. This transformation can be provided generically as a
pre-processor or Javascript.
The presence of grouped elements does not need to be indicated in the header; when it affects their operations, hOCR processors should check for the presence of grouped elements in the output and fail with an error message if they cannot correctly process the hOCR information.
6. Metadata
The creator of the hOCR document can indicate the following information
information using meta
tags in the head
section.
- ocr-system
-
Indicates software and version that generated the hOCR document
Every hOCR document must have exactly one ocr-system metadata field
- ocr-capabilities
-
Features consumers of the hOCR document can expect
See § 6.2 Capabilities for possible values
Every hOCR document must have exactly one ocr-capabilities metadata field
- ocr-number-of-pages
-
The number of
ocr_page
in the document - ocr-langs
-
Use ISO 639-1 codes
Value may be
unknown
- ocr-scripts
-
Use ISO 15924 letter codes
Value may be
unknown
6.1. Document metadata
For document meta information, use the Dublin Core Embedding into HTML. See also Citation Guidelines for Dublin Core.
6.2. Capabilities
Any program generating files in this output format must indicate in the document metadata what kind of markup it is capable of generating. This includes listing the exact set of markup sections that the system could have generated, even if it did not actually generate them for the particular document.
If a document lists a certain capabilities but no element or attribute is found that corresponds to that capability, users of the document may infer that the content is absent in the source document. If a capability is not listed, the corresponding element or attribute must not be present in the document.
The capability to generate specific properties is given by the prefix ocrp_...
;
the important properties are:
- ocrp_lang
-
Capable of generating
lang
attributes - ocrp_dir
-
Capable of generating
dir
attributes - ocrp_poly
-
Capable of generating polygonal bounds
- ocrp_font
-
Capable of generating font information (standard font information)
- ocrp_nlp
-
Capable of generating nlp confidences
ocr_embeddedformat_<formatname>
-
The capability to generate other specific embedded formats is given by the prefix
ocr_embeddedformat_<formatname>
. ocr_<tag>_unordered
-
If an OCR engine represents a particular tag but cannot determine reading order for that tag, it must must specify a capability of
ocr_<tag>_unordered
.
6.3. Profiles - Restricting hOCR markup
hOCR provides standard means of marking up information, but it does not mandate the presence or absence of particular kinds of information. For example, an hOCR file may contain only logical markup, only physical markup, or only engine-specific markup. As a result, merely knowing that OCR output is hOCR compliant doesn’t tell us whether that file is actually useful for subsequent processing.
OCR systems can use hOCR in various different ways internally, but we will eventually define some common profiles that mandate what kinds of information needs to be present in particular kinds of output.
Of particular importance are:
-
physical layout profile: OCR output in XHTML format with a defined set of common physical layout markup capabilities (page, carea, floats, line). Logical layout may be present as well, but the document tree structure must represent the physical layout structure, with logical layout elements split and grouped as needed.
-
logical layout profile: OCR output in XHTML format with a defined set of common logical layout markup capabilities (linear, chapter, section, subsection). Physical layout may be present as well, but the document tree structure must represent the logical layout structure, with logical layout elements split and grouped as needed.
Other possible profiles might be defined for specific engines or specific document classes:
-
common commercial OCR output (e.g., Abbyy)
-
book target
-
all logical structuring elements (as applicable), except
ocr_linear
-
-
newspaper target
-
all logical structuring elements (as applicable)
-
articles map on
ocr_linear
-
6.4. Formats: Restricting HTML Markup
The HTML-based markup is orthogonal to the hOCR-based markup; that is, both can
be chosen independent of one another. The only thing that needs to be
consistent between the two markups is the text contained within the tags. hOCR
and other embedded format tags can be put on HTML tags, or they can be put on
their own div
/span
tags.
There are many different choices possible and reasonable for the HTML markup, depending on the use and further processing of the document. Each such choice must be indicated in the meta data for the document.
Many mappings derived from existing tools are quite similar, and most follow the restrictions and recommendations below already without further modifications.
Depending on the particular HTML markup used in the document, the document is suitable for different kinds of processing and use. The formats have the following intents:
- html_none (see § 6.4.1 HTML without logical markup)
-
Straightforward equivalent of Goodoc or [XDOC]
- html_simple
-
Target format for convenient on-line viewing and intermediate format for indexing
- html_xytable_absolute, html_xytable_relative
-
Target format for layout-preserving on-screen document viewing
- Formats defined in § 6.4.3 HTML produced by OCR engines
-
Straightforward recording of commercial OCR system output
- Formats defined in § 6.4.4 HTML with absolute positioning
-
Target format for services like Google’s View as HTML
As long as a format contains the hOCR information, it can be reprocessed by layout analysis software and converted into one of the other formats. In particular, we envision layout analysis tools for converting any hOCR document into html_absolute, html_xytable_absolute, and html_simple. Furthermore, internally, a layout analysis system might use html_xytable_absolute as an intermediate format for converting hOCR into html_simple.
6.4.1. HTML without logical markup
The html_none format contains no logical markup at all; it is
simply a collection of div
and span
elements with associated hOCR
information. Note that such documents can still be rendered visually through
the use of CSS.
6.4.2. HTML with limited logical elements
The html_simple format follows the restrictions and recommendations above, and only uses the following tags:
-
b
,i
, andu
for appearance changes (bold, italic, underline) -
font
for any other appearance changes -
div
with a float style for floats -
table
for tables -
img
for images -
all SVG must be externally embedded with the
embed
tag -
the use of other embedded formats is permitted
-
all other uses of
div
,span
,ins
, anddel
only for hOCR tags or other embedded formats (hCard, …)
6.4.3. HTML produced by OCR engines
HTML markup produced by default by the OCR engine for the given document
must follow the template html_ocr_<engine>
.
Examples of possible values are:
- html_ocr_unknown
-
The HTML was generated by some OCR engine, but it’s unknown which one
- html_ocr_finereader_8
- html_ocr_textbridge_11
6.4.4. HTML with absolute positioning
- html_absolute
-
The HTML represents absolute positioning of elements on each page.
Possible subformats are:
- html_absolute_cols
-
absolute positioning of cols
- html_absolute_pars
-
absolute positioning of paragraphs
- html_absolute_lines
-
absolute positioning of lines
- html_absolute_words
-
absolute positioning of words
- html_absolute_chars
-
absolute positioning of characters
The "View as HTML" for PDF files feature of Google Search uses html_absolute_lines; this is probably the most reasonable choice for approximating the appearance of the original document.
6.4.5. HTML as table
- html_xytable
-
The HTML is a table that gives the XY-cut layout segmentation structure of the page in tabular form.
Note that in this format, text order does not necessarily correspond to reading order.
The format must contain one
table
of classocr_xycut
representing each page. The markup of the content of the table itself is as in html_simple.
Possible subformats are:
- html_xytable_absolute
-
The
table
structure must represent the absolute size of the original page element. - html_xytable_relative
-
Table element sizes are expressed relative (percentages).
6.4.6. HTML from word processors
The HTML represents markup that follows the mappings of the given document processor to HTML.
Note that the document doesn’t actually need to have been constructed in the processor and that the processor doesn’t need to have been used to generate the HTML. For example, the html_latex2html tag merely indicates that, say, a scanned and ocr’ed article uses the same conventions for logical markup tags that an equivalent article actually written in LaTeX and actually converted to HTML would have used.
- html_latex2html
- html_msword
-
HTML mapping generated by “Save As HTML”
- html_ooffice
-
HTML mapping generated by “Save As HTML”
- html_docbook_xsl
-
HTML mapping generated by official XSL style sheets
6.5. Example
< html > < head > < meta name = "ocr-system" content = "tesseract v3.03" /> < meta name = "ocr-capabilities" content = "ocr_page ocr_line ocrp_lang" /> < meta name = "ocr-langs" content = "aa la zu" /> < meta name = "ocr-scripts" content = "Arab Khmr" /> < meta name = "ocr-number-of-pages" content = "112" /> ...</ head > ...</ html >
Indicate that the work this hOCR file represents:
Appendix A: Revision History
hOCR has been originally developed by Thomas Breuel.
See the releases and full commit history for a revision history.
Appendix B: Sample Usage
See also the hocr-tools for more samples.
The HTML format described here may seem fairly complicated and difficult to parse, but because there are lots of tools for manipulating HTML documents, they’re actually pretty easy to manipulate. Here are some examples:
import libxml2 , re , os , string # convert the HTML to XHTML (if necessary) os . system ( "tidy -q -asxhtml < page.html > page.xhtml 2> /dev/null" ) # parse the XML doc = libxml2 . parseFile ( 'page.xhtml' ) # search all nodes having a class of ocr_line lines = doc . xpathEval ( "//*[@class='ocr_line']" ) # a function for extracting the text from a node def get_text ( node ): textnodes = node . xpathEval ( ".//text()" ) s = string . join ([ node . getContent () for node in textnodes ]) return re . sub ( r'\s+' , ' ' , s ) # a function for extracting the bbox property from a node # note that the title= attribute on a node with an ocr_ class must # conform with the OCR spec def get_bbox ( node ): data = node . prop ( 'title' ) bboxre = re . compile ( r'\bbbox\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)' ) return [ int( x ) for x in bboxre . search ( data ) . groups ()] # this extracts all the bounding boxes and the text they contain # it doesn’t matter what other markup the line node may contain for line in lines : get_bbox ( line ), get_text ( line )
Note that the OCR markup, basic HTML markup, and semantic markup can co-exist within the same HTML file without interfering with one another.
Appendix C: IANA Considerations
Media Type
In accordance to [RFC4289]
- MIME media type name
-
text
- MIME subtype name:
-
vnd.hocr+html
- Required parameters:
- Optional parameters:
- Encoding considerations:
- Optional parameters:
-
hOCR documents should be encoded as UTF-8
- Security considerations:
- Interoperability considerations:
- Applications which use this media type:
- File extension(s):
- Interoperability considerations:
-
*.html
,*.hocr