ocrd_validators.page_validator module¶
API for validating OcrdPage.
-
exception
ocrd_validators.page_validator.
ConsistencyError
(tag, ID, file_id, actual, expected)[source]¶ Bases:
Exception
Exception representing a consistency error in textual transcription across levels of a PAGE-XML. (Element text strings must be the concatenation of their children’s text strings, joined by white space.)
Construct a new ConsistencyError.
- Parameters
tag (string) – Level of the inconsistent element (parent)
ID (string) –
ID
of the inconsistent element (parent)file_id (string) –
mets:id
of the PAGE fileactual (string) – Value of parent’s TextEquiv[0]/Unicode
expected (string) – Concatenated values of children’s TextEquiv[0]/Unicode, joined by white-space
-
exception
ocrd_validators.page_validator.
CoordinateConsistencyError
(tag, ID, file_id, outer, inner)[source]¶ Bases:
Exception
Exception representing a consistency error in coordinate confinement across levels of a PAGE-XML. (Element coordinate polygons must be properly contained in their parents’ coordinate polygons.)
Construct a new CoordinateConsistencyError.
- Parameters
tag (string) – Level of the offending element (child)
ID (string) –
ID
of the offending element (child)file_id (string) –
mets:id
of the PAGE fileouter (string) – Coordinate points of the parent
inner (string) – Coordinate points of the child
-
exception
ocrd_validators.page_validator.
CoordinateValidityError
(tag, ID, file_id, points, reason='unknown')[source]¶ Bases:
Exception
Exception representing a validity error of an element’s coordinates in PAGE-XML. (Element coordinate polygons must have at least 3 points, and must not
self-intersect or be non-contiguous or be negative.)
Construct a new CoordinateValidityError.
- Parameters
tag (string) – Level of the offending element (child)
ID (string) –
ID
of the offending element (child)points (string) – Coordinate points
reason (string) – description of the problem
-
ocrd_validators.page_validator.
compare_without_whitespace
(a, b)[source]¶ Compare two strings, ignoring all whitespace.
-
ocrd_validators.page_validator.
page_get_reading_order
(ro, rogroup)[source]¶ Add all elements from the given reading order group to the given dictionary.
Given a dict
ro
from layout element IDs to ReadingOrder element objects, and an objectrogroup
with additional ReadingOrder element objects, add all references to the dict, traversing the group recursively.
-
ocrd_validators.page_validator.
make_poly
(polygon_points)[source]¶ Instantiate a Polygon from a list of point pairs, or return an error string
-
ocrd_validators.page_validator.
make_line
(line_points)[source]¶ Instantiate a LineString from a list of point pairs, or return an error string
-
ocrd_validators.page_validator.
validate_consistency
(node, page_textequiv_consistency, page_textequiv_strategy, check_baseline, check_coords, report, file_id, joinRelations=None, readingOrder=None, textLineOrder=None, readingDirection=None)[source]¶ Check whether the text results on an element is consistent with its child element text results, and whether the coordinates of an element are fully within its parent element coordinates.
-
ocrd_validators.page_validator.
concatenate
(nodes, concatenate_with, page_textequiv_strategy, joins=None)[source]¶ Concatenate nodes textually according to https://ocr-d.github.io/page#consistency-of-text-results-on-different-levels
-
ocrd_validators.page_validator.
get_text
(node, page_textequiv_strategy='first')[source]¶ Get the first or most confident among text results (depending on
page_textequiv_strategy
). For the strategybest
, return the string of the highest scoring result. For the strategyfirst
, return the string of the lowest indexed result. If there are no scores/indexes, use the first result. If there are no results, return the empty string.
-
ocrd_validators.page_validator.
set_text
(node, text, page_textequiv_strategy)[source]¶ Set the first or most confident among text results (depending on
page_textequiv_strategy
). For the strategybest
, set the string of the highest scoring result. For the strategyfirst
, set the string of the lowest indexed result. If there are no scores/indexes, use the first result. If there are no results, add a new one.
-
class
ocrd_validators.page_validator.
PageValidator
[source]¶ Bases:
object
Validator for OcrdPage <../ocrd_models/ocrd_models.ocrd_page.html>.
-
static
validate
(filename=None, ocrd_page=None, ocrd_file=None, page_textequiv_consistency='strict', page_textequiv_strategy='first', check_baseline=True, check_coords=True)[source]¶ Validates a PAGE file for consistency by filename, OcrdFile or passing OcrdPage directly.
- Parameters
filename (string) – Path to PAGE
ocrd_page (OcrdPage) – OcrdPage instance
ocrd_file (OcrdFile) – OcrdFile instance wrapping OcrdPage
page_textequiv_consistency (string) – ‘strict’, ‘lax’, ‘fix’ or ‘off’
page_textequiv_strategy (string) – Currently only ‘first’
check_baseline (bool) – whether Baseline must be fully within TextLine/Coords
check_coords (bool) – whether *Region/TextLine/Word/Glyph must each be fully contained within Border/*Region/TextLine/Word, resp.
- Returns
report (
ValidationReport
) Report on the validity
-
static