ocrd.processor.base module¶
Processor base class and helper functions.
-
class
ocrd.processor.base.Processor(workspace, ocrd_tool=None, parameter=None, input_file_grp='INPUT', output_file_grp='OUTPUT', page_id=None, show_resource=None, list_resources=False, show_help=False, show_version=False, dump_json=False, version=None)[source]¶ Bases:
objectA processor is a tool that implements the uniform OCR-D command-line interface for run-time data processing. That is, it executes a single workflow step, or a combination of workflow steps, on the workspace (represented by local METS). It reads input files for all or requested physical pages of the input fileGrp(s), and writes output files for them into the output fileGrp(s). It may take a number of optional or mandatory parameters.
Instantiate, but do not process. Unless
list_resourcesorshow_resourceorshow_helporshow_versionordump_jsonis true, setup for processing (parsing and validating parameters, entering the workspace directory).- Parameters
workspace (
Workspace) – The workspace to process. Can beNoneeven for processing (esp. on multiple workspaces), but then needs to be set before running.- Keyword Arguments
ocrd_tool (string) – JSON of the ocrd-tool description for that processor. Can be
Nonefor processing, but needs to be set before running.parameter (string) – JSON of the runtime choices for ocrd-tool
parameters. Can beNoneeven for processing, but then needs to be set before running.input_file_grp (string) – comma-separated list of METS ``fileGrp``s used for input.
output_file_grp (string) – comma-separated list of METS ``fileGrp``s used for output.
page_id (string) – comma-separated list of METS physical
pageIDs to process (or empty for all pages).show_resource (string) – If not
None, then instead of processing, resolve given resource by name and print its contents to stdout.list_resources (boolean) – If true, then instead of processing, find all installed resource files in the search paths and print their path names.
show_help (boolean) – If true, then instead of processing, print a usage description including the standard CLI and all of this processor’s ocrd-tool parameters and docstrings.
show_version (boolean) – If true, then instead of processing, print information on this processor’s version and OCR-D version. Exit afterwards.
dump_json (boolean) – If true, then instead of processing, print
ocrd_toolon stdout.
-
process()[source]¶ Process the
workspacefrom the giveninput_file_grpto the givenoutput_file_grpfor the givenpage_idunder the givenparameter.(This contains the main functionality and needs to be overridden by subclasses.)
-
add_metadata(pcgts)[source]¶ Add PAGE-XML
MetadataItemTypeMetadataItemdescribing the processing step and runtime parameters toPcGtsTypepcgts.
-
resolve_resource(val)[source]¶ Resolve a resource name to an absolute file path with the algorithm in https://ocr-d.de/en/spec/ocrd_tool#file-parameters
- Parameters
val (string) – resource value to resolve
-
property
input_files¶ List the input files (for single-valued
input_file_grp).For each physical page:
If there is a single PAGE-XML for the page, take it (and forget about all other files for that page)
Else if there is a single image file, take it (and forget about all other files for that page)
Otherwise raise an error (complaining that only PAGE-XML warrants having multiple images for a single page)
Algorithm <https://github.com/cisocrgroup/ocrd_cis/pull/57#issuecomment-656336593>_
- Returns
A list of
ocrd_models.ocrd_file.OcrdFileobjects.
-
zip_input_files(require_first=True, mimetype=None, on_error='skip')[source]¶ List tuples of input files (for multi-valued
input_file_grp).Processors that expect/need multiple input file groups, cannot use
input_files. They must align (zip) input files across pages. This includes the case where not all pages are equally present in all file groups. It also requires making a consistent selection if there are multiple files per page.Following the OCR-D functional model, this function tries to find a single PAGE file per page, or fall back to a single image file per page. In either case, multiple matches per page are an error (see error handling below). This default behaviour can be changed by using a fixed MIME type filter via
mimetype. But still, multiple matching files per page are an error.Single-page multiple-file errors are handled according to
on_error:if
skip, then the page for the respective fileGrp will be silently skipped (as if there was no match at all)if
first, then the first matching file for the page will be silently selected (as if the first was the only match)if
last, then the last matching file for the page will be silently selected (as if the last was the only match)if
abort, then an exception will be raised.
Multiple matches for PAGE-XML will always raise an exception.
- Keyword Arguments
require_first (boolean) – If true, then skip a page entirely whenever it is not available in the first input fileGrp.
mimetype (string) – If not None, filter by the specified MIME type (literal or regex prefixed by //). Otherwise prefer PAGE or image.
- Returns
A list of
ocrd_models.ocrd_file.OcrdFiletuples.
-
ocrd.processor.base.generate_processor_help(ocrd_tool, processor_instance=None)[source]¶ Generate a string describing the full CLI of this processor including params.
- Parameters
ocrd_tool (dict) – this processor’s
toolssection of the module’socrd-tool.jsonprocessor_instance (object, optional) – the processor implementation (for adding any module/class/function docstrings)
-
ocrd.processor.base.run_cli(executable, mets_url=None, resolver=None, workspace=None, page_id=None, overwrite=None, log_level=None, input_file_grp=None, output_file_grp=None, parameter=None, working_dir=None)[source]¶ Open a workspace and run a processor on the command line.
If
workspaceis not none, reuse that. Otherwise, instantiate anWorkspaceformets_url(andworking_dir) by usingocrd.Resolver.workspace_from_url()(i.e. open or clone local workspace).Run the processor CLI
executableon the workspace, passing: - the workspace, -page_id-input_file_grp-output_file_grp-parameter(after applying anyparameter_overridesettings)(Will create output files and update the in the filesystem).
- Parameters
executable (string) – Executable name of the module processor.
-
ocrd.processor.base.run_processor(processorClass, ocrd_tool=None, mets_url=None, resolver=None, workspace=None, page_id=None, log_level=None, input_file_grp=None, output_file_grp=None, show_resource=None, list_resources=False, parameter=None, parameter_override=None, working_dir=None)[source]¶ Instantiate a Pythonic processor, open a workspace, run the processor and save the workspace.
If
workspaceis not none, reuse that. Otherwise, instantiate anWorkspaceformets_url(andworking_dir) by usingocrd.Resolver.workspace_from_url()(i.e. open or clone local workspace).Instantiate a Python object for
processorClass, passing: - the workspace, -ocrd_tool-page_id-input_file_grp-output_file_grp-parameter(after applying anyparameter_overridesettings)Run the processor on the workspace (creating output files in the filesystem).
Finally, write back the workspace (updating the METS in the filesystem).
- Parameters
processorClass (object) – Python class of the module processor.