COCRID: A Challenging Optical Character Recognition Image Dataset

Robert  Grosso

August 2023 | Technical Memorandum NTIA TM-23-569

COCRID: A Challenging Optical Character Recognition Image Dataset

Robert Grosso and Margaret H. Pinson

Abstract: This memorandum provides technical details for the image quality experiment COCRID: A Challenging Optical Character Recognition Dataset. The design goals of the COCRID dataset are (1) to train no-reference metrics that track the quality of recognized text, (2) to understand characteristics of images that are particularly difficult for Optical Character Recognition (OCR) algorithms, and (3) to develop a metric that responds strongly to the effects of impaired text. The experiment has five environment scenarios and a control to replicate challenging conditions where OCR might be used. This experiment simulates the environment of a mobile scanning application. The experiment photographs source material under a variety of lighting and capture impairments to create a high noise environment. The COCRID contains 984 impaired images and 41 control images. The images are then processed by an OCR algorithm for a result. The resulting string of recognized text is compared with the original to create a character error rate metric. The lessons learned from this dataset will help researchers design datasets for other computer vision algorithms.

Keywords: image quality; camera capture; no reference (NR) metric; optical character recognition (OCR)

(TM-23-569.pdf)

For technical information concerning this report, contact:

Margaret H. Pinson
Institute for Telecommunication Sciences
(720) 601-7314
mpinson@ntia.gov

For funding information concerning this report, click this link.

Disclaimer:

Certain commercial equipment, components, and software may be identified in this report to specify adequately the technical aspects of the reported results. In no case does such identification imply recommendation or endorsement by the National Telecommunications and Information Administration, nor does it imply that the equipment or software identified is necessarily the best available for the particular application or uses.

For questions or information on this or any other NTIA scientific publication, contact the ITS Publications Office at ITSinfo@ntia.gov or 303-497-3572.

Back to Search Results

Publications Search

COCRID: A Challenging Optical Character Recognition Image Dataset

Cite This Publication

Funding Information

Performing Agency