August 2023 | Technical Memorandum TM-23-569
COCRID: A Challenging Optical Character Recognition Image Dataset
Cite This Publication
Robert Grosso and Margaret H. Pinson, “COCRID: A Challenging Optical Character Recognition Image Dataset,” Technical Memorandum TM-23-569, U.S. Department of Commerce, National Telecommunications and Information Administration, Institute for Telecommunication Sciences, August 2023.
Robert Grosso and Margaret H. Pinson
Abstract: This memorandum provides technical details for the image quality experiment COCRID: A Challenging Optical Character Recognition Dataset. The design goals of the COCRID dataset are (1) to train no-reference metrics that track the quality of recognized text, (2) to understand characteristics of images that are particularly difficult for Optical Character Recognition (OCR) algorithms, and (3) to develop a metric that responds strongly to the effects of impaired text. The experiment has five environment scenarios and a control to replicate challenging conditions where OCR might be used. This experiment simulates the environment of a mobile scanning application. The experiment photographs source material under a variety of lighting and capture impairments to create a high noise environment. The COCRID contains 984 impaired images and 41 control images. The images are then processed by an OCR algorithm for a result. The resulting string of recognized text is compared with the original to create a character error rate metric. The lessons learned from this dataset will help researchers design datasets for other computer vision algorithms.
Keywords: image quality; camera capture; no reference (NR) metric; optical character recognition (OCR)
For technical information concerning this report, contact:
Margaret H. Pinson
Institute for Telecommunication Sciences
(303) 497-3579
mpinson@ntia.doc.gov
Disclaimer: Certain commercial equipment, components, and software may be identified in this report to specify adequately the technical aspects of the reported results. In no case does such identification imply recommendation or endorsement by the National Telecommunications and Information Administration, nor does it imply that the equipment or software identified is necessarily the best available for the particular application or uses.
For questions or information on this or any other NTIA scientific publication, contact the ITS Publications Office at ITSinfo@ntia.gov or 303-497-3572.