August 2023 | NTIA Technical Memo TM-23-569
Robert Grosso; Margaret H. Pinson
Abstract: This memorandum provides technical details for the image quality experiment COCRID: A Challenging Optical Character Recognition Dataset. The design goals of the COCRID dataset are (1) to train no-reference metrics that track the quality of recognized text, (2) to understand characteristics of images that are particularly difficult for Optical Character Recognition (OCR) algorithms, and (3) to develop a metric that responds strongly to the effects of impaired text. The experiment has five environment scenarios and a control to replicate challenging conditions where OCR might be used. This experiment simulates the environment of a mobile scanning application. The experiment photographs source material under a variety of lighting and capture impairments to create a high noise environment. The COCRID contains 984 impaired images and 41 control images. The images are then processed by an OCR algorithm for a result. The resulting string of recognized text is compared with the original to create a character error rate metric. The lessons learned from this dataset will help researchers design datasets for other computer vision algorithms.
Keywords: image quality; camera capture; no reference (NR) metric; optical character recognition (OCR)
For technical information concerning this report, contact:
Margaret H. Pinson
Institute for Telecommunication Sciences
Disclaimer: Certain commercial equipment, components, and software may be identified in this report to specify adequately the technical aspects of the reported results. In no case does such identification imply recommendation or endorsement by the National Telecommunications and Information Administration, nor does it imply that the equipment or software identified is necessarily the best available for the particular application or uses.