Evaluating the Impact of OCR Errors on Topic Modeling

Main Authors: Mutuvi, Stephen, Doucet, Antoine, Odeo, Moses, Jatowt; Adam
Format: Book publication-section Journal
Bahasa: eng
Terbitan: Springer, Cham , 2018
Subjects:
Online Access: https://zenodo.org/record/2542539
Daftar Isi:
  • Historical documents pose a challenge for character recognition due to various reasons such as font disparities across different materials, lack of orthographic standards where same words are spelled differently, material quality and unavailability of lexicons of known historical spelling variants. As a result, optical character recognition (OCR) of those documents often yield unsatisfactory OCR accuracy and render digital material only partially discoverable and the data they hold difficult to process. In this paper, we explore the impact of OCR errors on the identification of topics from a corpus comprising text from historical OCRed documents. Based on experiments performed on OCR text corpora, we observe that OCR noise negatively impacts the stability and coherence of topics generated by topic modeling algorithms and we quantify the strength of this impact.