Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Main Authors: Boroş, Emanuela, Hamdi, Ahmed, Linhares Pontes, Elvys, Cabrera-Diego, Luis-Adrián, Moreno, José G., Sidere, Nicolas, Doucet, Antoine
Format: Proceeding Journal
Bahasa: eng
Terbitan: , 2021
Online Access: https://zenodo.org/record/4680697
ctrlnum 4680697
fullrecord <?xml version="1.0"?> <dc schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"><creator>Boro&#x15F;, Emanuela</creator><creator>Hamdi, Ahmed</creator><creator>Linhares Pontes, Elvys</creator><creator>Cabrera-Diego, Luis-Adri&#xE1;n</creator><creator>Moreno, Jos&#xE9; G.</creator><creator>Sidere, Nicolas</creator><creator>Doucet, Antoine</creator><date>2021-04-12</date><description>This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.</description><identifier>https://zenodo.org/record/4680697</identifier><identifier>10.18653/v1/2020.conll-1.35</identifier><identifier>oai:zenodo.org:4680697</identifier><language>eng</language><relation>info:eu-repo/grantAgreement/EC/H2020/825153/</relation><relation>url:https://zenodo.org/communities/embeddia</relation><rights>info:eu-repo/semantics/openAccess</rights><rights>https://creativecommons.org/licenses/by/4.0/legalcode</rights><title>Alleviating Digitization Errors in Named Entity Recognition for Historical Documents</title><type>Journal:Proceeding</type><type>Journal:Proceeding</type><recordID>4680697</recordID></dc>
language eng
format Journal:Proceeding
Journal
Journal:Journal
author Boroş, Emanuela
Hamdi, Ahmed
Linhares Pontes, Elvys
Cabrera-Diego, Luis-Adrián
Moreno, José G.
Sidere, Nicolas
Doucet, Antoine
title Alleviating Digitization Errors in Named Entity Recognition for Historical Documents
publishDate 2021
url https://zenodo.org/record/4680697
contents This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.
id IOS16997.4680697
institution ZAIN Publications
institution_id 7213
institution_type library:special
library
library Cognizance Journal of Multidisciplinary Studies
library_id 5267
collection Cognizance Journal of Multidisciplinary Studies
repository_id 16997
subject_area Multidisciplinary
city Stockholm
province INTERNASIONAL
shared_to_ipusnas_str 1
repoId IOS16997
first_indexed 2022-06-06T03:11:48Z
last_indexed 2022-06-06T03:11:48Z
recordtype dc
_version_ 1734897577655009280
score 17.610363