Tampilan Petugas: Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Main Authors:	Boroş, Emanuela, Hamdi, Ahmed, Linhares Pontes, Elvys, Cabrera-Diego, Luis-Adrián, Moreno, José G., Sidere, Nicolas, Doucet, Antoine
Format:	Proceeding Journal
Bahasa:	eng
Terbitan:	, 2021
Online Access:	https://zenodo.org/record/4680697

ctrlnum	4680697
fullrecord	<?xml version="1.0"?> <dc schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"><creator>Boroş, Emanuela</creator><creator>Hamdi, Ahmed</creator><creator>Linhares Pontes, Elvys</creator><creator>Cabrera-Diego, Luis-Adrián</creator><creator>Moreno, José G.</creator><creator>Sidere, Nicolas</creator><creator>Doucet, Antoine</creator><date>2021-04-12</date><description>This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.</description><identifier>https://zenodo.org/record/4680697</identifier><identifier>10.18653/v1/2020.conll-1.35</identifier><identifier>oai:zenodo.org:4680697</identifier><language>eng</language><relation>info:eu-repo/grantAgreement/EC/H2020/825153/</relation><relation>url:https://zenodo.org/communities/embeddia</relation><rights>info:eu-repo/semantics/openAccess</rights><rights>https://creativecommons.org/licenses/by/4.0/legalcode</rights><title>Alleviating Digitization Errors in Named Entity Recognition for Historical Documents</title><type>Journal:Proceeding</type><type>Journal:Proceeding</type><recordID>4680697</recordID></dc>
language	eng
format	Journal:Proceeding Journal Journal:Journal
author	Boroş, Emanuela Hamdi, Ahmed Linhares Pontes, Elvys Cabrera-Diego, Luis-Adrián Moreno, José G. Sidere, Nicolas Doucet, Antoine
title	Alleviating Digitization Errors in Named Entity Recognition for Historical Documents
publishDate	2021
url	https://zenodo.org/record/4680697
contents	This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.
id	IOS16997.4680697
institution	ZAIN Publications
institution_id	7213
institution_type	library:special library
library	Cognizance Journal of Multidisciplinary Studies
library_id	5267
collection	Cognizance Journal of Multidisciplinary Studies
repository_id	16997
subject_area	Multidisciplinary
city	Stockholm
province	INTERNASIONAL
shared_to_ipusnas_str	1
repoId	IOS16997
first_indexed	2022-06-06T03:11:48Z
last_indexed	2022-06-06T03:11:48Z
recordtype	dc
_version_	1734897577655009280
score	17.610363

Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Lihat Juga