DiSCourse Seminar with Antoine Doucet
13 September 2024, 12:00 (CEST), hybrid
Digital Science Center, Innrain 15, 1st floor, Open Space Area or Big Blue Button
DiSCourse* - The Digital Science Seminar Series on:
Information Extraction from Noisy Text Output
Many documents can only be made accessible for automatic analysis in the form of digitised images. This is particularly the case for any historical or handwritten document, but also for many native digital documents, which have been converted into image form for various reasons (e.g. file conversion or passing through an analog form, e.g., to insert a handwritten signature, send by post, etc.).
To be able to analyse the textual content of such digitised documents requires a conversion phase from the captured image to a textual representation, a key part of which is optical character recognition (OCR) or handwritten text recognition (HTR). The resulting text is often imperfect, to an extent that is correlated in particular with the quality of the initial medium (which may be stained, folded, aged, etc.) and the quality of the image that was taken of it.
This lecture will present recent advances in AI and automatic language processing enabling this type of corpus to be analysed in a way that is robust to OCR errors. For example, we will show how, as part of the NewsEye project, we were able to create the state of the art in cross-lingual recognition and disambiguation of named entities (place names, but also names of people and organisations) in old press corpora written in 4 languages between 1850 and 1950, despite particularly degraded corpora. This type of result paves the way for indexing at an advanced semantic level, as well as for large-scale analysis, which can in particular overcome (linguistic) borders.
*featuring a distinguished guest: Antoine Doucet, University of La Rochelle
Antoine Doucet is a university professor at the L3i laboratory at La Rochelle University, where he is head of the "Images and Content" research team (around 50 people). At the intersection of information retrieval, automatic language processing, textual data mining and artificial intelligence, his research focuses on developing methods that can be adapted to very large collections of documents, applicable to documents of all types written in any language: from press articles to social networks, and from digitised manuscripts to native digital documents. Until 2022, he was the principal investigator of H2020 NewsEye (a digital investigator for historical newspapers), developing cutting-edge approaches for robust processing of multilingual noise and natural language. He also led the semantic enrichment effort for low-resource languages in the context of H2020 Embeddia. He currently chairs the steering committee of the TPDL conferences: theory and practice in digital libraries.
He’ll be visiting the University of Innsbruck in August and September 2024 as part of the LFUI Guest Professorship which is supported by the Circle of Supporters (Förderkreis 1669) and International Services.