Weakly Supervised Information Extraction from Semi-Structured Document Images

Fabian Wolf Oliver Tueselmann and Gernot A. Fink
Family History Technology Workshop, 2024.

Salt Lake City, UT, USA

BibTeX

Abstract

Throughout the 20th century, the modern state produced semi-structured files and forms to organize and handle rapidly expanding amounts of data with increasing speed and efficiency. In Germany, massive card file indices, often created to store person-related data, spread and grew into massive and ubiquitous datafication systems. As historical sources, the surviving archival holdings of semi-structured mass data have long been used for qualitative research in small samples or case studies. Making large collections of Thousands or even Millions of forms or index cards machine readable was not feasible by manual transcription. Recent developments in automated recognition technologies that can recognize machinewritten and handwritten text in semi-structured historical files promise to change that and open up entire collections to systematic data extraction and in-depth indexing. While traditional approaches rely on the manual creation of labeled training sets, we investigate an annotation-free process that does not rely on manually annotated data.