학술논문

Improving OCR of historical newspapers and journals published in Finland
Document Type
Conference
Source
Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. :97-102
Subject
Language
English
Abstract
This paper presents experiments on Optical character recognition (OCR) of historical newspapers and journals published in Finland. The corpus has two main languages: Finnish and Swedish and is written in both Blackletter and Antiqua fonts. Here we experiment with how much training data is enough to train high accuracy models, and try to train a joint model for both languages and all fonts. So far we have not been successful in getting one best model for all, but it is promising that with the mixed model we get the best results on the Finnish test set with 95 % CAR, which clearly surpasses previous results on this data set.

Online Access