Download the project factsheet here.
Huge amounts of handwritten historical documents are being published by on-line digital libraries world wide. However, for these raw digital images to be really useful, they need be annotated with informative content. The tranScriptorium project aims to develop innovative, cost-effective solutions for the indexing, search and full transcription of historical handwritten document images, using Handwritten Text Recognition (HTR) technology. High quality handwritten text transcription will be approached in tranScriptorium by means of interactive techniques which minimize user intervention. The resulting technologies will be integrated into a set of HTR tools that will be implemented both in a content provider portal and in a specialized HTR web portal for crowdsourcing transcription. Through these implementations, uptake and validation of the HTR technology will be ensured.
For handwritten text recognition, the existing Optical Character Recognition (OCR) systems are not
suitable. This is due to the fact that in images of handwritten text, characters cannot be isolated automatically in a reliable way. Adequately tackling the problem of transcribing this type of text images requires holistic approaches often referred to as “segmentation free off-line HTR”. In these approaches all text elements (sentences, words and characters) need to be recognized as a whole, without any prior segmentation of the image into these elements. Recent technology for HTR borrows concepts and methods from the field of Automatic Speech Recognition (ASR) such as Hidden Markov Models and N-grams. For the application of this technology to handwritten document images, document image processing steps are needed, such as layout analysis and text line extraction. Given the large amount of data, it is necessary to automate these steps as much as possible. This is a challenge because, so far, standard rules for handwritten documents have been lacking. Once the text line images are available, the HTR models can be fully automatically obtained using powerful, well known training techniques, which only require the (whole) transcripts of a reasonably small number of these (unsegmented) line images.
Despite recent significant improvements, currently available HTR technologies are still far from offering fully automated solutions for transcription. In this project, we will turn HTR into a mature technology by addressing the following objectives:
- Enhancing HTR technology for efficient transcription. Departing from state-of-the-art HTR approaches, tranScriptorium will capitalize on interactive-predictive techniques for effective and user-friendly computer-assisted transcription. In addition, the automatic or semi-automatic production of partial transcripts which can be useful as relevant document metadata will be explored. We aim to demonstrate that useful results can be obtained in real document collections with affordable manual transcription and supervision effort.
- Bringing the HTR technology to users. Expected users of the HTR technology belong mainly to two groups: a) individual researchers with experience in handwritten documents transcription interested in transcribing specific documents. These users will greatly benefit from having the assistance of an interactive HTR tool which will help them in the transcription task. For this kind of users, the HTR tools will be available through handwritten text image content provider portals, as an added service; b) volunteers which collaborate in large transcription projects. For this kind of users, the HTR tools will be available through a specialized crowd-sourcing web portal which provides support for structured collaborative work, including expert supervision.
- Integrating the HTR results in public web portals. The HTR technology will become a support in the digitization of the handwritten materials. Most digital libraries nowadays attach the output of modern OCR to the digitized pages of text documents; however, so far, only printed text images can be presented in this way. In a similar way, the outcomes of the tranScriptorium tools will be attached to the published handwritten document images. This includes not only full, correct transcriptions produced with the interactive HTR transcription techniques, but also partially correct transcriptions and other kinds of automatically produced metadata, useful for indexing and searching based on Key Word Spotting (KWS) techniques. These (meta-)data will empower visitors and users with the availability of conventional full-text search, copy/paste and text print options to advantageously exploit the available handwritten resources.
Within the tranScriptorium project span, we intend to apply the developed HTR technology to historical documents in cursive handwriting, for which only HTR technology can offer appropriate solutions. Preference will be given to documents for which basic metadata or other useful resources are at least partially available: vocabularies, (partial) transcripts from other sources, transcribed text from the same writer, etc. tranScriptorium will focus on four languages: Spanish, German, English and Dutch. This not only to illustrate the applicability of the technology to different languages, but also because it will stimulate the uptake and validation of the technology for a wider audience.