Lemmatising Early Modern English: an LLM-assisted experiment

Thursday 7 May 2026 10:15-11:30,
Hulda Garborgs hus,
HG N-105.

A Språkforum talk by Anna Cichocz and Piotr Pęzik, University of Łodz.

Published Updated on

A woman and a man stand in relaxed poses side by side

In our talk we would like to present an experiment during which we developed an LLM-assisted procedure for the lemmatisation of the Penn–Helsinki Parsed Corpus of Early Modern English (c. 1.7 million words; Kroch et al. 2004), and Early Modern English (EME) in general. First, we engineered a set of annotation guidelines for lemmatising word forms found in PPCEME on the basis of context and morphosyntactic tags. All the lemmatisations were then checked manually by a team of annotators and the overall accuracy of the procedure based on GPT-4o-mini was between 95% and 97% with some bypasses applied. We are going to present an in-depth quantitative and qualitative analysis of the results, where the main advantages and weaknesses of the model are identified. The LLM-based procedure will also be compared to VARD, a popular tool used for normalising EME spelling. Finally, we will show how we adapted the prompt to work without morphosyntactic annotation and fine-tuned a smaller specialised model aimed at lemmatising EME texts for which we report the highest accuracy values of 96-97% on held-out evaluation datasets.

Anna Cichosz is associate professor at the Institute of English Studies, University of Łódź (Poland). Her research activities focus on historical linguistists, in particular the history of English and other Germanic languages, Old English syntax and phraseology (especially word and constituent order) and the influence of Latin on Old English.

Piotr Pęzik is associate professor at the Institute of English Studies, University of Łódź (Poland). Within the Institute, he runs the Department of Corpus and Computational Linguistics. His research interests include general and English linguistics, phraseology, corpus linguistics, computational linguistics and natural language processing, information retrieval and information extraction.