References

steps

Шаги/Steps

Shagi / Steps

2412-94102782-1765

The Russian Presidential Academy of National Economy and Public Administration

10.22394/2412-9410-2021-7-1-183-198

steps-571

Research Article

Статьи

Identifying Latin authors through maximum-likelihood Dirichlet inference: A contribution to model-based stylometry

Николаев

Д. С.

Nikolaev

Dmitry S.

dnikolaev@fastmail.com

Шумилин

М. В.

Shumilin

Mikhail V.

mvlshumilin@gmail.com

Стокгольмский университетStockholm University

Институт мировой литературы им. А. М. Горького РАНA. M. Gorky Institute of World Literature of the Russian Academy of Sciences

71183198

1970

Николаев Д.С., Шумилин М.В.

Nikolaev D., Shumilin M.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://steps.ranepa.ru/jour/article/view/571

В статье предлагается новый алгоритм для определения авторов латинских прозаических текстов, основанный на Дельте Берроуза и распределении Дирихле. Для демонстрации эффективности алгоритма проводится анализ фрагментов текстов 36 авторов классического и средневекового периода. Наш алгоритм показывает результаты, сопоставимые с результатами, полученными за счет применения Random Forest, одного из самых мощных универсальных классификационных алгоритмов. Преимущество нашего алгоритма заключается в том, что он требует очень мало времени и вычислительных ресурсов для обучения, его легко имплементировать на любом языке программирования общего назначения и его тривиально параллелизовать. Кроме того, поскольку алгоритм основан на эксплицитной модели порождения текста, параметры натренированной модели поддаются интерпретации: точность распределения (сумма его параметров) прямо соответствует стилистической гомогенности текстов соответствующего автора.

The last two decades saw a dramatic increase in the number of papers published on the subject of stylometry, which is often narrowly understood as the task of identification of the author of a particular text fragment based on its stylistic properties. We present a new lightweight algorithm for stylometric identification of authors of Latin prose texts based on Burrows’s Delta, computed over relative frequencies of 244 manually selected genre and topic neutral words, and the Dirichlet distribution, whose parameters we estimate using an iterative maximum-likelihood algorithm. In order to demonstrate the effectiveness of the method, we present a case study of 3000-word fragments of texts by 36 classical and medieval authors and show that our method performs on par with Random Forest, a powerful general-purpose classification algorithm. We provide summary statistics of our algorithm’s performance together with confusion matrices demonstrating pairwise discriminability of texts by different authors. The advantages of our method are that it is very simple to implement, very quick to train and do inference with, and that it is very interpretable since it is a model-based algorithm: precision of the fitted Dirichlet distributions directly corresponds to the stylistic homogeneity of the texts by different authors. This makes it possible to use the algorithm as a general research tool in Latin stylistics.

стилометриялатинская литературараспределение ДирихлеДельта Берроузаrandom forestатрибуция текстовстилистический анализмашинное обучение

stylometryLATIN literatureDirichlet distributionBurrows's Deltatext attributionstylistic analysismachine learning

References

The authors declare that there are no conflicts of interest present.