About me

I am an applied linguist and researcher specializing in the design and evaluation of language resources and technologies, with a focus on their application in corpus linguistic research.

I currently split my time between Centre for Language Resources and Technologies at the University of Ljubljana, and CLARIN.SI at the Jožef Stefan Institute.

For more info, see my full CV here, or view my profiles on Google Scholar and ResearchGate.

Research areas

Corpus annotation and analysis
Language technology and evaluation
Language description and variation

A common thread in my work is the use of state-of-the-art resources and tools for data-driven exploration of how language functions in different communicative settings. While the topics have ranged widely (from morphology, syntax, and lexis to semantics, discourse, and formulaic language), they share a focus on empirical analysis grounded in richly annotated corpora. I also apply this expertise in the development and evaluation of language technologies of various kinds.

Current projects

LLM4DH: Large Language Models for Digital Humanities (ARIS-Gravity, 2024‒2027)
- T2.3 - Advanced grammatical analysis of multilingual corpora
AI4DH: Centre of Excellence in Artificial Intelligence for Digital Humanities (Horizon Europe ERA Chair, 2025-2030)
- WP2 - Infrastructure and Research Challenges
UniDive: Universality, diversity and idiosyncrasy in language technology (CA21167 COST Action, 2022-2026)
- WG1 - Corpus Annotation

Selected past projects

SPOT: Treebank-Driven Approach to the Study of Spoken Slovenian (PI, 2022‒2025) <!—
SLOKIT: CLARIN.SI tool for corpus data analysis and summarization (2022-2023)
DSDE: Development of Slovene in a Digital Environment (2020-2023)
ELEXIS: European Lexicographic Infrastructure (2020-2023)
SLED: Monitor Corpus for Slovene and Related Resources (2021-2022)
NSSS: New grammar of contemporary standard Slovene (2017-2020)
Language Technology Seminars for Teachers (2013-2014) –>

Publications

For a full list, please see the SICRIS database.

News archive

October 2024: Excited to announce that SyntaxFest 2025 will take place in Ljubljana in August 2025-bringing together five workshops—TLT, UDW, DepLing, IWPT, and Quasy—and two UniDive pre-conference events.

July 2024: Release of STARK v3 – a significantly enhanced version of this versatile tool for bottom-up linguistic analysis and comparison of UD treebanks.

October 2023: Honoured to give an invited talk on 'Cross-lingually Harmonized Approaches to Spoken Data Annotation' at SPELLL 2023.

July 2023: Join us at ESSLLI 2023, the European Summer School in Logic, Language, and Information, hosted by the University of Ljubljana, where I'll be serving as the Local PC Chair.

October 2022: Very excited to learn that my postdoctoral project proposal 'A Treebank-Driven Approach to the Study of Spoken Slovenian' has been selected for funding.

September 2022: Kick-off meeting of the UniDive COST Action on universality, diversity, and idiosyncrasy in language technology. I am honoured to have been elected as a co-leader of the WG1 on Corpus Annotation.

May 2022: Looking forward to the LREC 2022 in Marseille where I will be presenting a paper on spoken language treebanks (main conference) and a paper on the SSJ treebank extension (LAW workshop).

March 2022: I was invited as a speaker at the ESFRI 20th anniversary conference to present the CLARIN infrastructure and its impact on my research work. The presentation was also featured as a CLARIN Impact Story.

October 2021: Kick-off meeting for project SLED: Monitor Corpus for Slovene and Related Language Resources.

July 2021: Launch of the DSDE Universal Dependencies annotation campaign aiming at 5,000 new manually parsed sentences for Slovenian.

April 2021: I co-organized the EACL 2021 Language Diversity Games as part of the Language Diversity Panel and Games event at EACL 2021.

March 2021: I joined the Development of Slovene in a Digital Environment project to work on SSJ UD treebank extension, CLASSLA-Stanza pipeline evaluation and GOS spoken corpus concordancer.

Kaja Dobrovoljc