This repository contains the code, processing scripts, and resources for the project “Tool-Augmented Language Models for Precision Analysis of Ancient/Koine Greek Texts Using a Graph-based Lexicon”.
The project introduces a novel methodology that enhances Large Language Models (LLMs) with a dedicated, real-time lexical tool to achieve high-precision morphosyntactic and functional grammar analysis of Ancient Greek texts.
An interactive website to explore the parsed results is available here: https://wmotte.github.io/llm_tool_greek_lexicon/docs/
Computational analysis of Ancient Greek has traditionally been hampered by the limitations of existing methods.
This project bridges this methodological gap by developing a system capable of exact lemma identification and subsequent linguistic analysis, moving beyond purely rule-based and similarity-based approaches.
We present a tool-augmented modeling approach that integrates LLMs with real-time, dynamic access to a structured lexical database. This architecture moves beyond simple context-stuffing (RAG) and allows the LLM to actively query a specialized tool for precise information when needed.
The core components of our system are:
The basis of our tool is a machine-readable knowledge graph built from an up-to-date Greek-Dutch Lexicon (SvBKR), a scholarly dictionary covering Greek vocabulary from Homer to the second century CE.
The construction process involved several key steps:
LexicalEntry
node with properties for grammar, glosses, and source info. This graph structure enables efficient querying of lemma frequency, contextual usage, and intertextual relationships.The processing scripts used for this process are available in this repository to ensure replication and adaptation.
The LLM interacts with the knowledge graph using Cypher queries. This allows for precise, efficient retrieval of lexical data.
1. Single Lemma Lookup (Exact Match) This query finds a lemma by matching its normalized, accent-free form.
// This query locates the exact lemma λόγος by matching its normalized form ‘logos’
MATCH (l:Lemma)-[:HAS_ENTRY]->(e:Entry)-[:BELONGS_TO]->(d:Dictionary)
WHERE l.text_no_accents = "logos"
AND d.name = "SvBKR"
RETURN l.text, e.text
2. Batch Processing for Multiple Lemmas This demonstrates the efficiency of retrieving multiple entries in a single call.
// This batch query simultaneously retrieves lexical entries for multiple lemmas
MATCH (l:Lemma)-[:HAS_ENTRY]->(e:Entry)-[:BELONGS_TO]->(d:Dictionary)
WHERE l.text_no_accents IN ["eimi", "logos", "theos", "anthropos"]
AND d.name = "SvBKR"
RETURN l.text, l.text_no_accents, e.text
The system’s performance was systematically evaluated using three distinct and challenging Greek texts, each chosen for a specific purpose.
χωρὶς θεοῦ
vs. χάριτι θεοῦ
) to evaluate the model’s ability to handle subtle but critical input changes, a misguided attention test against the “Einstellung effect”.The results, which can be explored on the interactive website, demonstrate that the tool-augmented approach provides good performance in both morphosyntactic parsing and functional grammar analysis across all test cases.
If you use this work in your research, please cite the following manuscript:
Otte, W. M., van Wieringen, A. L. H. M., & Koet, B. J. (in preparation). Tool-Augmented Language Models for Precision Analysis of Ancient/Koine Greek Texts Using a Graph-based Lexicon.
This project is licensed under the CC0 1.0 Universal License. See the LICENSE
file for details.