While large language models (LLMs) are now the state-of-the-art in natural language processing (NLP) worldwide, most languages in South Africa are not only severely under-resourced, but also present unique modelling challenges due to their agglutinating morphology, with the result that LLMs remain out of reach for these languages. Modern rule-based approaches are uniquely positioned to “stand in the gap”, as it were, and provide the ability to include our languages in the digital space despite the unique challenges they face.
In this workshop, we present a brief overview of the current state-of-the-art in rule-based NLP as it is applied to the South African Bantu languages. Rule-based NLP is introduced within the broader NLP context, and the linguistic features of the South African Bantu languages are presented in order to highlight the challenges they present to NLP. Finite state transducers are presented as an effective formalism for modelling Bantu language morphology, with examples drawn from the implementation of isiZulu morphology in the FOMA framework.
In order to model both morphology and syntax, perhaps more accurately referred to simply as morphosyntax, Grammatical Framework (GF) is introduced, which is a formalism and programming language for the development of multilingual computational grammars. The workshop includes the systematic development of a computational grammar which implements a small language fragment in order to show how agreement and inflection/derivation is modelled using GF. The final two lectures are dedicated to a group project in which participants contribute implementations of the same fragment in their own language or a Bantu language of their choice, resulting in a multilingual Bantu language grammar for simple sentences.
This workshop is aimed at anyone who wishes to learn about and gain hands-on experience in rule-based NLP within the South African context. While programming in GF (a functional programming language) will form part of the last two lectures of the workshop, programming skills are not a prerequisite. Participants with either a computational or a linguistic background are invited to attend the workshop.
The workshop will take the form of 10 30-minute lectures, with the final two lectures
dedicated to a group project. Lectures may be grouped in pairs if necessary.
Each participant who wishes to contribute to the group project should have access to a
laptop. If possible, participants should have Python 3 installed, although help will be provided
in this regard by the presenters. An internet connection will also be required to install the
“pgf” Python package via pip, and Github (or a similar platform) will be used to coordinate
the group project.
Laurette Marais is a senior researcher in the Voice Computing Research Group at the CSIR. Her background is in theoretical computer science, having obtained her PhD in Computer Science from Stellenbosch University in the field of descriptional complexity in 2018. However, she has been involved in NLP research for the South African languages since 2009, when she attended the first ever GF Summer School in Gothenburg, Sweden. She is co-developer of the Afrikaans and isiZulu GF Resource Grammars, and is currently leading a research and development project, called Ngiyaqonda!, wherein GF application grammars are utilised alongside speech technology to support literacy development in foundation phase learners.
Laurette Pretorius is emeritus (Unisa) and Extraordinary Professor of Computer Science at the Stellenbosch University. She holds postgraduate qualifications in computer science, pure mathematics and applied mathematics of the universities of Stellenbosch, South Africa, Pretoria en North-West. Her research mainly concerns the natural language processing (NLP) of the resource-scarce South African languages. She is co-developer of ZulMorph and other finite-state morphological analysers for the Nguni languages and Setswana, as well as co-developer of the Afrikaans and isiZulu GF Resource Grammars. She is a collaborator on various NLP projects for the South African languages with the Voice Computing Research Group of the CSIR.