The Ada Stemmer Library provides several stemming algorithms that can be used in natural language analysis to find the base or root form of a word.
Ada Stemmer Library
By Stephane Carrez2020-05-16 07:55:00
Stemming is not new as it was first introduced in 1968 by Julie Beth Lovis who was a computational linguist that created the first algorithm known today as the Lovins Stemming algorithm. Her algorithm has significantly influenced other algorithms such as the Porter Stemmer algorithm which is now a common stemming algorithm for English words. These algorithms are specific to the English language and will not work for French, Greek or Russian.
To support several natural languages, it is necessary to have several algorithms. The Snowball stemming algorithms project provides such support through a specific string processing language, a compiler and a set of algorithms for various natural languages. The Snowball compiler has been adapted to generate Ada code (See Snowball Ada on GitHub).
The Ada Stemmer Library integrates stemming algorithms for: English, Danish, Dutch, French, German, Greek, Italian, Serbian, Spanish, Swedish, Russian. The Snowball compiler provides several other algorithms but they are not integrated yet: their integration is left as an exercise to the reader.
Stemmer Overview
Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. A Snowball script describes a set of rules which are applied and checked on an input word or some portion of it in order to eliminate or replace some terms. The stemmer will usually transform a plural into a singular form, it will reduce the multiple forms of a verb, find the noun from an adverb and so on. Romance languages, Germanic languages, Scandinavian languages share some common rules but each language will need its own snowball algorithm. The Snowball compiler provides a detailed list of several stemming algorithms for various natural languages. This list is available on: https://snowballstem.org/algorithms/
The Snowball compiler reads the Snowball script and generates the stemmer implementation for a given target programming language such as Ada, C, C#, Java, JavaScript, Go, Python, Rust. The Ada Stemmer Library contains the generated algorithms for several natural languages. The generated stemmers are not able to recognize the natural language and it is necessary to tell the stemmer library which natural language you wish to use.
The Ada Stemmer Library supports only UTF-8 strings which simplifies both the implementation and the API. The library only uses the Ada String
type to handle strings.
Setup
To use the library, you should run the following commands:
git clone https://github.com/stcarrez/ada-stemmer.git
cd ada-stemmer
make build install
This will fetch, compile and install the library. You can then add the following line in your GNAT project file:
with "stemmer";
Stemming examples
Each stemmer algorithm works on a single word at a time. The Ada Stemmer Library does not split words. You have to give it one word at a time to stem and it returns either the word itself or its stem. The Stemmer.Factory
is the multi-language entry point. The stemmer algorithm is created for each call. The following simple code:
with Stemmer.Factory; use Stemmer.Factory;
with Ada.Text_IO; use Ada.Text_IO;
...
Put_Line (Stem (L_FRENCH, "chienne"));
will print the string:
chien
When multiple words must be stemmed, it may be better to declare the instance of the stemmer and use the same instance to stem several words. The Stem_Word
procedure can be called with each word and it returns a boolean that indicates whether the word was stemmed or not. The result is obtained by calling the Get_Result
function. For exemple,
with Stemmer.English;
with Ada.Text_IO; use Ada.Text_IO;
..
Ctx : Stemmer.English.Context_Type;
Stemmed : Boolean;
..
Ctx.Stem_Word ("zealously", Stemmed);
if Stemmed then
Put_Line (Ctx.Get_Result);
end if;
Integrating a new Stemming algorithm
Integration of a new stemming algorithm is quite easy but requires to install the Snowball Ada compiler.
git clone --branch ada-support https://github.com/stcarrez/snowball
cd snowball
make
The Snowball compiler needs the path of the stemming algorithm, the target programming language, the name of the Ada child package that will contain the generated algorithm and the target path. For example, to generate the Lithuanian stemmer, the following command can be used:
./snowball algorithms/lithuanian.sbl -ada -P Lithuanian -o stemmer-lithuanian
You will then get two files: stemmer-lithuanian.ads
and stemmer-lithuanian.adb
. After integration of the generated files in your project, you can access the generated stemmer with:
with Stemmer.Lithuanian;
..
Ctx : Stemmer.Lithuanian.Context_Type;
Conclusion
Thanks to the Snowball compiler and its algorithms, it is possible to do some natural language analysis. Version 1.0 of the Ada Stemmer Library being available on GitHub, it is now possible to start doing some natural language analysis in Ada!
Tags
- Facelet
- NetBSD
- framework
- Mysql
- generator
- files
- application
- gcc
- ReadyNAS
- Security
- binutils
- ELF
- JSF
- Java
- bacula
- Tutorial
- Apache
- COFF
- collaboration
- planning
- project
- upgrade
- AWA
- C
- EL
- J2EE
- UML
- php
- symfony
- Ethernet
- Ada
- FreeBSD
- Go
- KVM
- MDE
- Proxy
- STM32
- Servlet
- backup
- lvm
- multiprocessing
- web
- Bean
- Jenkins
- release
- OAuth
- ProjectBar
- REST
- Rewrite
- Sqlite
- Storage
- USB
- Ubuntu
- bison
- cache
- crash
- Linux
- firefox
- performance
- interview
Add a comment
To add a comment, you must be connected. Login