Ada Stemmer Library

By Stephane Carrez

The Ada Stemmer Library provides several stemming algorithms that can be used in natural language analysis to find the base or root form of a word.

Stemming is not new as it was first introduced in 1968 by Julie Beth Lovis who was a computational linguist that created the first algorithm known today as the Lovins Stemming algorithm. Her algorithm has significantly influenced other algorithms such as the Porter Stemmer algorithm which is now a common stemming algorithm for English words. These algorithms are specific to the English language and will not work for French, Greek or Russian.

To support several natural languages, it is necessary to have several algorithms. The Snowball stemming algorithms project provides such support through a specific string processing language, a compiler and a set of algorithms for various natural languages. The Snowball compiler has been adapted to generate Ada code (See Snowball Ada on GitHub).

The Ada Stemmer Library integrates stemming algorithms for: English, Danish, Dutch, French, German, Greek, Italian, Serbian, Spanish, Swedish, Russian. The Snowball compiler provides several other algorithms but they are not integrated yet: their integration is left as an exercise to the reader.

Stemmer Overview

Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. A Snowball script describes a set of rules which are applied and checked on an input word or some portion of it in order to eliminate or replace some terms. The stemmer will usually transform a plural into a singular form, it will reduce the multiple forms of a verb, find the noun from an adverb and so on. Romance languages, Germanic languages, Scandinavian languages share some common rules but each language will need its own snowball algorithm. The Snowball compiler provides a detailed list of several stemming algorithms for various natural languages. This list is available on: https://snowballstem.org/algorithms/

C

The Snowball compiler reads the Snowball script and generates the stemmer implementation for a given target programming language such as Ada, C, C#, Java, JavaScript, Go, Python, Rust. The Ada Stemmer Library contains the generated algorithms for several natural languages. The generated stemmers are not able to recognize the natural language and it is necessary to tell the stemmer library which natural language you wish to use.

The Ada Stemmer Library supports only UTF-8 strings which simplifies both the implementation and the API. The library only uses the Ada String type to handle strings.

Setup

To use the library, you should run the following commands:

  git clone https://github.com/stcarrez/ada-stemmer.git
  cd ada-stemmer
  make build install

This will fetch, compile and install the library. You can then add the following line in your GNAT project file:

  with "stemmer";

Stemming examples

Each stemmer algorithm works on a single word at a time. The Ada Stemmer Library does not split words. You have to give it one word at a time to stem and it returns either the word itself or its stem. The Stemmer.Factory is the multi-language entry point. The stemmer algorithm is created for each call. The following simple code:

  with Stemmer.Factory; use Stemmer.Factory;
  with Ada.Text_IO; use Ada.Text_IO;
    ...
    Put_Line (Stem (L_FRENCH, "chienne"));

will print the string:

 chien

When multiple words must be stemmed, it may be better to declare the instance of the stemmer and use the same instance to stem several words. The Stem_Word procedure can be called with each word and it returns a boolean that indicates whether the word was stemmed or not. The result is obtained by calling the Get_Result function. For exemple,

  with Stemmer.English;
  with Ada.Text_IO; use Ada.Text_IO;
  ..
    Ctx : Stemmer.English.Context_Type;
    Stemmed : Boolean;
    ..
    Ctx.Stem_Word ("zealously", Stemmed);
    if Stemmed then
       Put_Line (Ctx.Get_Result);
    end if;

Integrating a new Stemming algorithm

Integration of a new stemming algorithm is quite easy but requires to install the Snowball Ada compiler.

  git clone --branch ada-support https://github.com/stcarrez/snowball
  cd snowball
  make

The Snowball compiler needs the path of the stemming algorithm, the target programming language, the name of the Ada child package that will contain the generated algorithm and the target path. For example, to generate the Lithuanian stemmer, the following command can be used:

  ./snowball algorithms/lithuanian.sbl -ada -P Lithuanian -o stemmer-lithuanian

You will then get two files: stemmer-lithuanian.ads and stemmer-lithuanian.adb. After integration of the generated files in your project, you can access the generated stemmer with:

  with Stemmer.Lithuanian;
  ..
    Ctx : Stemmer.Lithuanian.Context_Type;

Conclusion

Thanks to the Snowball compiler and its algorithms, it is possible to do some natural language analysis. Version 1.0 of the Ada Stemmer Library being available on GitHub, it is now possible to start doing some natural language analysis in Ada!

Add a comment

To add a comment, you must be connected. Login