Porter Stemming Algorithm — Basic Intro
In linguistics (study of language and its structure), a stem is part of a word, that is common to all of its inflected variants.
The above words are inflected variants of CONNECT. Hence, CONNECT is a stem. To this stem, we can add different suffixes to form different words.
The process of reducing such inflected (or sometimes derived) words to their word stem is known as Stemming. For example, CONNECTED, CONNECTION and CONNECTING can be reduced to the stem CONNECT.
The Porter Stemming algorithm (or Porter Stemmer) is used to remove the suffixes from an English word and obtain its stem which becomes very useful in the field of Information Retrieval (IR). This process reduces the number of terms kept by an IR system which will be advantageous both in terms of space and time complexity. This algorithm was developed by a British Computer Scientist named Martin F. Porter. You can visit the official home page of the Porter stemming algorithm for further information.
First, a few terms and expressions will be introduced, which will be helpful for ease of explanation.
Consonants and Vowels
A consonant is a letter other than the vowels and other than the letter “Y” preceded by a consonant. So in “TOY”, the consonants are “T” and “Y”, and in “SYZYGY” they are “S”, “Z” and “G”.
If a letter is not a consonant it is a vowel.
A consonant will be denoted by c and a vowel by v.
A list of one or more consecutive consonants (ccc…) will be denoted by C, and a list of one or more consecutive vowels (vvv…) will be denoted by V. Any word, or part of a word, therefore has one of the four forms given below.
- CVCV … C → collection, management
- CVCV … V → conclude, revise
- VCVC … C → entertainment, illumination
- VCVC … V → illustrate, abundance
All of these forms can be represented using a single form as,