Stemmers & Lemmatization | Notes of "NLP with Python"

http://www.nltk.org/howto/stem.html

http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Stemmers remove morphological affixes from words, leaving only the word stem.

The three major stemming algorithms:

Porter: Most commonly used stemmer without a doubt, also one of the most gentle stemmers. One of the few stemmers that actually has Java support which is a plus, though it is also the most computationally intensive of the algorithms(Granted not by a very significant margin). It is also the oldest stemming algorithm by a large margin.

Snowball: Nearly universally regarded as an improvement over porter, and for good reason. Porter himself in fact admits that Snowball is better than his original algorithm. Slightly faster computation time than snowball, with a fairly large community around it.

Lancaster: Very aggressive stemming algorithm, sometimes to a fault. With porter and snowball, the stemmed representations are usually fairly intuitive to a reader, not so with Lancaster, as many shorter words will become totally obfuscated. The fastest algorithm here, and will reduce your working set of words hugely, but if you want more distinction, not the tool you would want.

(from Quora)

Snowball stemmer

>>> from nltk.stem.snowball import SnowballStemmer
>>> stemmer = SnowballStemmer("english") #SnowballStemmer("english", ignore_stopwords=True)
>>> print(stemmer.stem("having"))
have

lemmatization

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'