- source code: https://github.com/nltk/nltk
- http://www.nltk.org/howto/collocations.html
- http://www.nltk.org/_modules/nltk/collocations.html
Overview
Collocations:expressions of multiple words which commonly co-occur.
Measurement: Pointwise Mutual Information
Steps to find collocations:
- Calculating the frequencies of words and their appearance in the context of other words.
- Scoring each ngram of words according to some association measure --> determine the relative likelihood of each ngram being a collocation
>>> import nltk
>>> from nltk.collocations import *
>>> #measurement of the scoring function
>>> bigram_measures = nltk.collocations.BigramAssocMeasures()
>>> #Construct a BigramCollocationFinder for all bigrams, defauted window_size=2
>>> #(************it can be bigger than 2 ************)
>>> finder = BigramCollocationFinder.from_words(
... nltk.corpus.genesis.words('english-web.txt'))
>>> #Returns the top n ngrams
>>> finder.nbest(bigram_measures.pmi, 10)
apply filters
>>> finder.apply_freq_filter(3)
>>> ignored_words = nltk.corpus.stopwords.words('english')
>>>
>>> finder.nbest(bigram_measures.pmi, 10)