Overview

Collocations:expressions of multiple words which commonly co-occur.

Measurement: Pointwise Mutual Information

Steps to find collocations:

  1. Calculating the frequencies of words and their appearance in the context of other words.
  2. Scoring each ngram of words according to some association measure --> determine the relative likelihood of each ngram being a collocation
>>> import nltk
>>> from nltk.collocations import *

>>> #measurement of the scoring function
>>> bigram_measures = nltk.collocations.BigramAssocMeasures()

>>> #Construct a BigramCollocationFinder for all bigrams, defauted window_size=2
>>> #(************it can be bigger than 2 ************)
>>> finder = BigramCollocationFinder.from_words(
...     nltk.corpus.genesis.words('english-web.txt'))

>>> #Returns the top n ngrams
>>> finder.nbest(bigram_measures.pmi, 10)

apply filters

>>> finder.apply_freq_filter(3)
>>> ignored_words = nltk.corpus.stopwords.words('english')
>>> 

>>> finder.nbest(bigram_measures.pmi, 10)

Finders