(The), (quick), (brown), ...
(The, quick), (quick, brown), (brown, fox), ...
(The, quick, brown), (quick, brown, fox), (brown, fox, jumps), ...
(The, quick, brown, fox), (quick, brown, fox, jumps), (brown, fox, jumps, over), ...
(brown, quick)
is very likely grammatically incorrect because we normally encounter those two words the other way around. Applying n-gram analysis to text is a very simple and powerful technique used frequently in language modelling problems like the one we just showed, and as such is often the foundation of more advanced NLP applications (some of which we’ll explore in this series).brown
directory containing a lot of .txt
files:sh
$ ll brown
total 10588
-rw-r--r-- 1 nathan nathan 20187 Dec 3 2008 ca01
-rw-r--r-- 1 nathan nathan 20357 Dec 3 2008 ca02
-rw-r--r-- 1 nathan nathan 20214 Dec 3 2008 ca03
(...)
each_cons
method is defined in the Enumerable
module, and does exactly what we need by returning every consecutive possible set of n
elements. It’s effectively a built-in method to do what we want once we have the input data into an Array
format.target
string to work on, and allow people to optionally define how we break apart the given target before generating the n-grams (It defaults to splitting it up into words, but you could change it to splitting into characters, as an example.)unigrams
, bigrams
and trigrams
./
no/at evidence/nn ”/” that/cs any/dti irregularities/nns took/vbd place/nn ././
. The tags describe the sort of words we are looking at, for example “noun” or a “verb”, and often the tense and the role of the word, for example “past tense” and “plural”.strip
just to clean off any preceding or trailing white-space we may have, and then we split the sentence up into words by using the space character as a delimiter.each_with_object
. This comes courtesy of the Enumerable
module, and is effectively a specialized version of inject
. Here’s an example without using either of those methods:sentences
variable ourselves. The inject
method gets us one step closer:acc
(short for “accumulator” which is a common nomenclature for the object one is building when using inject
or other fold style variants.) If we don’t, the acc
variable on the next loop run will become whatever we returned in the last loop. Thus, each_with_object
was born, which takes care of always returning the object you are building without you having to worry about it.inject
and each_with_object
is that the acc
parameter is passed to the block in the opposite order. Be sure to check the documentation before using these methods to be sure you’ve got it the right way around!brown
folder that you extracted previously), and a class to pass each file that is found to. If you’re not familiar with glob patterns, they’re often used in the terminal to wildcard things, eg. *.txt
is a glob pattern that will find all files ending in .txt
.files
and sentences
methods. The former uses the glob to find all matching files, loops over them, and creates a new instance of the class we passed to the constructor. The later calls the files method, and loops over those created classes, calling the sentences
method on the class, and flattening the result into a single level deep Array
.Ngram
class. The ngrams
method calls sentences
and returns an array of n-grams for those sentences. Note that when we call flatten
, we ask it to only flatten one level, otherwise we’ll lose the n-gram Array
. The unigrams
, bigrams
and trigrams
methods are just helper methods to make things look nicer.Corpus
class we wrote, and tell it the corpus we are looking at uses the BrownCorpusFile
format.ca01
), there are lots of proper nouns (including places and names) mentioned with the word “of” beforehand. An example is the sentence “Janet Jossy of North Plains […]”. Most proper nouns will likely be 2 words or less, so let’s loop over all of the sentences in trigrams, looking for anything that has the first member “of” and the second member starting with a capital letter. We’ll include the third member of the trigram in the result if it also starts with a capital letter.Hash
that looks something like the following format:2
in this example is the number of times the given proper noun occurred.Hash
, although we use Hash.new(0)
to do so which sets all values to 0
by default. This allows us to increment values of each key without ever having to worry if the key was created and set before.Array
with just the second member in it.Array
.Array
with a space character and store it in the results Hash
, incrementing the value of that key by one.Hash
populated with key/value pairs. The keys are the proper nouns we’ve found and the values are the number of times we’ve seen that proper noun in the corpus. Here’s the top 10 after running that code over the entire Brown Corpus: