Ngram language model implemented in Pharo
# Ngram Language Model [](https://github.com/pharo-ai/NgramModel/actions/workflows/test.yml) [](https://coveralls.io/github/pharo-ai/NgramModel?branch=master) [](https://raw.githubusercontent.com/pharo-ai/NgramModel/master/LICENSE) `Ngram` package provides basic [n-gram](https://en.wikipedia.org/wiki/N-gram) functionality for Pharo. This includes `Ngram` class as well as `String` and `SequenceableCollection` extension that allow you to split text into unigrams, bigrams, trigrams, etc. Basically, this is just a simple utility for splitting texts into sequences of words. This project also provides ## Installation To install the packages of NgramModel, go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D): ```Smalltalk Metacello new baseline: 'AINgramModel'; repository: 'github://pharo-ai/NgramModel/src'; load ``` ## How to depend on it? If you want to add a dependency to this project to your own project, include the following lines into your baseline method: ```Smalltalk spec baseline: 'NgramModel' with: [ spec repository: 'github://pharo-ai/NgramModel/src' ]. ``` If you are new to baselines and Metacello, check out the [Baselines](https://github.com/pharo-open-documentation/pharo-wiki/blob/master/General/Baselines.md) tutorial on Pharo Wiki. ## What are n-grams? [N-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of n elements, usually words. Number n is called the order of n-gram The concept of n-grams is widely used in [natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing). A text can be split into n-grams - sequences of n words. Consider the following text: ``` I do not like green eggs and ham ``` We can split it into **unigrams** (n-grams with n=1): ``` (I), (do), (not), (like), (green), (eggs), (and), (ham) ``` Or **bigrams** (n-grams with n=2): ``` (I do), (do not), (not like), (like green), (green eggs), (eggs and), (and ham) ``` Or **trigrams** (n-grams with n=3): ``` (I do not), (do not like), (not like green), (like green eggs), (green eggs and), (eggs and ham) ``` And so on (tetragrams, pentagrams, etc.). ### Applications N-grams are widely applied in [language modeling](https://en.wikipedia.org/wiki/Language_model). For example, take a look at the implementation of [n-gram language model](https://github.com/olekscode/NgramModel) in Pharo. ### Structure of n-gram Each n-gram can be separated into: * **last word** - the last element in a sequence; * **history** (context) - n-gram of order n-1 with all words except the last one. Such separation is useful in probabilistic modeling when we want to estimate the probability of word given n-1 previous words (see [n-gram language model](https://github.com/olekscode/NgramModel)). ## Ngram class This package provides only one class - `Ngram`. It models the n-gram. ### Instance creation You can create n-gram from any `SequenceableCollection`: ```Smalltalk trigram := AINgram withElements: #(do not like). tetragram := #(green eggs and ham) asNgram. ``` Or by explicitly providing the history (n-gram of lower order) and last element: ```Smalltalk hist := #(green eggs and) asNgram. w := 'ham'. ngram := AINgram withHistory: hist last: w. ``` You can also create a zerogram - n-gram of order 0. It is an empty sequence with no history and no last word: ```Smalltalk AINgram zerogram. ``` ### Accessing You can access the order of n-gram, its history and last element: ```Smalltalk tetragram. "n-gram(green eggs and ham)" tetragram order. "4" tetragram history. "n-gram(green eggs and)" tetragram last. "ham" ``` ## String extensions > TODO ## Example of text generation #### 1. Loading Brown corpus ```Smalltalk file := 'pharo-local/iceberg/pharo-ai/NgramModel/Corpora/brown.txt' asFileReference. brown := file contents. ``` #### 2. Training a 2-gram language model on the corpus ```Smalltalk model := AINgramModel order: 2. model trainOn: brown. ``` #### 3. Generating text of 100 words At each step the model selects top 5 words that are most likely to follow the previous words and returns the random word from those five (this randomnes ensures that the generator does not get stuck in a cycle). ```Smalltalk generator := AINgramTextGenerator new model: model. generator generateTextOfSize: 100. ``` ## Result: #### 100 words generated by a 2-gram model trained on Brown corpus ``` educator cannot describe and edited a highway at private time `` Fallen Figure Technique tells him life pattern more flesh tremble with neither my God `` Hit ) landowners began this narrative and planted , post-war years Josephus Daniels was Virginia years Congress with confluent , jurisdiction involved some used which he''s something the Lyle Elliott Carter officiated and edited and portents like Paradise Road in boatloads . Shipments of Student Movement itself officially shifted religions of fluttering soutane . Coolest shade which reasonably . Coolest shade less shaky . Doubts thus preventing them proper bevels easily take comfort was ``` #### 100 words generated by a 3-gram model trained on Brown corpus ``` The Fulton County purchasing departments do to escape Nicolas Manas . But plain old bean soup , broth , hash , and cultivated in himself , back straight , black sheepskin hat from Texas A & I College and operates the institution , the antipathy to outward ceremonies hailed by modern plastic materials -- a judgment based on displacement of his arrival spread through several stitches along edge to her paper for further meditation . `` Hit the bum '' ! ! Fort up ! ! Fort up ! ! Kizzie turned to similar approaches . When Mrs. Coolidge for ``` #### 100 words generated by a 3-gram model trained on Pharo source code corpus This model was trained on the corpus composed from the source code of [85,000 Pharo methods tokenized at the subtoken level](https://github.com/pharo-ai/NgramModel/blob/master/Corpora/pharo_source.txt) (composite names like `OrderedCollection` were split into subtokens: `ordered`, `collection`) ``` super initialize value holders . ( aggregated series := ( margins if nil if false ) text styler blue style table detect : [ uniform drop list input . export csv label : suggested file name < a parametric function . | phase <num> := bit thing basic size >= desired length ) ascii . space width + bounds top - an event character : d bytes : stream if absent put : answers ) | width of text . status value := dual value at last : category string := value cos ) abs raised to n number of ``` ## Warning Training the model on the entire Pharo corpus and generating 100 words can take over 10 minutes. So start with a smaller exercise: train a 2-gram model on a Brown corpus (it is the smallest one) and generate 10 words.