You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`Ngram` package provides basic [n-gram](https://en.wikipedia.org/wiki/N-gram) functionality for Pharo. This includes `Ngram` class as well as `String` and `SequenceableCollection` extension that allow you to split text into unigrams, bigrams, trigrams, etc. Basically, this is just a simple utility for splitting texts into sequences of words.
8
+
This project also provides
9
+
7
10
## Installation
8
11
9
12
To install the packages of NgramModel, go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):
@@ -12,9 +15,97 @@ To install the packages of NgramModel, go to the Playground (Ctrl+OW) in your Ph
12
15
Metacello new
13
16
baseline: 'NgramModel';
14
17
repository: 'github://pharo-ai/NgramModel/src';
15
-
load.
18
+
load
19
+
```
20
+
21
+
## How to depend on it?
22
+
23
+
If you want to add a dependency to this project to your own project, include the following lines into your baseline method:
If you are new to baselines and Metacello, check out the [Baselines](https://github.com/pharo-open-documentation/pharo-wiki/blob/master/General/Baselines.md) tutorial on Pharo Wiki.
32
+
33
+
## What are n-grams?
34
+
35
+
[N-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of n elements, usually words. Number n is called the order of n-gram The concept of n-grams is widely used in [natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing). A text can be split into n-grams - sequences of n words. Consider the following text:
36
+
```
37
+
I do not like green eggs and ham
38
+
```
39
+
We can split it into **unigrams** (n-grams with n=1):
(I do), (do not), (not like), (like green), (green eggs), (eggs and), (and ham)
46
+
```
47
+
Or **trigrams** (n-grams with n=3):
48
+
```
49
+
(I do not), (do not like), (not like green), (like green eggs), (green eggs and), (eggs and ham)
50
+
```
51
+
And so on (tetragrams, pentagrams, etc.).
52
+
53
+
### Applications
54
+
55
+
N-grams are widely applied in [language modeling](https://en.wikipedia.org/wiki/Language_model). For example, take a look at the implementation of [n-gram language model](https://github.com/olekscode/NgramModel) in Pharo.
56
+
57
+
### Structure of n-gram
58
+
59
+
Each n-gram can be separated into:
60
+
61
+
***last word** - the last element in a sequence;
62
+
***history** (context) - n-gram of order n-1 with all words except the last one.
63
+
64
+
Such separation is useful in probabilistic modeling when we want to estimate the probability of word given n-1 previous words (see [n-gram language model](https://github.com/olekscode/NgramModel)).
65
+
66
+
## Ngram class
67
+
68
+
This package provides only one class - `Ngram`. It models the n-gram.
69
+
70
+
### Instance creation
71
+
72
+
You can create n-gram from any `SequenceableCollection`:
73
+
74
+
```Smalltalk
75
+
trigram := Ngram withElements: #(do not like).
76
+
tetragram := #(green eggs and ham) asNgram.
77
+
```
78
+
79
+
Or by explicitly providing the history (n-gram of lower order) and last element:
80
+
81
+
```Smalltalk
82
+
hist := #(green eggs and) asNgram.
83
+
w := 'ham'.
84
+
85
+
ngram := Ngram withHistory: hist last: w.
16
86
```
17
87
88
+
You can also create a zerogram - n-gram of order 0. It is an empty sequence with no history and no last word:
89
+
90
+
```Smalltalk
91
+
Ngram zerogram.
92
+
```
93
+
94
+
### Accessing
95
+
96
+
You can access the order of n-gram, its history and last element:
0 commit comments