Skip to content

Commit 6a1476d

Browse files
authored
Merge pull request #26 from pharo-ai/merge-ngram
Merge ngram in NgramModel
2 parents ac8041c + 5c1a6e0 commit 6a1476d

File tree

11 files changed

+1158
-13
lines changed

11 files changed

+1158
-13
lines changed

.github/workflows/test.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,16 @@ env:
88
on:
99
push:
1010
branches:
11-
- master
11+
- '**'
12+
pull_request:
13+
types: [assigned, opened, synchronize, reopened]
1214

1315
jobs:
1416
build:
1517
runs-on: ubuntu-latest
1618
strategy:
1719
matrix:
18-
smalltalk: [ Pharo64-9.0, Pharo64-10 ]
20+
smalltalk: [ Pharo64-9.0, Pharo64-10, Pharo64-11 ]
1921
name: ${{ matrix.smalltalk }}
2022
steps:
2123
- uses: actions/checkout@v2

README.md

Lines changed: 92 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44
[![Coverage Status](https://coveralls.io/repos/github/pharo-ai/NgramModel/badge.svg?branch=master)](https://coveralls.io/github/pharo-ai/NgramModel?branch=master)
55
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/pharo-ai/NgramModel/master/LICENSE)
66

7+
`Ngram` package provides basic [n-gram](https://en.wikipedia.org/wiki/N-gram) functionality for Pharo. This includes `Ngram` class as well as `String` and `SequenceableCollection` extension that allow you to split text into unigrams, bigrams, trigrams, etc. Basically, this is just a simple utility for splitting texts into sequences of words.
8+
This project also provides
9+
710
## Installation
811

912
To install the packages of NgramModel, go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):
@@ -12,9 +15,97 @@ To install the packages of NgramModel, go to the Playground (Ctrl+OW) in your Ph
1215
Metacello new
1316
baseline: 'NgramModel';
1417
repository: 'github://pharo-ai/NgramModel/src';
15-
load.
18+
load
19+
```
20+
21+
## How to depend on it?
22+
23+
If you want to add a dependency to this project to your own project, include the following lines into your baseline method:
24+
25+
```Smalltalk
26+
spec
27+
baseline: 'NgramModel'
28+
with: [ spec repository: 'github://pharo-ai/NgramModel/src' ].
29+
```
30+
31+
If you are new to baselines and Metacello, check out the [Baselines](https://github.com/pharo-open-documentation/pharo-wiki/blob/master/General/Baselines.md) tutorial on Pharo Wiki.
32+
33+
## What are n-grams?
34+
35+
[N-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of n elements, usually words. Number n is called the order of n-gram The concept of n-grams is widely used in [natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing). A text can be split into n-grams - sequences of n words. Consider the following text:
36+
```
37+
I do not like green eggs and ham
38+
```
39+
We can split it into **unigrams** (n-grams with n=1):
40+
```
41+
(I), (do), (not), (like), (green), (eggs), (and), (ham)
42+
```
43+
Or **bigrams** (n-grams with n=2):
44+
```
45+
(I do), (do not), (not like), (like green), (green eggs), (eggs and), (and ham)
46+
```
47+
Or **trigrams** (n-grams with n=3):
48+
```
49+
(I do not), (do not like), (not like green), (like green eggs), (green eggs and), (eggs and ham)
50+
```
51+
And so on (tetragrams, pentagrams, etc.).
52+
53+
### Applications
54+
55+
N-grams are widely applied in [language modeling](https://en.wikipedia.org/wiki/Language_model). For example, take a look at the implementation of [n-gram language model](https://github.com/olekscode/NgramModel) in Pharo.
56+
57+
### Structure of n-gram
58+
59+
Each n-gram can be separated into:
60+
61+
* **last word** - the last element in a sequence;
62+
* **history** (context) - n-gram of order n-1 with all words except the last one.
63+
64+
Such separation is useful in probabilistic modeling when we want to estimate the probability of word given n-1 previous words (see [n-gram language model](https://github.com/olekscode/NgramModel)).
65+
66+
## Ngram class
67+
68+
This package provides only one class - `Ngram`. It models the n-gram.
69+
70+
### Instance creation
71+
72+
You can create n-gram from any `SequenceableCollection`:
73+
74+
```Smalltalk
75+
trigram := Ngram withElements: #(do not like).
76+
tetragram := #(green eggs and ham) asNgram.
77+
```
78+
79+
Or by explicitly providing the history (n-gram of lower order) and last element:
80+
81+
```Smalltalk
82+
hist := #(green eggs and) asNgram.
83+
w := 'ham'.
84+
85+
ngram := Ngram withHistory: hist last: w.
1686
```
1787

88+
You can also create a zerogram - n-gram of order 0. It is an empty sequence with no history and no last word:
89+
90+
```Smalltalk
91+
Ngram zerogram.
92+
```
93+
94+
### Accessing
95+
96+
You can access the order of n-gram, its history and last element:
97+
98+
```Smalltalk
99+
tetragram. "n-gram(green eggs and ham)"
100+
tetragram order. "4"
101+
tetragram history. "n-gram(green eggs and)"
102+
tetragram last. "ham"
103+
```
104+
105+
## String extensions
106+
107+
> TODO
108+
18109
## Example of text generation
19110

20111
#### 1. Loading Brown corpus

src/BaselineOfNgramModel/BaselineOfNgramModel.class.st

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,15 @@ Class {
66

77
{ #category : #baselines }
88
BaselineOfNgramModel >> baseline: spec [
9+
910
<baseline>
10-
spec for: #common do: [
11-
"External dependencies"
11+
spec for: #common do: [
1212
spec
13-
baseline: 'Ngram'
14-
with: [ spec repository: 'github://pharo-ai/Ngram/src' ].
15-
16-
"Packages"
17-
spec
18-
package: 'NgramModel' with: [ spec requires: #('Ngram') ];
19-
package: 'NgramModel-Tests' with: [ spec requires: #('NgramModel') ];
20-
package: 'NgramModelTextGenerator' with: [ spec requires: #('NgramModel') ] ].
13+
package: 'Ngram';
14+
package: 'NgramModel' with: [ spec requires: #( 'Ngram' ) ];
15+
package: 'NgramModelTextGenerator' with: [ spec requires: #( 'NgramModel' ) ];
16+
package: 'Ngram-Tests' with: [ spec requires: #( 'Ngram' ) ];
17+
package: 'NgramModel-Tests' with: [ spec requires: #( 'NgramModel' ) ].
18+
19+
spec group: 'Ngram-core' with: #( 'Ngram' 'Ngram-Tests' ) ]
2120
]

0 commit comments

Comments
 (0)