Skip to content

Commit 683f048

Browse files
committed
Columbo FTW!
1 parent 22b8aff commit 683f048

File tree

2 files changed

+45
-35
lines changed

2 files changed

+45
-35
lines changed

vignettes-src/examples/nlp/text_classification_from_scratch.Rmd

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@ word splitting & indexing.
2222
## Setup
2323

2424
```{r}
25-
options(conflicts.policy = "strict")
2625
library(tensorflow, exclude = c("shape", "set_random_seed"))
2726
library(tfdatasets, exclude = "shape")
2827
library(keras3)
@@ -57,7 +56,7 @@ The `aclImdb/train/pos` and `aclImdb/train/neg` folders contain text files, each
5756
which represents one review (either positive or negative):
5857

5958
```{r, warning=FALSE}
60-
writeLines(strwrap(readLines("datasets/aclImdb/train/pos/6248_7.txt")))
59+
writeLines(strwrap(readLines("datasets/aclImdb/train/pos/4229_10.txt")))
6160
```
6261

6362
We are only interested in the `pos` and `neg` subfolders, so let's delete the other subfolder that has text files in it:

vignettes/examples/nlp/text_classification_from_scratch.Rmd

Lines changed: 44 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,7 @@ word splitting & indexing.
2323
## Setup
2424

2525

26-
```r
27-
options(conflicts.policy = "strict")
26+
``` r
2827
library(tensorflow, exclude = c("shape", "set_random_seed"))
2928
library(tfdatasets, exclude = "shape")
3029
library(keras3)
@@ -36,7 +35,7 @@ use_virtualenv("r-keras")
3635
Let's download the data and inspect its structure.
3736

3837

39-
```r
38+
``` r
4039
if (!dir.exists("datasets/aclImdb")) {
4140
dir.create("datasets")
4241
download.file(
@@ -52,7 +51,7 @@ if (!dir.exists("datasets/aclImdb")) {
5251
The `aclImdb` folder contains a `train` and `test` subfolder:
5352

5453

55-
```r
54+
``` r
5655
head(list.files("datasets/aclImdb/test"))
5756
```
5857

@@ -61,7 +60,7 @@ head(list.files("datasets/aclImdb/test"))
6160
## [5] "urls_pos.txt"
6261
```
6362

64-
```r
63+
``` r
6564
head(list.files("datasets/aclImdb/train"))
6665
```
6766

@@ -74,18 +73,30 @@ The `aclImdb/train/pos` and `aclImdb/train/neg` folders contain text files, each
7473
which represents one review (either positive or negative):
7574

7675

77-
```r
78-
cat(readLines("datasets/aclImdb/train/pos/6248_7.txt"))
76+
``` r
77+
writeLines(strwrap(readLines("datasets/aclImdb/train/pos/4229_10.txt")))
7978
```
8079

8180
```
82-
## Being an Austrian myself this has been a straight knock in my face. Fortunately I don't live nowhere near the place where this movie takes place but unfortunately it portrays everything that the rest of Austria hates about Viennese people (or people close to that region). And it is very easy to read that this is exactly the directors intention: to let your head sink into your hands and say "Oh my god, how can THAT be possible!". No, not with me, the (in my opinion) totally exaggerated uncensored swinger club scene is not necessary, I watch porn, sure, but in this context I was rather disgusted than put in the right context.<br /><br />This movie tells a story about how misled people who suffer from lack of education or bad company try to survive and live in a world of redundancy and boring horizons. A girl who is treated like a whore by her super-jealous boyfriend (and still keeps coming back), a female teacher who discovers her masochism by putting the life of her super-cruel "lover" on the line, an old couple who has an almost mathematical daily cycle (she is the "official replacement" of his ex wife), a couple that has just divorced and has the ex husband suffer under the acts of his former wife obviously having a relationship with her masseuse and finally a crazy hitchhiker who asks her drivers the most unusual questions and stretches their nerves by just being super-annoying.<br /><br />After having seen it you feel almost nothing. You're not even shocked, sad, depressed or feel like doing anything... Maybe that's why I gave it 7 points, it made me react in a way I never reacted before. If that's good or bad is up to you!
81+
## Don't waste time reading my review. Go out and see this
82+
## astonishingly good episode, which may very well be the best Columbo
83+
## ever written! Ruth Gordon is perfectly cast as the scheming yet
84+
## charming mystery writer who murders her son-in-law to avenge his
85+
## murder of her daughter. Columbo is his usual rumpled, befuddled and
86+
## far-cleverer-than-he-seems self, and this particular installment
87+
## features fantastic chemistry between Gordon and Falk. Ironically,
88+
## this was not written by heralded creators Levinson or Link yet is
89+
## possibly the densest, most thoroughly original and twist-laden
90+
## Columbo plot ever. Utterly satisfying in nearly every department
91+
## and overflowing with droll and witty dialogue and thinking. Truly
92+
## unexpected and inventive climax tops all. 10/10...seek this one out
93+
## on Netflix!
8394
```
8495

8596
We are only interested in the `pos` and `neg` subfolders, so let's delete the other subfolder that has text files in it:
8697

8798

88-
```r
99+
``` r
89100
unlink("datasets/aclImdb/train/unsup", recursive = TRUE)
90101
```
91102

@@ -109,7 +120,7 @@ random seed, or to pass `shuffle=FALSE`, so that the validation & training split
109120
get have no overlap.
110121

111122

112-
```r
123+
``` r
113124
batch_size <- 32
114125

115126
raw_train_ds <- text_dataset_from_directory(
@@ -126,7 +137,7 @@ raw_train_ds <- text_dataset_from_directory(
126137
## Using 20000 files for training.
127138
```
128139

129-
```r
140+
``` r
130141
raw_val_ds <- text_dataset_from_directory(
131142
"datasets/aclImdb/train",
132143
batch_size = batch_size,
@@ -141,7 +152,7 @@ raw_val_ds <- text_dataset_from_directory(
141152
## Using 5000 files for validation.
142153
```
143154

144-
```r
155+
``` r
145156
raw_test_ds <- text_dataset_from_directory(
146157
"datasets/aclImdb/test",
147158
batch_size = batch_size
@@ -152,23 +163,23 @@ raw_test_ds <- text_dataset_from_directory(
152163
## Found 25000 files belonging to 2 classes.
153164
```
154165

155-
```r
166+
``` r
156167
cat("Number of batches in raw_train_ds:", length(raw_train_ds), "\n")
157168
```
158169

159170
```
160171
## Number of batches in raw_train_ds: 625
161172
```
162173

163-
```r
174+
``` r
164175
cat("Number of batches in raw_val_ds:", length(raw_val_ds), "\n")
165176
```
166177

167178
```
168179
## Number of batches in raw_val_ds: 157
169180
```
170181

171-
```r
182+
``` r
172183
cat("Number of batches in raw_test_ds:", length(raw_test_ds), "\n")
173184
```
174185

@@ -179,7 +190,7 @@ cat("Number of batches in raw_test_ds:", length(raw_test_ds), "\n")
179190
Let's preview a few samples:
180191

181192

182-
```r
193+
``` r
183194
# It's important to take a look at your raw data to ensure your normalization
184195
# and tokenization will work as expected. We can do that by taking a few
185196
# examples from the training set and looking at them.
@@ -196,7 +207,7 @@ str(batch)
196207
## $ :<tf.Tensor: shape=(32), dtype=int32, numpy=…>
197208
```
198209

199-
```r
210+
``` r
200211
c(text_batch, label_batch) %<-% batch
201212
for (i in 1:3) {
202213
print(text_batch[i])
@@ -218,7 +229,7 @@ for (i in 1:3) {
218229
In particular, we remove `<br />` tags.
219230

220231

221-
```r
232+
``` r
222233
# Having looked at our data above, we see that the raw text contains HTML break
223234
# tags of the form '<br />'. These tags will not be removed by the default
224235
# standardizer (which doesn't strip HTML). Because of this, we will need to
@@ -269,7 +280,7 @@ There are 2 ways we can use our text vectorization layer:
269280
strings, like this:
270281

271282

272-
```r
283+
``` r
273284
text_input <- keras_input(shape = c(1L), dtype = "string", name = 'text')
274285
x <- text_input |>
275286
vectorize_layer() |>
@@ -289,7 +300,7 @@ strings as input, like in the code snippet for option 1 above. This can be done
289300
training. We do this in the last section.
290301

291302

292-
```r
303+
``` r
293304
vectorize_text <- function(text, label) {
294305
text <- text |>
295306
op_expand_dims(-1) |>
@@ -319,7 +330,7 @@ test_ds <- test_ds |>
319330
We choose a simple 1D convnet starting with an `Embedding` layer.
320331

321332

322-
```r
333+
``` r
323334
# A integer input for vocab indices.
324335
inputs <- keras_input(shape = c(NA), dtype = "int64")
325336

@@ -372,7 +383,7 @@ summary(model)
372383
##  Non-trainable params: 0 (0.00 B)
373384
```
374385

375-
```r
386+
``` r
376387
# Compile the model with binary crossentropy loss and an adam optimizer.
377388
model |> compile(loss = "binary_crossentropy",
378389
optimizer = "adam",
@@ -382,7 +393,7 @@ model |> compile(loss = "binary_crossentropy",
382393
## Train the model
383394

384395

385-
```r
396+
``` r
386397
epochs <- 3
387398

388399
# Fit the model using the train and test datasets.
@@ -391,30 +402,30 @@ model |> fit(train_ds, validation_data = val_ds, epochs = epochs)
391402

392403
```
393404
## Epoch 1/3
394-
## 625/625 - 5s - 8ms/step - accuracy: 0.6944 - loss: 0.5248 - val_accuracy: 0.8624 - val_loss: 0.3150
405+
## 625/625 - 6s - 10ms/step - accuracy: 0.6909 - loss: 0.5300 - val_accuracy: 0.8658 - val_loss: 0.3229
395406
## Epoch 2/3
396-
## 625/625 - 2s - 2ms/step - accuracy: 0.9046 - loss: 0.2403 - val_accuracy: 0.8730 - val_loss: 0.3135
407+
## 625/625 - 2s - 3ms/step - accuracy: 0.9047 - loss: 0.2412 - val_accuracy: 0.8742 - val_loss: 0.3202
397408
## Epoch 3/3
398-
## 625/625 - 2s - 2ms/step - accuracy: 0.9524 - loss: 0.1275 - val_accuracy: 0.8716 - val_loss: 0.3424
409+
## 625/625 - 2s - 3ms/step - accuracy: 0.9573 - loss: 0.1237 - val_accuracy: 0.8704 - val_loss: 0.3551
399410
```
400411

401412
## Evaluate the model on the test set
402413

403414

404-
```r
415+
``` r
405416
model |> evaluate(test_ds)
406417
```
407418

408419
```
409-
## 782/782 - 1s - 2ms/step - accuracy: 0.8608 - loss: 0.3672
420+
## 782/782 - 1s - 2ms/step - accuracy: 0.8594 - loss: 0.3818
410421
```
411422

412423
```
413424
## $accuracy
414-
## [1] 0.86084
425+
## [1] 0.85936
415426
##
416427
## $loss
417-
## [1] 0.3671538
428+
## [1] 0.381799
418429
```
419430

420431
## Make an end-to-end model
@@ -423,7 +434,7 @@ If you want to obtain a model capable of processing raw strings, you can simply
423434
create a new model (using the weights we just trained):
424435

425436

426-
```r
437+
``` r
427438
# A string input
428439
inputs <- keras_input(shape = c(1), dtype = "string")
429440
# Turn strings into vocab indices
@@ -444,12 +455,12 @@ end_to_end_model |> evaluate(raw_test_ds)
444455
```
445456

446457
```
447-
## 782/782 - 3s - 4ms/step - accuracy: 0.8608 - loss: 0.0000e+00
458+
## 782/782 - 3s - 4ms/step - accuracy: 0.8594 - loss: 0.0000e+00
448459
```
449460

450461
```
451462
## $accuracy
452-
## [1] 0.86084
463+
## [1] 0.85936
453464
##
454465
## $loss
455466
## [1] 0

0 commit comments

Comments
 (0)