A bug in dataset.py

line 114 in https://github.com/codertimo/BERT-pytorch/blob/master/bert_pytorch/dataset/dataset.py

then original code generate abnormal data with wrong is_next_lable. 
for example, when corpus is below.

```
Welcome to the  \t the jungle\n
I can stay  \t  here all night\n
```

the original code generate t1, t2, is_next_lable like below

| t1 | t2 | is_next_lable |
| ------------- | ------------- | ------------- |
| Welcome to the | the jungle  | 1 (correct) |
| I can stay  | here all night | 1 (correct) |
| Welcome to the | here all night | 0 (correct) |
| I can stay  | the jungle | 0 (correct) |
| Welcome to the | the jungle  | 0 (wrong)  |
| I can stay  | here all night | 0  (wrong) |


it's hard to find problem when corpus is huge because of little portion of abnormal data on total data
but if corpus is too small, loss will not decreased because of big portion of abnormal data on total data. this problem related to #32

so, solve the problem. fix the code below

```
    def get_random_line(self, exclude_index):
        if self.on_memory:
            find_index = -1

            while True:
                find_index = random.randrange(len(self.lines))

                if exclude_index != find_index:
                    return self.lines[find_index][1]
                        :
```


I tested fixed code with small corpus include only 5 line

```
Welcome to the  \t the jungle\n
I can stay   \t  here all night\n
You need to apologize to   \t her and need to do it right away\n
Hundreds of soldiers ate in  \t silence around their campfires\n
I asked twenty people to  \t my party but not all of them came\n
```

losses have been reduced. this is training log.

<img width="1026" height="215" alt="Image" src="https://github.com/user-attachments/assets/55935523-b8a0-4920-8036-0755f8fd2c2b" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A bug in dataset.py #110

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

t1	t2	is_next_lable
Welcome to the	the jungle	1 (correct)
I can stay	here all night	1 (correct)
Welcome to the	here all night	0 (correct)
I can stay	the jungle	0 (correct)
Welcome to the	the jungle	0 (wrong)
I can stay	here all night	0 (wrong)

A bug in dataset.py #110

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions