After training process all CATS_SCORE and SCORE has 0.00 #12167

Shiyinq · 2023-01-24T09:41:40Z

Shiyinq
Jan 24, 2023

First step i have already create code to convert csv file to binary spacy, the code look like this

def convert(lang: str, input_path, output_path):
    train_data = pd.read_csv(input_path)
    train_data.dropna(axis=0, how='any', inplace=True)

    data = tuple(zip(train_data['tweet'].tolist(),
                     train_data['label'].tolist()))

    data_label = {}
    labels = set(train_data["label"])
    for dl in labels:
        data_label[dl] = 0.0

    nlp = spacy.blank(lang)
    db = DocBin()
    for text, label in tqdm(data, total=len(data)):
        data_label[label] = 1.0

        doc = nlp.make_doc(text)
        doc.cats = data_label

        db.add(doc)
        print(doc.cats[label])
        data_label[label] = 0.0

    db.to_disk(output_path)

my_path = "train"
train_data = f"{my_path}/dataset/train.csv"
train_eval = f"{my_path}/dataset/valid.csv"

output_train_data = f"{my_path}/spacy_docs/data_train.spacy"
output_train_eval = f"{my_path}/spacy_docs/data_valid.spacy"

convert("id", train_data, output_train_data)
convert("id", train_eval, output_train_eval)

and then run cli command for convert csv to binary and cli command for training
python process.py
python -m spacy train train/config.cfg --verbose --output train/models

but the result all CATS_SCORE and SCORE got 0.00

i'm missing something on the code?

the code above im updated from spacy project:

def convert(lang: str, input_path: Path, output_path: Path):
    nlp = spacy.blank(lang)
    db = DocBin()
    for line in srsly.read_jsonl(input_path):
        doc = nlp.make_doc(line["text"])
        doc.cats = line["cats"]
        db.add(doc)
    db.to_disk(output_path)

Answered by danieldk

Jan 24, 2023

You are reusing the same dictionary for every instance:

    data_label = {}

    # ...

    nlp = spacy.blank(lang)
    db = DocBin()
    for text, label in tqdm(data, total=len(data)):
        data_label[label] = 1.0 # <- HERE

So, eventually all labels will be set to a probability of 1.0 and all documents will have the same label dict. So, each doc will have a probability of 1.0 for all labels. You could fix this issue by making a copy of the label dict for each doc and setting the label in the copy.

View full answer

danieldk · 2023-01-24T10:40:42Z

danieldk
Jan 24, 2023

You are reusing the same dictionary for every instance:

    data_label = {}

    # ...

    nlp = spacy.blank(lang)
    db = DocBin()
    for text, label in tqdm(data, total=len(data)):
        data_label[label] = 1.0 # <- HERE

So, eventually all labels will be set to a probability of 1.0 and all documents will have the same label dict. So, each doc will have a probability of 1.0 for all labels. You could fix this issue by making a copy of the label dict for each doc and setting the label in the copy.

4 replies

Shiyinq Jan 24, 2023
Author

@danieldk thank you for the response

i use dictionary to avoid hard coded using if else
but i have already set back to 0.0

    for text, label in tqdm(data, total=len(data)):
        data_label[label] = 1.0

        doc = nlp.make_doc(text)
        doc.cats = data_label

        db.add(doc)
        print(doc.cats[label])
        data_label[label] = 0.0 # -> HERE

and the result look like this if i print
print(doc.cats)

danieldk Jan 24, 2023

Ah, right, missed that. But the underlying issue is still the same. Every doc gets the same dict, so when you do

data_label[label] = 0.0 # -> HERE

you are also resetting the label in the doc you just added to the DocBin container. Two iterations of your loop unrolled:

>>> db = DocBin()
>>> labels = {"foo": 0.0, "bar": 0.0}
>>> labels["foo"] = 1.0
>>> doc = nlp.make_doc("hello world")
>>> doc.cats = labels
>>> db.add(doc)
>>> labels["foo"] = 0.0
>>> labels["bar"] = 1.0
>>> doc2 = nlp.make_doc("yet another doc")
>>> doc2.cats = labels
>>> db.add(doc2)
>>> labels["bar"] = 0.0
>>> docs = list(db.get_docs(nlp.vocab))
>>> docs[0].cats
{'foo': 0.0, 'bar': 0.0}
>>> docs[1].cats
{'foo': 0.0, 'bar': 0.0}

Shiyinq Jan 25, 2023
Author

@danieldk thank you for the explanation, i understand
i change the code look like this and its work

    nlp = spacy.blank(lang)
    db = DocBin()
    for text, label in tqdm(data, total=len(data)):

        data_label = {}
        for dl in labels:
            data_label[dl] = 0.0

        data_label[label] = 1.0

        doc = nlp.make_doc(text)
        doc.cats = data_label

        db.add(doc)
        # data_label[label] = 0.0

    docs = list(db.get_docs(nlp.vocab))
    print(docs[0].cats)

    db.to_disk(output_path)

but.. any suggest to improve this code to be better ?

danieldk Jan 25, 2023

Rather than reinitializing the dict every iteration, you could also do it once and copy it over. E.g. (untested):

        doc = nlp.make_doc(text)
        doc.cats = {**data_label, label: 1.0}

Uh oh!

After training process all CATS_SCORE and SCORE has 0.00 #12167

Uh oh!

Uh oh!

Shiyinq Jan 24, 2023

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

danieldk Jan 24, 2023

Uh oh!

Uh oh!

Shiyinq Jan 24, 2023 Author

Uh oh!

danieldk Jan 24, 2023

Uh oh!

Uh oh!

Shiyinq Jan 25, 2023 Author

Uh oh!

danieldk Jan 25, 2023

Shiyinq
Jan 24, 2023

Replies: 1 comment 4 replies

danieldk
Jan 24, 2023

Shiyinq Jan 24, 2023
Author

Shiyinq Jan 25, 2023
Author