Skip to content

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 1: invalid start byte #10221

@Sooraj577

Description

@Sooraj577

How to reproduce the behaviour

I am trying to build a custom ner model. So for a reference I tried the below code and generated a demo_train.spacy file.

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
]
# the DocBin will store the example documents
db = DocBin()
for text, annotations in training_data:
    doc = nlp(text)
    ents = []
    for start, end, label in annotations:
        span = doc.char_span(start, end, label=label)
        ents.append(span)
    doc.ents = ents
    db.add(doc)
db.to_disk("./spacy3/demo_train.spacy")

After the demo_train.spacy file was created, I debug the data using:

!python -m spacy debug data /home/sooraj/rough/doccano/spacy3/demo_train.spacy

The result for this command was an error which is given below:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/spacy/cli/_util.py", line 69, in setup_cli
    command(prog_name=COMMAND)
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/spacy/cli/debug_data.py", line 65, in debug_data_cli
    debug_data(
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/spacy/cli/debug_data.py", line 89, in debug_data
    cfg = util.load_config(config_path, overrides=config_overrides)
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/spacy/util.py", line 549, in load_config
    return config.from_disk(
  File "/home/sooraj/.virtualenvs/spacy3/lib/python3.8/site-packages/thinc/config.py", line 454, in from_disk
    text = file_.read()
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 1: invalid start byte

I have used the example given in the spacy website to generate spacy file. Why is it showing this?

Your Environment

  • Operating System: Ubuntu 18.04
  • Python Version Used: 3.8
  • spaCy Version Used: 3.1.0
  • Environment Information:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions