Skip to content

UnicodeDecodeError when using TextLoaderΒ #1

@harshalekkalaarjun

Description

@harshalekkalaarjun

Description:
I'm encountering a UnicodeDecodeError when loading a text file using TextLoader from langchain. Even when specifying encoding="utf-8", the error occurs. The error traceback is as follows:
``
UnicodeDecodeError Traceback (most recent call last)
File c:\Users\BOSON-229\Music\lanchain\env\Lib\site-packages\langchain_community\document_loaders\text.py:43, in TextLoader.lazy_load(self)
42 with open(self.file_path, encoding=self.encoding) as f:
---> 43 text = f.read()
44 except UnicodeDecodeError as e:

File C:\Program Files\Python312\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 198278: character maps to

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)
Cell In[29], line 3
1 file_path = "tagged_description.txt"
2 loader = TextLoader(file_path) # πŸ‘ˆ specify encoding
----> 3 raw_documents = loader.load()
5 text_splitter = CharacterTextSplitter(chunk_size=0, chunk_overlap=0,)
6 documents = text_splitter.split_documents(raw_documents)

File c:\Users\BOSON-229\Music\lanchain\env\Lib\site-packages\langchain_core\document_loaders\base.py:32, in BaseLoader.load(self)
30 def load(self) -> list[Document]:
...
---> 56 raise RuntimeError(f"Error loading {self.file_path}") from e
57 except Exception as e:
58 raise RuntimeError(f"Error loading {self.file_path}") from e

RuntimeError: Error loading tagged_description.txt
``
To Reproduce:

Steps to reproduce the behavior:

Place a text file (tagged_description.txt) with non-ASCII characters in the working directory.

Run the following code:

from langchain.document_loaders import TextLoader loader = TextLoader("tagged_description.txt") # or TextLoader("tagged_description.txt", encoding="utf-8") raw_documents = loader.load()
Possible solution :
file_path = "tagged_description.txt" loader = TextLoader(file_path, encoding="utf-8") # πŸ‘ˆ specify encoding raw_documents = loader.load()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions