Why UnstructuredMarkdownLoader strips out formatting ? #26976
Replies: 1 comment 3 replies
-
UnstructuredMarkdownLoader strips out formatting by default because it combines different "elements" of the markdown into a single document. This is the intended behavior for basic usage. To keep markdown formatting for RAG (Retrieval-Augmented Generation) applications, you should use the Here is how you can do it: from langchain_community.document_loaders import UnstructuredMarkdownLoader
markdown_path = "./example_data/example.md"
loader = UnstructuredMarkdownLoader(markdown_path, mode="elements")
data = loader.load()
print(f"Number of documents: {len(data)}\n")
for document in data[:2]:
print(f"{document}\n") Additionally, if you are using the from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders.unstructured_file import UnstructuredMarkdownLoader
loader = DirectoryLoader("./example_data", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader)
docs = loader.load() This configuration ensures that |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Checked other resources
Commit to Help
Example Code
Description
The issue is that I am using DirectoryLoader with or without
loader_cls
on a directory containing markdown files and the result is that the parsed content is basically just raw text, all formatting gets deleted.Is this the right behaviour ? If yes, if i want to keep markdown formatting to be used in a RAG applications, what should i use instead ?
System Info
(.venv) vscode ➜ $ pip freeze | grep langchain
langchain==0.3.0
langchain-anthropic==0.2.1
langchain-chroma==0.1.4
langchain-cli==0.0.31
langchain-community==0.3.0
langchain-core==0.3.5
langchain-openai==0.2.0
langchain-text-splitters==0.3.0
langchain-unstructured==0.1.4
(.venv) vscode ➜ $ pip freeze | grep langchain
langchain-unstructured==0.1.4
unstructured==0.15.13
unstructured-client==0.25.9
unstructured-inference==0.7.36
unstructured.pytesseract==0.3.13
Beta Was this translation helpful? Give feedback.
All reactions