Introduce MarkdownDocumentReader #1106
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
@markpollack and @tzolov, maybe you are interested in the new
MarkdownDocumentReader
, which can read structured Markdown documents. As @markpollack wrote in #105, it could be valuable. I agree.So, I've prepared a simple implementation of that
DocumentReader
.Description
For parsing Markdown documents, I've used the commonmark/commonmark-java library.
Document dividing
By default, all documents are divided by headers. This includes all header types from 1 to 6. For a simple document like:
Six documents will be generated. Each of these documents will have entries in the metadata as follows:
category
=>header_X
, whereX
is the number of the headertitle
=><header title>
, e.g.:BBB
from the exampleThere is also an option to divide the Markdown document by horizontal lines. This is not the default option, but it can be turned on through configuration.
Blockquotes and Code Blocks support
All blockquotes and code blocks are treated as separate documents. For code blocks where we the language can be determined, it is included in the
lang
metadata entry.This behavior can be changed by setting options.
Additional metadata
The Markdown reader configuration also provides support for additional metadata, which may be set for all processed documents. It contains fixed values that offer more context about the created document, such as the service name that provides the document, or the environment in which it was created.
TODO
MarkdownDocumentReader
to the documentation