Skip to content

Conversation

piotrooo
Copy link
Contributor

@piotrooo piotrooo commented Jul 24, 2024

Motivation

@markpollack and @tzolov, maybe you are interested in the new MarkdownDocumentReader, which can read structured Markdown documents. As @markpollack wrote in #105, it could be valuable. I agree.

So, I've prepared a simple implementation of that DocumentReader.

Description

For parsing Markdown documents, I've used the commonmark/commonmark-java library.

Document dividing

By default, all documents are divided by headers. This includes all header types from 1 to 6. For a simple document like:

# AAA
content 1

## BBB
content 2

### CCC
content 3

#### DDD
content 4

##### EEE
content 5

###### FFF
content 6

Six documents will be generated. Each of these documents will have entries in the metadata as follows:

  • category => header_X, where X is the number of the header
  • title => <header title>, e.g.: BBB from the example

There is also an option to divide the Markdown document by horizontal lines. This is not the default option, but it can be turned on through configuration.

Blockquotes and Code Blocks support

All blockquotes and code blocks are treated as separate documents. For code blocks where we the language can be determined, it is included in the lang metadata entry.

This behavior can be changed by setting options.

Additional metadata

The Markdown reader configuration also provides support for additional metadata, which may be set for all processed documents. It contains fixed values that offer more context about the created document, such as the service name that provides the document, or the environment in which it was created.

TODO

@markpollack
Copy link
Member

Hi, I'm super happy to see this. I'll review asap. I have a version of this on my machine from way back when i started the project. Markdown is a great "lingua franca" for document ETL.

@markpollack markpollack added this to the 1.0.0-M2 milestone Jul 24, 2024
@markpollack markpollack self-assigned this Jul 24, 2024
@piotrooo
Copy link
Contributor Author

@markpollack great to hear it!

Any ideas or enhancements are more than appreciated.

I also have some future ideas about handling tables, but first things first. Baby steps.

@markpollack
Copy link
Member

I've added docs. I haven't tried it in anger yet but it looks great. Merged in 56e678c

@piotrooo piotrooo deleted the feature/introduce-markdown-etl branch August 22, 2024 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants