Introduce MarkdownDocumentReader #1106

piotrooo · 2024-07-24T07:27:45Z

Motivation

@markpollack and @tzolov, maybe you are interested in the new MarkdownDocumentReader, which can read structured Markdown documents. As @markpollack wrote in #105, it could be valuable. I agree.

So, I've prepared a simple implementation of that DocumentReader.

Description

For parsing Markdown documents, I've used the commonmark/commonmark-java library.

Document dividing

By default, all documents are divided by headers. This includes all header types from 1 to 6. For a simple document like:

# AAA
content 1

## BBB
content 2

### CCC
content 3

#### DDD
content 4

##### EEE
content 5

###### FFF
content 6

Six documents will be generated. Each of these documents will have entries in the metadata as follows:

category => header_X, where X is the number of the header
title => <header title>, e.g.: BBB from the example

There is also an option to divide the Markdown document by horizontal lines. This is not the default option, but it can be turned on through configuration.

Blockquotes and Code Blocks support

All blockquotes and code blocks are treated as separate documents. For code blocks where we the language can be determined, it is included in the lang metadata entry.

This behavior can be changed by setting options.

Additional metadata

The Markdown reader configuration also provides support for additional metadata, which may be set for all processed documents. It contains fixed values that offer more context about the created document, such as the service name that provides the document, or the environment in which it was created.

TODO

Add MarkdownDocumentReader to the documentation
Update a ETL Class Diagram

markpollack · 2024-07-24T14:14:14Z

Hi, I'm super happy to see this. I'll review asap. I have a version of this on my machine from way back when i started the project. Markdown is a great "lingua franca" for document ETL.

piotrooo · 2024-07-24T15:31:20Z

@markpollack great to hear it!

Any ideas or enhancements are more than appreciated.

I also have some future ideas about handling tables, but first things first. Baby steps.

markpollack · 2024-08-22T17:37:29Z

I've added docs. I haven't tried it in anger yet but it looks great. Merged in 56e678c

piotrooo added 11 commits July 23, 2024 18:13

Start working with Markdown document reader

6dc825e

Add documents content for text with formatting

b18d91a

Handle horizontal rules

412c302

Handle hard line break

bf606c8

Handle hard line break - refactor

b3303d0

Handle hard line break - refactor

90ed6f3

Handle inline and block codes

aeae2ad

Handle blockquote

28634d9

Handle ordered and unordered lists

f90cde4

Add JavaDocs

5a2b967

Introduce additional metadata

4a71e01

markpollack added this to the 1.0.0-M2 milestone Jul 24, 2024

markpollack added the ETL label Jul 24, 2024

markpollack self-assigned this Jul 24, 2024

markpollack closed this Aug 22, 2024

piotrooo deleted the feature/introduce-markdown-etl branch August 22, 2024 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce MarkdownDocumentReader #1106

Introduce MarkdownDocumentReader #1106

Uh oh!

piotrooo commented Jul 24, 2024 •

edited by markpollack

Loading

Uh oh!

markpollack commented Jul 24, 2024

Uh oh!

piotrooo commented Jul 24, 2024

Uh oh!

markpollack commented Aug 22, 2024

Uh oh!

Uh oh!

Introduce MarkdownDocumentReader #1106

Introduce MarkdownDocumentReader #1106

Uh oh!

Conversation

piotrooo commented Jul 24, 2024 • edited by markpollack Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Document dividing

Blockquotes and Code Blocks support

Additional metadata

TODO

Uh oh!

markpollack commented Jul 24, 2024

Uh oh!

piotrooo commented Jul 24, 2024

Uh oh!

markpollack commented Aug 22, 2024

Uh oh!

Uh oh!

piotrooo commented Jul 24, 2024 •

edited by markpollack

Loading