Skip to content

Commit 68e78df

Browse files
Add Markdown/Text Parser (#1381)
* add markdown parser * add example file in markdown * fix formatting and linting errors * rename company in markdown example file * rename markdownparser to textparser * fix typo * remove empty space from test case * add txt support + fix import * Apply suggestions from code review --------- Co-authored-by: Pamela Fox <[email protected]>
1 parent 3d86d24 commit 68e78df

File tree

4 files changed

+141
-0
lines changed

4 files changed

+141
-0
lines changed
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Contoso Electronics
2+
3+
*Disclaimer: This content is generated by AI and may not accurately represent factual information about any real entity. Use this information with caution and verify details from reliable sources.*
4+
5+
## History
6+
7+
Contoso Electronics, a pioneering force in the tech industry, was founded in 1985 by visionary entrepreneurs with a passion for innovation. Over the years, the company has played a pivotal role in shaping the landscape of consumer electronics.
8+
9+
| Year | Milestone |
10+
|------|-----------|
11+
| 1985 | Company founded with a focus on cutting-edge technology |
12+
| 1990 | Launched the first-ever handheld personal computer |
13+
| 2000 | Introduced groundbreaking advancements in AI and robotics |
14+
| 2015 | Expansion into sustainable and eco-friendly product lines |
15+
16+
## Company Overview
17+
18+
At Contoso Electronics, we take pride in fostering a dynamic and inclusive workplace. Our dedicated team of experts collaborates to create innovative solutions that empower and connect people globally.
19+
20+
### Core Values
21+
22+
- **Innovation:** Constantly pushing the boundaries of technology.
23+
- **Diversity:** Embracing different perspectives for creative excellence.
24+
- **Sustainability:** Committed to eco-friendly practices in our products.
25+
26+
## Vacation Perks
27+
28+
We believe in work-life balance and understand the importance of well-deserved breaks. Our vacation perks are designed to help our employees recharge and return with renewed enthusiasm.
29+
30+
| Vacation Tier | Duration | Additional Benefits |
31+
|---------------|----------|---------------------|
32+
| Standard | 2 weeks | Health and wellness stipend |
33+
| Senior | 4 weeks | Travel vouchers for a dream destination |
34+
| Executive | 6 weeks | Luxury resort getaway with family |
35+
36+
## Employee Recognition
37+
38+
Recognizing the hard work and dedication of our employees is at the core of our culture. Here are some ways we celebrate achievements:
39+
40+
- Monthly "Innovator of the Month" awards
41+
- Annual gala with awards for outstanding contributions
42+
- Team-building retreats for high-performing departments
43+
44+
## Join Us!
45+
46+
Contoso Electronics is always on the lookout for talented individuals who share our passion for innovation. If you're ready to be part of a dynamic team shaping the future of technology, check out our [careers page](http://www.contoso.com) for exciting opportunities.
47+
48+
[Learn more about Contoso Electronics!](http://www.contoso.com)

scripts/prepdocs.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
from prepdocslib.parser import Parser
3030
from prepdocslib.pdfparser import DocumentAnalysisParser, LocalPdfParser
3131
from prepdocslib.strategy import DocumentAction, SearchInfo, Strategy
32+
from prepdocslib.textparser import TextParser
3233
from prepdocslib.textsplitter import SentenceTextSplitter, SimpleTextSplitter
3334

3435

@@ -87,6 +88,8 @@ async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> St
8788
".tiff": FileProcessor(doc_int_parser, sentence_text_splitter),
8889
".bmp": FileProcessor(doc_int_parser, sentence_text_splitter),
8990
".heic": FileProcessor(doc_int_parser, sentence_text_splitter),
91+
".md": FileProcessor(TextParser(), sentence_text_splitter),
92+
".txt": FileProcessor(TextParser(), sentence_text_splitter),
9093
}
9194
use_vectors = not args.novectors
9295
embeddings: Optional[OpenAIEmbeddings] = None

scripts/prepdocslib/textparser.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
import re
2+
from typing import IO, AsyncGenerator
3+
4+
from .page import Page
5+
from .parser import Parser
6+
7+
8+
def cleanup_data(data: str) -> str:
9+
"""Cleans up the given content using regexes
10+
Args:
11+
data: (str): The data to clean up.
12+
Returns:
13+
str: The cleaned up data.
14+
"""
15+
# match two or more newlines and replace them with one new line
16+
output = re.sub(r"\n{2,}", "\n", data)
17+
# match two or more spaces that are not newlines and replace them with one space
18+
output = re.sub(r"[^\S\n]{2,}", " ", output)
19+
20+
return output.strip()
21+
22+
23+
class TextParser(Parser):
24+
"""Parses simple text into a Page object."""
25+
26+
async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
27+
data = content.read()
28+
decoded_data = data.decode("utf-8")
29+
text = cleanup_data(decoded_data)
30+
yield Page(0, 0, text=text)

tests/test_textparser.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
import io
2+
3+
import pytest
4+
5+
from scripts.prepdocslib.textparser import TextParser
6+
7+
8+
@pytest.mark.asyncio
9+
async def test_textparser_remove_new_lines():
10+
file = io.BytesIO(
11+
b"""
12+
# Text Example with multiple empty lines
13+
this is paragraph 1
14+
15+
16+
17+
and this is paragraph 2
18+
"""
19+
)
20+
parser = TextParser()
21+
pages = [page async for page in parser.parse(file)]
22+
assert len(pages) == 1
23+
assert pages[0].page_num == 0
24+
assert pages[0].offset == 0
25+
assert pages[0].text == "# Text Example with multiple empty lines\n this is paragraph 1\n and this is paragraph 2"
26+
27+
28+
@pytest.mark.asyncio
29+
async def test_textparser_remove_white_spaces():
30+
file = io.BytesIO(b" Test multiple white spaces ")
31+
parser = TextParser()
32+
pages = [page async for page in parser.parse(file)]
33+
assert pages[0].text == "Test multiple white spaces"
34+
35+
36+
@pytest.mark.asyncio
37+
async def test_textparser_full():
38+
file = io.BytesIO(
39+
b"""
40+
# Text Example
41+
Some short text here, with bullets:
42+
* write code
43+
* test code
44+
* merge code
45+
46+
47+
## Subheading
48+
Some more text here with a link to Azure. Here's a the link to [Azure](https://azure.microsoft.com/).
49+
"""
50+
)
51+
file.name = "test.md"
52+
parser = TextParser()
53+
pages = [page async for page in parser.parse(file)]
54+
assert len(pages) == 1
55+
assert pages[0].page_num == 0
56+
assert pages[0].offset == 0
57+
assert (
58+
pages[0].text
59+
== "# Text Example\n Some short text here, with bullets:\n * write code\n * test code\n * merge code\n ## Subheading\n Some more text here with a link to Azure. Here's a the link to [Azure](https://azure.microsoft.com/)."
60+
)

0 commit comments

Comments
 (0)