Skip to content
This repository was archived by the owner on Dec 9, 2022. It is now read-only.

Commit a4659ec

Browse files
authored
Update README.md
1 parent 9252a45 commit a4659ec

File tree

1 file changed

+20
-2
lines changed

1 file changed

+20
-2
lines changed

README.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,26 @@ GitHub contains a large corpus data that is amenable for NLP, in the form of Iss
44

55
## 1. Insert custom field indicators
66

7-
This is so markdown information is not lost. For example, a list block is enclosed with `xxxlistB` and `xxxlistE` and a code block is enclosed with `xxxcdb` and `xxxcde`.
7+
This is so markdown information is not lost. For example, a list block is enclosed with `xxxlistB` and `xxxlistE` and a code block is enclosed with `xxxcdb` and `xxxcde`. Other noteable examples:
8+
9+
- @mentions: xxxatmention (the handle is removed and replaced by just this indicator)
10+
- quote blocks: xxxqb/xxxqe
11+
- strikethrough: xxxdelb/xxxdele
12+
- horizontal rule: xxxhr
13+
- {large, medium, small} headers: annotated with xxxh{l,m,s}. H1=large, H2-3=medium, H4-6=small.
14+
-
815

916
## 2. Discard superflous information
1017

11-
Documentation TBD
18+
GitHub issues often contain a large stack trace, or a large table with data. This library comes equipped with sensible defaults to surface the most relevant information and discard what would otherwhise be lots of characters for a machine learning algorithm to handle:
19+
20+
- Code Blocks: only first two and last two rows are kept
21+
- Tables: only table headers are kept
22+
- Urls: only the host is kept. For example www.google.com/search is reformatted to www.google.com
23+
- Images: the image is discarded but the file extension and metadata about the image (available to screenreader) is extracted.
24+
- IP Addresses, extremely long numbers are marked as xxunk
25+
26+
27+
# Examples
28+
29+
See [/notebooks/Demo.ipynb](/notebooks/Demo.ipynb) for an example of the transformations this parser does on a markdown file.

0 commit comments

Comments
 (0)