Skip to content

Commit 445f7ee

Browse files
committed
Revert "removing unnecessary files"
This reverts commit 32b8b74.
1 parent 32b8b74 commit 445f7ee

File tree

62 files changed

+2317
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+2317
-0
lines changed
Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
# Chatbot parser
2+
3+
`chatbot_parser.py` is a script that transforms the markdown sourcefiles into a structured directory as input for a chatbot.
4+
5+
## Usage
6+
7+
The script can be ran in a shell environment with the following command:
8+
9+
```shell
10+
python chatbot_parser.py
11+
```
12+
13+
This command has the following possible options:
14+
15+
```shell
16+
chatbot_parser.py [-h] -src SOURCE -dst DESTINATION [-st] [-pl MIN_PARAGRAPH_LENGTH] [-td MAX_TITLE_DEPTH] [-l] [-dd]
17+
```
18+
19+
### Options
20+
21+
#### `h`/`help`
22+
23+
Display the help message
24+
25+
#### `src`/`source`
26+
27+
This is a required option that specifies the source directory of the input files for the script. This location is also used to look for jinja templates when using jinja to parse the source files (such as the `macros` directory within `vsc_user_docs/mkdocs/docs/HPC`).
28+
29+
#### `dst`/`destination`
30+
31+
This is a required option that specifies where the output of the script should be written. The script also generates extra intermediate subdirectories, so subdirectories with the following names shouldn't be present in the destination directory: `parsed_mds`, `copies` and `if_mangled_files`. If any of these pose a problem, the name of the intermediate subdirectory used for the script can be changed in the macros at the top of the script.
32+
33+
#### `st`/`split_on_titles`
34+
35+
Including this option will split the source files based on the titles and subtitles in the markdown text. Not including this option will split the text on paragraphs with a certain minimum length.
36+
37+
#### `pl`/`min_paragraph_length`
38+
39+
This option allows the user to configure the minimum length a paragraph must be. Some deviations from this minimum length are possible (for example at the end of a file). The default value for this minimum paragraph length is 512 tokens. This options only works if `split_on_titles` is not enabled.
40+
41+
#### `td`/`max_title_depth`
42+
43+
This option allows the user to configure the maximum "title depth" (the amount of `#` in front) to be used as borders between sections if `split_on_titles` is enabled. The default value is 4.
44+
45+
#### `l`/`links`
46+
47+
Some of the sourcefiles might contain links. Including this option will retain the links in the plaintext. If this option is not included, the links will be dropped from the plaintext.
48+
49+
#### `dd`/`deep_directories`
50+
51+
Including this option will make the script generate a "deep directory" where every title encountered will be made into a subdirectory of its parent title (So for example a title with three `#`s will be made a subdirectory of the most recent title with two `#`s). This option only works if `split_on_titles` is enabled.
52+
53+
## Generated file structure
54+
55+
The generated directory structure is written as a subdirectory of `parsed_mds`. In `parsed_mds`, two subdirectories can be found:
56+
57+
- `generic` contains the parts of the markdown sources that were non-OS-specific
58+
- `os_specific` contains the parts of the markdown sources that were OS-specific
59+
60+
Within `os_specific` a further distinction is made for each of the three possible operating systems included in the documentation.
61+
62+
Both the generic and each of the three os-specific directories then contain a directory for each source file.
63+
64+
If the option `deep_directories` is not enabled, all paragraphs of the source file and their corresponding metadata will be saved in this directory. The (processed) plaintext of the paragraph is written to a `.txt` file and the metadata is written to a `.json` file.
65+
66+
If the option `deep_directories` is enabled, the directory of each source file will contain a subdirectory structure corresponding to the structure of the subtitles at different levels in the source file. Each subtitle in the source file corresponds to a directory nested in the directory of its parent title (So for example a title with three `#`s will be made a subdirectory of the most recent title with two `#`s).
67+
68+
Finally, each of these subtitle-specific subdirectories contains a `.txt` file with the (processed) plaintext of that section and a `.json` file with the metadata of that section.
69+
70+
## Requirements
71+
72+
- The required Python packages are listed in `requirements.txt`
73+
74+
## Restrictions on source-files
75+
76+
Due to the nature of the script, some restrictions should be taken into account about the markdown files it can use as input.
77+
78+
### Nested if structures
79+
80+
The script uses the if-structures in the source-files to split the documentation into general documentation and os-specific documentation. As such it needs to keep track of which types of if-structures (os-related/non-os-related) it is reading from. When using certain nested if-structures, this will cause problems. The supported nested if-structures are determined by the macros `NON_OS_IF`, `NON_OS_IF_IN_OS_IF`, `OS_IF` and `OS_IF_IN_OS_IF`. So respectively a non-os-related if-structure, a non-os-related if nested in an os-related one, an os-related if-structure and an os-related if-structure nested in another os-related if-structure. All of these are allowed to be nested in an undetermined amount of non-os-related if-structures, but no non-os-related if structures should be nested in them. It is also not allowed to nest any of the allowed structures in more os-related if-structures.
81+
82+
#### Examples of valid and invalid if-structures
83+
84+
##### Allowed
85+
86+
###### non-os-related in os-related
87+
88+
This is an example of one of the basic allowed if-structures (`NON_OS_IF_IN_OS_IF`)
89+
90+
```
91+
if OS == windows:
92+
if site == Gent:
93+
...
94+
endif
95+
endif
96+
```
97+
98+
###### os-related in os-related in non-os-related
99+
100+
This is an example of the basic allowed if-structure `OS_IF_IN_OS_IF` nested in a non-os-specific if.
101+
102+
```
103+
if site == Gent:
104+
if OS == windows:
105+
...
106+
else:
107+
if OS == Linux:
108+
...
109+
endif
110+
endif
111+
endif
112+
```
113+
114+
##### Not allowed
115+
116+
###### non-os-related in os-related in os-related
117+
118+
This is an example of a non-os-related if-structure nested in one of the basic allowed if-structures (`OS_IF_IN_OS_IF`).
119+
120+
```
121+
if OS != windows:
122+
if OS == Linux:
123+
if site == Gent:
124+
...
125+
endif
126+
endif
127+
endif
128+
```
129+
130+
This will result in the parser "forgetting" it opened an os-specific if-statement with OS != windows and not properly closing it.
131+
132+
###### os-related in non-os-related in os-related
133+
134+
This is an example of the basic allowed if-structure `OS_IF` (indirectly) nested in an os-specific if-structure.
135+
136+
```
137+
if OS != windows:
138+
if site == Gent:
139+
if OS == Linux:
140+
...
141+
endif
142+
endif
143+
endif
144+
```
145+
146+
This will also result in the parser "forgetting" it opened an os-specific if-statement with OS != windows and not properly closing it.
147+
148+
### Non OS-related if-statements
149+
150+
Due to the way jinja parses the sourcefiles, the script slightly alters non os-specific if-statements as well. It expects if-statements of the following form:
151+
152+
```
153+
{%- if site == gent %}
154+
{% if site != (gent or brussel) %}
155+
```
156+
157+
All spaces and the dash are optional. City names don't need to be fully lowercase since the parser will capitalize them properly anyway.
158+
159+
### html syntax
160+
161+
The input shouldn't contain any html syntax. While some failsafes are in place, the script isn't made with the use case of handling html syntax in mind.
162+
163+
### Comments
164+
165+
Any comments within the markdown files (for example TODO's) should follow the following syntax:
166+
167+
```
168+
<!--your comment-->
169+
```
170+
and should be limited to one line.
171+
172+
Comments can be written in such a way that the script will keep them as input for the bot. To do that, the marker `INPUT_FOR_BOT` should be put in front of the content of the comment as such.
173+
174+
```
175+
<!--INPUT_FOR_BOT: your comment for the bot-->
176+
```
177+
178+
This will be reworked to
179+
180+
```
181+
your comment for the bot
182+
```
183+
184+
in the final output.
185+
186+
### Long filepaths
187+
188+
Due to the nature of this script, it can generate large directories with very long names if `deep_directories` is enabled. Depending on the operating system, this can cause problems with filepaths being to long, resulting in files not being able to open. A possible fix for this is to make sure the filepath to where the script is located is not too long. Another solution is lowering the `max_title_depth` or disabling `deep_directories`.
189+
190+
### Markdown lists
191+
192+
The parser is made in a way to detect lists and not split them in multiple paragraphs. The kinds of lists it can detect is all lists with denominators `-`, `+`, `*` and list indexed with numbers or letters (one letter per list entry). It can handle list entries being spread out over multiple lines if there is an indentation of at least two spaces. It can also handle multiple paragraph list entries in this way, as long as the indentation stays.
193+
194+
### Links
195+
196+
Part of the metadata of the parser are links. In order for the links to be built up in the right way, links to external sites should always start with either `https://` or `http://`.

0 commit comments

Comments
 (0)