|
7 | 7 | * `summarize`: means the input will be passed through a summarization prompt. |
8 | 8 | * `summarize_then_query`: summarize the text then open the prompt to allow querying directly the source document. |
9 | 9 |
|
10 | | -* `--filetype`: str, default `infer` |
| 10 | +* `--filetype`: str, default `auto` |
11 | 11 | * the type of input. Depending on the value, different other parameters |
12 | 12 | are needed. If json_entries is used, the line of the input file can contain |
13 | 13 | any of those parameters as long as they are as json. You can find |
14 | | - an example of json_entries file in `DocToolsLLM/docs/json_entries_example.txt` |
| 14 | + an example of json_entries file in `WinstonDoc/docs/json_entries_example.txt` |
15 | 15 |
|
16 | 16 | * Supported values: |
17 | | - * `infer`: will guess the appropriate filetype based on `--path`. |
| 17 | + * `auto`: will guess the appropriate filetype based on `--path`. |
18 | 18 | Irrelevant for some filetypes, eg if `--filetype`=anki |
19 | 19 | * `youtube`: `--path` must link to a youtube video |
20 | 20 | * `youtube_playlist`: `--path` must link to a youtube playlist |
21 | 21 | * `pdf`: `--path` is path to pdf |
22 | | - * `txt`: `--path` is path to txt |
| 22 | + * `text`: `--path` is path to a .txt file |
23 | 23 | * `url`: `--path` must be a valid http(s) link |
24 | 24 | * `anki`: must be set: `--anki_profile`. Optional: `--anki_deck`, |
25 | 25 | `--anki_notetype`, `--anki_template`, `--anki_tag_filter`. |
|
35 | 35 | be downloaded. Possible arguments are `--onlinemedia_url_regex`, |
36 | 36 | `--onlinemedia_resourcetype_regex`. Then arguments of `local_audio`. |
37 | 37 |
|
38 | | - * `json_entries`: `--path` is path to a txt file that contains a json |
| 38 | + * `json_entries`: `--path` is path to a text file that contains a json |
39 | 39 | for each line containing at least a filetype and a path key/value |
40 | 40 | but can contain any parameters described here |
41 | 41 | * `recursive_paths`: `--path` is the starting path `--pattern` is the globbing |
|
118 | 118 | if contains `hyde` but modelname contains `testing` then `hyde` will |
119 | 119 | be removed. |
120 | 120 |
|
121 | | -* `--query_eval_modelname`: str, default `"openrouter/anthropic/claude-3.5-sonnet:beta"` |
| 121 | +* `--query_eval_modelname`: str, default `"openai/gpt4o-mini"` |
122 | 122 | * Cheaper and quicker model than modelname. Used for intermediate |
123 | 123 | steps in the RAG, not used in other tasks. |
124 | 124 | If the value is not part of the model list of litellm, will use |
125 | 125 | fuzzy matching to find the best match. |
126 | 126 | None to disable. |
127 | 127 |
|
128 | | -* `--query_eval_check_number`: int, default `1` |
| 128 | +* `--query_eval_check_number`: int, default `4` |
129 | 129 | * number of pass to do with the eval llm to check if the document |
130 | 130 | is indeed relevant to the question. The document will not |
131 | 131 | be processed if all answers from the eval llm are 0, and will |
|
137 | 137 | * threshold underwhich a document cannot be considered relevant by |
138 | 138 | embeddings alone. |
139 | 139 |
|
140 | | -* `--query_condense_question`: bool, default `True` |
141 | | - * if True, will not use a special LLM call to reformulate the question |
142 | | - when task is `query`. Otherwise, the query will be reformulated as |
143 | | - a standalone question. Useful when you have multiple questions in |
144 | | - a row. |
145 | | - Disabled if using a testing model. |
146 | | - |
147 | 140 | --- |
148 | 141 |
|
149 | 142 | * `--summary_n_recursion`: int, default `1` |
|
187 | 180 | can be used for example to send notification on your phone |
188 | 181 | using ntfy.sh to get summaries. |
189 | 182 |
|
190 | | -* `--memoryless`: bool, default `False` |
191 | | - * if False, will remember the messages across a given chat exchange. |
192 | | - Disabled if using a testing model. |
193 | | - |
194 | 183 | * `--disable_llm_cache`: bool, default `False` |
195 | 184 | * WARNING: The cache is temporarily ignored in non openaillms |
196 | 185 | generations because of an error with langchain's ChatLiteLLM. |
197 | 186 | Basically if you don't use `--private` and use llm form openai, |
198 | | - DocToolsLLM will use ChatOpenAI with regular caching, otherwise |
| 187 | + WinstonDoc will use ChatOpenAI with regular caching, otherwise |
199 | 188 | we use ChatLiteLLM with LLM caching disabled. |
200 | 189 | More at https://github.com/langchain-ai/langchain/issues/22389 |
201 | 190 |
|
|
243 | 232 | to a loader. They apply depending on the value of `--filetype`. |
244 | 233 | An unexpected argument for a given filetype will result in a crash. |
245 | 234 |
|
246 | | -* `--path`: str |
| 235 | +* `--path`: str or PosixPath |
247 | 236 | * Used by most loaders. For example for `--filetype=youtube` the path |
248 | 237 | must point to a youtube video. |
249 | 238 |
|
|
311 | 300 | Either 'youtube', 'whisper' or 'deepgram'. |
312 | 301 | Default is 'youtube'. |
313 | 302 | * If 'youtube': will take the youtube transcripts as text content. |
314 | | - * If 'whisper': DocToolsLLM will download |
| 303 | + * If 'whisper': WinstonDoc will download |
315 | 304 | the audio from the youtube link, and whisper will be used to turn the audio into text. whisper_prompt and whisper_lang will be used if set. |
316 | 305 | * If 'deepgram' will download |
317 | 306 | the audio from the youtube link, and deepgram will be used to turn the audio into text. `--deepgram_kwargs` will be used if set. |
318 | 307 |
|
319 | 308 | * `--include`: str |
320 | | - * Only active if `--filetype` is one of 'json_entries', 'recursive_paths', |
321 | | - 'link_file', 'youtube_playlist'. |
| 309 | + * Only active if `--filetype` is 'recursive_paths' |
322 | 310 | `--include` can be a list of regex that must be present in the |
323 | 311 | document PATH (not content!) |
324 | 312 | `--exclude` can be a list of regex that if present in the PATH |
|
329 | 317 |
|
330 | 318 | # Other specific arguments |
331 | 319 |
|
332 | | -* `--out_file`: str, default `None` |
333 | | - * If doctools must create a summary, if out_file given the summary will |
| 320 | +* `--out_file`: str or PosixPath, default `None` |
| 321 | + * If WinstonDoc must create a summary, if out_file given the summary will |
334 | 322 | be written to this file. Note that the file is not erased and |
335 | | - Doctools will simply append to it. |
| 323 | + WinstonDoc will simply append to it. |
336 | 324 | * If `--summary_n_recursion` is used, additional files will be |
337 | 325 | created with the name `{out_file}.n.md` with n being the n-1th recursive |
338 | 326 | summary. |
|
379 | 367 | each document instead of the metadata. |
380 | 368 | Syntax: `[+-]your_regex` |
381 | 369 | Example: |
382 | | - * Keep only the document that contain `doctools` |
383 | | - `--filter_content=+.*doctools.*` |
384 | | - * Discard the document that contain `DOCTOOLS` |
385 | | - `--filter_content=-.*DOCTOOLS.*` |
| 370 | + * Keep only the document that contain `winstondoc` |
| 371 | + `--filter_content=+.*winstondoc.*` |
| 372 | + * Discard the document that contain `winstondoc` |
| 373 | + `--filter_content=-.*winstondoc.*` |
386 | 374 |
|
387 | 375 | * `--embed_instruct`: bool, default `None` |
388 | 376 | * when loading an embedding model using HuggingFace or |
|
436 | 424 |
|
437 | 425 | # Runtime flags |
438 | 426 |
|
439 | | -* `DOCTOOLS_TYPECHECKING` |
| 427 | +* `WINSTONDOC_TYPECHECKING` |
440 | 428 | * Setting for runtime type checking. Default value is `warn`. * Possible values: |
441 | 429 | The typing is checked using [beartype](https://beartype.readthedocs.io/en/latest/) so shouldn't slow down the runtime. |
442 | 430 | * `disabled`: disable typechecking. |
|
0 commit comments