You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To learn more about WhyHow and our projects, visit our [website](https://whyhow.ai/).
18
16
19
17
## Table of Contents
@@ -102,11 +100,13 @@ The frontend can be accessed at `http://localhost:3000`, and the backend can be
102
100
4.**Install the dependencies:**
103
101
104
102
For basic installation:
103
+
105
104
```sh
106
105
pip install .
107
106
```
108
107
109
108
For installation with development tools:
109
+
110
110
```sh
111
111
pip install .[dev]
112
112
```
@@ -180,6 +180,7 @@ To set up the project for development:
180
180
black .
181
181
isort .
182
182
```
183
+
183
184
---
184
185
185
186
## Features
@@ -189,12 +190,12 @@ To set up the project for development:
189
190
-**Chunk Linking** - Link raw source text chunks to the answers for traceability and provenance.
190
191
-**Extract with natural language** - Use natural language queries to extract structured data from unstructured documents.
191
192
-**Customizable extraction rules** - Define rules to guide the extraction process and ensure data quality.
192
-
-**Custom formatting** - Control the output format of your extracted data.
193
+
-**Custom formatting** - Control the output format of your extracted data. Knowledge table current supports text, list of text, number, list of numbers, and boolean formats.
193
194
-**Filtering** - Filter documents based on metadata or extracted data.
194
195
-**Exporting as CSV or Triples** - Download extracted data as CSV or graph triples.
195
196
-**Chained extraction** - Reference previous columns in your extraction questions using @ i.e. "What are the treatments for `@disease`?".
196
197
-**Split Cell Into Rows** - Turn outputs within a single cell from List of Numbers or List of Values and split it into individual rows to do more complex Chained Extraction
197
-
198
+
198
199
---
199
200
200
201
## Concepts
@@ -211,6 +212,15 @@ Each **document** is an unstructured data source (e.g., a contract, article, or
211
212
212
213
A **Question** is the core mechanism for guiding extraction. It defines what data you want to extract from a document.
213
214
215
+
### Rule
216
+
217
+
A **Rule** guides the extraction from the LLM. You can add rules on a column level or on a global level. Currently, the following rule types are supported:
218
+
219
+
-**May Return** rules give the LLM examples of answers that can be used to guide the extraction. This is a great way to give more guidance for the LLM on the type of things it should keep an eye out for.
220
+
-**Must Return** rules give the LLM an exhaustive list of answers that are allowed to be returned. This is a great way to give guardrails for the LLM to ensure only certain terms are returned.
221
+
-**Allowed # of Responses** rules are useful for provide guardrails in the event there are may be a range of potential ‘grey-area’ answers and we want to only restrict and guarantee only a certain number of the top responses are provided.
222
+
-**Resolve Entity** rules allow you to resolve values to a specific entity. This is useful for ensuring output conforms to a specific entity type. For example, you can write rules that ensure "blackrock", "Blackrock, Inc.", and "Blackrock Corporation" all resolve to the same entity - "Blackrock".
223
+
214
224
---
215
225
216
226
## Practical Usage
@@ -225,6 +235,7 @@ Once you've set up your questions, rules, and documents, the Knowledge Table pro
225
235
-**Metadata Generation**: Classify and tag information about your documents and files by running targeted questions against the files (i.e. "What project is this email thread about?")
226
236
227
237
---
238
+
228
239
## Export to Triples
229
240
230
241
To create the Schema for the Triples, we use an LLM to consider the Entity Type of the Column, the question that was used to generate the cells, and the values themselves, to create the schema and the triples. The document name is inserted as a node property. The vector chunk ids are also included in the JSON file of the triples, and tied to the triples created.
@@ -263,8 +274,7 @@ To use the Unstructured API integration:
263
274
264
275
When the `UNSTRUCTURED_API_KEY` is set, Knowledge Table will automatically use the Unstructured API for document processing. If the key is not set or if there's an issue with the Unstructured API, the system will fall back to the default document loaders.
265
276
266
-
Note: Usage of the Unstructured API may incur costs based on your plan with Unstructured.io.
267
-
---
277
+
## Note: Usage of the Unstructured API may incur costs based on your plan with Unstructured.io.
0 commit comments