You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Remove text unit group_by_columns
* Semver
* Fix default token split test
* Fix models in config test samples
* Fix token length in context sort test
* Fix document sort
Copy file name to clipboardExpand all lines: docs/index/default_dataflow.md
+1-3Lines changed: 1 addition & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,9 +59,7 @@ flowchart TB
59
59
60
60
The first phase of the default-configuration workflow is to transform input documents into _TextUnits_. A _TextUnit_ is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source text.
61
61
62
-
The chunk size (counted in tokens), is user-configurable. By default this is set to 300 tokens, although we've had positive experience with 1200-token chunks using a single "glean" step. (A "glean" step is a follow-on extraction). Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.
63
-
64
-
The group-by configuration is also user-configurable. By default, we align our chunks to document boundaries, meaning that there is a strict 1-to-many relationship between Documents and TextUnits. In rare cases, this can be turned into a many-to-many relationship. This is useful when the documents are very short and we need several of them to compose a meaningful analysis unit (e.g. Tweets or a chat log)
62
+
The chunk size (counted in tokens), is user-configurable. By default this is set to 1200 tokens. Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.
Copy file name to clipboardExpand all lines: docs/index/outputs.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -102,7 +102,7 @@ List of all text chunks parsed from the input documents.
102
102
| ----------------- | ----- | ----------- |
103
103
| text | str | Raw full text of the chunk. |
104
104
| n_tokens | int | Number of tokens in the chunk. This should normally match the `chunk_size` config parameter, except for the last chunk which is often shorter. |
105
-
|document_ids| str[]| List of document IDs the chunk came from. This is normally only 1 due to our default groupby, but for very short text documents (e.g., microblogs) it can be configured so text units span multiple documents. |
105
+
|document_id| str| ID of the document the chunk came from. |
106
106
| entity_ids | str[]| List of entities found in the text unit. |
107
107
| relationships_ids | str[]| List of relationships found in the text unit. |
108
108
| covariate_ids | str[]| Optional list of covariates found in the text unit. |
0 commit comments