|
| 1 | +--- |
| 2 | +title: Named entity recognition (NER) |
| 3 | +--- |
| 4 | + |
| 5 | +After partitioning and chunking, you can have Unstructured generate a list of recognized entities and their types (such as the names of organizations, products, and people) in the content, through a process known as _named entity recognition_ (NER). |
| 6 | + |
| 7 | +This NER is done by using models offered through these providers: |
| 8 | + |
| 9 | +- [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI. |
| 10 | +- [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet), provided through Anthropic. |
| 11 | + |
| 12 | +Here is an example of a list of recognized entities and their types using GPT-4o. Note specifically the `entities` field that is added. |
| 13 | + |
| 14 | +```json |
| 15 | +{ |
| 16 | + "type": "CompositeElement", |
| 17 | + "element_id": "bc8333ea0d374670ff0bd03c6126e70d", |
| 18 | + "text": "SECTION. 3\n\nThe Senate of the United States shall be composed of two Senators from each State, |
| 19 | + [chosen by the Legislature there- of,]* for six Years; and each Senator shall have one Vote.\n\n |
| 20 | + Immediately after they shall be assembled in Consequence of the first Election, they shall be divided |
| 21 | + as equally as may be into three Classes. The Seats of the Senators of the first Class shall be vacated |
| 22 | + at the Expiration of the second Year, of the second Class at the Expiration of the fourth Year, and of |
| 23 | + the third Class at the Expiration of the sixth Year, so that one third may be chosen every second Year; |
| 24 | + [and if Vacan- cies happen by Resignation, or otherwise, during the Recess of the Legislature of any |
| 25 | + State, the Executive thereof may make temporary Appointments until the next Meeting of the Legislature, |
| 26 | + which shall then fill such Vacancies.]*\n\nC O N S T I T U T I O N O F T H E U N I T E D S T A T E S", |
| 27 | + "metadata": { |
| 28 | + "filename": "constitution.pdf", |
| 29 | + "filetype": "application/pdf", |
| 30 | + "languages": [ |
| 31 | + "eng" |
| 32 | + ], |
| 33 | + "page_number": 2, |
| 34 | + "entities": [ |
| 35 | + { |
| 36 | + "entity": "Senate", |
| 37 | + "type": "ORGANIZATION" |
| 38 | + }, |
| 39 | + { |
| 40 | + "entity": "United States", |
| 41 | + "type": "LOCATION" |
| 42 | + }, |
| 43 | + { |
| 44 | + "entity": "Senators", |
| 45 | + "type": "PERSON" |
| 46 | + }, |
| 47 | + { |
| 48 | + "entity": "State", |
| 49 | + "type": "LOCATION" |
| 50 | + }, |
| 51 | + { |
| 52 | + "entity": "Legislature", |
| 53 | + "type": "ORGANIZATION" |
| 54 | + }, |
| 55 | + { |
| 56 | + "entity": "six Years", |
| 57 | + "type": "DATE" |
| 58 | + }, |
| 59 | + { |
| 60 | + "entity": "first Election", |
| 61 | + "type": "EVENT" |
| 62 | + }, |
| 63 | + { |
| 64 | + "entity": "second Year", |
| 65 | + "type": "DATE" |
| 66 | + }, |
| 67 | + { |
| 68 | + "entity": "fourth Year", |
| 69 | + "type": "DATE" |
| 70 | + }, |
| 71 | + { |
| 72 | + "entity": "sixth Year", |
| 73 | + "type": "DATE" |
| 74 | + }, |
| 75 | + { |
| 76 | + "entity": "Executive", |
| 77 | + "type": "PERSON" |
| 78 | + }, |
| 79 | + { |
| 80 | + "entity": "C O N S T I T U T I O N O F T H E U N I T E D S T A T E S", |
| 81 | + "type": "ARTIFACT" |
| 82 | + } |
| 83 | + ] |
| 84 | + } |
| 85 | +} |
| 86 | +``` |
| 87 | + |
| 88 | +# Generate a list of entities and their types |
| 89 | + |
| 90 | +To generate a list of recognized entities and their types, in the **Task** drop-down list of an **Enrichment** node in a workflow, specify the following: |
| 91 | + |
| 92 | +<Note> |
| 93 | + You can change a workflow's NER settings only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. |
| 94 | + |
| 95 | + Entities are only recognized when the **Partitioner** node in a workflow is also set to use the **High Res** partitioning strategy. [Learn more](/platform/partitioning). |
| 96 | +</Note> |
| 97 | + |
| 98 | +1. Select **Named Entity Recognition (NER)**. By default, OpenAI's GPT-4o will follow a default set of instructions (called a _prompt_) to perform NER using a set of predefined entity types. |
| 99 | +2. To use Anthropic's Claude 3.5 Sonnet to perform NER instead, or to customize the prompt, click **Edit**. |
| 100 | +3. To switch to using Anthropic's Claude 3.5 Sonnet, click **Anthropic (Claude 3.5 Sonnet)**. |
| 101 | +4. To experiment with running the default prompt against some sample data, click **Run Prompt**. The selected **Model** uses the |
| 102 | + **Prompt** to run NER on the **Input sample** and shows the results in the **Output**. Look specifically at the `response_json` field for the |
| 103 | + entities that were recognized and their types. |
| 104 | +5. To customize the prompt, change the contents of **Prompt**. |
| 105 | + |
| 106 | + <Note> |
| 107 | + For best results, Unstructured strongly recommends that you limit your changes only to certain portions of the default prompt, specifically: |
| 108 | + |
| 109 | + - Adding, renaming, or deleting items in the list of predefined types (such as `PERSON`, `ORGANIZATION`, `LOCATION`, and so on). |
| 110 | + - As needed, adding any clarifying instructions only between these two lines: |
| 111 | + |
| 112 | + ```text |
| 113 | + ... |
| 114 | + Provide the entities and their corresponding types as a structured JSON response. |
| 115 | +
|
| 116 | + (Add any clarifying instructions here only.) |
| 117 | +
|
| 118 | + [START OF TEXT] |
| 119 | + ... |
| 120 | + ``` |
| 121 | + |
| 122 | + - Changing any other portions of the default prompt could produce unexpected results. |
| 123 | + </Note> |
| 124 | + |
| 125 | +6. To experiment with different data, change the contents of **Input sample**. For best results, Unstructured strongly recommends that the JSON structure in **Input sample** be preserved. |
| 126 | +7. When you are satisfied with the **Model** and **Prompt** that you want to use, click **Save**. |
| 127 | + |
0 commit comments