I want to use to semantically segment a piece of text #51

maxin9966 · 2025-07-30T03:57:54Z

maxin9966
Jul 30, 2025

Objective:

I want to use the contextgem project to semantically segment a piece of text, preferably with structured output.

The core requirement is segmentation—for example, if the input is a book, the system should extract elements like the book title, table of contents, chapters, subsections, etc.

Text Definition:

The input text could be:

Web-scraped content
A book
An academic paper
Subtitles from videos or podcasts
Or even a collection of articles from different online sources (potentially on disparate topics). The goal is to semantically split such unstructured or mixed-content text into meaningful, coherent segments.

Problem:

Since the uploaded document content is unknown in advance, I need a generic semantic segmentation logic to preprocess the text. This will facilitate downstream tasks like summarization and aggregation for individual segments.

SergiiShcherbak · 2025-08-02T23:57:59Z

SergiiShcherbak
Aug 2, 2025
Maintainer

@maxin9966 ContextGem provides Aspects API that is designed to extract text segments (topics, sections) from a document. But using Aspects API right away requires knowing what aspects you need to extract, since each Aspect instance requires name and description. If the aspects are unknown, you can first extract section titles using StringConcept, and then create Aspect instances using the extracted section titles.

Quick example:

from contextgem import Aspect, Document, DocumentLLM, StringConcept


# Configure your LLM
llm = DocumentLLM(...)

# Create a document (article, book, contract, etc.)
document = Document(raw_text="document-text")

# Define a string concept that will hold extracted section titles
section_concept = StringConcept(
    name="Section and sub-section titles",
    description="Section and sub-section titles in the document",
)

# Extract the concept items
document.add_concepts([section_concept])
extracted_section_concept = llm.extract_concepts_from_document(document)[0]
extracted_section_titles = list(
    dict.fromkeys([i.value for i in extracted_section_concept.extracted_items])
)  # remove section title duplicates, if any

# Create aspects from extracted section titles
aspects = []
for section_title in extracted_section_titles:
    aspects.append(
        Aspect(
            name=section_title,
            description=f"Section or sub-section titled '{section_title}'",
        )
    )

# Extract the aspects
document.add_aspects(aspects)
extracted_aspects = llm.extract_aspects_from_document(document)
for aspect in extracted_aspects :
    print(aspect.name)
    for item in aspect.extracted_items:
        print("- ", item.value)
    print("-"*100)
    print()

Since you are working with long documents (books, academic papers, etc.), you will need to configure the extraction params to account for the content length. Check out the Dealing with Long Documents guide. For instance, you will probably need to adjust max_paragraphs_to_analyze_per_call and max_items_per_call params, as well as optionally enable concurrency using use_concurrency flag.

Hope this helps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I want to use to semantically segment a piece of text #51

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

I want to use to semantically segment a piece of text #51

Uh oh!

maxin9966 Jul 30, 2025

Objective:

Text Definition:

Problem:

Replies: 1 comment

Uh oh!

SergiiShcherbak Aug 2, 2025 Maintainer

maxin9966
Jul 30, 2025

SergiiShcherbak
Aug 2, 2025
Maintainer