Centroid Filter and representative docs #1381

NewsSoup · 2023-06-30T10:15:53Z

NewsSoup
Jun 30, 2023

I built a centroid filter in place of using a PageRank/TextRank algorithm to organise documents into order. I use it to create statistical summaries of news articles before sending those to a language model for abstractive summarisation. Combining the centroid filter with sorting by date produces strikingly coherent summaries when you group the sentences of various articles into a single corpus and then cluster them, because you preserve the content, in order, and only the outliers fall away.

Tangent aside, the CentroidFilter function does exactly what your get_representative_docs() does in practice, but for some reason I can't get representative docs from KMeans clusters so the method must differ. I am proposing the code below as a useful filter function that you could add as a utility, but you can also use it to get representative docs for KMeans clusters if the DataFrame that you pass in is filtered for the queried cluster.

from typing import List, Literal
from torch import Tensor
import pandas as pd
import numpy as np



def CentroidFilter(
    df: pd.DataFrame,
    num_results: int|float|Literal["xx.x%"] = 3, 
        # Target number of results, float or string for percentage
    text_col: str = "text", 
    embedding_col: str = "embeddings", 
    ) -> str|List[str]:


    #* Find centroid
    embeddings_list: List[Tensor] # Type hint <- SentenceTransformer embedding outputs
    embeddings_list = df[embedding_col].to_list()
    centroid = np.average(np.array([embeddings_list]), axis=1) 

    text_list = df[text_col].to_list()
    num_docs = len(text_list)

    #* If number is a percent (string or decimal)
    if type(num_results) is str:
        num_results = round((num_docs * float(num_results[:-1].strip())/100), 0)
    elif num_results < 1 and num_results > 0:
        num_results = round((num_docs * num_results), 0)


    #* Find distance from centroid
    scores = [
        np.dot(centroid, embeddings_list[i])
        for i in range(num_docs)
    ]

    #* Extract sentences
    top_doc_indices = sorted(
        range(len(scores)), 
        key=lambda i: scores[i], 
        reverse=True,
        )[:num_results]

    #* Preserve sentence order in output
    unordered_doc_list = [text_list[i] for i in top_doc_indices]
    doc_list = [sent for sent in text_list if sent in unordered_doc_list]
        # The above can be adapted if desired return type is DataFrame
        #? makes a difference when documents are sorted by relation, such as time or hierarchical cluster position


    return doc_list

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centroid Filter and representative docs #1381

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Centroid Filter and representative docs #1381

Uh oh!

NewsSoup Jun 30, 2023

Replies: 0 comments

NewsSoup
Jun 30, 2023