You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I built a centroid filter in place of using a PageRank/TextRank algorithm to organise documents into order. I use it to create statistical summaries of news articles before sending those to a language model for abstractive summarisation. Combining the centroid filter with sorting by date produces strikingly coherent summaries when you group the sentences of various articles into a single corpus and then cluster them, because you preserve the content, in order, and only the outliers fall away.
Tangent aside, the CentroidFilter function does exactly what your get_representative_docs() does in practice, but for some reason I can't get representative docs from KMeans clusters so the method must differ. I am proposing the code below as a useful filter function that you could add as a utility, but you can also use it to get representative docs for KMeans clusters if the DataFrame that you pass in is filtered for the queried cluster.
fromtypingimportList, LiteralfromtorchimportTensorimportpandasaspdimportnumpyasnpdefCentroidFilter(
df: pd.DataFrame,
num_results: int|float|Literal["xx.x%"] =3,
# Target number of results, float or string for percentagetext_col: str="text",
embedding_col: str="embeddings",
) ->str|List[str]:
#* Find centroidembeddings_list: List[Tensor] # Type hint <- SentenceTransformer embedding outputsembeddings_list=df[embedding_col].to_list()
centroid=np.average(np.array([embeddings_list]), axis=1)
text_list=df[text_col].to_list()
num_docs=len(text_list)
#* If number is a percent (string or decimal)iftype(num_results) isstr:
num_results=round((num_docs*float(num_results[:-1].strip())/100), 0)
elifnum_results<1andnum_results>0:
num_results=round((num_docs*num_results), 0)
#* Find distance from centroidscores= [
np.dot(centroid, embeddings_list[i])
foriinrange(num_docs)
]
#* Extract sentencestop_doc_indices=sorted(
range(len(scores)),
key=lambdai: scores[i],
reverse=True,
)[:num_results]
#* Preserve sentence order in outputunordered_doc_list= [text_list[i] foriintop_doc_indices]
doc_list= [sentforsentintext_listifsentinunordered_doc_list]
# The above can be adapted if desired return type is DataFrame#? makes a difference when documents are sorted by relation, such as time or hierarchical cluster positionreturndoc_list
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I built a centroid filter in place of using a PageRank/TextRank algorithm to organise documents into order. I use it to create statistical summaries of news articles before sending those to a language model for abstractive summarisation. Combining the centroid filter with sorting by date produces strikingly coherent summaries when you group the sentences of various articles into a single corpus and then cluster them, because you preserve the content, in order, and only the outliers fall away.
Tangent aside, the CentroidFilter function does exactly what your get_representative_docs() does in practice, but for some reason I can't get representative docs from KMeans clusters so the method must differ. I am proposing the code below as a useful filter function that you could add as a utility, but you can also use it to get representative docs for KMeans clusters if the DataFrame that you pass in is filtered for the queried cluster.
Beta Was this translation helpful? Give feedback.
All reactions