Skip to content

Commit 08a048d

Browse files
Merge pull request #86 from haesleinhuepf/wordclouds
Wordclouds
2 parents 6a34d9d + dcbc269 commit 08a048d

File tree

8 files changed

+889
-3
lines changed

8 files changed

+889
-3
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -305,6 +305,13 @@ stackview.sliceplot(df, images, column_x="UMAP0", column_y="UMAP1")
305305

306306
![](https://raw.githubusercontent.com/haesleinhuepf/stackview/main/docs/images/sliceplot.gif)
307307

308+
309+
### Wordcloudplot
310+
311+
If you have a pandas DataFrame with a column containing text and additionally numeric columns related to the text, you can use the `wordcloudplot` function to visualize selected texts in a wordcloud.
312+
313+
![img.png](https://raw.githubusercontent.com/haesleinhuepf/stackview/main/docs/images/wordcloudplot.png)
314+
308315
### Interact
309316

310317
Exploration of the parameter space of image processing functions is available using `interact`:

docs/data/sentence_embeddings.csv

Lines changed: 598 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Extracted from: https://arxiv.org/abs/2204.07547 licensed CC-BY 4.0 by:
2+
Robert Haase, Elnaz Fazeli, David Legland, Michael Doube, Siân Culley, Ilya Belevich, Eija Jokitalo, Martin Schorb, Anna Klemm, Christian Tischer
3+
4+
Used embedding: https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1

docs/images/wordcloudplot.png

869 KB
Loading

docs/wordcloudplots.ipynb

Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "20a04b0c-06cb-4f29-8381-a6a0d4a20ccd",
6+
"metadata": {},
7+
"source": [
8+
"# Wordcloud plots\n",
9+
"For text exploration, it might make sense to visualize texts as data points and interact with them."
10+
]
11+
},
12+
{
13+
"cell_type": "code",
14+
"execution_count": 1,
15+
"id": "8d301701-368f-4365-b555-dae6f06d8bea",
16+
"metadata": {},
17+
"outputs": [],
18+
"source": [
19+
"import stackview\n",
20+
"import pandas as pd"
21+
]
22+
},
23+
{
24+
"cell_type": "markdown",
25+
"id": "a991df41-86ab-4188-af47-e6e0cf6d7b32",
26+
"metadata": {},
27+
"source": [
28+
"Here we reuse a list of sentences and a [UMAP](https://umap-learn.readthedocs.io/en/latest/) produced from their text-embeddings. The sentences are taken from [Haase et al. 2022](https://arxiv.org/abs/2204.07547) which is licensed [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0)."
29+
]
30+
},
31+
{
32+
"cell_type": "code",
33+
"execution_count": 2,
34+
"id": "bb75ed7e-aa83-4015-a74b-8dfb9405ecf1",
35+
"metadata": {},
36+
"outputs": [
37+
{
38+
"data": {
39+
"text/html": [
40+
"<div>\n",
41+
"<style scoped>\n",
42+
" .dataframe tbody tr th:only-of-type {\n",
43+
" vertical-align: middle;\n",
44+
" }\n",
45+
"\n",
46+
" .dataframe tbody tr th {\n",
47+
" vertical-align: top;\n",
48+
" }\n",
49+
"\n",
50+
" .dataframe thead th {\n",
51+
" text-align: right;\n",
52+
" }\n",
53+
"</style>\n",
54+
"<table border=\"1\" class=\"dataframe\">\n",
55+
" <thead>\n",
56+
" <tr style=\"text-align: right;\">\n",
57+
" <th></th>\n",
58+
" <th>Unnamed: 0</th>\n",
59+
" <th>sentence</th>\n",
60+
" <th>UMAP0</th>\n",
61+
" <th>UMAP1</th>\n",
62+
" </tr>\n",
63+
" </thead>\n",
64+
" <tbody>\n",
65+
" <tr>\n",
66+
" <th>0</th>\n",
67+
" <td>0</td>\n",
68+
" <td>A Hitchhiker’s Guide through the Bio-image Ana...</td>\n",
69+
" <td>-2.863276</td>\n",
70+
" <td>8.680281</td>\n",
71+
" </tr>\n",
72+
" <tr>\n",
73+
" <th>1</th>\n",
74+
" <td>1</td>\n",
75+
" <td>Modern research in the life sciences is unthin...</td>\n",
76+
" <td>-3.731295</td>\n",
77+
" <td>7.875060</td>\n",
78+
" </tr>\n",
79+
" <tr>\n",
80+
" <th>2</th>\n",
81+
" <td>2</td>\n",
82+
" <td>In the past decade, we observed a dramatic inc...</td>\n",
83+
" <td>-4.748690</td>\n",
84+
" <td>6.128065</td>\n",
85+
" </tr>\n",
86+
" <tr>\n",
87+
" <th>3</th>\n",
88+
" <td>3</td>\n",
89+
" <td>As it is increasingly difficult to keep track ...</td>\n",
90+
" <td>-4.183692</td>\n",
91+
" <td>6.847530</td>\n",
92+
" </tr>\n",
93+
" <tr>\n",
94+
" <th>4</th>\n",
95+
" <td>4</td>\n",
96+
" <td>We give guidance on which aspects to consider ...</td>\n",
97+
" <td>-4.912832</td>\n",
98+
" <td>6.691180</td>\n",
99+
" </tr>\n",
100+
" </tbody>\n",
101+
"</table>\n",
102+
"</div>"
103+
],
104+
"text/plain": [
105+
" Unnamed: 0 sentence UMAP0 \\\n",
106+
"0 0 A Hitchhiker’s Guide through the Bio-image Ana... -2.863276 \n",
107+
"1 1 Modern research in the life sciences is unthin... -3.731295 \n",
108+
"2 2 In the past decade, we observed a dramatic inc... -4.748690 \n",
109+
"3 3 As it is increasingly difficult to keep track ... -4.183692 \n",
110+
"4 4 We give guidance on which aspects to consider ... -4.912832 \n",
111+
"\n",
112+
" UMAP1 \n",
113+
"0 8.680281 \n",
114+
"1 7.875060 \n",
115+
"2 6.128065 \n",
116+
"3 6.847530 \n",
117+
"4 6.691180 "
118+
]
119+
},
120+
"execution_count": 2,
121+
"metadata": {},
122+
"output_type": "execute_result"
123+
}
124+
],
125+
"source": [
126+
"df = pd.read_csv(\"data/sentence_embeddings.csv\")\n",
127+
"df.head()"
128+
]
129+
},
130+
{
131+
"cell_type": "markdown",
132+
"id": "b5a69daa-2282-47d6-a020-cf3f8a5539fe",
133+
"metadata": {},
134+
"source": [
135+
"A word cloud plot is an interactive plot where you can select texts and from your selection, a wordcloud is generated."
136+
]
137+
},
138+
{
139+
"cell_type": "code",
140+
"execution_count": 3,
141+
"id": "99720165-0a6c-4c0a-8922-e6350b5a70f3",
142+
"metadata": {},
143+
"outputs": [
144+
{
145+
"data": {
146+
"application/vnd.jupyter.widget-view+json": {
147+
"model_id": "b30463a9357f4846827c31acb06fc0bc",
148+
"version_major": 2,
149+
"version_minor": 0
150+
},
151+
"text/plain": [
152+
"VBox(children=(HBox(children=(HBox(children=(VBox(children=(VBox(children=(HBox(children=(VBox(children=(Image…"
153+
]
154+
},
155+
"execution_count": 3,
156+
"metadata": {},
157+
"output_type": "execute_result"
158+
}
159+
],
160+
"source": [
161+
"stackview.wordcloudplot(df, column_text=\"sentence\", column_x=\"UMAP0\", column_y=\"UMAP1\")"
162+
]
163+
},
164+
{
165+
"cell_type": "code",
166+
"execution_count": null,
167+
"id": "39c71ab5-43f3-4768-9e99-973905082950",
168+
"metadata": {},
169+
"outputs": [],
170+
"source": []
171+
}
172+
],
173+
"metadata": {
174+
"kernelspec": {
175+
"display_name": "Python 3 (ipykernel)",
176+
"language": "python",
177+
"name": "python3"
178+
},
179+
"language_info": {
180+
"codemirror_mode": {
181+
"name": "ipython",
182+
"version": 3
183+
},
184+
"file_extension": ".py",
185+
"mimetype": "text/x-python",
186+
"name": "python",
187+
"nbconvert_exporter": "python",
188+
"pygments_lexer": "ipython3",
189+
"version": "3.11.10"
190+
}
191+
},
192+
"nbformat": 4,
193+
"nbformat_minor": 5
194+
}

setup.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
setuptools.setup(
77
name="stackview",
8-
version="0.12.0",
8+
version="0.12.1",
99
author="Robert Haase",
1010
author_email="robert.haase@uni-leipzig.de",
1111
description="Interactive image stack viewing in jupyter notebooks",
@@ -14,7 +14,7 @@
1414
url="https://github.com/haesleinhuepf/stackview/",
1515
packages=setuptools.find_packages(),
1616
include_package_data=True,
17-
install_requires=["numpy!=1.19.4", "ipycanvas", "ipywidgets", "scikit-image", "ipyevents", "toolz", "matplotlib", "ipykernel", "imageio", "ipympl"],
17+
install_requires=["numpy!=1.19.4", "ipycanvas", "ipywidgets", "scikit-image", "ipyevents", "toolz", "matplotlib", "ipykernel", "imageio", "ipympl", "wordcloud"],
1818
python_requires='>=3.6',
1919
classifiers=[
2020
"Programming Language :: Python :: 3",

stackview/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
__version__ = "0.12.0"
1+
__version__ = "0.12.1"
22

33
from ._static_view import jupyter_displayable_output, insight
44
from ._utilities import merge_rgb
@@ -22,5 +22,6 @@
2222
from ._grid import grid
2323
from ._clusterplot import clusterplot
2424
from ._sliceplot import sliceplot
25+
from ._wordcloudplot import wordcloudplot
2526

2627

stackview/_wordcloudplot.py

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
def wordcloudplot(df, column_x: str = "x", column_y: str = "y", column_text: str = "text",
2+
column_selection: str = "selection",
3+
figsize=(4, 4), markersize=4, width=400, height=400):
4+
"""
5+
Visualizes a scatter plot of columns in a given dataframe next to a word cloud.
6+
Per default, the dataframe should contain a column "text".
7+
8+
Parameters
9+
----------
10+
df: pandas.DataFrame
11+
The dataframe to plot
12+
column_x: str, optional
13+
The column to use for the x-axis
14+
column_y: str, optional
15+
The column to use for the y-axis
16+
column_text: str, optional
17+
The column to use for the text that make the word cloud
18+
column_selection: str, optional
19+
The column to use for the selection
20+
figsize: tuple, optional
21+
The size of the scatter plot figure
22+
markersize: int
23+
The size of the markers
24+
width: int
25+
The width of the word cloud
26+
height: int
27+
The height of the word cloud
28+
29+
Returns
30+
-------
31+
An ipywidgets widget
32+
"""
33+
import numpy as np
34+
from ._grid import grid
35+
from ._curtain import curtain
36+
from ._slice import slice
37+
from ._scatterplot import scatterplot
38+
import functools
39+
from wordcloud import WordCloud
40+
41+
if column_selection in df.columns:
42+
selected_texts = df[df['selection'] == 1][column_text]
43+
text = "\n".join(selected_texts)
44+
else:
45+
selected_texts = df[column_text]
46+
text = "\n".join(selected_texts)
47+
48+
wordcloud = WordCloud(colormap="twilight", background_color="white", width=width, height=height).generate(text)
49+
image = wordcloud.to_image()
50+
selected_image = np.array(image)
51+
52+
image_display = slice(selected_image)
53+
54+
def update(selection, df, column_text, selected_image, widget):
55+
selected_texts = df[column_text][list(selection)]
56+
text = "\n".join(selected_texts)
57+
58+
if len(text) == 0:
59+
text = "empty wordcloud"
60+
61+
wordcloud = WordCloud(colormap="twilight", background_color="white", width=width, height=height).generate(text)
62+
image = wordcloud.to_image()
63+
temp = np.array(image)
64+
65+
# overwrite the pixels in the given image
66+
np.copyto(selected_image, temp.astype(selected_image.dtype))
67+
68+
# redraw the visualization
69+
widget.update()
70+
71+
update_selection = functools.partial(update, df=df, column_text=column_text, selected_image=selected_image,
72+
widget=image_display)
73+
74+
scatterplot = scatterplot(df, column_x, column_y, column_selection, figsize=figsize,
75+
selection_changed_callback=update_selection, markersize=markersize)
76+
77+
return grid([[
78+
image_display,
79+
scatterplot,
80+
81+
]])
82+

0 commit comments

Comments
 (0)