Skip to content

Commit bcd99f8

Browse files
committed
Added JOSS paper
1 parent f667563 commit bcd99f8

File tree

6 files changed

+853
-1390
lines changed

6 files changed

+853
-1390
lines changed

paper/Reilly_SemanticDistance_JOSS.Rmd

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,10 @@ journal: JOSS
5555

5656
---
5757

58+
```{r, include=FALSE}
59+
options(tinytex.verbose = TRUE)
60+
```
61+
5862

5963
# Summary
6064

@@ -70,34 +74,34 @@ One of the specific things listeners do in the process of understanding a conver
7074

7175
# Description
7276

73-
The `SemanticDistance` package is available from:
77+
The `SemanticDistance` R package is available from:
7478

7579
https://github.com/Reilly-ConceptsCognitionLab/SemanticDistance
7680

77-
Consider, for example, a researcher interested in quantifying the distance between *wolf* and *dog* in a unidimensional semantic space constrained by perceived threat. A simple subtraction of the respective threat ratings for wolf and dog would yield an empirical index of the distance between these two concepts in 'threat' space. In practice, most researchers interested in modeling semantic relationships do so using multidimensional semantic spaces. This approach involves quantifying the salience of target words across many unique psychological dimensions (e.g., color, sound, threat, etc.) or in the case of word embedding models across a series of hyperparameters.
81+
Consider, for example, a researcher interested in quantifying the distance between *wolf* and *dog* in a unidimensional semantic space constrained by perceived threat. A simple subtraction of the respective threat ratings for *wolf* and *dog* would yield an empirical index of the distance between these two concepts in 'threat' space. In practice, most researchers interested in modeling semantic relationships do so using multidimensional semantic spaces. This approach involves quantifying the salience of target words across many unique psychological dimensions (e.g., color, sound, threat, etc.) or in the case of word embedding models across a series of hyperparameters.
7882

79-
`SemanticDistance` will append distance values between each pair of elements specified by the user (e.g., word-to-word, ngram-to-word). These distance values are derived from two large lookup databases in the package with fixed semantic vectors for >70k English words. `CosDist_Glo` reflects cosine distance between vectors derived from training a GLOVE word embedding model (300 hyperparameters per word) [@Pennington2014]. `CodDist_SD15` refects cosine distance between two chunks (words, groups of words) characterized across 15 meaningful perceptual and affective dimensions (e.g., color, sound, valence).
83+
`SemanticDistance` appenda distance values between each pair of elements specified by the user (e.g., word-to-word, ngram-to-word, ngram-to-ngram). These distance values are derived from two large lookup databases in the package with fixed semantic vectors for >70k English words. `CosDist_Glo` reflects cosine distance between vectors derived from training a GLOVE word embedding model (300 hyperparameters per word) [@Pennington2014]. `CodDist_SD15` refects cosine distance between two chunks (words, groups of words) characterized across 15 meaningful perceptual and affective dimensions (e.g., color, sound, valence).
8084

8185
Users specify an ngram window size. This window rolls successively over a language sample to compute a semantic distance value for each new word relative to the n-words (ngram size) before it. A 1-gram distance computes the distance from word-to-word; a 2-gram would compute the distance from a pair of words to the next pair, and so on.
8286

83-
This model of computing distance is illustrated in the figure. The larger the specified ngram size, the smoother the semantic vector will be over the provided language sample.
84-
8587

8688
## Preparation of text
8789

88-
Before using `SemanticDistance`, figure out what format your transcript is in and what you want to measure. `SemanticDistance` offers many possible options with some default arguments. For example, the package requires users to clean and prepare the data. You can choose to omit stopwords, lemmatize, split strings, and so on. Or, you can decide to leave your data alone and split the transcript into a one-word-per-row format. The prepared dataframe should nominally contain a text column and a speaker/talker column.
90+
Before using `SemanticDistance`, users need to decide what format their text is in and what they want to measure. `SemanticDistance` offers many possible options with some default arguments. For example, the package requires users to clean and prepare the data. Useres can choose to omit stopwords, lemmatize, split strings, and so on. Or, users can decide to leave their data alone and split the transcript into a one-word-per-row format. The prepared dataframe should minimally contain a text column and a speaker/talker column.
8991

9092
```{r, message=FALSE}
9193
library(SemanticDistance)
9294
93-
Monologue_Cleaned <- clean_monologue(dat=Monologue_Structured, wordcol='mytext', clean=TRUE, omit_stops=TRUE, split_strings=TRUE)
95+
Monologue_Cleaned <- clean_monologue(dat=Monologue_Structured,
96+
wordcol='mytext', clean=TRUE,
97+
omit_stops=TRUE, split_strings=TRUE)
9498
head(Monologue_Cleaned, n=8)
9599
```
96100

97101

98102
## Semantic distance estimates
99103

100-
Included function average the semantic vectors for all content words in a turn then computes the distance to the average of the semantic vectors of the content words in the subsequent turn. It averages across the semantic vectors of all words within a turn and then computes cosine distance to all the words in the next turn. A user simply needs to feed it a transcript formatted with `clean_dialogue`. `dist_dialogue` will return a summary dataframe that distance values aggregated by talker and turn (`id_turn`).
104+
Included function average the semantic vectors for all content words in a turn then computes the distance to the average of the semantic vectors of the content words in the subsequent turn. It averages across the semantic vectors of all words within a turn and then computes cosine distance to all the words in the next turn. A user simply needs to feed it a transcript formatted with `clean_dialogue`. The function `dist_dialogue` returns a summary dataframe that distance values aggregated by talker and turn (`id_turn`).
101105

102106

103107
```{r, message=FALSE}
@@ -109,7 +113,10 @@ head(Ngram2Ngram_Dist1)
109113

110114
## Visualization
111115

112-
`SemanticDistance` allows several visualizations of the data...
116+
`SemanticDistance` allows several visualizations of the data. These include cluster and dendrogram visualizations of how words in a text sample relate to one another.
117+
118+
119+
<!-- ![This description will be the figure caption.](figures/cluster.png) -->
113120

114121

115122

paper/Reilly_SemanticDistance_JOSS.html

Lines changed: 0 additions & 563 deletions
This file was deleted.

0 commit comments

Comments
 (0)