Skip to content

Commit f667563

Browse files
committed
Added paper folder for JOSS manuscript
1 parent ed580af commit f667563

8 files changed

+4247
-0
lines changed

.Rbuildignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@
77
^_pkgdown\.yml$
88
^docs$
99
^pkgdown$
10+
^paper$
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
---
2+
title: 'SemanticDistance: An R package for computing semantic relationships in language samples'
3+
4+
5+
tags:
6+
- R
7+
- psychology
8+
- semantic memory
9+
- natural language processing
10+
authors:
11+
- name: Jamie Reilly
12+
orcid: 0000-0002-0891-438X
13+
equal-contrib: false
14+
corresponding: true
15+
affiliation: "1, 2"
16+
- name: Emily B. Myers
17+
orcid: 0000-0000-0000-0000
18+
equal-contrib: false
19+
corresponding: false
20+
affiliation: "3, 4"
21+
- name: Hannah R. Mechtenberg
22+
orcid: 0000-0003-1436-1846
23+
equal-contrib: false
24+
corresponding: false
25+
affiliation: 4
26+
- name: Jonathan E. Peelle
27+
orcid: 0000-0001-9194-854X
28+
equal-contrib: false
29+
corresponding: false
30+
affiliation: "5, 6, 7" # (Multiple affiliations must be quoted)
31+
32+
affiliations:
33+
- name: Department of Communication Sciences and Disorders, Temple University, United States
34+
index: 1
35+
- name: Department of Psychology and Neuroscience, Temple University, United States
36+
index: 2
37+
- name: Department of Speech, Language, and Hearing Sciences, University of Connecticut, United States
38+
index: 3
39+
- name: Department of Psychological Sciences, University of Connecticut, United States
40+
index: 4
41+
- name: Institute for Cognitive and Brain Health, Northeastern University, United States
42+
index: 5
43+
- name: Department of Communication Sciences and Disorders, Northeastern University, United States
44+
index: 6
45+
- name: Department of Psychology, Northeastern University, United States
46+
index: 7
47+
48+
date: "`r format(Sys.Date(), '%e %B %Y')`"
49+
bibliography: paper.bib
50+
51+
output: rticles::joss_article
52+
csl: apa-single-spaced.csl
53+
journal: JOSS
54+
55+
56+
---
57+
58+
59+
# Summary
60+
61+
Word meanings can be represented as vectors in n-dimensional semantic space. The distance between successive items in a list, sentence, or story is important for understanding the type of information being conveyed. `SemanticDistance` is an R package for flexibly estimating the distance between words or groups of words in a text.
62+
63+
64+
# Statment of Need
65+
66+
One of the specific things listeners do in the process of understanding a conversation or story is to relate incoming words with what has been previously heard. *Semantic distance* [@Reilly2023] corresponds to the dissimilarity between two or more concepts within an n-dimensional space, typically derived from analyzing word co-occurrence in large corpora of texts [@Landauer1997]. Currently there are no existing packages that implement these calculations.
67+
68+
69+
70+
71+
# Description
72+
73+
The `SemanticDistance` package is available from:
74+
75+
https://github.com/Reilly-ConceptsCognitionLab/SemanticDistance
76+
77+
Consider, for example, a researcher interested in quantifying the distance between *wolf* and *dog* in a unidimensional semantic space constrained by perceived threat. A simple subtraction of the respective threat ratings for wolf and dog would yield an empirical index of the distance between these two concepts in 'threat' space. In practice, most researchers interested in modeling semantic relationships do so using multidimensional semantic spaces. This approach involves quantifying the salience of target words across many unique psychological dimensions (e.g., color, sound, threat, etc.) or in the case of word embedding models across a series of hyperparameters.
78+
79+
`SemanticDistance` will append distance values between each pair of elements specified by the user (e.g., word-to-word, ngram-to-word). These distance values are derived from two large lookup databases in the package with fixed semantic vectors for >70k English words. `CosDist_Glo` reflects cosine distance between vectors derived from training a GLOVE word embedding model (300 hyperparameters per word) [@Pennington2014]. `CodDist_SD15` refects cosine distance between two chunks (words, groups of words) characterized across 15 meaningful perceptual and affective dimensions (e.g., color, sound, valence).
80+
81+
Users specify an ngram window size. This window rolls successively over a language sample to compute a semantic distance value for each new word relative to the n-words (ngram size) before it. A 1-gram distance computes the distance from word-to-word; a 2-gram would compute the distance from a pair of words to the next pair, and so on.
82+
83+
This model of computing distance is illustrated in the figure. The larger the specified ngram size, the smoother the semantic vector will be over the provided language sample.
84+
85+
86+
## Preparation of text
87+
88+
Before using `SemanticDistance`, figure out what format your transcript is in and what you want to measure. `SemanticDistance` offers many possible options with some default arguments. For example, the package requires users to clean and prepare the data. You can choose to omit stopwords, lemmatize, split strings, and so on. Or, you can decide to leave your data alone and split the transcript into a one-word-per-row format. The prepared dataframe should nominally contain a text column and a speaker/talker column.
89+
90+
```{r, message=FALSE}
91+
library(SemanticDistance)
92+
93+
Monologue_Cleaned <- clean_monologue(dat=Monologue_Structured, wordcol='mytext', clean=TRUE, omit_stops=TRUE, split_strings=TRUE)
94+
head(Monologue_Cleaned, n=8)
95+
```
96+
97+
98+
## Semantic distance estimates
99+
100+
Included function average the semantic vectors for all content words in a turn then computes the distance to the average of the semantic vectors of the content words in the subsequent turn. It averages across the semantic vectors of all words within a turn and then computes cosine distance to all the words in the next turn. A user simply needs to feed it a transcript formatted with `clean_dialogue`. `dist_dialogue` will return a summary dataframe that distance values aggregated by talker and turn (`id_turn`).
101+
102+
103+
```{r, message=FALSE}
104+
Ngram2Ngram_Dist1 <- dist_ngram2ngram(dat=Monologue_Cleaned, ngram=2)
105+
head(Ngram2Ngram_Dist1)
106+
```
107+
108+
109+
110+
## Visualization
111+
112+
`SemanticDistance` allows several visualizations of the data...
113+
114+
115+
116+
# Acknowledgements
117+
118+
This work was supported in part by grants R01 DC013063, R01 DC013064, and R01 DC019507 from the US National Institutes of Health.
119+
120+
121+
# References
122+

paper/Reilly_SemanticDistance_JOSS.html

Lines changed: 563 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)