|
| 1 | +--- |
| 2 | +title: 'SemanticDistance: An R package for computing semantic relationships in language samples' |
| 3 | + |
| 4 | + |
| 5 | +tags: |
| 6 | + - R |
| 7 | + - psychology |
| 8 | + - semantic memory |
| 9 | + - natural language processing |
| 10 | +authors: |
| 11 | + - name: Jamie Reilly |
| 12 | + orcid: 0000-0002-0891-438X |
| 13 | + equal-contrib: false |
| 14 | + corresponding: true |
| 15 | + affiliation: "1, 2" |
| 16 | + - name: Emily B. Myers |
| 17 | + orcid: 0000-0000-0000-0000 |
| 18 | + equal-contrib: false |
| 19 | + corresponding: false |
| 20 | + affiliation: "3, 4" |
| 21 | + - name: Hannah R. Mechtenberg |
| 22 | + orcid: 0000-0003-1436-1846 |
| 23 | + equal-contrib: false |
| 24 | + corresponding: false |
| 25 | + affiliation: 4 |
| 26 | + - name: Jonathan E. Peelle |
| 27 | + orcid: 0000-0001-9194-854X |
| 28 | + equal-contrib: false |
| 29 | + corresponding: false |
| 30 | + affiliation: "5, 6, 7" # (Multiple affiliations must be quoted) |
| 31 | + |
| 32 | +affiliations: |
| 33 | + - name: Department of Communication Sciences and Disorders, Temple University, United States |
| 34 | + index: 1 |
| 35 | + - name: Department of Psychology and Neuroscience, Temple University, United States |
| 36 | + index: 2 |
| 37 | + - name: Department of Speech, Language, and Hearing Sciences, University of Connecticut, United States |
| 38 | + index: 3 |
| 39 | + - name: Department of Psychological Sciences, University of Connecticut, United States |
| 40 | + index: 4 |
| 41 | + - name: Institute for Cognitive and Brain Health, Northeastern University, United States |
| 42 | + index: 5 |
| 43 | + - name: Department of Communication Sciences and Disorders, Northeastern University, United States |
| 44 | + index: 6 |
| 45 | + - name: Department of Psychology, Northeastern University, United States |
| 46 | + index: 7 |
| 47 | + |
| 48 | +date: "`r format(Sys.Date(), '%e %B %Y')`" |
| 49 | +bibliography: paper.bib |
| 50 | + |
| 51 | +output: rticles::joss_article |
| 52 | +csl: apa-single-spaced.csl |
| 53 | +journal: JOSS |
| 54 | + |
| 55 | + |
| 56 | +--- |
| 57 | + |
| 58 | + |
| 59 | +# Summary |
| 60 | + |
| 61 | +Word meanings can be represented as vectors in n-dimensional semantic space. The distance between successive items in a list, sentence, or story is important for understanding the type of information being conveyed. `SemanticDistance` is an R package for flexibly estimating the distance between words or groups of words in a text. |
| 62 | + |
| 63 | + |
| 64 | +# Statment of Need |
| 65 | + |
| 66 | +One of the specific things listeners do in the process of understanding a conversation or story is to relate incoming words with what has been previously heard. *Semantic distance* [@Reilly2023] corresponds to the dissimilarity between two or more concepts within an n-dimensional space, typically derived from analyzing word co-occurrence in large corpora of texts [@Landauer1997]. Currently there are no existing packages that implement these calculations. |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | + |
| 71 | +# Description |
| 72 | + |
| 73 | +The `SemanticDistance` package is available from: |
| 74 | + |
| 75 | +https://github.com/Reilly-ConceptsCognitionLab/SemanticDistance |
| 76 | + |
| 77 | +Consider, for example, a researcher interested in quantifying the distance between *wolf* and *dog* in a unidimensional semantic space constrained by perceived threat. A simple subtraction of the respective threat ratings for wolf and dog would yield an empirical index of the distance between these two concepts in 'threat' space. In practice, most researchers interested in modeling semantic relationships do so using multidimensional semantic spaces. This approach involves quantifying the salience of target words across many unique psychological dimensions (e.g., color, sound, threat, etc.) or in the case of word embedding models across a series of hyperparameters. |
| 78 | + |
| 79 | +`SemanticDistance` will append distance values between each pair of elements specified by the user (e.g., word-to-word, ngram-to-word). These distance values are derived from two large lookup databases in the package with fixed semantic vectors for >70k English words. `CosDist_Glo` reflects cosine distance between vectors derived from training a GLOVE word embedding model (300 hyperparameters per word) [@Pennington2014]. `CodDist_SD15` refects cosine distance between two chunks (words, groups of words) characterized across 15 meaningful perceptual and affective dimensions (e.g., color, sound, valence). |
| 80 | + |
| 81 | +Users specify an ngram window size. This window rolls successively over a language sample to compute a semantic distance value for each new word relative to the n-words (ngram size) before it. A 1-gram distance computes the distance from word-to-word; a 2-gram would compute the distance from a pair of words to the next pair, and so on. |
| 82 | + |
| 83 | +This model of computing distance is illustrated in the figure. The larger the specified ngram size, the smoother the semantic vector will be over the provided language sample. |
| 84 | + |
| 85 | + |
| 86 | +## Preparation of text |
| 87 | + |
| 88 | +Before using `SemanticDistance`, figure out what format your transcript is in and what you want to measure. `SemanticDistance` offers many possible options with some default arguments. For example, the package requires users to clean and prepare the data. You can choose to omit stopwords, lemmatize, split strings, and so on. Or, you can decide to leave your data alone and split the transcript into a one-word-per-row format. The prepared dataframe should nominally contain a text column and a speaker/talker column. |
| 89 | + |
| 90 | +```{r, message=FALSE} |
| 91 | +library(SemanticDistance) |
| 92 | +
|
| 93 | +Monologue_Cleaned <- clean_monologue(dat=Monologue_Structured, wordcol='mytext', clean=TRUE, omit_stops=TRUE, split_strings=TRUE) |
| 94 | +head(Monologue_Cleaned, n=8) |
| 95 | +``` |
| 96 | + |
| 97 | + |
| 98 | +## Semantic distance estimates |
| 99 | + |
| 100 | +Included function average the semantic vectors for all content words in a turn then computes the distance to the average of the semantic vectors of the content words in the subsequent turn. It averages across the semantic vectors of all words within a turn and then computes cosine distance to all the words in the next turn. A user simply needs to feed it a transcript formatted with `clean_dialogue`. `dist_dialogue` will return a summary dataframe that distance values aggregated by talker and turn (`id_turn`). |
| 101 | + |
| 102 | + |
| 103 | +```{r, message=FALSE} |
| 104 | +Ngram2Ngram_Dist1 <- dist_ngram2ngram(dat=Monologue_Cleaned, ngram=2) |
| 105 | +head(Ngram2Ngram_Dist1) |
| 106 | +``` |
| 107 | + |
| 108 | + |
| 109 | + |
| 110 | +## Visualization |
| 111 | + |
| 112 | +`SemanticDistance` allows several visualizations of the data... |
| 113 | + |
| 114 | + |
| 115 | + |
| 116 | +# Acknowledgements |
| 117 | + |
| 118 | +This work was supported in part by grants R01 DC013063, R01 DC013064, and R01 DC019507 from the US National Institutes of Health. |
| 119 | + |
| 120 | + |
| 121 | +# References |
| 122 | + |
0 commit comments