updated vignettes

reilly-lab · reilly-lab · commit cca76f0af9d8 · 2025-07-08T15:26:10.000-04:00
diff --git a/vignettes/CA_Intro.Rmd b/vignettes/CA_Intro.Rmd
@@ -0,0 +1,111 @@
+---
+title: ConversationAlign Intro
+author: Jamie Reilly, Ben Sacks, Ginny Ulichney, Gus Cooney, Chelsea Helion
+date: "`r format(Sys.Date(), '%B %d, %Y')`"
+show_toc: true
+slug: ConversationAlign
+output:
+  rmarkdown::html_vignette:
+    toc: yes
+vignette: >
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteIndexEntry{ConversationAlign Intro}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+# Intro to ConversationAlign
+A good conversation is a cooperative endeavor where both parties modify the form and content of their own production to align with each other. This is a phenomenon known as alignment. People align across many dimensions including the words they choose and the affective tenor of their prosody. ``ConversationAlign`` measures dynamics of lexical use between conversation partners across more than 40 semantic, lexical, phonological, and affective dimensions. Before launching into your analyses, there are some important use caveats to consider.
+
+## Caveats for Using ConversationAlign
+- Language analyses often proliferate in complexity. Spend the extra time to keep careful records and a logical organization system (e.g., smart filenaming, variable keys, a formalized processing pipeline). 
+- ``ConversationAlign`` only works on dyadic language transcripts (i.e., 2-person dialogues). 
+- ``ConversationAlign`` does NOT parse turns automatically. The software will aggregate all words produced by one speaker across sentences and rows until a switch occurs in the 'speaker' column.
+- ``ConversationAlign`` will strip punctuation and other special characters automatically. 
+- ``ConversationAlign`` will split/vectorize your text into a one-word-per-row format, retaining all variable labels.
+- ``ConversationAlign`` will retain all meta-data throughout text processing (e.g., timestamps, grouping variables). ``ConversationAlign`` is pretty good at detecting and repairing unconventional font encoding systems, but it will not catch everything, You will find all sorts of hidden junk when you copy/paste interview transcripts from random websites are from YouTube. Inspect your transcripts to make sure they are what you think they are before launching into a complex computational analysis. 
+
+## Prepare your Transcripts Outside of the Package
+- Save each conversation transcript as a separate file (``*.txt``, ``*.csv``, or Otter ``*.ai``).
+- Be deliberate about your filenaming convention. Each transcript's filename will become its unique document identifier (Event_ID) when importing into ``ConversationAlign``. <br>
+- Store all the transcripts you want to analyze in the same folder (e.g., '/my_transcripts).
+- Your transcript folder and analysis scripts should ideally exist within the same directory.
+- Each raw comversation transcript **MUST** nominally contain at least two columns, talker/interlocutor and text.  
+- Name your talker/interlocutor column as 'Interlocutor', 'Speaker', or 'Participant' (not case sensitive).
+- Name your text column as 'Text', 'Utterance', or 'Turn' (not case sensitive).
+
+# Installation
+Install and load the development version of `ConversationAlign` from [GitHub](https://github.com/) using the `devtools` package.
+```{r, message=FALSE, warning=F}
+# Check if devtools is installed, if not install it
+if (!require("devtools", quietly = TRUE)) {
+  install.packages("devtools")
+}
+
+# Load devtools
+library(devtools)
+
+# Check if ConversationAlign is installed, if not install from GitHub
+if (!require("ConversationAlign", quietly = TRUE)) {
+  devtools::install_github("Reilly-ConceptsCognitionLab/ConversationAlign")
+}
+
+# Load SemanticDistance
+library(ConversationAlign)
+```
+
+## Calibaration Transcripts Included in `ConversationAlign`
+``ConversationAlign``contains two sample conversation transcripts that are pre-load when you call the package. These are:
+**MaronGross_2013:** Interview transcript of Marc Maron and Terry Gross on NPR (2013). <br>
+**NurseryRhymes:** Three nursery rhymes looping same phrases formatted as conversations, cleaned, and aligned to illustrate how the formatting pipeline reshaopes conversation transcripts.
+
+### NurseryRhymes
+```{r, eval=T, message=F, warning=F}
+knitr::kable(head(NurseryRhymes, 20), format = "simple")
+str(NurseryRhymes)
+```
+
+### Maron-Gross Interview
+Here's one from a 2013 NPR interview (USA) between Marc Maron and Terry Gross, titled [Marc Maron: A Life Fueled By 'Panic And Dread'](https://www.npr.org/transcripts/179014321).<br>
+```{r}
+knitr::kable(head(MaronGross_2013, 20), format = "simple")
+str(MaronGross_2013)
+```
+
+# Caveat emptor
+Any analysis of language comes with assumptions and potential bias. For example, there are some instances where a researcher might care about morphemes and grammatical elements such as 'the', 'a', 'and', etc..  The default for ConversationAlign is to omit these as stopwords and to average across all open class words (e.g., nouns, verbs) in each turn by interlocutor. There are some specific cases where this can all go wrong. Here are some things to consider: <br>
+
+1. <span style="color:red;">Stopwords </span>: ``ConversationAlign`` omits stopwords by default applying a customized stopword list, ``Temple_Stopwords25``. [CLICK HERE](https://osf.io/dc5k7) to inspect the list. This stopword list includes greetings, idioms, filler words, numerals, and pronouns.
+
+2. <span style="color:red;">Lemmatization </span>: The package will lemmatize your language transcripts by default. Lemmatization transforms inflected forms (e.g., standing, stands) into their root or dictionary entry (e.g., stand). This helps for yoking offline values (e.g., happiness, concreteness) to each word and also entails what NLP folks refer to as 'term aggregation'. However, sometimes you might NOT want to lemmatize. You can easily change this option by using the argument, "lemmatize=FALSE," to the clean_dyads function below. <br>
+
+3. <span style="color:red;">Sample Size Issue 1: Exchange Count</span>: The program derives correlations and AUC for each dyad as metrics of alignment. For very brief conversations (<30 turns), the likelihood of unstable or unreliable estimates is high. <br>
+
+4. <span style="color:red;">Sample Size Issue 2 </span>: matching to lookup database: ConversationAlign works by yoking values from a lookup database to each word in your language transcript. Some variables have lots of values characterizing many English words. Other variables (e.g., age of acquisition) only cover about 30k words. When a word in your transcript does not have a 'match' in the lookup datase, ConversationAlign will return an NA which will not go into the average of the words for that interlocutor and turn. This can be dangerous when there are many missing values. Beware! <br>
+
+5. <span style="color:red;">Compositionality </span>: ConversationAlign is a caveman in its complexity. It matches a value to each word as if that word is an island. Phenomena like polysemy (e.g., bank) and the modulation of one word by an intensifier (e.g., very terrible) are not handled. This is a problem for many of the affective measures but not for lexical variables like word length. <br>
+
+
+# Background and Supporting Materials
+
+1. **Preprint** <br>
+Our PsyArXiv preprint describing the method(s) in greater detail is referenced as:
+Sacks, B., Ulichney, V., Duncan, A., Helion, C., Weinstein, S., Giovannetti, T., ... Reilly, J. (2025, March 12). *ConversationAlign: Open-Source Software for Analyzing Patterns of Lexical Use and Alignment in Conversation Transcripts*. [Click Here](https://osf.io/preprints/psyarxiv/7xqnp_v1) to read our preprint. It was recently invited for revision at Behavior Rsearch Methods. We will update when/if eventually accepted there! <br>
+
+2. **Methods for creating internal lookup database** <br>
+ConversationAlign contains a large, internal lexical lookup_database. [Click Here](https://reilly-lab.github.io/ConversationAlign_LookupDatabaseCreation.html) to see how we created this by merging other offline psycholinguistic databases into one. <br>
+
+3. **Variable Key for ConversationAlign** <br>
+ConversationAlign currently allows users to compute alignment dynamics across 30 different lexical, affective, and semantic dimensions.[Click Here](https://reilly-lab.github.io/ConversationAlign_VariableLookupKey.pdf) to link to a variable key. <br>
+
+# References
+Lewis, David D., et al. (2004) "Rcv1: A new benchmark collection for text categorization research." Journal of machine learning research 5: 361-397.
+
+
+
diff --git a/vignettes/CA_Step1_Read.Rmd b/vignettes/CA_Step1_Read.Rmd
@@ -0,0 +1,44 @@
+---
+title: CA Step 1 Read_Dyads
+subtitle: Read and Format Data for ConversationAlign
+author: Jamie Reilly, Ben Sacks, Ginny Ulichney, Gus Cooney, Chelsea Helion
+date: "`r Sys.Date()`"
+show_toc: true
+slug: ConversationAlign Read
+output:
+  rmarkdown::html_vignette:
+    toc: yes
+vignette: >
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteIndexEntry{CA Step 1 Read_Dyads}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+# Installation
+Install and load the development version of `ConversationAlign` from [GitHub](https://github.com/) using the `devtools` package.
+```{r, message=FALSE, warning=F, echo=F}
+# Check if devtools is installed, if not install it
+if (!require("devtools", quietly = TRUE)) {
+  install.packages("devtools")
+}
+
+# Load devtools
+library(devtools)
+
+# Check if ConversationAlign is installed, if not install from GitHub
+if (!require("ConversationAlign", quietly = TRUE)) {
+  devtools::install_github("Reilly-ConceptsCognitionLab/ConversationAlign")
+}
+
+# Load SemanticDistance
+library(ConversationAlign)
+```
+
+# Prep_Data
diff --git a/vignettes/CA_Step2_Prep.Rmd b/vignettes/CA_Step2_Prep.Rmd
@@ -0,0 +1,47 @@
+---
+title: CA Step 2 Prep Data
+subtitle: prep_dyads()
+author: Jamie Reilly, Ben Sacks, Ginny Ulichney, Gus Cooney, Chelsea Helion
+date: "`r format(Sys.Date(), '%B %d, %Y')`"
+show_toc: true
+slug: ConversationAlign Prep
+output:
+  rmarkdown::html_vignette:
+    toc: yes
+vignette: |
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteIndexEntry{ConversationAlign Step_2 Prep_Dyads}
+  %\VignetteEncoding{UTF-8}
+---
+
+Install and load the development version of `ConversationAlign` from [GitHub](https://github.com/) using the `devtools` package.
+```{r, message=FALSE, warning=F, echo=F}
+# Check if devtools is installed, if not install it
+if (!require("devtools", quietly = TRUE)) {
+  install.packages("devtools")
+}
+
+# Load devtools
+library(devtools)
+
+# Check if ConversationAlign is installed, if not install from GitHub
+if (!require("ConversationAlign", quietly = TRUE)) {
+  devtools::install_github("Reilly-ConceptsCognitionLab/ConversationAlign")
+}
+
+# Load SemanticDistance
+library(ConversationAlign)
+```
+
+## prep_dyads()
+``prep_dyads()`` uses numerous regex to clean and format the data your just read into R in the previous step. ``ConversationAlign`` applies an ordered sequence of cleaning steps on the road toward vectorizing your original text into a one-word-per row format. These steps include: converting all text to lowercase, expanding contractions, omitting all non-alphabetic characters (e.g., numbers, punctuation, line breaks). In addition to text cleaning, users guide options for stopword removal and lemmatization. During formatting ``prep_dyads()`` will prompt you to select up to three variables for computing alignment on. This works by joining values from a large internal lookup database to each word in your language transcript. ``prep_dyads()`` is customizable via the following arguments.
+
+### Stopword removal
+There are two important arguments regarding stopword removal. ``omit_stops`` specifies whether or not to remove stopwords. ``which_stopwords`` specifies which stopword list you would like to apply with the default being ``Temple_stops25``. The full list of choices is: ``none``, ``SMART_stops``, ``CA_orig_stops``. ``MIT_stops``, and ``Temple_stops25``. Stopword removal is an important, yet also controversial step in text cleaning.  <br>
+
+### Lemmatization
+``ConversationAlign`` calls the ``textstem`` package as a dependency to lemmatize your language transcript. This converts morphologiocal derivatives to their root forms. The default is lemmatize=T. Sometimes you want to retain language output in its native form. If this is the case, change the argument in clean_dyads to lemmatize=F. ``clean_dyads()`` outputs word count metrics pre/post cleaning by dyad and interlocutor. This can be useful if you are interested in whether one person just doesn't produce many words or produces a great deal of empty utterances. <br>
+
+### Dimension Selection
+This is where the magic happens. ``prep_dyads()`` will yoke published norms for up to 45 possible dimensions to every content word in your transcript. This join is executed by merging your vectorized conversation transcript with a huge internal lexical database with norms spanning over 100k English words. ``prep_dyads()`` will prompt you to select anywhere from 1 to 3 target dimensions at a time. Enter the number corresponding to each dimension of interest separated by spaces and then hit enter (e.g., 10 14 19) ``ConversationAlign`` will append a published norm if available (e.g., concreteness, word length) to every running word in your transcript. These quantitative values are used in the subsequent ``summarize_dyads()`` step to compute alignment. <br>
+
diff --git a/vignettes/CA_Step3_Summarize.Rmd b/vignettes/CA_Step3_Summarize.Rmd
@@ -0,0 +1,43 @@
+---
+title: CA Step 3 Summarize Dyads
+subtitle: summarize_dyads()
+author: Jamie Reilly, Ben Sacks, Ginny Ulichney, Gus Cooney, Chelsea Helion
+date: "`r format(Sys.Date(), '%B %d, %Y')`"
+show_toc: true
+slug: ConversationAlign Summarize
+output:
+  rmarkdown::html_vignette:
+    toc: yes
+vignette: |
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteIndexEntry{CA Step 3 Summarize Dyads}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+```{r, message=FALSE, warning=F, echo=F}
+# Check if devtools is installed, if not install it
+if (!require("devtools", quietly = TRUE)) {
+  install.packages("devtools")
+}
+
+# Load devtools
+library(devtools)
+
+# Check if ConversationAlign is installed, if not install from GitHub
+if (!require("ConversationAlign", quietly = TRUE)) {
+  devtools::install_github("Reilly-ConceptsCognitionLab/ConversationAlign")
+}
+
+# Load SemanticDistance
+library(ConversationAlign)
+```
+
+## summarize_dyads()
+`
diff --git a/vignettes/CA_Step4_Analytics.Rmd b/vignettes/CA_Step4_Analytics.Rmd
@@ -0,0 +1,44 @@
+---
+title: CA Step 4 Corpus Analytics
+subtitle: corpus_analytics()
+author: Jamie Reilly, Ben Sacks, Ginny Ulichney, Gus Cooney, Chelsea Helion
+date: "`r format(Sys.Date(), '%B %d, %Y')`"
+show_toc: true
+slug: ConversationAlign Analytics
+output:
+  rmarkdown::html_vignette:
+    toc: yes
+vignette: >
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteIndexEntry{CA Step 4 Corpus Analytics}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+```{r, message=FALSE, warning=F, echo=F}
+# Check if devtools is installed, if not install it
+if (!require("devtools", quietly = TRUE)) {
+  install.packages("devtools")
+}
+
+# Load devtools
+library(devtools)
+
+# Check if ConversationAlign is installed, if not install from GitHub
+if (!require("ConversationAlign", quietly = TRUE)) {
+  devtools::install_github("Reilly-ConceptsCognitionLab/ConversationAlign")
+}
+
+# Load SemanticDistance
+library(ConversationAlign)
+```
+
+# Generate corpus analytics
+
+`
diff --git a/vignettes/my-vignette.Rmd b/vignettes/my-vignette.Rmd