This directory contains user-friendly tutorial notebooks for the ALIGN (Analyzing Linguistic Alignment) package.
Purpose: Learn how to prepare raw conversational transcripts for alignment analysis
What's Included:
- Step-by-step preprocessing workflow
- Using different POS taggers (NLTK, spaCy, Stanford)
- Setup instructions for optional taggers
- Input/output format validation
- Sample data inspection
Time to Complete: ~10-15 minutes (plus download time for optional taggers)
Purpose: Learn how to analyze linguistic alignment in preprocessed conversations
What's Included:
- Lexical-syntactic alignment (word and grammar similarity)
- Semantic alignment with FastText
- Semantic alignment with BERT (optional)
- Conversation-level alignment (aggregate repertoire overlap)
- Multi-analyzer comprehensive analysis
- Visualization and interpretation
- Correlation analysis between alignment types
Time to Complete: ~15-20 minutes (plus download time for FastText on first run)
Purpose: Learn how to establish baseline alignment levels using surrogate conversations
What's Included:
- Understanding surrogate/baseline analysis
- Generating surrogate conversation pairs (cross-role pairing)
- Controlling surrogate sample sizes with
num_surrogates - Multi-party conversation support via dyadic decomposition
- Analyzing alignment in surrogate data
- Comparing real vs. baseline alignment
- Statistical significance testing
- Interpreting results
Why This Matters:
- Establishes what alignment occurs "by chance"
- Allows statistical testing of real alignment
- Essential for research and publication
- Helps interpret whether observed alignment is meaningful
Time to Complete: ~20-30 minutes (generates many surrogate pairs)
git clone https://github.com/your-username/align2-linguistic-alignment.git
cd align2-linguistic-alignment
pip install -r requirements.txt
pip install -e .jupyter notebook tutorial_1_preprocessing.ipynb💡 Tip: You can also open and run these notebooks in Visual Studio Code! VS Code has excellent Jupyter notebook support with features like IntelliSense, debugging, and variable inspection. Just open the
.ipynbfile in VS Code and click "Run All" or run cells individually.
- View the notebook on GitHub to see expected outputs
- Download and run locally to process your own data
- Use included CHILDES sample data to learn
jupyter notebook tutorial_2_alignment.ipynb- Use preprocessed data from Tutorial 1
- Compute alignment metrics
- Visualize and interpret results
jupyter notebook tutorial_3_baseline.ipynb- Generate surrogate conversation pairs
- Compute baseline alignment levels
- Test if real alignment is statistically significant
- Publish with confidence!
tutorial_output/
├── preprocessed_nltk/ # NLTK-only (fastest)
├── preprocessed_spacy/ # NLTK + spaCy (recommended)
└── preprocessed_stanford/ # NLTK + Stanford (highest accuracy)
tutorial_output/alignment_results/
├── lexsyn/ # Lexical-syntactic alignment results
│ ├── lexsyn_alignment_ngram2_lag1_noDups_noAdd.csv
│ ├── lexsyn_alignment_ngram2_lag1_noDups_withSpacy.csv
│ └── convo_lexsyn_alignment_ngram2_noDups_noAdd.csv # Conversation-level
├── fasttext/ # FastText semantic alignment
│ ├── semantic_alignment_fasttext_lag1_sd3_n1.csv
│ └── convo_semantic_alignment_fasttext_sd3_n1.csv # Conversation-level
├── bert/ # BERT semantic alignment (optional)
│ └── semantic_alignment_bert-base-uncased_lag1.csv
├── merged/ # Combined multi-analyzer results
│ └── merged-lag1-ngram2-noAdd-noDups-sd3-n1.csv
└── cache/ # Model caches (FastText, BERT)
tutorial_output/baseline_results/
├── surrogates/ # Generated surrogate conversation pairs
│ └── surrogate_run-{timestamp}/
│ ├── SurrogatePair-dyad1-dyad2-cond1.txt
│ ├── SurrogatePair-dyad1-dyad3-cond1.txt
│ └── ... (one surrogate per file pair)
├── lexsyn/ # Baseline alignment results
│ └── baseline_alignment_lexsyn_ngram2_lag1_noDups_noAdd.csv
├── fasttext/ # Baseline semantic alignment
│ └── baseline_alignment_fasttext_lag1_sd3_n1.csv
└── comparison/ # Real vs. Baseline comparisons
└── alignment_comparison_lexsyn.csv
- Tab-delimited text files (
.txt) - Required columns:
participant,content - UTF-8 encoding
- One utterance per row
participant content
Speaker1 Hello there
Speaker2 Hi how are you
Speaker1 I am doing well
- Tutorial 1: Change
INPUT_DIRto your data directory - Run preprocessing
- Tutorial 2: Update
INPUT_DIR_NLTKto your preprocessed output - Run alignment analysis