This repository contains all the code and data accompanying our paper:
Rubehn, A., Montemagni, S., & Nerbonne, J. (2024). Extracting Tuscan phonetic correspondences from dialect pronunciations automatically. Language Dynamics and Change, 14(1), 1-33. https://doi.org/10.1163/22105832-bja10034
If you (re-)use parts of this code, please cite our work accordingly.
Clone into this repository on your machine:
$ git clone https://github.com/arubehn/TuscanSoundCorrespondences
$ cd TuscanSoundCorrespondencesIt is highly recommended to set up a fresh virtual environment.
In order to set up the project and install all required dependencies, simply run
$ pip install -e .Once the project is set up correctly, all analyses discussed in the paper can be replicated by executing the run.py script.
$ python3 run.pyYou can analyze your own data with just a few lines of code -- you just need to define the varieties of interest (V) and your data filepath. Input data is expected to be present either as a CLDF repository [1], or as a TSV file with headers according to the EDICTOR specifications. [2]
from shibboleth.calculate import ShibbolethCalculator
data = "YOUR/DATA/FILE/PATH"
varieties = ["your", "set", "of", "varieties"] # needs to match the language id given in the data exactly
sc = ShibbolethCalculator(varieties, data)
charac, repr, dist = sc.calculate_metrics(normalize=True) # normalize=True calculates normalized PMI; otherwise plain PMI is calculatedIf your data already Multiple Sequence Alignments (MSA), those are used by default. Otherwise, alignments are generated using the SCA algorithm [3] implemented in LingPy. [4]
The dataset used in this study is derived from the Atlante Lessicale Toscano (ALT). [5] Our derived dataset was published within the Lexibank collection under https://github.com/lexibank/alt/. This repository contains a condensed version of this Lexibank dataset (/resources/data/alt.tsv) -- for details on how this file was produced, please check the README in the resources/data folder.
[1] Forkel, R. et al. (2018). Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific data, 5(1), 1-10.
[2] List, J.-M. and van Dam, K. P. (2024): EDICTOR 3. A web-based tool for Computer-Assisted Language Comparison [Software Tool, Version 3.0]. MCL Chair at the University of Passau: Passau. URL: https://edictor.org.
[3] List, J.-M. (2012). SCA: Phonetic alignment based on sound classes. In Marija Slavkovik and Dan Lassiter (eds.), New Directions in Logic, Language, and Computation, 32–51. Berlin and Heidelberg: Springer. https://doi.org/10.1007/978-3-642-31467-4_3.
[4] List, J.-M. and Forkel, R. (2021). LingPy: A Python library for historical linguistics, version 2.6.9. Leipzig: Max Planck Institute for Evolutionary Anthropology. URL: https://lingpy.org, DOI: https://zenodo.org/badge/latestdoi/5137/lingpy/lingpy.
[5] Giacomelli, G. et al. (eds.) (2000). Atlante lessicale toscano. Rome: Lexis Progetti Editoriali.