Skip to content

arubehn/TuscanSoundCorrespondences

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TuscanSoundCorrespondences

This repository contains all the code and data accompanying our paper:

Rubehn, A., Montemagni, S., & Nerbonne, J. (2024). Extracting Tuscan phonetic correspondences from dialect pronunciations automatically. Language Dynamics and Change, 14(1), 1-33. https://doi.org/10.1163/22105832-bja10034

If you (re-)use parts of this code, please cite our work accordingly.

Installation

Clone into this repository on your machine:

$ git clone https://github.com/arubehn/TuscanSoundCorrespondences
$ cd TuscanSoundCorrespondences

It is highly recommended to set up a fresh virtual environment.

In order to set up the project and install all required dependencies, simply run

$ pip install -e .

Replication

Once the project is set up correctly, all analyses discussed in the paper can be replicated by executing the run.py script.

$ python3 run.py

Analyzing new data

You can analyze your own data with just a few lines of code -- you just need to define the varieties of interest (V) and your data filepath. Input data is expected to be present either as a CLDF repository [1], or as a TSV file with headers according to the EDICTOR specifications. [2]

from shibboleth.calculate import ShibbolethCalculator

data = "YOUR/DATA/FILE/PATH"
varieties = ["your", "set", "of", "varieties"]  # needs to match the language id given in the data exactly

sc = ShibbolethCalculator(varieties, data)
charac, repr, dist = sc.calculate_metrics(normalize=True) # normalize=True calculates normalized PMI; otherwise plain PMI is calculated

If your data already Multiple Sequence Alignments (MSA), those are used by default. Otherwise, alignments are generated using the SCA algorithm [3] implemented in LingPy. [4]

Dataset

The dataset used in this study is derived from the Atlante Lessicale Toscano (ALT). [5] Our derived dataset was published within the Lexibank collection under https://github.com/lexibank/alt/. This repository contains a condensed version of this Lexibank dataset (/resources/data/alt.tsv) -- for details on how this file was produced, please check the README in the resources/data folder.

References

[1] Forkel, R. et al. (2018). Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific data, 5(1), 1-10.

[2] List, J.-M. and van Dam, K. P. (2024): EDICTOR 3. A web-based tool for Computer-Assisted Language Comparison [Software Tool, Version 3.0]. MCL Chair at the University of Passau: Passau. URL: https://edictor.org.

[3] List, J.-M. (2012). SCA: Phonetic alignment based on sound classes. In Marija Slavkovik and Dan Lassiter (eds.), New Directions in Logic, Language, and Computation, 32–51. Berlin and Heidelberg: Springer. https://doi.org/10.1007/978-3-642-31467-4_3.

[4] List, J.-M. and Forkel, R. (2021). LingPy: A Python library for historical linguistics, version 2.6.9. Leipzig: Max Planck Institute for Evolutionary Anthropology. URL: https://lingpy.org, DOI: https://zenodo.org/badge/latestdoi/5137/lingpy/lingpy.

[5] Giacomelli, G. et al. (eds.) (2000). Atlante lessicale toscano. Rome: Lexis Progetti Editoriali.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors