This repository contains all code and materials used for the empirical component of my master’s thesis, as well as the LaTeX files used to compile the final submission PDF.
The general workflow follows these steps:
raw data → data cleaning → feature generation → feature analysis → result visualization
For any questions regarding the code, feel free to contact me via email at: A.Schied@campus.lmu.de
To replicate the sentiment indexing, paste the shared data into the data folder and run plots_were_made_here.R.
To replicate the empirical analysis, please set up a Python environment with the required dependencies.
You can do this using Conda as follows:
conda env create --name "schied_replication" --file "environment.yml"
conda activate schied_replication
python3 analysis_stepname.pyconda env create --name "schied_replication" --file "environment.yml"
conda activate schied_replication
python analysis_stepname.pyFor steps involing an LLM, ollama needs to be available locally.
The pipeline processes input data and manages the execution of analysis steps.
- Input: Separate CSV files located in the input directory.
- The pipeline automatically scans the input folder, loads all CSVs into memory using Polars (a faster alternative to Pandas), and tracks processed files by checking the output directory.
- Output: Results are saved in the format
inputfile_intermediatestep_finalstep.parquet
Each pipeline run requires:
- A worker class (defining the analysis logic)
- A configuration dataclass (defining directories and parameters)
All analysis steps are implemented as classes in workers.py.
Each worker class must contain a .run() method, which serves as the main entry point for the pipeline.
Requirements for worker classes:
.run()must accept only a Polars DataFrame (andself) as input.- Additional helper methods can be defined within the same class and accessed via
self. - The
.run()method name must remain unchanged for the pipeline manager to execute correctly.
Configuration classes are dataclasses that store all hardcoded parameters required by both the pipeline and worker classes — such as:
- Input and output directory paths
- File naming conventions
- Analysis-specific constants or thresholds
masterthesis/
│
│
│
│
│
├── scripts/
│ ├── notebooks/ ← jupyter notebooks
│ ├── R/ ← R scripts for fixed effects model
│ ├── streamlit/ ← contains the browser app used for data labeling
│ │ ├── app.py ← main structure of app (start using "streamlit run app.py" make sure to "cd ../streamlit" first)
│ │ └── pages/ ← contains the webpages of the app
│ │ └── ...
│ ├── bt_analysis_ri.py ← runs the analysis (Text cleaning, Keyword matching) pipeline on the raw inputs data (Bundestag Speeches SpeakGer)
│ ├── bt_analysis_if.py ← runs the analysis (LLM, Bert and Dictionary based Sentiment scoring, Embeddings, Named Entity Recognition) on the preprocessed data (Relevant Cleaned speeches)
│ │
│ ├── analysis_raw_inter.py ← runs the analysis (Text cleaning, Keyword matching) pipeline on the raw inputs data (Genios Articles)
│ ├── analysis_inter_final.py ← runs the analysis (LLM, Bert and Dictionary based Sentiment scoring, Embeddings, Named Entity Recognition) on the preprocessed data (Relevant Cleaned Articles)
│ ├── data/ ← public data
│ ├── .../ ← all raw inputs (Genios, Bundestag speeches)
│ │ └── results/ ← contains analysis ready dataframes and analysis results
│ ├── src/ ← main python code for analysis
│. │ ├── config.py ← config of parameters for classes in workers.py for Genios Wiso Data (Promts, Bert Models, ...)
│. │ ├── config_bt.py ← config of parameters for classes in workers.py for Bundestag Speeches
│ │ ├── workers.py ← contains all NLP classes for feature generation (LLM scoring, Embeddings, ...)
│ │ ├── analysis.py ← contains all steps analyzing the created features (Index aggregation, Event Studies, Semantic Deduplication, ...)
│ │ ├── plots.py ← contains classes to create the final visualizations of the findings
│ └── pipeline.py ← contains the pipeline reading/ writing data to/ from classes (build to fit the VM specific requirements, like raw data in CSV form)
│ └── setup/ ← shell scripts for model setup
│
├── tex/
│ ├── main.tex ← the structure of the main pdf
│ ├── beamer.tex ← the final presentation (unfinished)
│ ├── chapters/ ← individual chapters per section
│ ├── figures/ ← figures
│ ├── tables/ ← tables
│ └── static/ ← bibliography, pictures, preamble, beamer class file
│
├── environment.yml ← relevant packages for replication
│
└── README.md