This document provides a comprehensive overview of the work done to build and refine a linguistic analysis pipeline for financial documents.
The project started with a set of Python scripts designed to perform psycholinguistic feature engineering on corporate communications. The initial structure was functional but fragmented, with separate mock data files and a monolithic feature engineering script that mixed different analysis types (core linguistic features, catalyst event scores).
The primary goal was to integrate a new data source: the "Management’s Discussion and Analysis" (MD&A) section from SEC filings, and refactor the project into a clean, scalable pipeline.
After an initial exploration, we decided on a complete refactoring to streamline the workflow. The approved plan was as follows:
Plan: Consolidate Data and Refactor Pipeline
This plan streamlines the entire project by adopting a new, unified data structure. It involves cleaning up obsolete files, creating a single source of mock data, and refactoring the feature engineering and backtesting scripts to work with the new format.
-
Clean Up Project Structure:
- Delete obsolete files and directories (
output/, old mock data, old feature scripts,run_pipeline.py,scraper/).
- Delete obsolete files and directories (
-
Create New Consolidated Mock Data:
- Create a single new mock data file:
data/mock_sec_data.json. - Populate this file with data following a specific structure, with distinct entries for "MD&A" and "Risk_Factors" sections.
- Create a single new mock data file:
-
Create a New Unified Feature Engineering Script:
- Create a new, single
analysis/feature_engineering.pyscript. - This script contains one primary function,
calculate_features_by_section, to apply logic based on the "section" field in the data.
- Create a new, single
-
Simplify and Update the Backtesting Script:
- Update
analysis/backtesting.pyto directly load the consolidated JSON, call the new feature engineering function, and run the correlation analysis.
- Update
The plan was implemented successfully. The project is now organized around a clean, simple, and powerful pipeline:
-
Data Source: A single source of truth for mock data exists at
data/mock_sec_data.json. All new data, including MD&A and Risk Factors, should be added here in the specified JSON format. -
Feature Engineering: The
analysis/feature_engineering.pyscript contains a single, unified function that:- Reads the consolidated data.
- Applies the correct feature logic based on the
sectionfield ("MD&A" or "Risk_Factors"). - Pivots the data to create a clean, wide-format DataFrame where each row represents a single filing (ticker + date) and the columns represent the linguistic features from each section.
-
Backtesting: The
analysis/backtesting.pyscript:- Loads the JSON data and calls the feature engineering function.
- Fetches historical stock price data from Yahoo Finance.
- Calculates the future stock price volatility for the 90-day period following each filing.
- Runs a correlation analysis to measure the relationship between the linguistic features and the future volatility.
This refactored structure is minimal, clean, and easily extensible for future work, such as integrating audio transcript data.