Predicting the market capitalization growth of listed companies is a complex time-series regression task. This project aims to predict future company valuation targets using historical fundamental financial indicators, addressing challenges like missing data, outliers, and temporal dependencies.
- Modular Pipeline: Clean separation of concerns (Data -> Features -> Models) for scalability.
- Config-Driven: All hyperparameters and paths controlled via
config/config.yaml. - Deep Learning Architectures:
- MLP: Baseline feed-forward network with embeddings.
- LSTM: Captures temporal trends in financial history.
- Encoder-Decoder: Advanced sequence-to-vector modeling.
- Robust Preprocessing:
- KNN Imputation for missing values.
- RobustScaler for handling financial outliers.
- PCA for dimensionality reduction.
- Expanding Window Validation: Realistic backtesting that retrains models on all available history for each test year.
- Automated Logging: Timestamped, model-specific logs saved to
logs/for full auditability. - Results Tracking: Automated storage of validation metrics (RMSE) to
metrics.json.
For a deeper dive into the project background and technical implementation, please refer to:
- Problem Statement PDF: Original challenge description and requirements.
- Presentation Deck (PPT): Project overview slideshow.
- Implementation Details: Comprehensive technical guide explaining the "How" and "Why" of every pipeline step (Data Cleaning, Engineering, PCA, Model Architectures).
FidelFolio_Project/
├── config/
│ └── config.yaml # Hyperparameters & settings
├── data/
│ └── FidelFolio_Dataset.csv # [REQUIRED] Place your dataset here
├── experiments/ # Original Jupyter Notebooks
├── src/ # Source code
│ ├── data/ # Loading & Preprocessing
│ ├── features/ # Feature Engineering & Sequences
│ ├── models/ # MLP, LSTM, Encoder-Decoder Architectures
│ └── utils/ # Utilities
├── main.py # CLI Entry Point
├── pyproject.toml # Build configuration
└── setup.py # Setup script
-
Install Dependencies:
pip install -e . -
Data Setup: Place your
FidelFolio_Dataset.csvfile into thedata/directory.
Run the training pipeline using main.py:
# Run the pipeline (uses config/config.yaml by default)
python main.py
# Run with a specific configuration file (e.g. for testing)
python main.py --config config/test_config.yamlTo switch models (MLP / LSTM / Encoder-Decoder), edit model_type in config/config.yaml.
Modify config/config.yaml to adjust:
preprocessing: Imputation neighbors, outlier capping thresholds, PCA parameters.models: Layer sizes, dropout rates, embedding dimensions.training: Epochs, batch size, learning rate.
A traditional feed-forward network that flattens time-series data into a single vector, combined with learned company embeddings.
graph TD
subgraph Inputs
A["Sequence Input (Time x Feats)"] --> B[Flatten]
C["Company ID"] --> D[Embedding]
D --> E[Flatten Embedding]
end
B --> F[Concatenate]
E --> F
subgraph MLP_Layers
F --> G["Dense Layer 1 (ReLU)"]
G --> H[Dropout]
H --> I["Dense Layer 2 (ReLU)"]
I --> J[Dropout]
end
J --> K["Output (Regression)"]
A Recurrent Neural Network (RNN) designed to capture temporal dependencies in financial data.
graph TD
subgraph Inputs
A["Sequence Input"] --> B[Masking]
C["Company ID"] --> D[Embedding]
D --> E[Flatten Embedding]
end
subgraph LSTM_Stack
B --> F["LSTM Layer 1 (return_seq=True)"]
F --> G[Dropout]
G --> H["LSTM Layer 2 (return_seq=False)"]
H --> I[Dropout]
end
I --> J[Concatenate]
E --> J
subgraph Prediction
J --> K["Dense Layer (ReLU)"]
K --> L[Dropout]
L --> M["Output"]
end
Uses an LSTM as an encoder to compress the time-series context into a hidden state, which is then passed to a Dense decoder for prediction.
graph TD
subgraph Encoder
A["Sequence Input"] --> B[Masking]
B --> C["LSTM Encoder"]
C -- "Extract Context (State H)" --> D[Context Vector]
end
subgraph Context_Fusion
E["Company ID"] --> F[Embedding]
F --> G[Flatten]
D --> H[Concatenate]
G --> H
end
subgraph Decoder
H --> I["Dense Decoder (ReLU)"]
I --> J[Dropout]
J --> K["Output"]
end
graph TD
A["Start: main.py"] --> B{"Load Config"}
B --> C["Load Config.yaml"]
C --> D["Load Data"]
D --> E["Data Cleaning"]
E --> F["Feature Engineering: YoY Diffs"]
F --> G["Preprocessing (Impute, Scale, Cap)"]
G --> H{"PCA Enabled?"}
H -- Yes --> I["Apply PCA"]
H -- No --> J["Skip PCA"]
I --> K["Encode Company IDs"]
J --> K
K --> L["Start Loop (Expanding Window)"]
L --> M["Generate Sequences"]
M --> N["Train Model (MLP/LSTM/Enc-Dec)"]
N --> O["Predict & Evaluate"]
O --> P{"Next Year?"}
P -- Yes --> L
P -- No --> Q["Finish"]