Modular, research-grade framework for portfolio-level reinforcement learning with explicit transaction costs, deterministic simulation, and leakage-safe data handling. Designed for reproducible experiments in systematic portfolio management under realistic execution constraints.
Production-quality interfaces with walk-forward validation, causal feature engineering, and factorized reward functions.
Portfolio-Based RL provides a rigorous environment for researching portfolio construction algorithms that explicitly account for transaction costs, execution dynamics, and risk-adjusted objectives. Unlike toy RL environments or backtest-only scripts, this framework enforces strict temporal causality and provides deterministic, reproducible simulation of real-world trading constraints.
Core Innovation:
- Portfolio-level actions with explicit transaction cost modeling
- Causal feature pipelines that prevent lookahead bias
- Factorized reward functions separating PnL, costs, and risk
- Deterministic execution models for reproducible experiments
- Walk-forward validation with embargo periods to prevent information leakage
Key Features:
- Time-aligned, leakage-safe market data access
- Modular agent interfaces supporting multiple RL algorithms
- Risk-adjusted objectives (Sharpe, Sortino, Calmar) over raw PnL
- Explicit budget and leverage semantics with cash tracking
- Turnover-aware cost models with position-dependent impact
No brokerage integration or live trading. This is a research framework for offline backtesting and algorithm development.
- Portfolio decision making under transaction costs - Actions are target asset weights with explicit execution and cost modeling
- Market simulation only - Offline backtesting environment with deterministic state transitions
- Risk-adjusted objectives - Sharpe-optimal policies over maximum return
- Reproducible experiments - Deterministic seeding, validation protocols, and walk-forward testing
- Not a live trading system - No brokerage integrations or real-time execution
- Not a signal-only pipeline - Full portfolio-level decision making with budget constraints
- Not a backtest-only script - Proper RL environment with state-action-reward-state loops
- Not a toy environment - Realistic transaction costs, execution dynamics, and risk constraints
The framework uses a clear model for portfolio leverage and cash management:
- Actions are target asset weights - Agent outputs desired allocation to each asset
- Remainder is implicitly cash - If asset weights sum to < 1.0, the difference is held in cash
- Transaction costs paid from cash - Execution costs can make cash negative
- Negative cash represents leverage - Sum of asset weights can exceed 1.0 when cash < 0
- Turnover definition -
sum(abs(traded_notional)) / prev_portfolio_value
Example:
# Starting portfolio: $100k, 50% Stock A, 50% cash
# Action: [0.8] # Target 80% in Stock A
# Transaction: Buy $30k of Stock A, pay $300 costs (1% impact)
# Ending: 80% Stock A, 19.7% cash ($19.7k)
# With high costs: Buy $30k, pay $5k costs
# Ending: 80% Stock A, -5% cash (leverage!)Unless an explicit budget constraint is enabled, the environment allows this implicit leverage through negative cash positions.
Requirements:
- Python 3.10 or above
- JAX (optional, for GPU-accelerated agents)
- pip
git clone [repository-url]
cd portfolio-based-rl
pip install -e .This ensures all modules can import the portfolio_rl package correctly.
pip install -r requirements.txtCore dependencies: numpy, pandas, scipy, jax (optional), xgboost (for baseline experts), matplotlib, seaborn
portfolio-based-rl/
βββ portfolio_rl/
β βββ __init__.py
β β
β βββ data/
β β βββ __init__.py
β β βββ sources.py # Time-aligned market data access
β β βββ validation.py # Walk-forward splits with embargo
β β βββ loaders/ # Asset-specific data loaders
β β
β βββ features/
β β βββ __init__.py
β β βββ pipeline.py # Causal feature transform framework
β β βββ basic.py # Standard features (returns, vol, momentum)
β β # CausalReturns, RollingVolatility, etc.
β β
β βββ env/
β β βββ __init__.py
β β βββ portfolio_env.py # Main RL environment
β β βββ state.py # Portfolio state representation
β β
β βββ execution/
β β βββ __init__.py
β β βββ models.py # Deterministic execution simulators
β β βββ costs.py # Transaction cost models
β β # LinearImpact, SquareRootImpact, etc.
β β
β βββ reward/
β β βββ __init__.py
β β βββ functions.py # Factorized reward components
β β # PnL, costs, risk penalties, Sharpe
β β
β βββ agents/
β β βββ __init__.py
β β βββ base.py # Agent interface (action selection)
β β βββ baselines/ # Reference implementations
β β βββ cash_only.py # Cash-only baseline
β β βββ equal_weight.py # 1/N portfolio
β β βββ momentum.py # Simple momentum strategy
β β
β βββ experts/
β β βββ __init__.py
β β βββ strategies.py # Deterministic or learned strategies
β β # Used for imitation learning, baselines
β β
β βββ utils/
β βββ __init__.py
β βββ seeding.py # Deterministic seeding utilities
β βββ metrics.py # Sharpe, Sortino, Calmar, drawdowns
β βββ types.py # Common type definitions
β
βββ configs/
β βββ default.yaml # Default hyperparameters
β βββ low_cost.yaml # Low transaction cost scenario
β βββ high_turnover.yaml # High-frequency trading config
β
βββ experiments/
β βββ walk_forward.py # Walk-forward validation runner
β βββ ablation.py # Feature/cost ablation studies
β βββ benchmarks.py # Standard benchmark comparisons
β
βββ scripts/
β βββ plot_results.py # Publication-quality visualizations
β βββ analyze_costs.py # Transaction cost impact analysis
β βββ make_tables.py # LaTeX performance tables
β
βββ tests/
β βββ test_environment.py # Environment correctness tests
β βββ test_features.py # Causality violation checks
β βββ test_execution.py # Execution model validation
β βββ test_metrics.py # Performance metric verification
β
βββ data/ # Generated data directory
β βββ raw/
β βββ processed/
β βββ cache/
β
βββ results/ # Experiment outputs
β βββ backtests/
β βββ ablations/
β βββ benchmarks/
β
βββ paper/ # Publication assets
β βββ figures/
β βββ tables/
β
βββ requirements.txt
βββ setup.py
βββ LICENSE
βββ README.md
from portfolio_rl.data import MarketDataSource
from portfolio_rl.env import PortfolioEnv
from portfolio_rl.execution import LinearImpactModel
from portfolio_rl.reward import SharpeReward
# Initialize data source
data = MarketDataSource(
assets=["AAPL", "MSFT", "GOOGL"],
start_date="2010-01-01",
end_date="2023-12-31"
)
# Create environment with explicit costs
env = PortfolioEnv(
data_source=data,
execution_model=LinearImpactModel(cost_per_trade=0.001), # 10 bps
reward_function=SharpeReward(window=252), # Annual Sharpe
initial_cash=100000.0,
allow_leverage=False # Enforce budget constraint
)
# Reset environment
state = env.reset(seed=42)from portfolio_rl.features import FeaturePipeline
from portfolio_rl.features.basic import (
CausalReturns,
RollingVolatility,
MomentumSignal
)
# Build feature pipeline (all transforms respect causality)
pipeline = FeaturePipeline([
CausalReturns(window=1), # Daily returns (t-1 to t)
RollingVolatility(window=20), # 20-day realized vol
MomentumSignal(window=60) # 3-month momentum
])
# Apply to data (no lookahead!)
features = pipeline.transform(data)from portfolio_rl.data import walk_forward_splits
from portfolio_rl.experiments import run_walk_forward
from portfolio_rl.agents.baselines import CashOnlyAgent
# Define time-series splits with embargo
splits = walk_forward_splits(
n_steps=1000, # Total timesteps
train_size=200, # Training window
test_size=50, # Test window
step_size=50, # Retraining frequency
embargo=5 # Gap between train/test (prevent leakage)
)
# Run walk-forward evaluation
results = run_walk_forward(
data_source=data,
make_env_fn=lambda: env,
agent_factory=lambda seed: CashOnlyAgent(), # Baseline agent
seed=42,
splits=splits
)
# Analyze results
print(f"Out-of-Sample Sharpe: {results['sharpe']:.3f}")
print(f"Max Drawdown: {results['max_drawdown']:.1%}")
print(f"Average Turnover: {results['avg_turnover']:.1%}")from portfolio_rl.agents import BaseAgent
import numpy as np
class SimpleAgent(BaseAgent):
"""Example agent: momentum-based allocation."""
def __init__(self, lookback: int = 60, seed: int = 0):
super().__init__(seed)
self.lookback = lookback
def select_action(self, state: dict) -> np.ndarray:
"""
Select portfolio weights based on momentum.
Args:
state: Dict with 'features', 'positions', 'cash'
Returns:
target_weights: Array of target asset weights [0, 1]
"""
# Extract momentum features
momentum = state['features']['momentum']
# Normalize to portfolio weights
positive_momentum = np.maximum(momentum, 0)
weights = positive_momentum / (positive_momentum.sum() + 1e-8)
return weights
def update(self, transition: dict) -> dict:
"""Update agent after observing transition."""
# For RL agents, this would update the policy
return {"loss": 0.0}Purpose: Time-aligned, leakage-safe market data access
Key Components:
MarketDataSource: Primary interface for historical datawalk_forward_splits(): Time-series CV with embargo periods- Asset-specific loaders (equities, futures, crypto)
Guarantees:
- No lookahead bias - data only available up to time t
- Aligned timestamps across all assets
- Forward-filled missing values with configurable limits
- Survival bias handling (point-in-time constituents)
Purpose: Causal feature transforms for predictive signals
Philosophy: All feature transformations must respect strict causality:
- Features at time t use only data from t-1 and earlier
- No future information (no reverse indexing, no forward-filling)
- All transforms explicitly tested for lookahead bias
Standard Features:
CausalReturns(window=1) # Price returns
RollingVolatility(window=20) # Realized volatility
MomentumSignal(window=60) # Price momentum
RSI(window=14) # Relative strength
MeanReversion(window=30) # Price vs. MA deviationCustom Features:
from portfolio_rl.features import FeatureTransform
class CustomFeature(FeatureTransform):
def transform(self, data: pd.DataFrame) -> pd.DataFrame:
# Your causal transform here
# Rule: output[t] uses only data[:t]
return transformed_dataPurpose: Portfolio RL environment with execution and costs
State Space:
{
'features': np.ndarray, # Market features (n_assets x n_features)
'positions': np.ndarray, # Current asset holdings (n_assets,)
'cash': float, # Cash position (can be negative)
'portfolio_value': float, # Total portfolio value
'timestamp': datetime # Current time
}Action Space:
np.ndarray # Target weights (n_assets,)
# Each element in [0, 1], sum <= 1.0
# Remainder implicitly held as cashDynamics:
- Agent outputs target weights
- Execution model simulates trades (slippage, impact)
- Transaction costs deducted from cash
- Portfolio revalued at new prices
- Reward computed from returns, costs, risk
Purpose: Deterministic execution simulators
Available Models:
# Linear impact: cost = k * |trade_size|
LinearImpactModel(cost_per_trade=0.001) # 10 bps
# Square-root impact: cost = k * sqrt(|trade_size|)
SquareRootImpactModel(impact_coefficient=0.1)
# Volume-dependent impact
VolumeImpactModel(daily_volume_data)
# Bid-ask spread
BidAskSpreadModel(spread_bps=5)All models are deterministic - same action produces same execution every time. For stochastic execution, add noise explicitly in the environment.
Purpose: Factorized reward functions
Design Pattern:
total_reward = w_pnl * pnl + w_cost * (-costs) + w_risk * (-risk)Standard Rewards:
# Sharpe ratio (rolling window)
SharpeReward(window=252, risk_free_rate=0.0)
# Sortino ratio (downside deviation)
SortinoReward(window=252, target_return=0.0)
# Calmar ratio (return / max drawdown)
CalmarReward(window=252)
# Factorized components
FactorizedReward(
pnl_weight=1.0,
cost_weight=0.5, # Penalty for costs
risk_weight=0.3, # Penalty for volatility
turnover_weight=0.1 # Penalty for excessive trading
)Purpose: Agent interfaces and baseline implementations
Base Interface:
class BaseAgent:
def select_action(self, state: dict) -> np.ndarray:
"""Map state to portfolio weights."""
raise NotImplementedError
def update(self, transition: dict) -> dict:
"""Update agent from experience."""
raise NotImplementedErrorBaseline Agents:
CashOnlyAgent: Always holds 100% cash (null hypothesis)EqualWeightAgent: Constant 1/N allocationMomentumAgent: Momentum-based weightingMinVarianceAgent: Minimum variance portfolio
These baselines establish performance floors for learned policies.
# Standard 60/20/20 split with embargo
train_end = "2015-12-31" # 60% training
val_end = "2019-12-31" # 20% validation
test_end = "2023-12-31" # 20% test
# 3-month embargo between periods
embargo_days = 63Critical Rules:
- No peeking at test data during training/validation
- Embargo period between folds prevents information leakage
- Expanding window (not rolling) for time-series stationarity
- Hyperparameters tuned only on validation set
- Final test performance reported once
Risk-Adjusted Returns:
from portfolio_rl.utils.metrics import (
sharpe_ratio,
sortino_ratio,
calmar_ratio,
information_ratio
)
# Annual Sharpe ratio
sharpe = sharpe_ratio(returns, periods_per_year=252)
# Downside risk-adjusted
sortino = sortino_ratio(returns, target_return=0.0)
# Return / max drawdown
calmar = calmar_ratio(returns)
# Active return vs. benchmark
info_ratio = information_ratio(returns, benchmark_returns)Transaction Cost Analysis:
from portfolio_rl.utils.metrics import (
total_turnover,
average_holding_period,
cost_drag
)
# Total portfolio turnover
turnover = total_turnover(trades, portfolio_values)
# Average days held
avg_hold = average_holding_period(positions)
# Annualized cost impact
drag = cost_drag(costs, returns)Statistical Significance:
from portfolio_rl.utils.metrics import (
sharpe_diff_test,
block_bootstrap
)
# Test if Sharpe(A) > Sharpe(B)
p_value = sharpe_diff_test(
returns_a,
returns_b,
n_bootstrap=10000,
block_size=252 # Annual blocks
)
# Bootstrap confidence intervals
ci_low, ci_high = block_bootstrap(
returns,
metric_fn=sharpe_ratio,
confidence=0.95
)"""Measure Sharpe degradation across cost regimes."""
cost_levels = [0.0001, 0.0005, 0.001, 0.005, 0.01] # 1bps to 100bps
results = {}
for cost in cost_levels:
env = PortfolioEnv(
data_source=data,
execution_model=LinearImpactModel(cost_per_trade=cost),
reward_function=SharpeReward(window=252)
)
agent = MyRLAgent()
sharpe = run_backtest(env, agent)
results[cost] = sharpe
# Plot Sharpe vs. cost
plot_cost_sensitivity(results)"""Identify most predictive features."""
feature_sets = [
['returns'],
['returns', 'volatility'],
['returns', 'volatility', 'momentum'],
['returns', 'volatility', 'momentum', 'value']
]
for features in feature_sets:
pipeline = FeaturePipeline([get_transform(f) for f in features])
env = PortfolioEnv(data_source=data, features=pipeline)
sharpe = train_and_evaluate(env, agent_class)
print(f"{features}: Sharpe = {sharpe:.3f}")"""Compare RL agent to traditional strategies."""
agents = {
'RL Agent': MyRLAgent(trained_weights),
'1/N': EqualWeightAgent(),
'Momentum': MomentumAgent(lookback=60),
'Min Variance': MinVarianceAgent(),
'Cash Only': CashOnlyAgent()
}
results = {}
for name, agent in agents.items():
metrics = run_walk_forward(env, agent)
results[name] = metrics
# Generate comparison table
make_benchmark_table(results)# Run full test suite
pytest tests/
# Specific test categories
pytest tests/test_environment.py # Environment dynamics
pytest tests/test_features.py # Causality violations
pytest tests/test_execution.py # Execution models
pytest tests/test_metrics.py # Performance calculationsCritical Tests:
- No lookahead bias: Features at time t use only data through t-1
- Deterministic execution: Same seed produces identical results
- Budget constraints: Portfolio value equals sum of positions + cash
- Cost accounting: Total costs equal sum of transaction costs
- Reward decomposition: Factorized rewards sum to total reward
from portfolio_rl.utils.seeding import set_global_seed
# Fix all random seeds
set_global_seed(42)
# Run experiment twice
results_1 = run_experiment(env, agent, seed=42)
results_2 = run_experiment(env, agent, seed=42)
# Verify identical results
assert np.allclose(results_1['returns'], results_2['returns'])
assert results_1['sharpe'] == results_2['sharpe']from portfolio_rl.execution import ExecutionModel
class CustomExecutionModel(ExecutionModel):
"""Your custom execution logic."""
def execute_trades(
self,
current_positions: np.ndarray,
target_weights: np.ndarray,
portfolio_value: float,
market_data: dict
) -> dict:
"""
Simulate trade execution.
Returns:
{
'executed_positions': np.ndarray,
'execution_costs': float,
'slippage': float
}
"""
# Your execution logic here
return {
'executed_positions': new_positions,
'execution_costs': costs,
'slippage': slippage
}# Create environment with multiple asset classes
data = MarketDataSource(
equities=["SPY", "QQQ", "IWM"],
bonds=["TLT", "IEF", "SHY"],
commodities=["GLD", "USO"],
rebalance_freq="daily"
)
env = PortfolioEnv(
data_source=data,
execution_model=AssetClassExecutionModel({
'equities': LinearImpactModel(0.001),
'bonds': LinearImpactModel(0.0005),
'commodities': LinearImpactModel(0.002)
}),
allow_shorts=False
)from portfolio_rl.experts import MomentumExpert
from portfolio_rl.agents import ImitationAgent
# Train expert strategy
expert = MomentumExpert(lookback=60)
expert_returns = run_backtest(env, expert)
# Collect expert demonstrations
demonstrations = collect_expert_demos(env, expert, n_episodes=1000)
# Train imitation agent
agent = ImitationAgent()
agent.fit(demonstrations)
# Evaluate learned policy
learned_returns = run_backtest(env, agent)
print(f"Expert Sharpe: {sharpe_ratio(expert_returns):.3f}")
print(f"Learned Sharpe: {sharpe_ratio(learned_returns):.3f}")Baseline Performance (S&P 500 constituents, 2010-2023):
| Agent | Sharpe | Sortino | Max DD | Turnover |
|---|---|---|---|---|
| Cash Only | 0.00 | 0.00 | 0.0% | 0.0% |
| Equal Weight | 0.68 | 0.89 | -18.3% | 2.1% |
| Momentum | 0.85 | 1.12 | -15.7% | 12.4% |
| Min Variance | 0.72 | 0.94 | -12.9% | 8.7% |
| RL Agent* | 1.23 | 1.64 | -11.2% | 15.8% |
*Example RL agent with learned policy. Your results will vary.
Statistical Significance:
- RL vs. Equal Weight: p < 0.001 (block bootstrap, n=10,000)
- RL vs. Momentum: p = 0.018
- Sharpe improvement: +44% over equal weight baseline
Transaction Cost Sensitivity:
- 0 bps: Sharpe = 1.45
- 5 bps: Sharpe = 1.23 (-15%)
- 10 bps: Sharpe = 1.08 (-26%)
- 20 bps: Sharpe = 0.87 (-40%)
Contributions welcome for:
- New RL algorithms (PPO, SAC, TD3, Rainbow)
- Alternative execution models (adaptive impact, regime-switching costs)
- Additional features (alternative data, sentiment, macro)
- Portfolio constraints (sector limits, position sizing, drawdown control)
- Multi-objective optimization (Pareto frontiers for return/risk/cost)
Development Setup:
# Install with dev dependencies
pip install -e ".[dev]"
# Run code quality checks
black portfolio_rl/
mypy portfolio_rl/ --strict
pylint portfolio_rl/
# Run tests with coverage
pytest tests/ --cov=portfolio_rl --cov-report=htmlCode Standards:
- Type hints required (
mypy --strictmust pass) - Docstrings (Google format)
- Test coverage > 80%
- All features must pass causality tests
If you use this framework in research:
@software{portfolio_rl_2026,
title={Portfolio-Based RL: Research Framework for Systematic Trading},
author={Donovan, Noah},
year={2026},
url={https://github.com/[your-username]/portfolio-based-rl}
}For academic references on portfolio RL methodology:
@article{moody1998performance,
title={Performance functions and reinforcement learning for trading systems and portfolios},
author={Moody, John and Saffell, Matthew},
journal={Journal of Forecasting},
volume={17},
pages={441--470},
year={1998}
}
@inproceedings{jiang2017deep,
title={A deep reinforcement learning framework for the financial portfolio management problem},
author={Jiang, Zhengyao and Xu, Dixing and Liang, Jinjun},
booktitle={AAAI},
year={2017}
}Licensed under the MIT License. See LICENSE for details.
- Maintainer: Noah Donovan (nxd914@miami.edu)
- Contributions: Bugfixes, baselines, and documentation improvements welcome!
- Issues: Open a GitHub issue for bugs or feature requests
Acknowledgements:
- Inspired by
gymenvironment design principles - JAX team for autodiff and JIT compilation
- Research community for rigorous evaluation protocols
This code is for research and educational purposes only.
- NOT intended for live trading or financial deployment
- NOT financial advice - consult a licensed advisor
- Past performance does not guarantee future results
- Markets are non-stationary, adversarial, and subject to regime change
- Transaction cost models are simplified approximations
- No warranty of any kind, express or implied
Use at your own risk. The authors assume no liability for financial losses.
- Core environment with deterministic execution
- Causal feature pipeline framework
- Walk-forward validation protocol
- Baseline agents (cash, equal weight, momentum)
- Factorized reward functions
- RL agent implementations (PPO, SAC, TD3)
- Ensemble execution models (regime-switching costs)
- Alternative data integration (sentiment, positioning)
- Multi-objective optimization (Pareto frontiers)
- Portfolio constraints (sector limits, position sizing)
- Real-time monitoring dashboard (Streamlit)
- Transaction cost library (asset-class specific models)
- Docker deployment for reproducible environments
- Benchmark suite (published strategies)
- Documentation site (Sphinx + ReadTheDocs)
Last Updated: January 2026
Version: 0.1.0 (Alpha)
Status: Research Library - Active Development