Skip to content

Backtest hour min lookup error#975

Open
davidlatte wants to merge 3 commits intodevfrom
backtest_hour_min_lookup_error
Open

Backtest hour min lookup error#975
davidlatte wants to merge 3 commits intodevfrom
backtest_hour_min_lookup_error

Conversation

@davidlatte
Copy link
Collaborator

@davidlatte davidlatte commented Mar 13, 2026

This pull request introduces a configurable mechanism for controlling whether minute-level cached data can satisfy day-bar lookup requests in backtesting data sources, addressing inconsistencies across providers like Polygon and ThetaData. The changes add an allow_day_resampling flag to the PandasData class and its derivatives, update the timestep-matching logic, and provide comprehensive tests and documentation for this behavior.

Key changes:

Data source configuration and logic

  • Added an allow_day_resampling parameter (defaulting to True) to PandasData and its subclasses, allowing each data source to specify whether minute data may be resampled to fulfill day-bar requests. This is set to True for Polygon and base PandasData, and False for ThetaData to enforce provider-specific normalization rules. [1] [2] [3]
  • Updated the _accepts_timestep method in PandasData to use the new allow_day_resampling flag, with detailed comments explaining the rationale and differences between data sources. This ensures that day requests are only satisfied by minute data when appropriate. [1] [2]

Testing and regression coverage

  • Added a new regression test class (TestGetHistoricalPricesMinuteToDayRegression) to verify correct (and buggy) behavior when requesting day bars from minute-only data, especially for stocks versus crypto assets. The tests document and demonstrate the previously buggy behavior and provide a baseline for future fixes.
  • Refactored and improved test utilities in test_pandas_data_find_asset_timestep_match.py to support the new configuration, ensuring that tests accurately reflect the new timestep-matching logic. [1] [2]

These changes make the data source behavior more explicit and configurable, prevent silent bypassing of provider-specific normalization, and improve test coverage and documentation for this critical aspect of the backtesting engine.

Summary by CodeRabbit

  • New Features

    • Added a configurable option to control whether minute-level data may be resampled to satisfy day-bar requests (defaults to enabled).
    • Polygon backtesting now permits on-demand resampling from minute to day data.
    • ThetaData backtesting enforces exact-timestep day matching (resampling disabled).
  • Tests

    • Expanded tests covering minute-to-day resampling behavior across asset types and access patterns.

…ice requests demonstrating a bug with quirying for 15m then 1d prices.
@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aad69a9d-aee7-415b-bf21-f99e9a6b124d

📥 Commits

Reviewing files that changed from the base of the PR and between 3300ea4 and 7ff70d8.

📒 Files selected for processing (2)
  • tests/test_pandas_data.py
  • tests/test_pandas_data_find_asset_timestep_match.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/test_pandas_data_find_asset_timestep_match.py
  • tests/test_pandas_data.py

📝 Walkthrough

Walkthrough

Adds an allow_day_resampling flag controlling whether minute-level data may satisfy day-bar requests: PolygonDataBacktesting sets it True, ThetaDataBacktestingPandas sets it False, and PandasData gains a True-by-default parameter plus conditional timestep-acceptance logic.

Changes

Cohort / File(s) Summary
Backtesting Data Source Configuration
lumibot/backtesting/polygon_backtesting.py, lumibot/backtesting/thetadata_backtesting_pandas.py
Introduce allow_day_resampling instance attribute: set to True in Polygon backend and False in ThetaData backend to control day-resampling behavior.
Core Data Source Logic
lumibot/data_sources/pandas_data.py
Add allow_day_resampling: bool = True parameter to PandasData.__init__, store self.allow_day_resampling, and modify _accepts_timestep to conditionally allow minute data to satisfy day requests when the flag is True.
Test Coverage — Minute/Day Resampling
tests/test_pandas_data.py
Add TestGetHistoricalPricesMinuteToDayRegression with helpers and tests covering minute→day lookup behaviors, sequence interactions, and crypto vs stock cases.
Test Coverage — Timestep Matching Utilities
tests/test_pandas_data_find_asset_timestep_match.py
Refactor tests to use real PandasData constructor, add helpers and expanded cases validating allow_day_resampling behavior across minute/day native data and different flag settings.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A little flag hopped into the stack,
Minute bars may stretch or stay back.
Polygon nibble, Theta stands firm,
PandasData learns a flexible term.
Hooray for choices — a rabbit's small perk! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and uses generic phrasing that doesn't clearly convey the main change—the addition of the configurable allow_day_resampling parameter. Use a more descriptive title that captures the core change, such as 'Add allow_day_resampling flag to control minute-to-day data resampling' or similar.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch backtest_hour_min_lookup_error
📝 Coding Plan
  • Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (4.0.5)
tests/test_pandas_data_find_asset_timestep_match.py

************* Module pylintrc
pylintrc:1:0: F0011: error while parsing the configuration: File contains no section headers.
file: 'pylintrc', line: 1
'known-third-party=lumibot' (config-parse-error)
[
{
"type": "convention",
"module": "tests.test_pandas_data_find_asset_timestep_match",
"obj": "",
"line": 76,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "tests/test_pandas_data_find_asset_timestep_match.py",
"symbol": "line-too-long",
"message": "Line too long (102/100)",
"message-id": "C0301"
},
{
"type": "convention",
"module": "tests.test_pandas_data_find_asset_timestep_match",
"obj": "",
"line": 169,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "tests/test_pandas_data_find_asset_timestep_match.py",
"symbol": "line-too-long",
"message": "Line too long (102/100)",
"message-id":

... [truncated 7859 characters] ...

ue",
"line": 421,
"column": 4,
"endLine": 421,
"endColumn": 33,
"path": "tests/test_pandas_data_find_asset_timestep_match.py",
"symbol": "import-outside-toplevel",
"message": "Import outside toplevel (datetime.timezone)",
"message-id": "C0415"
},
{
"type": "convention",
"module": "tests.test_pandas_data_find_asset_timestep_match",
"obj": "",
"line": 5,
"column": 0,
"endLine": 5,
"endColumn": 29,
"path": "tests/test_pandas_data_find_asset_timestep_match.py",
"symbol": "wrong-import-order",
"message": "standard import "datetime.datetime" should be placed before third party imports "pytz", "pandas"",
"message-id": "C0411"
}
]

tests/test_pandas_data.py

************* Module pylintrc
pylintrc:1:0: F0011: error while parsing the configuration: File contains no section headers.
file: 'pylintrc', line: 1
'known-third-party=lumibot' (config-parse-error)
[
{
"type": "convention",
"module": "tests.test_pandas_data",
"obj": "",
"line": 1,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "tests/test_pandas_data.py",
"symbol": "missing-module-docstring",
"message": "Missing module docstring",
"message-id": "C0114"
},
{
"type": "error",
"module": "tests.test_pandas_data",
"obj": "",
"line": 3,
"column": 0,
"endLine": 3,
"endColumn": 19,
"path": "tests/test_pandas_data.py",
"symbol": "import-error",
"message": "Unable to import 'pandas'",
"message-id": "E0401"
},
{
"type": "error",
"module": "tests.test_pandas_data",
"obj"

... [truncated 6261 characters] ...

 "obj": "TestGetHistoricalPricesMinuteToDayRegression.test_1day_request_after_15m_request_same_asset",
    "line": 243,
    "column": 31,
    "endLine": 243,
    "endColumn": 65,
    "path": "tests/test_pandas_data.py",
    "symbol": "protected-access",
    "message": "Access to a protected member _find_asset_in_data_store_cache of a client class",
    "message-id": "W0212"
},
{
    "type": "warning",
    "module": "tests.test_pandas_data",
    "obj": "",
    "line": 10,
    "column": 0,
    "endLine": 10,
    "endColumn": 46,
    "path": "tests/test_pandas_data.py",
    "symbol": "unused-import",
    "message": "Unused pandas_data_fixture imported from tests.fixtures",
    "message-id": "W0611"
}

]


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can approve the review once all CodeRabbit's comments are resolved.

Enable the reviews.request_changes_workflow setting to automatically approve the review once all CodeRabbit's comments are resolved.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lumibot/data_sources/pandas_data.py`:
- Around line 427-447: The day-resolution lookup logic currently treats allowed
resampling as only minute→day by checking data_ts in {"day", "minute"}, which
ignores warmed hourly caches; update the condition in the pandas_data resolution
branch (the block using self.allow_day_resampling, requested_unit and data_ts)
to also accept "hour" when requested_unit == "day" so hourly cached data can be
resampled to daily, and add a regression test alongside the new minute→day tests
that primes an "hour" cache then requests "day" to assert the hour data is
accepted/resampled.

In `@tests/test_pandas_data.py`:
- Around line 123-139: The test class is still using the old behavior and fails
because the __new__ fixtures don't set the new allow_day_resampling flag; update
the fixtures (the __new__ methods) that construct PandasData to set
allow_day_resampling=True (or the intended default) so the new day-lookup branch
can run without AttributeError, and then update the assertions that check
result_day (and any checks mentioning PandasData behavior) to expect minute→day
resampling to be allowed (i.e., change assertions that expect result_day is None
to expect a valid result or remove the obsolete class and fold its cases into
the new flag-driven tests in
tests/test_pandas_data_find_asset_timestep_match.py), applying the same changes
to the other affected blocks referenced (lines ~165-174, 193-261, 276-279).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 43cf81a3-b1fc-4383-acb0-5ba9513c231d

📥 Commits

Reviewing files that changed from the base of the PR and between c962eb9 and 3300ea4.

📒 Files selected for processing (5)
  • lumibot/backtesting/polygon_backtesting.py
  • lumibot/backtesting/thetadata_backtesting_pandas.py
  • lumibot/data_sources/pandas_data.py
  • tests/test_pandas_data.py
  • tests/test_pandas_data_find_asset_timestep_match.py

Comment on lines 427 to 447
if requested_unit == "day":
# IMPORTANT:
# Keep explicit day requests pinned to native day datasets.
# IMPORTANT — two conflicting philosophies exist across data sources:
#
# allow_day_resampling=False (ThetaData):
# Keep explicit day requests pinned to native day datasets.
# ThetaData stores minute and day data under separate canonical keys
# (asset, quote, "minute") vs (asset, quote, "day"). Allowing minute
# data to satisfy day requests would silently bypass ThetaData's
# split-spike repair / split-adjustment normalisation and could trigger
# expensive re-fetch churn in daily-cadence backtests.
#
# Allowing minute datasets to satisfy day requests can silently bypass provider-
# specific day-bar normalization (for example split-spike repair/timestamp
# alignment in IBKR helpers), and can trigger expensive minute fetch churn in
# daily-cadence backtests.
if requested_asset_type in {"stock", "index"}:
# allow_day_resampling=True (Polygon, base PandasData — the default):
# Polygon's _update_pandas_data always tries to obtain the finest
# granularity available and relies on Data.get_bars() to resample
# minute → day on demand. If only minute data is cached for a stock,
# the day request must be allowed to reach Data.get_bars() so the
# resampling path fires. The same applies to user-provided minute
# CSV data in the plain PandasData source.
if not self.allow_day_resampling:
return data_ts == "day"
return data_ts in {"day", "minute"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Day lookups still reject warmed hourly caches.

With allow_day_resampling=True, this branch now re-enables minute→day reuse, but it still excludes hour. A prior 1 hour fetch followed by 1 day will still miss here, even though lumibot/backtesting/polygon_backtesting.py already preserves hourly caches for that path. Please include hour here and add a matching regression alongside the new minute→day cases.

♻️ Suggested fix
-                return data_ts in {"day", "minute"}
+                return data_ts in {"day", "hour", "minute"}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if requested_unit == "day":
# IMPORTANT:
# Keep explicit day requests pinned to native day datasets.
# IMPORTANT — two conflicting philosophies exist across data sources:
#
# allow_day_resampling=False (ThetaData):
# Keep explicit day requests pinned to native day datasets.
# ThetaData stores minute and day data under separate canonical keys
# (asset, quote, "minute") vs (asset, quote, "day"). Allowing minute
# data to satisfy day requests would silently bypass ThetaData's
# split-spike repair / split-adjustment normalisation and could trigger
# expensive re-fetch churn in daily-cadence backtests.
#
# Allowing minute datasets to satisfy day requests can silently bypass provider-
# specific day-bar normalization (for example split-spike repair/timestamp
# alignment in IBKR helpers), and can trigger expensive minute fetch churn in
# daily-cadence backtests.
if requested_asset_type in {"stock", "index"}:
# allow_day_resampling=True (Polygon, base PandasData — the default):
# Polygon's _update_pandas_data always tries to obtain the finest
# granularity available and relies on Data.get_bars() to resample
# minute → day on demand. If only minute data is cached for a stock,
# the day request must be allowed to reach Data.get_bars() so the
# resampling path fires. The same applies to user-provided minute
# CSV data in the plain PandasData source.
if not self.allow_day_resampling:
return data_ts == "day"
return data_ts in {"day", "minute"}
if requested_unit == "day":
# IMPORTANT — two conflicting philosophies exist across data sources:
#
# allow_day_resampling=False (ThetaData):
# Keep explicit day requests pinned to native day datasets.
# ThetaData stores minute and day data under separate canonical keys
# (asset, quote, "minute") vs (asset, quote, "day"). Allowing minute
# data to satisfy day requests would silently bypass ThetaData's
# split-spike repair / split-adjustment normalisation and could trigger
# expensive re-fetch churn in daily-cadence backtests.
#
# allow_day_resampling=True (Polygon, base PandasData — the default):
# Polygon's _update_pandas_data always tries to obtain the finest
# granularity available and relies on Data.get_bars() to resample
# minute → day on demand. If only minute data is cached for a stock,
# the day request must be allowed to reach Data.get_bars() so the
# resampling path fires. The same applies to user-provided minute
# CSV data in the plain PandasData source.
if not self.allow_day_resampling:
return data_ts == "day"
return data_ts in {"day", "hour", "minute"}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lumibot/data_sources/pandas_data.py` around lines 427 - 447, The
day-resolution lookup logic currently treats allowed resampling as only
minute→day by checking data_ts in {"day", "minute"}, which ignores warmed hourly
caches; update the condition in the pandas_data resolution branch (the block
using self.allow_day_resampling, requested_unit and data_ts) to also accept
"hour" when requested_unit == "day" so hourly cached data can be resampled to
daily, and add a regression test alongside the new minute→day tests that primes
an "hour" cache then requests "day" to assert the hour data is
accepted/resampled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant