Skip to content

Conversation

@m-misiura
Copy link

@m-misiura m-misiura commented Jun 2, 2025

What does this PR do?

Added initial implementation of the /data/download and /data/upload endpoints

Quick example of data upload / download

To upload / download data, you can start a server e.g.:

mkdir tmp_data
export STORAGE_DATA_FOLDER="tmp_data"
uv run uvicorn src.main:app --host 0.0.0.0 --port 8080
curl -X POST http://localhost:8080/data/upload \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "gaussian-credit-model",
    "data_tag": "TRAINING",
    "request": {
      "inputs": [
        {
          "name": "credit_inputs",
          "shape": [2, 4],
          "datatype": "FP64",
          "data": [
            [
              47.45380690750797,
              478.6846214843319,
              13.462184703540503,
              20.764525303373535
            ],
            [
              47.468246185717554,
              575.6911203538863,
              10.844143722475575,
              14.81343667761101
            ]
          ]
        }
      ]
    },
    "response": {
      "outputs": [
        {
          "name": "predict",
          "datatype": "FP32",
          "shape": [2, 1],
          "data": [
            0.19013395683309373,
            0.2754730253205645
          ]
        }
      ]
    }
  }'
curl -X 'POST' \
  'http://localhost:8080/data/download' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "modelId": "gaussian-credit-model"
}'

Tests

You can run the accompanying tests, using e.g.

  • for the download endpoint
pytest tests/endpoints/test_download_endpoint.py
  • for the upload endpoint:
pytest tests/endpoints/test_upload_endpoint.py 

Summary by Sourcery

Implement complete data ingestion and retrieval endpoints for models, backed by a unified storage interface with payload validation, ground truth handling, and filtering logic.

New Features:

  • Add /data/upload endpoint to ingest regular and ground truth model data with validation and persistent storage
  • Add /data/download endpoint to export stored model data as CSV with support for AND/OR/NOT filtering

Enhancements:

  • Introduce utility modules for tensor processing, data tag validation, ground truth matching, and HDF5-based storage operations

Tests:

  • Add comprehensive endpoint tests covering varied data shapes, types, tag rules, filtering scenarios, and error conditions

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jun 2, 2025

Reviewer's Guide

This PR delivers complete implementations for the /data/upload and /data/download endpoints by adding request parsing, validation, storage interactions, ground-truth handling, and DataFrame-based filtering, and bolsters them with extensive tests covering normal flows, edge cases, and error conditions.

Entity Relationship Diagram for Stored Model Data Components

erDiagram
    MODEL {
        string model_id PK "e.g., gaussian-credit-model"
    }

    MODEL_INPUT_DATA {
        string model_id FK
        string execution_id "Correlates with metadata"
        array input_features "Stored as NumPy array"
        array feature_names
    }

    MODEL_OUTPUT_DATA {
        string model_id FK
        string execution_id "Correlates with metadata"
        array output_values "Stored as NumPy array"
        array output_names
    }

    MODEL_METADATA {
        string model_id FK
        string execution_id PK "Unique ID for an inference transaction"
        datetime timestamp "Timestamp of the transaction"
        string tag "User-defined tag"
    }

    GROUND_TRUTH_DATA {
        string model_id FK
        string execution_id FK "References MODEL_METADATA.execution_id"
        array ground_truth_values "Stored as NumPy array"
        array ground_truth_names
    }

    MODEL ||--|{ MODEL_INPUT_DATA : "has inputs stored as"
    MODEL ||--|{ MODEL_OUTPUT_DATA : "has outputs stored as"
    MODEL ||--|{ MODEL_METADATA : "has metadata stored as"
    MODEL_METADATA }o--|| MODEL_INPUT_DATA : "corresponds to"
    MODEL_METADATA }o--|| MODEL_OUTPUT_DATA : "corresponds to"
    MODEL_METADATA ||--o{ GROUND_TRUTH_DATA : "can have associated"
Loading

Class Diagram for Data Transfer Objects (DTOs)

classDiagram
    class UploadPayload {
        <<DTO (src/endpoints/data/data_upload.py)>>
        +model_name: str
        +data_tag: Optional[str]
        +is_ground_truth: bool
        +request: Dict[str, Any]
        +response: Dict[str, Any]
    }

    class RowMatcher {
        <<DTO (src/service/utils/download.py)>>
        +columnName: str
        +operation: str
        +values: List[Any]
    }

    class DataRequestPayload {
        <<DTO (src/service/utils/download.py)>>
        +modelId: str
        +matchAny: Optional[List[RowMatcher]]
        +matchAll: Optional[List[RowMatcher]]
        +matchNone: Optional[List[RowMatcher]]
    }
    DataRequestPayload --> "*" RowMatcher : uses

    class DataResponsePayload {
        <<DTO (src/service/utils/download.py)>>
        +dataCSV: str
    }
Loading

File-Level Changes

Change Details Files
Implement /data/upload endpoint with full parsing, validation, and storage logic
  • Standardize model ID and validate optional data tag
  • Parse and convert input/output tensors to NumPy arrays
  • Validate tensor shapes and unique names
  • Branch for ground truth: sanitize IDs, validate against stored data, and write GT outputs and metadata
  • Branch for regular upload: generate or reuse execution IDs, flatten arrays, build metadata rows with timestamps and tags, and save inputs/outputs/metadata
src/endpoints/data/data_upload.py
src/service/utils/upload.py
Implement /data/download endpoint with DataFrame loading and matcher-based filtering
  • Load inputs, outputs, and metadata into a pandas DataFrame
  • Apply matchAll (AND), matchNone (NOT), and matchAny (OR) filters via reusable matcher functions
  • Handle special cases for timestamp and index columns
  • Convert filtered DataFrame to CSV and wrap in a response payload
  • Propagate HTTP errors for invalid columns, operations, or parsing
src/endpoints/data/data_download.py
src/service/utils/download.py
Add extensive endpoint tests covering functionality and error scenarios
  • Parameterized upload tests for varying dimensions, datatypes, and tags
  • Tests for multi-tensor uploads, tag restrictions, and ground truth consistency/mismatch cases
  • Mock-based download tests for numeric and text data, all filter combinations, and invalid filter conditions
  • Validation of CSV output, storage contents, and exception messages
tests/endpoints/test_upload_endpoint.py
tests/endpoints/test_download_endpoint.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @m-misiura - I've reviewed your changes - here's some feedback:

  • Consider refactoring the large upload handler into smaller service functions or helpers to separate ground-truth and standard flows for better readability and maintainability.
  • Replace the raw Dict[str, Any] fields for inputs/outputs in UploadPayload with dedicated Pydantic models to leverage automatic validation of tensor schemas (shape, datatype, execution IDs).
  • Avoid mutable default lists in DataRequestPayload (e.g. matchAny: Optional[List[RowMatcher]] = []) by using Field(default_factory=list) or Optional[...] = None to prevent shared state across requests.
Here's what I looked at during the review
  • 🟡 General issues: 3 issues found
  • 🟢 Security: all looks good
  • 🟡 Testing: 2 issues found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

MODEL_ID = "example1"


def generate_payload(n_rows, n_input_cols, n_output_cols, datatype, tag, input_offset=0, output_offset=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider testing scenarios where execution IDs are provided for non-ground-truth uploads.

Please add a test that uploads non-ground-truth data with explicit execution IDs and verifies these IDs are stored in the metadata, ensuring user-supplied IDs are handled correctly.

Suggested implementation:

MODEL_ID = "example1"
def generate_payload(n_rows, n_input_cols, n_output_cols, datatype, tag, input_offset=0, output_offset=0):
    """Generate a test payload with specific dimensions and data types."""
    model_name = f"{MODEL_ID}_{uuid.uuid4().hex[:8]}"
    input_data = []
    for i in range(n_rows):
        if n_input_cols == 1:
            input_data.append(i + input_offset)
        else:
            row = [i + j + input_offset for j in range(n_input_cols)]
            input_data.append(row)
    output_data = []
    for i in range(n_rows):
        if n_output_cols == 1:
    for i in range(n_rows):
        if n_output_cols == 1:

def test_upload_non_ground_truth_with_explicit_execution_ids():
    """Test uploading non-ground-truth data with explicit execution IDs and verify they are stored in metadata."""
    import uuid

    n_rows = 3
    n_input_cols = 2
    n_output_cols = 1
    datatype = "float"
    tag = "test-non-gt-execid"
    execution_ids = [f"execid-{uuid.uuid4().hex[:8]}" for _ in range(n_rows)]

    payload = {
        "model": f"{MODEL_ID}_{uuid.uuid4().hex[:8]}",
        "inputs": [[i, i+1] for i in range(n_rows)],
        "outputs": [[i * 2.0] for i in range(n_rows)],
        "datatype": datatype,
        "tag": tag,
        "ground_truth": False,
        "execution_ids": execution_ids,
    }

    response = client.post("/upload", json=payload)
    assert response.status_code == 200, f"Unexpected status code: {response.status_code}, {response.text}"
    data = response.json()
    assert "metadata" in data, "No metadata in response"
    meta = data["metadata"]
    assert "execution_ids" in meta, "No execution_ids in metadata"
    assert meta["execution_ids"] == execution_ids, f"Execution IDs in metadata do not match: {meta['execution_ids']} vs {execution_ids}"

MODEL_ID = "example1"


def test_download_data():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add test case for requesting data from a non-existent model ID.

Please add a test to ensure that requesting a download for a non-existent modelId returns a 404 status and the correct error message, as expected from load_model_dataframe.

Suggested implementation:

# Test constants
MODEL_ID = "example1"
NONEXISTENT_MODEL_ID = "nonexistent_model"


def test_download_data():
def test_download_data():
    """equivalent of Java downloadData() test"""
    dataframe = DataframeGenerators.generate_random_dataframe(1000)
    mock_storage.save_dataframe(dataframe, MODEL_ID)

    payload = {
        "modelId": MODEL_ID,
        "matchAll": [
            {"columnName": "gender", "operation": "EQUALS", "values": [0]},
            {"columnName": "race", "operation": "EQUALS", "values": [0]},
            {"columnName": "income", "operation": "EQUALS", "values": [0]},
        ],
        "matchAny": [
    }
    # ... rest of the test ...



def test_download_data_nonexistent_model(client):
    """Test that requesting data for a non-existent model ID returns 404 and correct error message."""
    payload = {
        "modelId": NONEXISTENT_MODEL_ID,
        "matchAll": [],
        "matchAny": [],
    }
    response = client.post("/download", json=payload)
    assert response.status_code == 404
    assert "not found" in response.json["error"].lower()
  • If your test client fixture is not named client, adjust the function argument accordingly.
  • If the error message from load_model_dataframe is more specific, update the assertion to match the exact message.
  • Ensure that the /download endpoint and error response structure match your actual API.

"""equivalent of Java downloadTextDataInternalColumnIndex() test"""
dataframe = DataframeGenerators.generate_random_text_dataframe(1000)
mock_storage.save_dataframe(dataframe, MODEL_ID)
expected_rows = dataframe.iloc[0:10].copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace a[0:x] with a[:x] and a[x:len(a)] with a[x:] (remove-redundant-slice-index)

Suggested change
expected_rows = dataframe.iloc[0:10].copy()
expected_rows = dataframe.iloc[:10].copy()

model_name = f"{MODEL_ID}_{uuid.uuid4().hex[:8]}"
input_tensors = []
for col_idx in range(n_input_cols):
tensor_data = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Comment on lines +163 to +200
if isinstance(id_val, np.ndarray):
ids.append(str(id_val))
else:
ids.append(str(id_val))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Hoist repeated code outside conditional statement (hoist-statement-from-if)

Suggested change
if isinstance(id_val, np.ndarray):
ids.append(str(id_val))
else:
ids.append(str(id_val))
ids.append(str(id_val))

Comment on lines +293 to +340
payload1 = generate_payload(n_payload1, 10, 1, "INT64", tag1)
payload1["model_name"] = model_name
post_test(payload1, 200, [f"{n_payload1} datapoints"])
payload2 = generate_payload(n_payload2, 10, 1, "INT64", tag2)
payload2["model_name"] = model_name
post_test(payload2, 200, [f"{n_payload2} datapoints"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Extract duplicate code into function (extract-duplicate-method)

@codecov-commenter
Copy link

codecov-commenter commented Jun 3, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 81.75313% with 102 lines in your changes missing coverage. Please review.

Project coverage is 62.58%. Comparing base (fffe1ca) to head (34e127b).
Report is 45 commits behind head on main.

Files with missing lines Patch % Lines
src/service/utils/download.py 77.27% 50 Missing ⚠️
src/service/utils/upload.py 84.15% 48 Missing ⚠️
src/endpoints/data/data_download.py 86.66% 2 Missing ⚠️
src/endpoints/data/data_upload.py 90.47% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##             main      #24       +/-   ##
===========================================
+ Coverage   48.13%   62.58%   +14.45%     
===========================================
  Files          15       26       +11     
  Lines        1498     2371      +873     
===========================================
+ Hits          721     1484      +763     
- Misses        777      887      +110     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@m-misiura m-misiura force-pushed the download_and_upload_endpoints branch from 8a0eb28 to fa2a88e Compare June 3, 2025 10:54
if len(metadata_data) > 0 and isinstance(metadata_data[0], bytes):
deserialized_metadata = []
for row in metadata_data:
deserialized_row = pickle.loads(row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure why this is necessary- is the metadata returned from storage.read_data ever serialized? If so, that's a bug in the storage code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to what @RobGeada said, even if it's necessary, we shouldn't be using raw pickle deserialisation. If some re-design of this part is needed, I'm fine with leaving it to another PR so we minimise conflicts with other PRs, but to be addressed before the final version.

return row_data


def process_tensors(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we re-use tensor parsing logic from the KServe inference parser?

@ruivieira ruivieira linked an issue Jun 15, 2025 that may be closed by this pull request
@ruivieira ruivieira changed the title RHOAIENG-21050 -- Endpoints: /data/download and /data/upload feat(RHOAIENG-21050) Endpoints: /data/download and /data/upload Jun 15, 2025
matching_dfs.append(matched_df)
# Union all results
if matching_dfs:
df = pd.concat(matching_dfs, ignore_index=True).drop_duplicates()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps for a separate PR, but we should revisit this. I'm concerned this might not scale well for large numbers of matchers and large DFs. concat, drop_duplicates (and even the filtering, in some situations) involve DF copying.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks -- very good food for thought; I tried to refactor it to avoid some of these operations

]
for eid in exec_ids
]
metadata = np.array(metadata_rows, dtype="<U100")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the character limit to a constant with an explanatory comment (perhaps make it configurable in a separate PR, if it makes sense)

# TODO: Implement
return {"status": "success", "message": "Data uploaded successfully"}
# Get fresh storage interface for each request
storage = get_storage_interface()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if we should get a storage instance per request, or once globally for the lifespan of the service.
An advantage of a global interface is that some interfaces are already thread-safe (e.g. PVC). This could a singleton either at the module level or as a FastAPI app state. wdyt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good question! I assumed (perhaps incorrectly) that upload operations via endpoint would be infrequent and would benefit from fault isolation. Creating storage interface per request would ensure that each upload operation is independent and if one fails, it doesn't affect the other.

I do not have a strong opinion on what is best here and happy to follow your recommendation :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - we probably want to use the global storage_interface. You can access it from model_data.storage_interface, but we should expose it via function

if len(metadata_data) > 0 and isinstance(metadata_data[0], bytes):
deserialized_metadata = []
for row in metadata_data:
deserialized_row = pickle.loads(row)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to what @RobGeada said, even if it's necessary, we shouldn't be using raw pickle deserialisation. If some re-design of this part is needed, I'm fine with leaving it to another PR so we minimise conflicts with other PRs, but to be addressed before the final version.

@ruivieira ruivieira added the enhancement New feature or request label Jun 15, 2025
@ruivieira ruivieira moved this to In Review in TrustyAI planning Jun 15, 2025
Comment on lines +150 to +380
for i, arr in enumerate(input_arrays[1:], 1):
if arr.shape[0] != first_dim:
errors.append(
f"Input tensor '{input_names[i]}' has first dimension {arr.shape[0]}, "
f"which doesn't match the first dimension {first_dim} of '{input_names[0]}'"
)
if errors:
return ". ".join(errors) + "."
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sourcery-ai, why not the simpler

errors = [
    f"Input tensor '{input_names[i]}' has first dimension {arr.shape[0]}, which doesn't match the first dimension {first_dim} of '{input_names[0]}'"
    for i, arr in enumerate(input_arrays[1:], 1)
    if arr.shape[0] != first_dim
]
return ". ".join(errors) + "." if errors else None

@m-misiura m-misiura closed this Jun 18, 2025
@m-misiura m-misiura force-pushed the download_and_upload_endpoints branch from 5ac8b6b to 9023359 Compare June 18, 2025 11:03
@m-misiura m-misiura reopened this Jun 18, 2025
@@ -0,0 +1,310 @@
import logging
import numbers
import pickle

Check notice

Code scanning / Bandit

Consider possible security implications associated with pickle module. Note

Consider possible security implications associated with pickle module.
@m-misiura m-misiura force-pushed the download_and_upload_endpoints branch from 89f3ff9 to 34e127b Compare June 18, 2025 12:34
@m-misiura m-misiura requested review from RobGeada and ruivieira June 18, 2025 12:41
@RobGeada
Copy link
Contributor

RobGeada commented Oct 2, 2025

duplicate of #47

@RobGeada RobGeada closed this Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Add data upload endpoint support

4 participants