Skip to content

Comments

Feature/flex data schema#36

Merged
ThiagoAVicente merged 14 commits intodevfrom
feature/flex-data-schema
Feb 22, 2026
Merged

Feature/flex data schema#36
ThiagoAVicente merged 14 commits intodevfrom
feature/flex-data-schema

Conversation

@ThiagoAVicente
Copy link
Collaborator

Summary

This PR implements a config-driven, schema-agnostic data storage system for NWDAF that can handle multiple analytics types (latency, anomaly detection, etc.) without requiring code changes.

Key Changes

Configuration System

  • Added YAML config files in confs/ for defining data fields (RAW):
    • core_fields.yml - Required fields (timestamp, cell_index)
    • extra_fields.yml - Optional metrics
    • tag_fields.yml - InfluxDB indexed tags
  • New src/configs/schema_conf.py - Config loader with type parsing (YAML strings → Python types)

Raw Data Pipeline (InfluxDB)

  • Refactored src/models/raw.py to plain class
  • Dynamic field validation against config
  • Configurable InfluxDB tags

Processed Data Pipeline (ClickHouse)

  • Generic analytics.processed table with nullable columns for all metric types
  • Dynamic transformer that flattens nested metrics and validates against schema
  • Schema caching via DESCRIBE query on connect
  • Uses column_names parameter for dynamic inserts

API Changes

  • Renamed /processed/latency//processed/
  • Removed ProcessedLatency model - endpoints now return plain dicts
  • Dynamic /example endpoint that generates from actual schema

Docker Setup

  • Moved Dockerfiles to docker/ directory
  • ClickHouse init via /docker-entrypoint-initdb.d (removed separate init container)
  • Config files copied to container

How it was tested

The whole PEI project was deployed locally and it was possible to conclude that services are correctly integrating with data storage. This means that these changes have solved some problems without breaking things.

@ThiagoAVicente ThiagoAVicente self-assigned this Feb 19, 2026
@ThiagoAVicente ThiagoAVicente added the enhancement New feature or request label Feb 19, 2026
@ThiagoAVicente ThiagoAVicente marked this pull request as ready for review February 20, 2026 15:00
@ThiagoAVicente
Copy link
Collaborator Author

The remaining errors were not added by this pr

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a flexible, configuration-driven data storage system for NWDAF that replaces hardcoded Pydantic models with a dynamic schema system based on YAML configuration files. The changes enable the system to handle multiple analytics types without code modifications.

Changes:

  • Introduced YAML-based schema configuration system (SchemaConf) that defines data fields, types, and InfluxDB tags
  • Refactored Raw model from Pydantic to plain Python class with dynamic field validation
  • Replaced ProcessedLatency Pydantic model with dict-based approach and generic ClickHouse table
  • Updated API endpoints from /processed/latency/ to /processed/ to reflect generic nature
  • Simplified Docker setup by using ClickHouse's native initialization mechanism

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 24 comments.

Show a summary per file
File Description
src/configs/schema_conf.py New configuration loader for YAML-based schema definitions with type parsing
src/configs/conf.py Added generic load() method to base configuration class
src/configs/init.py Added load_all() function to initialize all configurations
src/models/raw.py Refactored from Pydantic to plain class with dynamic field validation against schema config
src/models/processed_latency.py Deleted - replaced with dict-based approach
src/services/clickhouse.py Dynamic data transformation that queries schema at runtime and supports flexible fields
src/services/clickhouse_query.py Simplified query to use SELECT * for generic table
src/routers/v1/processed.py New generic router replacing latency-specific endpoint
src/routers/v1/latency_router.py Deleted - replaced by generic processed router
src/routers/v1/init.py Updated router registration with generic "data" tags
src/routers/v1/raw_router.py Code formatting improvements
sql/01_create_processed_table.sql Renamed table to generic "processed" with nullable fields for all metric types
docker/Dockerfile Added confs/ directory copy for runtime configuration
docker/Dockerfile.clickhouse New Dockerfile using native ClickHouse initialization
docker-compose.yml Removed separate init container, simplified ClickHouse setup
init-clickhouse.sh Deleted - replaced by native ClickHouse init
Dockerfile.clickhouse-init Deleted - no longer needed
confs/*.yml.example Example configuration files for core fields, extra fields, and tags
.gitignore Added confs/*.yml to ignore actual config files
.env.example New example environment file
main.py Added load_all() call to initialize schema configurations
tests/* Updated tests to work with dict-based models and mock schema
Comments suppressed due to low confidence (3)

sql/01_create_processed_table.sql:40

  • The table schema has been changed to make critical fields like "network" and temporal fields nullable or moved to the end. The cell_index is at the top but window_start_time, window_end_time, and window_duration_seconds (which are part of the ORDER BY) are now at the bottom. While this doesn't affect functionality, it's unconventional to have ORDER BY columns at the end of the schema. Consider grouping related fields together for better readability.
    sql/01_create_processed_table.sql:43
  • The data_type field was added to distinguish between different analytics types (latency, anomaly, etc.) but it's not included in the ORDER BY clause. If queries frequently filter by data_type, this could lead to poor query performance. Consider adding data_type to the ORDER BY clause as: ORDER BY (cell_index, data_type, window_start_time) for better query performance when filtering by analytics type.
    src/services/clickhouse_query.py:10
  • The query uses SELECT * which will return all columns including the new data_type field. However, there's no filter for data_type in the WHERE clause. This means the query will return mixed analytics types (latency, anomaly, etc.) if they exist in the same table. Consider adding an optional data_type parameter to allow filtering by analytics type, or document that this query returns all types.
    processed = """
    SELECT
        *
    FROM analytics.processed
    WHERE cell_index = {cell_index:Int32}
      AND window_duration_seconds = {window_duration_seconds:Int32}
      AND toUnixTimestamp(window_start_time) >= {start_time:Int64}
      AND toUnixTimestamp(window_end_time) <= {end_time:Int64}
    ORDER BY window_end_time DESC

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 43 to 67
def get_processed_latency(
start_time: int = Query(
..., description="Window start time (Unix timestamp in seconds)"
),
end_time: int = Query(
..., description="Window end time (Unix timestamp in seconds)"
),
cell_index: int = Query(..., description="Cell index (required)"),
window_duration_seconds: int = Query(
..., description="Duration of the target windows"
),
offset: int = Query(0, ge=0, description="Number of records to skip"),
limit: int = Query(
100, ge=1, le=1000, description="Maximum number of records to return"
),
):
"""
Query processed data with various filters.

Returns aggregated statistics over time windows including:
- Signal quality metrics
- Performance metrics
- Network information
- Statistical measures for each metric
"""
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoint function is still named "get_processed_latency" even though the endpoint is now generic and handles all types of processed data, not just latency. The docstring also still references "latency" specifically. This should be renamed to "get_processed_data" and the docstring updated to reflect the generic nature of the endpoint.

Copilot uses AI. Check for mistakes.
Comment on lines 23 to 45
@@ -41,29 +43,13 @@ services:
timeout: 3s
retries: 5

Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clickhouse-init service has been removed, which simplifies the architecture. However, ensure that the new approach using /docker-entrypoint-initdb.d/ in the ClickHouse image properly waits for ClickHouse to be ready before executing the SQL scripts. The removed init-clickhouse.sh had explicit health checks and retries. The standard ClickHouse entrypoint should handle this, but it's worth verifying in testing.

Copilot uses AI. Check for mistakes.
from_attributes=True
)
# Do not change the following if statement if timestamp is not ensured to be present in data
# The current approach allows core features to fe configurable but influxdb always needs timestamp
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo in the comment on line 27: "fe" should be "be". The comment reads "allows core features to fe configurable" but should read "allows core features to be configurable".

Suggested change
# The current approach allows core features to fe configurable but influxdb always needs timestamp
# The current approach allows core features to be configurable but influxdb always needs timestamp

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,3 @@
FROM clickhouse/clickhouse-server:latest

Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQL files are copied to /docker-entrypoint-initdb.d/ which is the standard ClickHouse initialization directory. However, there's no guarantee about the execution order of multiple SQL files. The file is named with a "01_" prefix suggesting ordering, but this should be documented or verified that ClickHouse processes files in alphanumeric order. If there are dependencies between SQL files, this could cause initialization failures.

Suggested change
# NOTE: The ClickHouse Docker entrypoint processes files in /docker-entrypoint-initdb.d/
# in lexicographical (alphanumeric) order. SQL files in sql/ should be named with
# appropriate numeric prefixes (e.g., 01_, 02_) to ensure dependent scripts run in sequence.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +16
core_fields: dict[str, type] = {}
extra_fields: dict[str, type] = {}
tags: set[str] = set()
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SchemaConf class variables (core_fields, extra_fields, tags) are class-level and never cleared. In a long-running service, if the configuration is reloaded multiple times, this is fine. However, in test environments, this can cause test pollution if tests don't properly reset the state. The test file test_singleton_databases.py resets database singletons but doesn't reset SchemaConf state, which could lead to test interdependencies.

Copilot uses AI. Check for mistakes.
Comment on lines +104 to +107
column_names = list(transformed_list[0].keys())

values = [list(d.values()) for d in transformed_list]

Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The write_batch method assumes all items in transformed_list have the same keys and uses the keys from the first item for column_names. If different records in the batch have different fields (which is possible in a flexible schema system), this will cause data corruption or insertion errors. The values will be extracted in dictionary iteration order which may differ between items, causing values to be written to wrong columns.

Suggested change
column_names = list(transformed_list[0].keys())
values = [list(d.values()) for d in transformed_list]
# Build a unified, deterministic column list across all transformed records
all_columns: set[str] = set()
for item in transformed_list:
all_columns.update(item.keys())
column_names = sorted(all_columns)
# For each record, align values to the unified column order
values = [[item.get(col) for col in column_names] for item in transformed_list]

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +16
core_fields: dict[str, type] = {}
extra_fields: dict[str, type] = {}
tags: set[str] = set()
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class attributes core_fields, extra_fields, and tags are initialized with mutable defaults (empty dict and set). In Python, mutable default arguments are shared across all instances. While this might be intentional for a singleton-like pattern, it's safer to initialize these in load_yml() or use None as defaults and check for None before returning them. This prevents potential issues if the class is instantiated multiple times or if tests don't properly clean up state.

Copilot uses AI. Check for mistakes.
Comment on lines +338 to +340
# Should return False because sample_count is a required field
# The transform function will raise ValueError for missing required fields
assert result is False
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test expectation has changed from "assert result is True" to "assert result is False" with a comment explaining that sample_count is a required field. However, this assumption may not be accurate - the actual behavior depends on whether the transform function in the ClickHouse sink catches and handles the ValueError. The test should verify the actual exception or error handling behavior rather than assuming a False return value.

Copilot uses AI. Check for mistakes.
v1_router = APIRouter()
v1_router.include_router(latencyR, prefix="/processed", tags=["v1", "latency"])
v1_router.include_router(rawR, prefix="/raw", tags=["v1", "latency"])
v1_router.include_router(latencyR, prefix="/processed", tags=["v1", "data"])
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tags in the router registration have changed from "latency" to "data" but the function is still named "get_processed_latency". This naming inconsistency could be confusing. Consider renaming the function to "get_processed_data" to match the generic nature of the endpoint and the tag name.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,3 @@
FROM clickhouse/clickhouse-server:latest
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dockerfile.clickhouse uses the mutable base image tag clickhouse/clickhouse-server:latest, which introduces a supply chain risk because future builds may automatically pull a compromised or incompatible image. An attacker who compromises the upstream image registry could gain code execution inside your ClickHouse container and access or exfiltrate analytics data. To reduce this risk, pin the base image to a specific, trusted version or immutable digest and update it deliberately as part of your release process.

Copilot uses AI. Check for mistakes.
Copy link

@JPSP9547 JPSP9547 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LFTM, but the tests are not working

@ThiagoAVicente ThiagoAVicente merged commit 4a9026a into dev Feb 22, 2026
1 of 2 checks passed
@ThiagoAVicente ThiagoAVicente deleted the feature/flex-data-schema branch February 22, 2026 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants