[Proposal] Data Clean Room Platform Prototype Based On Cloudberry #1270

houkawa2 · 2025-07-30T13:49:07Z

houkawa2
Jul 30, 2025

Proposers

houkawa2

Proposal Status

Under Discussion

Abstract

This proposal presents the construction of a Data Clean Room (DCR) platform based on Cloudberry. The platform adopts Spring Boot as its core framework and integrates Kafka with Python-based Multi-Party Computation (MPC) modules to achieve high-concurrency and privacy-preserving collaborative data processing.

Each participant can upload their own raw data and define fine-grained access control policies — such as which columns can be shared, row-level filters, and join keys. These policies are applied to the user-generated encrypted views. The views are then imported into the DCR database, where data from different users is strictly isolated and inaccessible to others.

During collaborative analysis, the system uses MPC to generate a cross-party matching table, and then leverages the SQL engine to execute queries under the constraints of the privacy policies. Before results are returned, differential privacy noise is added automatically to prevent sensitive information from being inferred.

To ensure data security, the platform implements a fine-grained role-based access control mechanism: each user has a private database role for managing their own data (invisible to others), and a dedicated staging role to upload encrypted views to the DCR. Analytical and matching operations within the DCR are executed by backend services using specific analysis and matching roles. The final query results are stored under a result role dedicated to each user, ensuring that only the data owner can access them.

By combining a layered database schema with strict role control, the platform achieves its core goal: enabling joint data analysis without exposing raw data between parties — thereby meeting the demands of data collaboration with strong privacy protection, policy transparency, and regulatory compliance.

Motivation

In today's data-driven landscape, enterprises frequently need to collaborate on sensitive datasets to unlock mutual value. As a Massively Parallel Processing (MPP) database, its robust SQL engine and scalability is suitable for supporting a DCR platform.

Operating Mode

DCRs on platforms like Snowflake and AWS enable secure multi-party data collaboration by leveraging cloud-based distributed storage and compute resources. Typically following a provider-consumer model, one party creates the DCR and invites others to join. Each participant contributes data alongside strict privacy rules (e.g., allowed join keys, filters, differential privacy). Users can then submit SQL queries, which are executed against protected, policy-enforced views. Results are returned only to authorized users, ensuring collaborative analysis without exposing raw data.

Implementation

Framework

Spring Boot Logic Layer -- Responsible for API entry, authentication and authorization, request validation, notifications, and DCR backend logic implementation. Spring Cloud Config Server or Vault will be used for managing role login passwords.

Python MPC Logic Layer -- Responsible for implementation of higher-order MPC functionalities in tables join.

Kafka Message Queue -- Responsible for secure communication between Spring Boot and Python MPC services, and providing better concurrency performance.

Cloudberry Data Layer -- Responsible for data storage, database role authorization, row and column-level access control, aggregation analysis, join, and other DCR analysis functionalities.

Cloudberry Data Layer

The data layer primarily consists of two parts: the user's own database and the created DCR database.

User's Own Database: Used for storing uploaded raw data. Initial row and column-level access restrictions, as well as hashing of sensitive join columns and IDs, are also performed within the user's own database. Only encrypted views are transferred from the user's database to the DCR database.

DCR Database: The DCR database has four layers (schemas): data-staging-schema, data-matching-schema, data-analysis-schema, data-result-schema.

-- data-staging-schema: Used for storing views input from user databases. In this layer, different users' views are logically isolated, preventing users from seeing each other's data views.

-- data-matching-schema: Used for storing the unified cross-reference-table for DCR participants, output from the MPC layer.

-- data-analysis-schema: Used for table joins and backend view generation when users execute SQL queries in the frontend.

-- data-result-schema: Used for storing users' SQL query results.

Spring Boot Logic Layer

Role Summary

super_admin_role: This is the highest management role in the system, used to manage all users' data and the DCR database. It possesses SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, and other permissions on all databases.

User's Personal DB Management Related Roles:

user_{userId}_data_extractor_role: This is the individual user's management role. The user has SELECT, INSERT, UPDATE, DELETE, and CREATE TABLE, DROP TABLE permissions for the databases and tables they create. No one can view or modify personal data without authorization.

DCR DB Related Roles:

dcr_admin_role: This is the DCR database management role, managed by the Spring Boot backend. It has SELECT, INSERT, UPDATE, DELETE, and CREATE TABLE permissions for all schemas within the DCR database.

dcr_data_staging_{userId}_role: This role is used by individual users to upload views into the DCR database. Each user has their own specific role, and users cannot access each other's data. It possesses INSERT and SELECT permissions for their own data tables within the data-staging-schema.

dcr_data_matching_role: Primarily used by the Spring Boot backend to manage and maintain the cross-reference-table within the data-matching-schema. It is responsible for receiving the unified cross-reference-table data output from the MPC layer and ensuring its correct storage and availability for subsequent analysis. It has SELECT, INSERT, UPDATE, and DELETE permissions for this schema, but cannot CREATE TABLE. Tables should be created by dcr_admin_role at first.

dcr_data_analysis_role: Primarily used by the Spring Boot backend to manage and maintain data within the data-analysis-schema. When joining tables, it needs to fetch data files from the data-staging-schema and data-matching-schema and generate relevant joined views. It also needs to use this role to place SQL query results into the data-result-schema. Therefore, this role holds:

SELECT permissions on data-staging-schema and data-matching-schema.
CREATE VIEW, ALTER VIEW, DROP VIEW, SELECT, INSERT, UPDATE, and DELETE permissions on data-analysis-schema.
CREATE VIEW, ALTER VIEW, DROP VIEW, INSERT, UPDATE, and DELETE permissions on data-result-schema.

dcr_data_analyst_role: This is a role specifically used for executing user SQL queries, used by the Spring Boot backend. It has SELECT permissions for data within the data-analysis-schema.

dcr_data_result_{userId}_role: Users utilize this role to access their respective results. It has SELECT permission for its own results within the data-result-schema.

Spring Boot Internal Database

The Spring Boot internal database is used to store associated data required for frontend display.

There is a many-to-many relationship between the User entity and the CleanRoom entity, thus the UserCleanRoom entity is introduced as a junction table. Within UserCleanRoom, the userStatus attribute indicates the role of that specific User within that CleanRoom (e.g., PROVIDER or CONSUMER).

A User has their own Database, which contains Tables, and Tables contain Columns. Since users need to manually select visible Columns when configuring row and column-level access policies, a separate Column entity has been established. There is a many-to-many relationship between Column and Policy (a Column can be used in multiple Policy settings, and a single Policy's row/column access configuration can involve multiple Columns). Therefore, a PolicyColumn junction table is created. In PolicyColumn, the isAllowed attribute indicates whether a specific Column is permitted to be shared and visible within that Policy; the isJoinKey attribute indicates whether this Column is designated as a join column within that Policy. The detailed policy content will be stored as a JSON-formatted policyRules attribute within the Policy instance.

In my design, Policy, View, and Notification are all dependent on the existence of a DCR. If a DCR does not exist, then these three entities also do not exist. Furthermore, within the same DCR, there can be different Users, and each User can construct different Views, Policys, and Notifications (as the sender of the notification differs, and the DCR joined also differs). Therefore, I have connected these three entities to the UserCleanRoom entity.

A View is constructed based on a User's Table and according to the Policy set by the User. View has a many-to-one relationship with both Table and Policy.

The Notification entity needs to be connected to the User entity because it requires a User to receive the notification; thus, the relationship is many-to-one from Notification to User. The status attribute of Notification represents the state of the notification, such as whether the notification has been sent, whether the recipient has agreed to join the DCR, and other enumeration states.

Required Third-Party Libraries/Plugins

JSqlParser - Python MPC: PSI Protocol- Spring Cloud Config Server or Vault - Firebase Cloud Messaging (FCM)

Function Points

User Registration & Login -- setup & login
User Database Creation
User Uploads CSV/JSON Files: Do we need a temporary storage path accessible by the backend server in cloud? Cloudberry will fecth csv/json files from this reachable address and load it into Cloudberry by gpload.
Create DCR
Join DCR: When he initiator of DCR invites someone, the FCM will help send notification to that user, after accepting the invitation, user can join this DCR successfully.
Row and Column-Level Access Control: Use Cloudberry to apply the rules of row and column-level access control.
Join
- Users configure the join columns in the frontend. The backend, based on these frontend parameters, will hash and encrypt the join columns' s value from the policy and the IDs of the respective tables.
- Subsequently, by switching to the dcr_data_staging_{userId}_role, the hashed data will be ingested into the data-staging-schema of the DCR database using the same gpload method as for user-uploaded CSV/JSON files.
- Spring Boot will then transfer this data from dcr_data_staging_{userId}_role to the MPC layer through message queue.
- After the cross-reference-table is generated by MPC module, Spring Boot will switch to the dcr_data_matching_role to store the cross-reference-table into the data-matching-schema.
- Upon receiving the user's query SQL from the frontend, a third-party SQL parser plugin will be used to extract the table names involved in the user's join query. Spring Boot will then switch to the dcr_data_analysis_role and execute dynamically generated join SQL to join the target tables via the cross-reference-table and views in data-staging-schema, storing the result in the data-analysis-schema.
- Next, Spring Boot will switch to the dcr_data_analyst_role and perform a SELECT query on the target view in data-analysis-schema based on the dynamically generated SQL.
- Finally, Spring Boot will switch to the dcr_result_{userId}_role to store the generated result in data-result-schema.
Aggregation: The backend logics will check if user's sql query contains aggregation functions which is not allowed. Only the compliant sql query can be execute by sql engine.
User SQL Query
- The DCR will provide users with pre-defined queryable tables, columns, and aggregation rules. Users can then submit custom SQL query requests through the DCR frontend based on these established rules.
- JSqlParser is used to parse the user's query SQL, verifying whether the aggregation functions used in the SQL (such as SUM, COUNT, AVG) are within the list permitted by the DCR's policies.
- Once the query validation is successful, the Spring Boot backend will use the permissions of the dcr_analyst_role to execute the query within the DCR database. User queries are performed against existing analytical views within the data-analysis-schema. These views have inherently encapsulated the join logic with the cross-reference-table and view data from various parties (originating from the data-staging-schema) . The database then returns the aggregated results to the Spring Boot backend. After the query is completed, the Spring Boot backend will apply differential privacy (adding noise) to these aggregated results to further protect against the inference of data details.
- The Spring Boot backend, by switching to the permissions of the dcr_data_analysis_role, stores these results into the user's dedicated result table (or view) within the data-result-schema, making them available for subsequent on-demand access.

Rollout/Adoption Plan

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

xtangcode · 2025-08-07T06:57:13Z

xtangcode
Aug 7, 2025
Collaborator

This is an excellent initiative. Thanks houkawa2.

Per our discussion, I would recommend that the DCR design proposal covers more about:

use cases. e.g. cloudberry database. cloudberry+hive, etc.
user interface design. e.g. what operations are allowed.
key authentication flow. e.g. how secure the data is passed along.

As this proposal is more about relying cloudberry to build a data clean room solution (instead of improving the kernel), we may consider moving future design and code discussion to https://github.com/cloudberry-contrib/.

Please keep up the good work!!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proposal] Data Clean Room Platform Prototype Based On Cloudberry #1270

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

[Proposal] Data Clean Room Platform Prototype Based On Cloudberry #1270

Uh oh!

Uh oh!

houkawa2 Jul 30, 2025

Proposers

Proposal Status

Abstract

Motivation

Operating Mode

Implementation

Framework

Cloudberry Data Layer

Spring Boot Logic Layer

Role Summary

Spring Boot Internal Database

Required Third-Party Libraries/Plugins

Function Points

Rollout/Adoption Plan

Are you willing to submit a PR?

Replies: 1 comment

Uh oh!

Uh oh!

xtangcode Aug 7, 2025 Collaborator

houkawa2
Jul 30, 2025

xtangcode
Aug 7, 2025
Collaborator