feat: implement comprehensive dataset management functionality#29
feat: implement comprehensive dataset management functionality#29Patrick-Ehimen merged 3 commits intomainfrom
Conversation
Reviewer's GuideThis PR introduces end-to-end dataset management support by extending the SDK, service layer, and MCP server tooling. The SDK wrapper gains create/update/get/list/delete dataset methods with progress tracking and versioning; the real and mock LighthouseService implementations are updated with dataset storage, caching, and conversion between SDK and service models; new MCP tools are registered and implemented for dataset operations; and corresponding types and test scaffolding are added. Sequence diagram for dataset creation via MCP tool and servicesequenceDiagram
actor User
participant MCPServer
participant LighthouseCreateDatasetTool
participant ILighthouseService
participant LighthouseAISDK
User->>MCPServer: Request to create dataset
MCPServer->>LighthouseCreateDatasetTool: execute(args)
LighthouseCreateDatasetTool->>ILighthouseService: createDataset(params)
ILighthouseService->>LighthouseAISDK: createDataset(filePaths, options)
LighthouseAISDK-->>ILighthouseService: DatasetInfo
ILighthouseService-->>LighthouseCreateDatasetTool: Dataset
LighthouseCreateDatasetTool-->>MCPServer: ToolResult
MCPServer-->>User: Dataset creation result
Sequence diagram for updating a dataset via MCP tool and servicesequenceDiagram
actor User
participant MCPServer
participant LighthouseUpdateDatasetTool
participant ILighthouseService
participant LighthouseAISDK
User->>MCPServer: Request to update dataset
MCPServer->>LighthouseUpdateDatasetTool: execute(args)
LighthouseUpdateDatasetTool->>ILighthouseService: updateDataset(params)
ILighthouseService->>LighthouseAISDK: updateDataset(datasetId, options)
LighthouseAISDK-->>ILighthouseService: DatasetInfo
ILighthouseService-->>LighthouseUpdateDatasetTool: Dataset
LighthouseUpdateDatasetTool-->>MCPServer: ToolResult
MCPServer-->>User: Dataset update result
Class diagram for new and updated dataset typesclassDiagram
class DatasetOptions {
+string name
+string description
+boolean encrypt
+Record<string, any> metadata
+string[] tags
+onProgress(progress: ProgressInfo)
}
class DatasetInfo {
+string id
+string name
+string description
+string[] files
+string version
+Date createdAt
+Date updatedAt
+boolean encrypted
+Record<string, any> metadata
+string[] tags
+number totalSize
+number fileCount
}
class ListDatasetsResponse {
+DatasetInfo[] datasets
+number total
+boolean hasMore
+string cursor
}
DatasetOptions --> ProgressInfo
ListDatasetsResponse --> DatasetInfo
Class diagram for ILighthouseService interface (updated)classDiagram
class ILighthouseService {
+createDataset(params): Promise<Dataset>
+updateDataset(params): Promise<Dataset>
+getDataset(datasetId): Promise<Dataset | undefined>
+listDatasets(params): Promise<DatasetsListResult>
+deleteDataset(datasetId, deleteFiles): Promise<void>
+clear(): void
}
class Dataset {
+string id
+string name
+string description
+UploadResult[] files
+Record<string, any> metadata
+string version
+Date createdAt
+Date updatedAt
+boolean encrypted
+AccessCondition[] accessConditions
}
class DatasetsListResult {
+Dataset[] datasets
+number total
+boolean hasMore
}
ILighthouseService --> Dataset
ILighthouseService --> DatasetsListResult
DatasetsListResult --> Dataset
Class diagram for LighthouseAISDK dataset management methodsclassDiagram
class LighthouseAISDK {
+createDataset(filePaths, options): Promise<DatasetInfo>
+updateDataset(datasetId, options): Promise<DatasetInfo>
+getDataset(datasetId): Promise<DatasetInfo>
+listDatasets(limit, offset): Promise<ListDatasetsResponse>
+deleteDataset(datasetId, deleteFiles): Promise<void>
}
LighthouseAISDK --> DatasetInfo
LighthouseAISDK --> ListDatasetsResponse
Class diagram for new MCP dataset toolsclassDiagram
class LighthouseCreateDatasetTool {
+execute(args): Promise<ProgressAwareToolResult>
+static getDefinition(): MCPToolDefinition
}
class LighthouseListDatasetsTool {
+execute(args): Promise<ProgressAwareToolResult>
+static getDefinition(): MCPToolDefinition
}
class LighthouseGetDatasetTool {
+execute(args): Promise<ProgressAwareToolResult>
+static getDefinition(): MCPToolDefinition
}
class LighthouseUpdateDatasetTool {
+execute(args): Promise<ProgressAwareToolResult>
+static getDefinition(): MCPToolDefinition
}
LighthouseCreateDatasetTool --> MCPToolDefinition
LighthouseListDatasetsTool --> MCPToolDefinition
LighthouseGetDatasetTool --> MCPToolDefinition
LighthouseUpdateDatasetTool --> MCPToolDefinition
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- There’s a lot of duplicated logic mapping SDK’s DatasetInfo to your internal Dataset type across the service layer—consider extracting a shared converter to reduce repetition and prevent drift.
- I noticed you registered create, list, get, and update dataset tools but did not add a deleteDataset tool to the registry—adding that would round out the full CRUD support.
- You’re generating dataset IDs and version strings inline in multiple methods; pulling that into a shared utility or factory would keep things consistent and make future changes easier.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There’s a lot of duplicated logic mapping SDK’s DatasetInfo to your internal Dataset type across the service layer—consider extracting a shared converter to reduce repetition and prevent drift.
- I noticed you registered create, list, get, and update dataset tools but did not add a deleteDataset tool to the registry—adding that would round out the full CRUD support.
- You’re generating dataset IDs and version strings inline in multiple methods; pulling that into a shared utility or factory would keep things consistent and make future changes easier.
## Individual Comments
### Comment 1
<location> `apps/mcp-server/src/services/LighthouseService.ts:399-401` </location>
<code_context>
+ id: datasetInfo.id,
+ name: datasetInfo.name,
+ description: datasetInfo.description || "",
+ files: datasetInfo.files.map((hash) => ({
+ cid: hash,
+ size: 0, // Would need to fetch individual file info
+ encrypted: datasetInfo.encrypted,
+ accessConditions: params.accessConditions,
</code_context>
<issue_to_address>
**suggestion:** File size is set to 0 for all dataset files, which may mislead consumers.
If file size is needed, fetch or calculate it during dataset creation. Otherwise, indicate in documentation or types that size is unavailable.
Suggested implementation:
```typescript
files: datasetInfo.files.map((hash) => ({
cid: hash,
// File size is currently unavailable; set to 0 as a placeholder.
// Update this logic to fetch actual file size if/when available.
size: 0,
encrypted: datasetInfo.encrypted,
accessConditions: params.accessConditions,
tags: params.tags,
uploadedAt: datasetInfo.createdAt,
originalPath: "",
hash: hash,
})),
```
If you have a type definition for the file object (e.g., `DatasetFile`), update its documentation to clarify that `size` may be unavailable and is set to 0 as a placeholder. For example:
```typescript
// In types/DatasetFile.ts or similar
/**
* File size in bytes. May be 0 if unavailable.
*/
size: number;
```
If you later implement logic to fetch the actual file size, replace the placeholder with the real value.
</issue_to_address>
### Comment 2
<location> `apps/mcp-server/src/services/LighthouseService.ts:403` </location>
<code_context>
+ cid: hash,
+ size: 0, // Would need to fetch individual file info
+ encrypted: datasetInfo.encrypted,
+ accessConditions: params.accessConditions,
+ tags: params.tags,
+ uploadedAt: datasetInfo.createdAt,
</code_context>
<issue_to_address>
**suggestion:** Access conditions are passed to the Dataset object but not handled in updateDataset.
If updates to access conditions are needed, add support in updateDataset; otherwise, document that access conditions cannot be changed after creation.
Suggested implementation:
```typescript
/**
* Updates a dataset.
* Note: accessConditions can be updated via this function.
*/
async function updateDataset(datasetId: string, updateParams: Partial<Dataset>) {
// Fetch the existing dataset
const dataset = await getDatasetById(datasetId);
// Update allowed fields
if (updateParams.name !== undefined) dataset.name = updateParams.name;
if (updateParams.description !== undefined) dataset.description = updateParams.description;
if (updateParams.tags !== undefined) dataset.tags = updateParams.tags;
if (updateParams.metadata !== undefined) dataset.metadata = updateParams.metadata;
// Support updating accessConditions
if (updateParams.accessConditions !== undefined) {
dataset.accessConditions = updateParams.accessConditions;
// If files also need accessConditions updated, update them as well
if (dataset.files) {
dataset.files = dataset.files.map(file => ({
...file,
accessConditions: updateParams.accessConditions
}));
}
}
// Save the updated dataset
await saveDataset(dataset);
return dataset;
}
```
```typescript
/**
* Updates a dataset.
* Note: accessConditions cannot be changed after creation.
*/
async function updateDataset(datasetId: string, updateParams: Partial<Dataset>) {
// Fetch the existing dataset
const dataset = await getDatasetById(datasetId);
// Update allowed fields
if (updateParams.name !== undefined) dataset.name = updateParams.name;
if (updateParams.description !== undefined) dataset.description = updateParams.description;
if (updateParams.tags !== undefined) dataset.tags = updateParams.tags;
if (updateParams.metadata !== undefined) dataset.metadata = updateParams.metadata;
// Do NOT allow updating accessConditions after creation
// If updateParams.accessConditions is present, ignore it and document this behavior
// Save the updated dataset
await saveDataset(dataset);
return dataset;
}
```
If you choose Option 1, ensure that the `Dataset` type/interface includes `accessConditions` as an updatable field.
If you choose Option 2, you may want to throw an error or log a warning if `updateParams.accessConditions` is provided, to make the restriction explicit.
</issue_to_address>
### Comment 3
<location> `apps/mcp-server/src/services/MockLighthouseService.ts:479-481` </location>
<code_context>
+
+ // Optionally delete associated files
+ if (deleteFiles) {
+ for (const file of dataset.files) {
+ this.fileStore.delete(file.cid);
+ this.currentStorageSize -= file.size;
+ }
+ }
</code_context>
<issue_to_address>
**issue (bug_risk):** Deleting files from fileStore may result in negative currentStorageSize.
Add a check to ensure currentStorageSize does not drop below zero when deleting files, accounting for possible inaccuracies in file sizes or repeated deletions.
</issue_to_address>
### Comment 4
<location> `apps/mcp-server/src/tools/LighthouseCreateDatasetTool.ts:128` </location>
<code_context>
+ return "filePaths is required and must be a non-empty array";
+ }
+
+ if (params.filePaths.length > 50) {
+ return "Maximum 50 files allowed per dataset";
+ }
</code_context>
<issue_to_address>
**suggestion:** Maximum file count per dataset is hardcoded to 50.
Consider moving this limit to a constant or configuration file if it may change in the future.
Suggested implementation:
```typescript
const MAX_FILES_PER_DATASET = 50;
if (params.filePaths.length > MAX_FILES_PER_DATASET) {
return `Maximum ${MAX_FILES_PER_DATASET} files allowed per dataset`;
}
```
If you have a central configuration file for such constants, consider moving `MAX_FILES_PER_DATASET` there and importing it instead of defining it locally.
</issue_to_address>
### Comment 5
<location> `apps/mcp-server/src/tools/LighthouseListDatasetsTool.ts:136-152` </location>
<code_context>
+ description: dataset.description,
+ version: dataset.version,
+ fileCount: dataset.files.length,
+ totalSize: dataset.files.reduce((sum, file) => sum + file.size, 0),
+ encrypted: dataset.encrypted,
+ tags: dataset.metadata?.keywords || [],
</code_context>
<issue_to_address>
**suggestion:** Total size calculation may be inaccurate if file sizes are not set.
If file sizes are missing or zero, either exclude totalSize from the response or clearly document its limitations.
```suggestion
datasets: response.datasets.map((dataset) => {
// Check if all files have a valid size
const allFilesHaveValidSize = dataset.files.every(
(file) => typeof file.size === 'number' && file.size > 0
);
const datasetResponse: any = {
id: dataset.id,
name: dataset.name,
description: dataset.description,
version: dataset.version,
fileCount: dataset.files.length,
encrypted: dataset.encrypted,
tags: dataset.metadata?.keywords || [],
createdAt: dataset.createdAt.toISOString(),
updatedAt: dataset.updatedAt.toISOString(),
metadata: {
author: dataset.metadata?.author,
license: dataset.metadata?.license,
category: dataset.metadata?.category,
},
};
// Only include totalSize if all files have a valid size
if (allFilesHaveValidSize) {
// Note: totalSize is only accurate if all file sizes are set and non-zero
datasetResponse.totalSize = dataset.files.reduce((sum, file) => sum + file.size, 0);
}
return datasetResponse;
}),
```
</issue_to_address>
### Comment 6
<location> `apps/mcp-server/src/__tests__/dataset-tools.test.ts:4` </location>
<code_context>
+/**
+ * Dataset Tools Tests
+ */
+
+import { describe, it, expect, beforeEach, afterEach, vi } from "vitest";
+import { Logger } from "@lighthouse-tooling/shared";
</code_context>
<issue_to_address>
**issue (testing):** No test cases found for dataset management tools.
Please add thorough unit and integration tests for all new dataset management tools and service methods, covering core functionality, edge cases, and error handling, especially for file operations, metadata, and access control.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| cid: hash, | ||
| size: 0, // Would need to fetch individual file info | ||
| encrypted: datasetInfo.encrypted, | ||
| accessConditions: params.accessConditions, |
There was a problem hiding this comment.
suggestion: Access conditions are passed to the Dataset object but not handled in updateDataset.
If updates to access conditions are needed, add support in updateDataset; otherwise, document that access conditions cannot be changed after creation.
Suggested implementation:
/**
* Updates a dataset.
* Note: accessConditions can be updated via this function.
*/
async function updateDataset(datasetId: string, updateParams: Partial<Dataset>) {
// Fetch the existing dataset
const dataset = await getDatasetById(datasetId);
// Update allowed fields
if (updateParams.name !== undefined) dataset.name = updateParams.name;
if (updateParams.description !== undefined) dataset.description = updateParams.description;
if (updateParams.tags !== undefined) dataset.tags = updateParams.tags;
if (updateParams.metadata !== undefined) dataset.metadata = updateParams.metadata;
// Support updating accessConditions
if (updateParams.accessConditions !== undefined) {
dataset.accessConditions = updateParams.accessConditions;
// If files also need accessConditions updated, update them as well
if (dataset.files) {
dataset.files = dataset.files.map(file => ({
...file,
accessConditions: updateParams.accessConditions
}));
}
}
// Save the updated dataset
await saveDataset(dataset);
return dataset;
}/**
* Updates a dataset.
* Note: accessConditions cannot be changed after creation.
*/
async function updateDataset(datasetId: string, updateParams: Partial<Dataset>) {
// Fetch the existing dataset
const dataset = await getDatasetById(datasetId);
// Update allowed fields
if (updateParams.name !== undefined) dataset.name = updateParams.name;
if (updateParams.description !== undefined) dataset.description = updateParams.description;
if (updateParams.tags !== undefined) dataset.tags = updateParams.tags;
if (updateParams.metadata !== undefined) dataset.metadata = updateParams.metadata;
// Do NOT allow updating accessConditions after creation
// If updateParams.accessConditions is present, ignore it and document this behavior
// Save the updated dataset
await saveDataset(dataset);
return dataset;
}If you choose Option 1, ensure that the Dataset type/interface includes accessConditions as an updatable field.
If you choose Option 2, you may want to throw an error or log a warning if updateParams.accessConditions is provided, to make the restriction explicit.
| for (const file of dataset.files) { | ||
| this.fileStore.delete(file.cid); | ||
| this.currentStorageSize -= file.size; |
There was a problem hiding this comment.
issue (bug_risk): Deleting files from fileStore may result in negative currentStorageSize.
Add a check to ensure currentStorageSize does not drop below zero when deleting files, accounting for possible inaccuracies in file sizes or repeated deletions.
| return "filePaths is required and must be a non-empty array"; | ||
| } | ||
|
|
||
| if (params.filePaths.length > 50) { |
There was a problem hiding this comment.
suggestion: Maximum file count per dataset is hardcoded to 50.
Consider moving this limit to a constant or configuration file if it may change in the future.
Suggested implementation:
const MAX_FILES_PER_DATASET = 50;
if (params.filePaths.length > MAX_FILES_PER_DATASET) {
return `Maximum ${MAX_FILES_PER_DATASET} files allowed per dataset`;
}If you have a central configuration file for such constants, consider moving MAX_FILES_PER_DATASET there and importing it instead of defining it locally.
| datasets: response.datasets.map((dataset) => ({ | ||
| id: dataset.id, | ||
| name: dataset.name, | ||
| description: dataset.description, | ||
| version: dataset.version, | ||
| fileCount: dataset.files.length, | ||
| totalSize: dataset.files.reduce((sum, file) => sum + file.size, 0), | ||
| encrypted: dataset.encrypted, | ||
| tags: dataset.metadata?.keywords || [], | ||
| createdAt: dataset.createdAt.toISOString(), | ||
| updatedAt: dataset.updatedAt.toISOString(), | ||
| metadata: { | ||
| author: dataset.metadata?.author, | ||
| license: dataset.metadata?.license, | ||
| category: dataset.metadata?.category, | ||
| }, | ||
| })), |
There was a problem hiding this comment.
suggestion: Total size calculation may be inaccurate if file sizes are not set.
If file sizes are missing or zero, either exclude totalSize from the response or clearly document its limitations.
| datasets: response.datasets.map((dataset) => ({ | |
| id: dataset.id, | |
| name: dataset.name, | |
| description: dataset.description, | |
| version: dataset.version, | |
| fileCount: dataset.files.length, | |
| totalSize: dataset.files.reduce((sum, file) => sum + file.size, 0), | |
| encrypted: dataset.encrypted, | |
| tags: dataset.metadata?.keywords || [], | |
| createdAt: dataset.createdAt.toISOString(), | |
| updatedAt: dataset.updatedAt.toISOString(), | |
| metadata: { | |
| author: dataset.metadata?.author, | |
| license: dataset.metadata?.license, | |
| category: dataset.metadata?.category, | |
| }, | |
| })), | |
| datasets: response.datasets.map((dataset) => { | |
| // Check if all files have a valid size | |
| const allFilesHaveValidSize = dataset.files.every( | |
| (file) => typeof file.size === 'number' && file.size > 0 | |
| ); | |
| const datasetResponse: any = { | |
| id: dataset.id, | |
| name: dataset.name, | |
| description: dataset.description, | |
| version: dataset.version, | |
| fileCount: dataset.files.length, | |
| encrypted: dataset.encrypted, | |
| tags: dataset.metadata?.keywords || [], | |
| createdAt: dataset.createdAt.toISOString(), | |
| updatedAt: dataset.updatedAt.toISOString(), | |
| metadata: { | |
| author: dataset.metadata?.author, | |
| license: dataset.metadata?.license, | |
| category: dataset.metadata?.category, | |
| }, | |
| }; | |
| // Only include totalSize if all files have a valid size | |
| if (allFilesHaveValidSize) { | |
| // Note: totalSize is only accurate if all file sizes are set and non-zero | |
| datasetResponse.totalSize = dataset.files.reduce((sum, file) => sum + file.size, 0); | |
| } | |
| return datasetResponse; | |
| }), |
Description
This PR implements comprehensive dataset management functionality for the Lighthouse AI integration system. The feature enables AI agents to create, manage, and organize collections of files with metadata, versioning, and access control through the MCP protocol.
Key Features
SDK Wrapper Enhancements
New MCP Tools
Service Layer Improvements
The implementation maintains backward compatibility while adding powerful new dataset management capabilities for AI agents.
Summary by Sourcery
Implement comprehensive dataset management support across the SDK, service layer, and MCP tools, enabling AI agents to manage collections of files with rich metadata, pagination, versioning, and access control.
New Features:
Enhancements: