Skip to content

Commit c7a0893

Browse files
committed
FEAT: Add GuepardDataFrame for automated version tracking and rollback functionality
1 parent b64f438 commit c7a0893

File tree

2 files changed

+84
-0
lines changed

2 files changed

+84
-0
lines changed

guepard_pandas/guepard_dataframe.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
import pandas as pd
2+
import requests
3+
4+
class GuepardDataFrame(pd.DataFrame):
5+
def __init__(self, *args, **kwargs):
6+
super().__init__(*args, **kwargs)
7+
self.api_url = "https://api.guepard.com"
8+
self.dataset_id = kwargs.get('dataset_id', 'default')
9+
10+
def commit(self, message=""):
11+
version_id = self._generate_version_id()
12+
data = self.to_parquet()
13+
response = requests.post(f"{self.api_url}/datasets/{self.dataset_id}/versions",
14+
files={"data": data},
15+
data={"message": message, "version_id": version_id})
16+
response.raise_for_status()
17+
return version_id
18+
19+
def list_versions(self):
20+
response = requests.get(f"{self.api_url}/datasets/{self.dataset_id}/versions")
21+
response.raise_for_status()
22+
return response.json()
23+
24+
def rollback(self, version_id):
25+
response = requests.get(f"{self.api_url}/datasets/{self.dataset_id}/versions/{version_id}")
26+
response.raise_for_status()
27+
data = response.content
28+
df = pd.read_parquet(data)
29+
self.__init__(df)
30+
31+
def next_version(self):
32+
return self.commit()
33+
34+
def _generate_version_id(self):
35+
from datetime import datetime
36+
return datetime.now().strftime("%Y%m%d_%H%M%S")

guepard_pandas/readme.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Guepard-Pandas Wrapper
2+
3+
## Introduction
4+
The Guepard-Pandas Wrapper is an extension of the Pandas DataFrame that integrates seamlessly with Guepard’s data versioning capabilities. This wrapper allows data engineers to use DataFrames as usual while automatically tracking versions, enabling rollback, and maintaining historical snapshots without additional effort.
5+
6+
## Features
7+
- Automated version tracking for DataFrames.
8+
- Easy rollback to previous states.
9+
- Seamless integration with Guepard, ensuring efficient storage and retrieval.
10+
11+
## Example Usage
12+
```python
13+
import pandas as pd
14+
from guepard_pandas.guepard_dataframe import GuepardDataFrame
15+
16+
# Load a DataFrame
17+
df = GuepardDataFrame(pd.read_csv("data.csv"), dataset_id="1234")
18+
19+
# Modify it
20+
df["new_col"] = df["existing_col"] * 2
21+
22+
# Commit the changes
23+
df.commit("Added new column")
24+
25+
# List versions
26+
print(df.list_versions())
27+
28+
# Rollback to an older version
29+
df.rollback(version_id="20240326_123456")
30+
```
31+
32+
## Implementation Plan
33+
1. Prototype Development
34+
- Extend `pd.DataFrame` with versioning methods.
35+
- Implement basic version storage using Parquet or Pickle.
36+
37+
2. Integration with Guepard API
38+
- Store versions directly in Guepard’s data management system.
39+
- Optimize performance for large DataFrames.
40+
41+
3. Testing & Optimization
42+
- Benchmark storage and retrieval performance.
43+
- Validate Pandas compatibility.
44+
45+
## Conclusion
46+
This wrapper offers an elegant solution to integrate version control within Pandas using Guepard, enhancing data engineering workflows while maintaining full compatibility with Pandas.
47+
48+
Next Steps: Review feedback and develop a proof of concept.

0 commit comments

Comments
 (0)