cellarr-frame provides a high-level, Pandas-like interface for interacting with TileDB DataFrames.
pip install cellarr-frameYou can create a new persistent CellArrayFrame directly from a Pandas DataFrame.
import pandas as pd
import shutil
from cellarr_frame import CellArrayFrame
# Prepare some data
df = pd.DataFrame({
"name": ["GeneA", "GeneB", "GeneC", "GeneD"],
"expression": [12.5, 0.0, 5.2, 8.1],
"category": ["coding", "non-coding", "coding", "coding"]
})
df.index.name = "row_id"
# Create the TileDB array at the specified URI
uri = "./my_cellarr_frame"
# clean up if exists
shutil.rmtree(uri, ignore_errors=True)
# Create with sparse=True to allow flexible appending and querying
CellArrayFrame.create(uri, df, sparse=True, full_domain=True)Open the frame and slice rows using standard Python syntax.
cf = CellArrayFrame(uri=uri)
# Slice the first 2 rows
# Returns a Pandas DataFrame
print(cf[0:2])
# name expression category
# row_id
# 0 GeneA 12.5 coding
# 1 GeneB 0.0 non-codingOptimize performance by selecting only specific columns.
# Select only 'name' and 'expression' for the first row
print(cf[0:1, ["name", "expression"]])Filter data using string conditions. The filtering happens at the storage layer, making it highly efficient for large datasets.
# Select all rows where expression is greater than 5.0
high_expr = cf["expression > 5.0"]
print(high_expr)
# Combine queries with column selection
# Get names of all 'coding' genes
coding_genes = cf["category == 'coding'", ["name"]]
print(coding_genes)Append new batches of data to the existing array.
new_data = pd.DataFrame({
"name": ["GeneE"],
"expression": [99.9],
"category": ["coding"]
})
# Ensure the index continues correctly
new_data.index = [4]
new_data.index.name = "row_id"
# Append to the array
cf.write_batch(new_data)
# Verify the new total count
print(f"Total rows: {cf.shape[0]}")This project has been set up using BiocSetup and PyScaffold.