Skip to content
Open
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
b422b80
webdataset schema finished, test cases not working
valerie-cal Mar 10, 2025
8184a1e
Merge branch '2.0' of https://github.com/nsaadhvi/codebase-deltacat i…
valerie-cal Mar 13, 2025
44e32a8
wds schema pytest and wds schema logic
valerie-cal Mar 13, 2025
fe348f8
edits to building wds schema
valerie-cal Mar 15, 2025
f8bf30f
Initial implementation of from_webdataset
valerie-cal Mar 31, 2025
7c5df3e
demo to load web data set
nsaadhvi Apr 3, 2025
7c8f880
fixed error with reading tar files and added sample tar files
valerie-cal Apr 3, 2025
dc348f3
finished pytests
valerie-cal Apr 4, 2025
1ee3ba2
Merge branch 'wds-schema' of https://github.com/nsaadhvi/codebase-del…
nsaadhvi Apr 4, 2025
0111b62
inconsistent jsons pytest added
valerie-cal Apr 4, 2025
124a3f6
comment out failing test
valerie-cal Apr 4, 2025
6ebcf18
restored failing test case
valerie-cal Apr 7, 2025
a843e07
Merge branch 'wds-schema' of https://github.com/nsaadhvi/codebase-del…
nsaadhvi Apr 7, 2025
5a060ac
create datasets in tmp_path
valerie-cal Apr 7, 2025
d8af0b4
added optional user batch_size input
nsaadhvi Apr 24, 2025
79c5cda
Merge branch 'wds-schema' of https://github.com/nsaadhvi/codebase-del…
nsaadhvi Apr 24, 2025
3511e5f
bird classification web data set demo
nsaadhvi Apr 24, 2025
4e642a0
webdataset schema finished, test cases not working
valerie-cal Mar 10, 2025
925fa67
wds schema pytest and wds schema logic
valerie-cal Mar 13, 2025
3c6ced0
edits to building wds schema
valerie-cal Mar 15, 2025
4900233
Initial implementation of from_webdataset
valerie-cal Mar 31, 2025
0e74062
fixed error with reading tar files and added sample tar files
valerie-cal Apr 3, 2025
d383439
finished pytests
valerie-cal Apr 4, 2025
a7e5b32
inconsistent jsons pytest added
valerie-cal Apr 4, 2025
2accfe9
comment out failing test
valerie-cal Apr 4, 2025
4b239e6
restored failing test case
valerie-cal Apr 7, 2025
ccc5440
create datasets in tmp_path
valerie-cal Apr 7, 2025
c328f2b
add image binary column to dataset and schema
valerie-cal May 1, 2025
acbddd4
resolved merge conflicts with 2.0 branch, wds tests passing
nsaadhvi May 1, 2025
1795a19
Throw error for image binary and batch row mismatch
nsaadhvi May 1, 2025
d154e17
edit demo to process images with binary column
nsaadhvi May 2, 2025
5fc1ca3
normalize function and media instead of image
valerie-cal Jun 15, 2025
0a2aaf1
merge
valerie-cal Jun 15, 2025
6bd5f1f
cleanup changes
valerie-cal Jun 15, 2025
5aac9c1
Update wds_demo.py with comments
NeeralBhalgat Jun 16, 2025
ca0f846
Add data read/write tests and remove internal attribute tests
valerie-cal Aug 17, 2025
0e3623d
Merge remote-tracking branch 'origin/2.0' into wds-schema
valerie-cal Aug 17, 2025
860a9d2
Fix accidental executable bit changes
valerie-cal Aug 17, 2025
784b650
Fix accidental executable bit changes
valerie-cal Aug 17, 2025
b8e819d
Manually resolve remaining merge conflicts
valerie-cal Aug 17, 2025
6a5b7df
Addressing comments, to be continued
valerie-cal Aug 18, 2025
ec52d2f
Improvements, to be continued
valerie-cal Aug 18, 2025
b5d638d
Cleaned and upgraded webdataset test suites
025rhu Aug 18, 2025
0f270bd
Removed static tar files used in old WebDataset test suite
025rhu Aug 18, 2025
1be6b29
Removed unnecessary set up code in TestFromWebDataset
025rhu Aug 18, 2025
787cf3f
Using WDS reader for tests, added intra-batch schema merge handling l…
025rhu Aug 21, 2025
c654e68
Make a WebDatasetReader class for Dataset's from_webdataset, and adde…
025rhu Sep 3, 2025
f3401d2
Upgraded inconsistent schema handling test, and added non-lossy promo…
025rhu Sep 3, 2025
90ae6e0
Cleaned code and update a doc string
025rhu Sep 3, 2025
bb86af5
Passing linter
025rhu Sep 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Empty file modified .flake8
100644 → 100755
Empty file.
Empty file modified .github/pull_request_template.md
100644 → 100755
Empty file.
Empty file modified .github/workflows/ci.yml
100644 → 100755
Empty file.
Empty file modified .github/workflows/publish-to-pypi.yml
100644 → 100755
Empty file.
Empty file modified .gitignore
100644 → 100755
Empty file.
Empty file modified .isort.cfg
100644 → 100755
Empty file.
Empty file modified .pre-commit-config.yaml
100644 → 100755
Empty file.
Empty file modified LICENSE
100644 → 100755
Empty file.
Empty file modified MANIFEST.in
100644 → 100755
Empty file.
Empty file modified Makefile
100644 → 100755
Empty file.
Empty file modified README-development.md
100644 → 100755
Empty file.
Empty file modified README.md
100644 → 100755
Empty file.
Empty file modified deltacat/__init__.py
100644 → 100755
Empty file.
Empty file modified deltacat/aws/clients.py
100644 → 100755
Empty file.
Empty file modified deltacat/aws/constants.py
100644 → 100755
Empty file.
Empty file modified deltacat/aws/s3u.py
100644 → 100755
Empty file.
Empty file modified deltacat/benchmarking/README.md
100644 → 100755
Empty file.
Empty file modified deltacat/benchmarking/__init__.py
100644 → 100755
Empty file.
Empty file modified deltacat/benchmarking/benchmark_engine.py
100644 → 100755
Empty file.
Empty file modified deltacat/benchmarking/benchmark_parquet_reads.py
100644 → 100755
Empty file.
Empty file modified deltacat/benchmarking/benchmark_report.py
100644 → 100755
Empty file.
Empty file modified deltacat/benchmarking/benchmark_suite.py
100644 → 100755
Empty file.
Empty file modified deltacat/benchmarking/conftest.py
100644 → 100755
Empty file.
Empty file modified deltacat/benchmarking/data/__init__.py
100644 → 100755
Empty file.
Empty file modified deltacat/benchmarking/data/random_row_generator.py
100644 → 100755
Empty file.
Empty file modified deltacat/benchmarking/data/row_generator.py
100644 → 100755
Empty file.
Empty file modified deltacat/benchmarking/test_benchmark_pipeline.py
100644 → 100755
Empty file.
Empty file modified deltacat/catalog/__init__.py
100644 → 100755
Empty file.
153 changes: 153 additions & 0 deletions deltacat/catalog/catalog_properties.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
from __future__ import annotations
from typing import Optional

import pyarrow
from deltacat.constants import DELTACAT_CATALOG_PROPERTY_ROOT

from deltacat.utils.filesystem import resolve_path_and_filesystem

"""
Property catalog is configured globally (at the interpreter level)

Ray has limitations around serialized class size. For this reason, larger files like catalog impl and
storage impl need to be a flat list of functions rather than a stateful class initialized with properties.
For more details see - README-development.md

These classes will fetch the globally configured CatalogProperties, OR allow injection of a custom
CatalogProperties in kwargs

Example: injecting custom CatalogProperties
catalog.namespace_exists("my_namespace", catalog_properties=CatalogProperties(root="..."))

Example: explicitly initializing global CatalogProperties
from deltacat.catalog import initialize_properties
initialize_properties(root="...")
catalog.namespace_exists("mynamespace")

By default, catalog properties are initialized automatically, and fall back to defaults/env variables.
Example: using env variables
os.environ["DELTACAT_ROOT"]="..."
catalog.namespace_exists("mynamespace")
"""
CATALOG_PROPERTIES: CatalogProperties = None
_INITIALIZED = False


def initialize_properties(
root: Optional[str] = None, *args, force: bool = False, **kwargs
) -> CatalogProperties:
"""
Initialize a Catalog state, if not already initialized.

If environment variables are present, will check the following environment variables to configure catalog:
DELTACAT_ROOT: maps to "root" parameter

Environment variables will be overridden if explicit parameters are provided

Args:
root: filesystem URI for catalog root
force: if True, will re-initialize even if global catalog exists. If False, will return global catalog
"""
global _INITIALIZED, CATALOG_PROPERTIES

if _INITIALIZED and not force:
return CATALOG_PROPERTIES

# Check environment variables
# This is set or defaulted in constants.py
env_root = DELTACAT_CATALOG_PROPERTY_ROOT
if env_root is None:
raise ValueError(
"Expected environment variable DELTACAT_ROOT to be set or defaulted"
)

# Environment variables are overridden by explicit parameters
if root is None:
root = env_root

# Initialize the catalog properties
CATALOG_PROPERTIES = CatalogProperties(root=root, **kwargs)

_INITIALIZED = True
return CATALOG_PROPERTIES


def get_catalog_properties(**kwargs) -> CatalogProperties:
"""
Helper function to get the appropriate CatalogProperties instance.

If 'catalog_properties' is provided in kwargs, it will be used.
Otherwise, it will use the global catalog, initializing it if necessary.

Args:
**kwargs: Keyword arguments that might contain 'properties'

Returns:
CatalogProperties: The catalog properties to use
"""
properties = kwargs.get("catalog_properties")
if properties is not None and isinstance(properties, CatalogProperties):
return properties
elif properties is not None and not isinstance(properties, CatalogProperties):
raise ValueError(
"Expected kwarg catalog_properties to be instance of CatalogProperties"
)

# Use the global catalog, initializing if necessary
if not _INITIALIZED:
initialize_properties()

return CATALOG_PROPERTIES


class CatalogProperties:
"""
This holds all configuration for a DeltaCAT catalog.

CatalogProperties can be configured at the interpreter level by calling initialize_properties, or provided with
the kwarg catalog_properties. We expect functions to plumb through kwargs throughout, so only when a property needs to be fetched does a function
need to retrieve the property catalog. Property catalog must be retrieved through get_property_catalog, which will
hierarchically check kwargs then the global value.

Specific properties are configurable via env variable.

Be aware that parallel code (e.g. parallel tests) may overwrite the catalog properties defined global at the interpreter level
In this case, you must explicitly provide the kwarg catalog_properties rather than declare it globally with initialize_catalog_properties

Attributes:
root (str): URI string The root path where catalog metadata and data files are stored. If none provided,
will be initialized as .deltacat/ relative to current working directory

filesystem (pyarrow.fs.FileSystem): pyarrow filesystem implementation used for
accessing files. If not provided, will be inferred via root
"""

def __init__(
self,
root: str,
*args,
filesystem: Optional[pyarrow.fs.FileSystem] = None,
**kwargs,
):
"""
Initialize a CatalogProperties instance.

Args:
root (str, optional): Root path for the catalog storage. If None, will be resolved later.
filesystem (pyarrow.fs.FileSystem, optional): FileSystem implementation to use.
If None, will be resolved based on the root path.
"""
resolved_root, resolved_filesystem = resolve_path_and_filesystem(
path=root,
filesystem=filesystem,
)
self._root = resolved_root
self._filesystem = resolved_filesystem

@property
def root(self) -> str:
return self._root

@property
def filesystem(self) -> Optional[pyarrow.fs.FileSystem]:
return self._filesystem
Loading