Skip to content

Hello-Imagine/cs349d_project

Repository files navigation

Stanford CS 349D - Optimizing UDFs with Early Filters

Overview

This project aims to reduce the computational cost of calling user-defined functions (UDFs) in data queries by using probabilistic predicates (PPs) as early filters. We construct PPs for each simple clause and incorporate them into a custom query optimizer designed for efficient data handling.

Getting Started

Initialize your environment by following these steps:

conda create --name cs349d
conda activate cs349d
pip install -r requirements.txt
git submodule init
git submodule update

Directory Structure

  • data/: Contains datasets used for training and testing the PP models. This folder is crucial for ensuring that our models are trained on representative data.
  • dataloader/: Includes scripts for loading and preprocessing data. These scripts are tailored to format the data correctly for model training.
  • ml_udf/: Stores the user-defined functions (UDFs) for image datasets that are optimized by the probabilistic predicates, including yolov5 and FastRCNN.
  • pp_models/: Contains the machine learning models that serve as probabilistic predicates. These models are designed to predict the necessity of executing UDFs, thereby acting as early filters.
  • pp_params/: Holds parameters files for loading and tuning the PP models. Adjusting these parameters can significantly affect the models' accuracy and efficiency.
  • query_optimizer/: Contains the query optimizer module that integrates probabilistic predicates into data query processes. This module is key to achieving optimal performance.
  • qo_tests/: Includes test scripts to run the query optimizer per simple clause.
  • query_test/: Includes scripts for testing the ML UDF for simple queries.

About

The final project of Stanford CS 349D class

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages