This plugin also supports inference via the powerful and flexible NVIDIA Triton server.
This allows a client to request that images stored in a remote Cassandra server be inferenced on a different remote, GPU-powered server.
The plugin provides two operators to be used with Triton:
This operator expects a batch of UUIDs as input, represented as pairs of uint64, and produces as output a batch containing the raw images which are stored as BLOBs in the database, possibly paired with the corresponding labels.
The decoupled version of the operator splits the input UUIDs (which, in this case, can form a very long list) into mini-batches and proceeds to request the images from the database using prefetching to increase the throughput and hide the network latencies.
The directory models contains the following subdirectories,
with examples of pipelines using both cassandra_interactive and
cassandra_decoupled:
This model retrieves the raw data from the database, decodes it into
images, performs normalization and cropping, and returns the images as
a tensor. It utilizes the fn.crs4.cassandra_interactive class.
This model retrieves the raw data from the database and returns the
first byte of every BLOB. It utilizes the
fn.crs4.cassandra_interactive class.
This model retrieves the raw data from the database, decodes it into
images, performs normalization and cropping, and returns the images as
a tensor. It utilizes the fn.crs4.cassandra_decoupled class.
This model retrieves the raw data from the database and returns the
first byte of every BLOB. It utilizes the
fn.crs4.cassandra_decoupled class.
This model utilizes a pre-trained ResNet50 for ImageNet classification to perform inference, predownloaded using the runme.py script.
This ensemble model connects dali_cassandra_interactive and
classification_resnet to load and preprocess images from the
database and perform inference on them.
This ensemble model connects dali_cassandra_decoupled and
classification_resnet to load and preprocess images from the
database and perform inference on them.
The most convenient method to test the cassandra-dali-plugin with Triton is by utilizing the provided docker-compose.triton.yml, which runs a Cassandra container and another container, (derived from NVIDIA Triton Inference Server NGC), which contains our plugin, NVIDIA Triton, NVIDIA DALI, Cassandra C++ and Python drivers. To build and run the containers, use the following commands:
docker compose -f docker-compose.triton.yml up --build -d
docker compose exec dali-cassandra fishOnce the Docker containers are set up, it is possible to populate the database with images from the imagenette dataset using the provided script:
./fill-db.sh # might take a few minutesAfter the database is populated, we can start the Triton server with
./start-triton.shNow you can leave this shell open, and it will display the logs of the Triton server.
To run the clients, start a new shell in the container with following command:
docker compose exec dali-cassandra fishNow, within the container, run the following commands to test the inference:
python3 client-http-stress.py
python3 client-grpc-stress.py
python3 client-grpc-stream-stress.py
python3 client-http-ensemble.py
python3 client-grpc-ensemble.py
python3 client-grpc-stream-ensemble.pyYou can also benchmark the inference performance using NVIDIA's perf_analyzer. For example:
perf_analyzer -m dali_cassandra_interactive_stress --input-data uuids.json -b 256 --concurrency-range 16 -p 10000
perf_analyzer -m dali_cassandra_interactive_stress --input-data uuids.json -b 256 --concurrency-range 16 -p 10000 -i grpc
perf_analyzer -m dali_cassandra_decoupled_stress --input-data uuids_2048.json --shape UUID:2048,2 --concurrency-range 4 -i grpc --streaming -p 10000
perf_analyzer -m cass_to_inference --input-data uuids.json -b 256 --concurrency-range 16 -i grpc
perf_analyzer -m cass_to_inference_decoupled --input-data uuids_2048.json --shape UUID:2048,2 --concurrency-range 4 -i grpc --streaming -p 10000