Skip to content

Commit 94d902d

Browse files
authored
Merge pull request #2 from datafusion-contrib/ntran/cleanup
Rename all datafusion-distributed to distributed-datafusion and remov…
2 parents 24f7ff2 + 3c9c003 commit 94d902d

File tree

9 files changed

+88
-38
lines changed

9 files changed

+88
-38
lines changed

Cargo.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@
1616
# under the License.
1717

1818
[package]
19-
name = "datafusion-distributed"
20-
description = "DataFusion distributed execution framework"
19+
name = "distributed-datafusion"
20+
description = "Distributed DataFusion execution framework"
2121
homepage = "https://github.com/datafusion-contrib/datafusion-distributed"
2222
repository = "https://github.com/datafusion-contrib/datafusion-distributed"
2323
authors = ["DataFusion Contributors <[email protected]>"]
@@ -29,7 +29,7 @@ rust-version = "1.70"
2929
build = "build.rs"
3030

3131
[[bin]]
32-
name = "datafusion-distributed"
32+
name = "distributed-datafusion"
3333
path = "src/main.rs"
3434

3535
[dependencies]

README.md

Lines changed: 35 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# DataFusion Distributed
1+
# Distributed DataFusion
22

33
[![Apache licensed][license-badge]][license-url]
44

@@ -7,7 +7,7 @@
77

88
## Overview
99

10-
DataFusion Distributed is a distributed execution framework that enables DataFusion DataFrame and SQL queries to run in a distributed fashion. This project provides the infrastructure to scale DataFusion workloads across multiple nodes in a cluster.
10+
Distributed DataFusion is a distributed execution framework that enables DataFusion DataFrame and SQL queries to run in a distributed fashion. This project provides the infrastructure to scale DataFusion workloads across multiple nodes in a cluster.
1111

1212
This is an open source version of the distributed DataFusion prototype, extracted from DataDog's internal implementation and made available to the community.
1313

@@ -100,20 +100,42 @@ protoc --version
100100
./build.sh --release
101101
```
102102

103+
**Clean Rebuild**: If you need to completely clean and rebuild (removes all build artifacts):
104+
105+
```bash
106+
# Clean rebuild in debug mode
107+
./clean_and_build.sh
108+
109+
# Clean rebuild in release mode (optimized)
110+
./clean_and_build.sh --release
111+
```
112+
103113
#### Using Cargo Directly
104114

105-
To build the project in debug mode:
115+
You can also build the project directly with Cargo (the build.rs script will automatically handle Protocol Buffer compilation):
106116

107117
```bash
118+
# Build in debug mode
108119
cargo build
109120
```
110121

111-
To build the project in release mode (optimized):
122+
```bash
123+
# Build in release mode (optimized)
124+
cargo build --release
125+
```
126+
127+
**Clean Build Artifacts**: To clean previous build artifacts before rebuilding:
112128

113129
```bash
130+
# Clean all build artifacts (removes target/ directory contents)
131+
cargo clean
132+
133+
# Then rebuild
114134
cargo build --release
115135
```
116136

137+
**Note**: Both commands, `build.sh` script and `cargo` automatically invoke `build.rs`, which handles Protocol Buffer compilation before building the main crate. The main advantage of using `./build.sh` is the user-friendly output and usage examples it provides.
138+
117139
### Running Tests
118140

119141
Run all tests:
@@ -203,10 +225,10 @@ In separate terminal windows, start two workers:
203225

204226
```bash
205227
# Terminal 1 - Start first worker
206-
DATAFUSION_RAY_LOG_LEVEL=trace ./target/release/datafusion-distributed --mode worker --port 20201
228+
DATAFUSION_RAY_LOG_LEVEL=trace ./target/release/distributed-datafusion --mode worker --port 20201
207229

208230
# Terminal 2 - Start second worker
209-
DATAFUSION_RAY_LOG_LEVEL=trace ./target/release/datafusion-distributed --mode worker --port 20202
231+
DATAFUSION_RAY_LOG_LEVEL=trace ./target/release/distributed-datafusion --mode worker --port 20202
210232
```
211233

212234
**Step 2: Start Proxy**
@@ -215,17 +237,17 @@ In another terminal, start the proxy connecting to both workers:
215237

216238
```bash
217239
# Terminal 3 - Start proxy connected to workers
218-
DATAFUSION_RAY_LOG_LEVEL=trace DFRAY_WORKER_ADDRESSES=worker1/localhost:20201,worker2/localhost:20202 ./target/release/datafusion-distributed --mode proxy --port 20200
240+
DATAFUSION_RAY_LOG_LEVEL=trace DFRAY_WORKER_ADDRESSES=worker1/localhost:20201,worker2/localhost:20202 ./target/release/distributed-datafusion --mode proxy --port 20200
219241
```
220242

221243
To make your cluster aware of specific table schemas, you’ll need to define a new environment variable, DFRAY_TABLES, when starting each worker and proxy. This variable should specify tables whose data is stored in Parquet files.For example, the following setup registers two tables—customer and nation—along with their corresponding data sources.
222244

223245
```bash
224-
DFRAY_TABLES=customer:parquet:/tmp/tpch_s1/customer.parquet,nation:parquet:/tmp/tpch_s1/nation.parquet DATAFUSION_RAY_LOG_LEVEL=trace ./target/release/datafusion-distributed --mode worker --port 20201
246+
DFRAY_TABLES=customer:parquet:/tmp/tpch_s1/customer.parquet,nation:parquet:/tmp/tpch_s1/nation.parquet DATAFUSION_RAY_LOG_LEVEL=trace ./target/release/distributed-datafusion --mode worker --port 20201
225247

226-
DFRAY_TABLES=customer:parquet:/tmp/tpch_s1/customer.parquet,nation:parquet:/tmp/tpch_s1/nation.parquet DATAFUSION_RAY_LOG_LEVEL=trace ./target/release/datafusion-distributed --mode worker --port 20202
248+
DFRAY_TABLES=customer:parquet:/tmp/tpch_s1/customer.parquet,nation:parquet:/tmp/tpch_s1/nation.parquet DATAFUSION_RAY_LOG_LEVEL=trace ./target/release/distributed-datafusion --mode worker --port 20202
227249

228-
DFRAY_TABLES=customer:parquet:/tmp/tpch_s1/customer.parquet,nation:parquet:/tmp/tpch_s1/nation.parquet DATAFUSION_RAY_LOG_LEVEL=trace DFRAY_WORKER_ADDRESSES=worker1/localhost:20201,worker2/localhost:20202 ./target/release/datafusion-distributed --mode proxy --port 20200
250+
DFRAY_TABLES=customer:parquet:/tmp/tpch_s1/customer.parquet,nation:parquet:/tmp/tpch_s1/nation.parquet DATAFUSION_RAY_LOG_LEVEL=trace DFRAY_WORKER_ADDRESSES=worker1/localhost:20201,worker2/localhost:20202 ./target/release/distributed-datafusion --mode proxy --port 20200
229251
```
230252

231253
#### Manual Client Setup
@@ -271,16 +293,16 @@ For development or testing, you can run individual components:
271293

272294
```bash
273295
# Single worker (no distributed queries)
274-
./target/release/datafusion-distributed --mode worker --port 20201
296+
./target/release/distributed-datafusion --mode worker --port 20201
275297

276298
# Proxy without workers (limited functionality)
277-
./target/release/datafusion-distributed --mode proxy --port 20200
299+
./target/release/distributed-datafusion --mode proxy --port 20200
278300
```
279301

280302
View Available Options
281303

282304
```bash
283-
./target/release/datafusion-distributed --help
305+
./target/release/distributed-datafusion --help
284306
```
285307

286308
The system supports various configuration options through environment variables:

build.sh

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,16 +8,16 @@ if [ "$1" = "--release" ] || [ "$1" = "-r" ]; then
88
echo "Building in release mode..."
99
cargo build --release
1010
echo "✅ Build completed successfully!"
11-
echo "Binary available at: ./target/release/datafusion-distributed"
11+
echo "Binary available at: ./target/release/distributed-datafusion"
1212
else
1313
echo "Building in debug mode..."
1414
cargo build
1515
echo "✅ Build completed successfully!"
16-
echo "Binary available at: ./target/debug/datafusion-distributed"
16+
echo "Binary available at: ./target/debug/distributed-datafusion"
1717
fi
1818

1919
echo ""
2020
echo "Usage:"
21-
echo " ./target/debug/datafusion-distributed --help # View help"
22-
echo " ./target/debug/datafusion-distributed --mode proxy --port 20200 # Start proxy"
23-
echo " ./target/debug/datafusion-distributed --mode worker --port 20201 # Start worker"
21+
echo " ./target/debug/distributed-datafusion --help # View help"
22+
echo " ./target/debug/distributed-datafusion --mode proxy --port 20200 # Start proxy"
23+
echo " ./target/debug/distributed-datafusion --mode worker --port 20201 # Start worker"

clean_and_build.sh

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/bin/bash
2+
set -e
3+
4+
echo "🧹 Cleaning old build artifacts..."
5+
6+
# Remove old target directory containing the old binary name
7+
if [ -d "target" ]; then
8+
echo " Removing target/ directory..."
9+
rm -rf target/
10+
echo " ✅ Old target/ directory removed"
11+
else
12+
echo " No target/ directory found"
13+
fi
14+
15+
echo ""
16+
echo "🔨 Building project with new binary name 'distributed-datafusion'..."
17+
18+
# Build the project (this will create a new target directory)
19+
if [ "$1" = "--release" ] || [ "$1" = "-r" ]; then
20+
echo "Building in release mode..."
21+
cargo build --release
22+
echo "✅ Build completed successfully!"
23+
echo "New binary available at: ./target/release/distributed-datafusion"
24+
else
25+
echo "Building in debug mode..."
26+
cargo build
27+
echo "✅ Build completed successfully!"
28+
echo "New binary available at: ./target/debug/distributed-datafusion"
29+
fi
30+
31+
echo ""
32+
echo "🎉 Cleanup and rebuild complete!"
33+
echo ""
34+
echo "Usage:"
35+
echo " ./target/debug/distributed-datafusion --help # View help"
36+
echo " ./target/debug/distributed-datafusion --mode proxy --port 20200 # Start proxy"
37+
echo " ./target/debug/distributed-datafusion --mode worker --port 20201 # Start worker"

scripts/launch_tpch_cluster.sh

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ if [ "$NUM_WORKERS" -lt 1 ]; then
6767
fi
6868

6969
# Check if the binary exists, build if not
70-
if [ ! -f "./target/release/datafusion-distributed" ]; then
70+
if [ ! -f "./target/release/distributed-datafusion" ]; then
7171
echo "Binary not found, building release version..."
7272
echo "This may take a few minutes on first run..."
7373
if [ -f "./build.sh" ]; then
@@ -77,8 +77,8 @@ if [ ! -f "./target/release/datafusion-distributed" ]; then
7777
fi
7878

7979
# Verify the build was successful
80-
if [ ! -f "./target/release/datafusion-distributed" ]; then
81-
echo "Error: Failed to build datafusion-distributed binary"
80+
if [ ! -f "./target/release/distributed-datafusion" ]; then
81+
echo "Error: Failed to build distributed-datafusion binary"
8282
exit 1
8383
fi
8484
echo "✅ Build completed successfully!"
@@ -146,7 +146,7 @@ for ((i=0; i<NUM_WORKERS; i++)); do
146146
WORKER_NAME="worker$((i+1))"
147147
LOG_FILE="${LOG_DIR}/${WORKER_NAME}.log"
148148
echo " Starting $WORKER_NAME on port $PORT..."
149-
env DATAFUSION_RAY_LOG_LEVEL="$DATAFUSION_RAY_LOG_LEVEL" DFRAY_TABLES="$DFRAY_TABLES" ./target/release/datafusion-distributed --mode worker --port $PORT > "$LOG_FILE" 2>&1 &
149+
env DATAFUSION_RAY_LOG_LEVEL="$DATAFUSION_RAY_LOG_LEVEL" DFRAY_TABLES="$DFRAY_TABLES" ./target/release/distributed-datafusion --mode worker --port $PORT > "$LOG_FILE" 2>&1 &
150150
WORKER_PIDS[$i]=$!
151151
WORKER_ADDRESSES[$i]="${WORKER_NAME}/localhost:${PORT}"
152152
done
@@ -162,7 +162,7 @@ WORKER_ADDRESSES_STR=$(IFS=,; echo "${WORKER_ADDRESSES[*]}")
162162
echo "Starting proxy on port 20200..."
163163
echo "Connecting to workers: $WORKER_ADDRESSES_STR"
164164
PROXY_LOG="${LOG_DIR}/proxy.log"
165-
env DATAFUSION_RAY_LOG_LEVEL="$DATAFUSION_RAY_LOG_LEVEL" DFRAY_TABLES="$DFRAY_TABLES" DFRAY_WORKER_ADDRESSES="$WORKER_ADDRESSES_STR" ./target/release/datafusion-distributed --mode proxy --port 20200 > "$PROXY_LOG" 2>&1 &
165+
env DATAFUSION_RAY_LOG_LEVEL="$DATAFUSION_RAY_LOG_LEVEL" DFRAY_TABLES="$DFRAY_TABLES" DFRAY_WORKER_ADDRESSES="$WORKER_ADDRESSES_STR" ./target/release/distributed-datafusion --mode proxy --port 20200 > "$PROXY_LOG" 2>&1 &
166166
PROXY_PID=$!
167167

168168
echo

src/codec.rs

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@ use datafusion::{
77
execution::FunctionRegistry,
88
physical_plan::{ExecutionPlan, displayable},
99
};
10-
// DataDog-specific dependencies removed for open source version
1110
use datafusion_proto::{
1211
physical_plan::{
1312
DefaultPhysicalExtensionCodec,
@@ -96,7 +95,6 @@ impl PhysicalExtensionCodec for DFRayCodec {
9695
)))
9796
}
9897
}
99-
// DataDog-specific NumpangExec and ContextExec removed for open source version
10098
Payload::NumpangExec(_) => {
10199
Err(internal_datafusion_err!(
102100
"NumpangExec not supported in open source version"
@@ -146,7 +144,6 @@ impl PhysicalExtensionCodec for DFRayCodec {
146144
};
147145
Payload::MaxRowsExec(pb)
148146
} else if let Some(_exec) = node.as_any().downcast_ref::<DataSourceExec>() {
149-
// DataDog-specific DataSourceExec encoding removed for open source version
150147
return internal_err!("DataSourceExec encoding not supported in open source version");
151148
} else {
152149
return internal_err!("Not supported node to encode to proto");
@@ -175,7 +172,6 @@ mod test {
175172
physical_plan::{Partitioning, displayable},
176173
prelude::SessionContext,
177174
};
178-
// DataDog-specific NumpangFileSource removed for open source version
179175
use datafusion_proto::physical_plan::AsExecutionPlan;
180176

181177
use super::*;
@@ -252,8 +248,6 @@ mod test {
252248
verify_round_trip(exec);
253249
}
254250

255-
// DataDog-specific numpang tests removed for open source version
256-
257251
#[test]
258252
fn max_rows_and_reader_round_trip() {
259253
let schema = create_test_schema();

src/lib.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ fn setup_logging() {
5757
let dfr_env = env::var("DATAFUSION_RAY_LOG_LEVEL").unwrap_or("WARN".to_string());
5858
let rust_log_env = env::var("RUST_LOG").unwrap_or("WARN".to_string());
5959

60-
let combined_env = format!("{rust_log_env},datafusion_distributed={dfr_env}");
60+
let combined_env = format!("{rust_log_env},distributed_datafusion={dfr_env}");
6161

6262
env_logger::Builder::new()
6363
.parse_filters(&combined_env)

src/main.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
use anyhow::Result;
22
use clap::Parser;
3-
use datafusion_distributed::{
3+
use distributed_datafusion::{
44
friendly::new_friendly_name,
55
processor_service::DFRayProcessorService,
66
proxy_service::DFRayProxyService,
77
setup,
88
};
99

1010
#[derive(Parser)]
11-
#[command(name = "datafusion-distributed")]
11+
#[command(name = "distributed-datafusion")]
1212
#[command(about = "A distributed execution engine for DataFusion", long_about = None)]
1313
struct Args {
1414
/// Port number for the service to listen on

src/planning.rs

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@ use datafusion::{
2828
},
2929
prelude::{SQLOptions, SessionConfig, SessionContext},
3030
};
31-
// DataDog-specific table functions removed for open source version
3231
use futures::TryStreamExt;
3332
use itertools::Itertools;
3433
use prost::Message;
@@ -129,8 +128,6 @@ async fn make_state() -> Result<SessionState> {
129128
.with_config(config)
130129
.build();
131130

132-
// DataDog-specific table functions registration removed for open source version
133-
134131
add_tables_from_env(&mut state)
135132
.await
136133
.context("Failed to add tables from environment")?;

0 commit comments

Comments
 (0)