A simple project for practicing Apache Spark (PySpark) locally.
This project uses Docker to run Spark applications in a local environment. No Spark installation required.
- Docker
- Just command runner (
brew install juston macOS)
spark-practice-python/
├── Justfile # Commands for running Spark applications
├── output/ # CSV output from Spark applications (auto-created)
└── src/ # Spark Python scripts
├── spark-app.py # Main Spark application
└── eda/
└── simple_analysis.py # Example EDA script
List available scripts:
just listRun a specific Spark script:
just run spark-app.pyOr run a script in a subdirectory:
just run eda/simple_analysis.pyView output from a script:
just view spark-app.pyClean all output directories:
just cleanCreate new Python scripts in the src directory. Each script should:
- Create a Spark session
- Perform data processing
- Save results to
/output/directory (this will map tooutput/{script_path}/on your host)
CSV output files from each script are automatically saved to a directory matching the script's path.
The project uses the bitnami/spark:latest Docker image which includes:
- Spark with local execution mode
- Python with PySpark
No need to build custom images for basic Spark exploration.
The project includes sample scripts:
src/spark-app.py: Basic Spark operations with random datasrc/eda/simple_analysis.py: Simple exploratory data analysis