A simple environment for data processing in home or study projects.
- Spark
- Jupyter with PySpark
- PostgreSQL
- Hive Metastore
- Minio
- Kafka
- NiFi
- Zookeeper
- TrinoDB
- Metabase
- Debezium
- MageAI (temporally in dc-bkp.yml)
- Rapids (NVidea GPU for machine learning)
- Airflow (temporally in dc-bkp.yml)
The Jupyter image is the only that needs a custom build. It's depends on a custom python version, the respective Dockerfile is present in the images folder.
Remember that you don't need to build it, in the first docker compose up command, the composer will build it automatically. But, if you need to force the build for some reason, you can use this command:
docker build images/jupyter/. -t 3.4.1For all other applications, only the docker-compose.yml is needed.
As every container users a mounted volume for data persistence, there's a .env file in this repository that sets the BASE_PATH variable. You'll need to change that for the cloned repository path into your computer.
Most of the saved data can be too large for beeing pushed into GitHub. Remember that if you need to move your environment to somewhere else.
Most of the containers will do it fine on the first run, but you'll have to do this following configurations manually (in the first run only):
After the Minio container is up, you will be able to access the UI by opening http://localhost:9001 it in the web browser. Use the credentials setted in the docker-compose.yml file to access the manegement panel:
MINIO_ROOT_USER: minioadmin
MINIO_ROOT_PASSWORD: minioadmin
After that, in the Adminstrator section, go to Users in the Identity tab. Create a new user as follows:
User Name: minio
Password: minioadmin
Assign Policies:
[x] consoleAdmin
[x] diagnostics
[x] readonly
[x] readwrite
Once the user was created, you will be able to access the Minio S3 with the credentials.
As this jar is way too large to be pushed here, you will need to download it from the Maven repository in this link:
And save it in the following paths:
data/hive-metastore/lib/
data/spark/jars/
Remember that you cant commit large files into GitHub, so this JAR is ignored by the gitignore file.
Contact me: https://www.linkedin.com/in/luiz-henrique-mm/
Install the lib in a especific directory as this:
mkdir libs
pip install --target=./libs unidecode
cd libs
zip -r ../unidecode.zip .
Then move the zip into the data/spark/python-libs folder and run this:
sc = spark.sparkContext
sc.addPyFile("/home/user/python-libs/unidecode.zip")