OPEA Data Engineering Case

Tech Stack

Tools

Python: 3.12.3
Apache Spark: 4.1.1
Java: 17.0.18

Libraries

pyspark 4.1.1
jupyter 1.1.1
pandas 3.0.0
python-dotenv 1.2.1
delta-spark 4.1.0
boto3 1.42.53

Project Structure

├── analytics.ipynb
├── athena.ipynb
├── crawler.ipynb
├── raw_address.ipynb
├── raw_client.ipynb
├── stage_address.ipynb
├── stage_client.ipynb
├── utils.py
├── requirements.txt
├── README.md
├── data
│   └── dados_entrada 2.xlsx
└── logs
    ├── analytics_ingestion_clients.log
    ├── raw_ingestion_addresses.log
    ├── raw_ingestion_clients.log
    ├── stage_ingestion_addresses.log
    └── stage_ingestion_clients.log

Files description:

- `utils.py`: Python file with classes used by the .ipynb files.
- `data/dados_entrada 2.xlsx`: XLSX database containing clients and addresses information.
- `README.md`: File containing this description.
- `requirements.txt`: Text file with the needed pip3 dependecies.

- `analytics.ipynb`: Jupyter notebook that ingests the data from Stage Layer to Analytics Layer.
- `athena.ipynb`: Jupyter notebook that runs a query using Athena.
- `crawler.ipynb`: Jupyter notebook that creates a AWS Glue Crawler.
- `raw_address.ipynb`: Jupyter notebook that ingests the data from XLSX (sheet "enderecos") to Raw Layer.
- `raw_client.ipynb`: Jupyter notebook that ingests the data from XLSX (sheet "clientes") to Raw Layer.
- `stage_address.ipynb`: Jupyter notebook that ingests the data addresses from Raw Layer to Stage Layer.
- `stage_client.ipynb`: Jupyter notebook that ingests the data clients from Raw Layer to Stage Layer.

- `logs/analytics_ingestion_clients.log`: Example of a successful execution of analytics.ipynb
- `logs/raw_ingestion_addresses.log`: Example of a successful execution of raw_address.ipynb
- `logs/raw_ingestion_clients.log`: Example of a successful execution of raw_client.ipynb
- `logs/stage_ingestion_addresses.log`: Example of a successful execution of stage_address.ipynb
- `logs/stage_ingestion_clients.log`: Example of a successful execution of stage_client.ipynb

How to run:

1. Ensure the tools listed in Tech Stack are installed.
2. Create a virtual environment:
    `$ python3 -m venv .venv`
3. Install dependencies:
    `$ pip3 install -r requirements.txt`
4. Start Jupyter notebook:
    `$ jupyter notebook`
5. Choose the .ipynb to execute
6. In the opened notebook, select Run > Run All Cells

Environment Variables:

```text
S3_BUCKET=bkt-dev1-data-avaliacoes
AWS_REGION=sa-east-1
AWS_ACCESS_KEY_ID=<credencial_fornecida>
AWS_SECRET_ACCESS_KEY=<credencial_fornecida>
ENV=PROD
```

* P.S. 1: If ENV not informed, assumes "HOMOL" environment.
* P.S. 2: Change the below variables in .env, if needed.
* P.S. 3: It was requested to inform the AWS credentials here, but GitHub blocks pushing a file with explicit AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

Decisions made:

1. Jupyter Notebook was used to facititate the analysis;
2. Centralize reusable methods in a single Python file with four classes;
3. For addresses raw ingestion, the .ipynb uses the table clientes in Raw Layer to check if the id_cliente exists. So, for the sake of correctness, raw_client.ipynb must be run once before raw_address.ipynb;
4. Create a HOMOL environment to test the notebooks before execution in AWS environment;
5. Use local directories in HOMOL environment to simulate a Data Lake.

TODO list:

1. Fix the Athena .ipynb:
    - Unfortunally, this notebook isn't working because of this error:
        ```text
        InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Unable to verify/create output bucket bkt-dev1-data-avaliacoes
        ```
    - At the time of this writing, the solution remains in the clutches of darkness ):
2. Log the removed rows in Raw Ingestions:
    - It's not a hard feature to implement. The problem was lack of time ):
3. Make Spark Sessions more robust:
    - For each execution, could be needed to restart the kernel in Jupyter because Spark session creation freezes at some point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPEA Data Engineering Case

Tech Stack

Tools

Libraries

Project Structure

Files description:

How to run:

Environment Variables:

Decisions made:

TODO list:

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

OPEA Data Engineering Case

Tech Stack

Tools

Libraries

Project Structure

Files description:

How to run:

Environment Variables:

Decisions made:

TODO list: