Skip to content

Latest commit

 

History

History
102 lines (89 loc) · 4.37 KB

File metadata and controls

102 lines (89 loc) · 4.37 KB

OPEA Data Engineering Case

Tech Stack

Tools

  • Python: 3.12.3
  • Apache Spark: 4.1.1
  • Java: 17.0.18

Libraries

  • pyspark 4.1.1
  • jupyter 1.1.1
  • pandas 3.0.0
  • python-dotenv 1.2.1
  • delta-spark 4.1.0
  • boto3 1.42.53

Project Structure

├── analytics.ipynb
├── athena.ipynb
├── crawler.ipynb
├── raw_address.ipynb
├── raw_client.ipynb
├── stage_address.ipynb
├── stage_client.ipynb
├── utils.py
├── requirements.txt
├── README.md
├── data
│   └── dados_entrada 2.xlsx
└── logs
    ├── analytics_ingestion_clients.log
    ├── raw_ingestion_addresses.log
    ├── raw_ingestion_clients.log
    ├── stage_ingestion_addresses.log
    └── stage_ingestion_clients.log

Files description:

- `utils.py`: Python file with classes used by the .ipynb files.
- `data/dados_entrada 2.xlsx`: XLSX database containing clients and addresses information.
- `README.md`: File containing this description.
- `requirements.txt`: Text file with the needed pip3 dependecies.

- `analytics.ipynb`: Jupyter notebook that ingests the data from Stage Layer to Analytics Layer.
- `athena.ipynb`: Jupyter notebook that runs a query using Athena.
- `crawler.ipynb`: Jupyter notebook that creates a AWS Glue Crawler.
- `raw_address.ipynb`: Jupyter notebook that ingests the data from XLSX (sheet "enderecos") to Raw Layer.
- `raw_client.ipynb`: Jupyter notebook that ingests the data from XLSX (sheet "clientes") to Raw Layer.
- `stage_address.ipynb`: Jupyter notebook that ingests the data addresses from Raw Layer to Stage Layer.
- `stage_client.ipynb`: Jupyter notebook that ingests the data clients from Raw Layer to Stage Layer.

- `logs/analytics_ingestion_clients.log`: Example of a successful execution of analytics.ipynb
- `logs/raw_ingestion_addresses.log`: Example of a successful execution of raw_address.ipynb
- `logs/raw_ingestion_clients.log`: Example of a successful execution of raw_client.ipynb
- `logs/stage_ingestion_addresses.log`: Example of a successful execution of stage_address.ipynb
- `logs/stage_ingestion_clients.log`: Example of a successful execution of stage_client.ipynb

How to run:

1. Ensure the tools listed in Tech Stack are installed.
2. Create a virtual environment:
    `$ python3 -m venv .venv`
3. Install dependencies:
    `$ pip3 install -r requirements.txt`
4. Start Jupyter notebook:
    `$ jupyter notebook`
5. Choose the .ipynb to execute
6. In the opened notebook, select Run > Run All Cells

Environment Variables:

```text
S3_BUCKET=bkt-dev1-data-avaliacoes
AWS_REGION=sa-east-1
AWS_ACCESS_KEY_ID=<credencial_fornecida>
AWS_SECRET_ACCESS_KEY=<credencial_fornecida>
ENV=PROD
```

* P.S. 1: If ENV not informed, assumes "HOMOL" environment.
* P.S. 2: Change the below variables in .env, if needed.
* P.S. 3: It was requested to inform the AWS credentials here, but GitHub blocks pushing a file with explicit AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

Decisions made:

1. Jupyter Notebook was used to facititate the analysis;
2. Centralize reusable methods in a single Python file with four classes;
3. For addresses raw ingestion, the .ipynb uses the table clientes in Raw Layer to check if the id_cliente exists. So, for the sake of correctness, raw_client.ipynb must be run once before raw_address.ipynb;
4. Create a HOMOL environment to test the notebooks before execution in AWS environment;
5. Use local directories in HOMOL environment to simulate a Data Lake.

TODO list:

1. Fix the Athena .ipynb:
    - Unfortunally, this notebook isn't working because of this error:
        ```text
        InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Unable to verify/create output bucket bkt-dev1-data-avaliacoes
        ```
    - At the time of this writing, the solution remains in the clutches of darkness ):
2. Log the removed rows in Raw Ingestions:
    - It's not a hard feature to implement. The problem was lack of time ):
3. Make Spark Sessions more robust:
    - For each execution, could be needed to restart the kernel in Jupyter because Spark session creation freezes at some point.