Data pipelines that integrate Brazilian geographic and demographic datasets with fast‑food locations to generate an expansion strategy for McDonald's in Brazil. [see report]
This project was developed using the following tools:
- 🔶 Kedro - Framework to create reproducible, maintainable, and modular data science code
- 🐍 PuLP - Linear and mixed integer programming library for optimization problems
- 📦 UV - Ultra-fast Python package manager
- 🚀 Just - Modern command runner with powerful features
- 💅 Ruff - Lightning-fast linter and formatter
- 🧪 Pytest - Testing framework with fixtures and plugins
- 🛫 Pre-commit - Hooks to ensure code quality and adherence to standards
- 🐳 Docker - Multi-stage build and distroless image
- 🔄 GitHub Actions - CI/CD pipeline
This project is based on UV as package manager and Just as command runner. You need to have both installed in your system to work on this project.
Once you have UV and Just installed, you can run just dev-sync in your terminal to create a virtual environment and install all the dependencies.
If you want to build a production environment without the development dependencies, you can run just prod-sync instead.
Finally, to install the pre-commit hooks, run just install-hooks.
This project uses data about the Brazilian population published by the Brazilian Institute of Geography and Statistics (IBGE) and data scraped from the Brazilian McDonald's and Subway websites.
To download all the data needed for this project, you can either access this Google Drive URL or follow these instructions to get the data from the original sources:
-
IBGE data
a. Population data: access IBGE's Population Estimates page and download:
- "Estimativas_2020" >
POP2020_20220905.xls - "Estimativas_2021" >
POP2021_20240624.xls
b. Cities GDP data: access IBGE's Downloads page and download:
- "Pib_Municipios" > "2021" > "base" >
base_de_dados_2010_2021_xlsx.zip(then unzip file)
c. Brazil's shapefiles: access IBGE's Municipal Mesh page, select "Editions" > "2021" > "More on the product", and download:
- "Municipalities" >
BR_Municipios_2021.zip - "Federation Units" >
BR_UF_2021.zip - "Microregions" >
BR_Microrregioes_2021.zip - "Mesoregions" >
BR_Mesorregioes_2021.zip
- "Estimativas_2020" >
-
Fast-food restaurants data: Location data for McDonald's and Subway restaurants was scraped from their websites using the code in
notebooks/scrape_restaurants.ipynb. However, the scraping code might stop working as their websites might change over time, so I have also included a copy of the scraped data in thenotebooks/scraped_data/folder.
After downloading the necessary files, move them to the data/01_raw/ folder, which should look like this:
data/01_raw
├── BR_Mesorregioes_2021.zip
├── BR_Microrregioes_2021.zip
├── BR_Municipios_2021.zip
├── BR_UF_2021.zip
├── mcdonalds.json
├── PIB dos Munic¡pios - base de dados 2010-2021.xlsx
├── POP2020_20220905.xls
├── POP2021_20240624.xls
└── subway.html
The final output of this project's pipelines is a report on "McDonald's Expansion Opportunities in Brazil", which can be found in data/08_reporting/final_report.md
You can run the full set of Kedro pipelines in this project (process_data, merge_data, and build_report) with:
kedro runIf you want to run a specific pipeline, you can use the --pipeline option. For example, to run the process_data pipeline:
kedro run --pipeline process_dataSimilarly, you can run specific nodes or tags by using the
--nodesand/or--tagsoptions followed by the name(s) of the node(s) or tag(s) you want to run.
You can visualize the datasets, nodes, and connections of the Kedro pipelines in this project by running the following command:
kedro viz --autoreloadThe image below shows the pipelines visualization for this project:
- Run
just formatto format your code - Run
just lintto run linter - Run
just testto run the tests - Run
just validateto run all of the above (format,lint, andtest)
You can configure Ruff by editing the
.ruff.tomlfile. It is currently set to the default configuration.
Have a look at the file
src/tests/test_run.pyfor instructions on how to write your tests. You can configure the coverage threshold in your project'spyproject.tomlfile under the[tool.coverage.report]section.
This project includes a multi-stage Dockerfile, which produces an image with the code and the dependencies installed. You can build the image with:
just docker-buildThen, you can run the Kedro pipelines inside a container with the image you just built by running:
just docker-runThe outputs of the pipelines will still be saved in your local data/ folder, because the docker container mounts the data/ folder as a volume.
This project includes a Github Actions workflow that runs the formatters, linters, and tests on every push/PR to the main or develop branches. You can find the workflow file in .github/workflows/format-lint-test.yml.