Skip to content

Data pipelines that integrate Brazilian geographic and demographic datasets with fast‑food locations to generate an expansion strategy for McDonald's in Brazil

License

Notifications You must be signed in to change notification settings

antonacio/geolocation-pipeline

Repository files navigation

Powered by Kedro Python

Geolocation Pipeline

Data pipelines that integrate Brazilian geographic and demographic datasets with fast‑food locations to generate an expansion strategy for McDonald's in Brazil. [see report]

This project was developed using the following tools:

  • 🔶 Kedro - Framework to create reproducible, maintainable, and modular data science code
  • 🐍 PuLP - Linear and mixed integer programming library for optimization problems
  • 📦 UV - Ultra-fast Python package manager
  • 🚀 Just - Modern command runner with powerful features
  • 💅 Ruff - Lightning-fast linter and formatter
  • 🧪 Pytest - Testing framework with fixtures and plugins
  • 🛫 Pre-commit - Hooks to ensure code quality and adherence to standards
  • 🐳 Docker - Multi-stage build and distroless image
  • 🔄 GitHub Actions - CI/CD pipeline

Setup

Environment

This project is based on UV as package manager and Just as command runner. You need to have both installed in your system to work on this project.

Once you have UV and Just installed, you can run just dev-sync in your terminal to create a virtual environment and install all the dependencies.

If you want to build a production environment without the development dependencies, you can run just prod-sync instead.

Finally, to install the pre-commit hooks, run just install-hooks.

Data

This project uses data about the Brazilian population published by the Brazilian Institute of Geography and Statistics (IBGE) and data scraped from the Brazilian McDonald's and Subway websites.

To download all the data needed for this project, you can either access this Google Drive URL or follow these instructions to get the data from the original sources:

  1. IBGE data

    a. Population data: access IBGE's Population Estimates page and download:

    • "Estimativas_2020" > POP2020_20220905.xls
    • "Estimativas_2021" > POP2021_20240624.xls

    b. Cities GDP data: access IBGE's Downloads page and download:

    • "Pib_Municipios" > "2021" > "base" > base_de_dados_2010_2021_xlsx.zip (then unzip file)

    c. Brazil's shapefiles: access IBGE's Municipal Mesh page, select "Editions" > "2021" > "More on the product", and download:

    • "Municipalities" > BR_Municipios_2021.zip
    • "Federation Units" > BR_UF_2021.zip
    • "Microregions" > BR_Microrregioes_2021.zip
    • "Mesoregions" > BR_Mesorregioes_2021.zip
  2. Fast-food restaurants data: Location data for McDonald's and Subway restaurants was scraped from their websites using the code in notebooks/scrape_restaurants.ipynb. However, the scraping code might stop working as their websites might change over time, so I have also included a copy of the scraped data in the notebooks/scraped_data/ folder.

After downloading the necessary files, move them to the data/01_raw/ folder, which should look like this:

data/01_raw
├── BR_Mesorregioes_2021.zip
├── BR_Microrregioes_2021.zip
├── BR_Municipios_2021.zip
├── BR_UF_2021.zip
├── mcdonalds.json
├── PIB dos Munic¡pios - base de dados 2010-2021.xlsx
├── POP2020_20220905.xls
├── POP2021_20240624.xls
└── subway.html

Usage

The final output of this project's pipelines is a report on "McDonald's Expansion Opportunities in Brazil", which can be found in data/08_reporting/final_report.md

Running the Pipelines

You can run the full set of Kedro pipelines in this project (process_data, merge_data, and build_report) with:

kedro run

If you want to run a specific pipeline, you can use the --pipeline option. For example, to run the process_data pipeline:

kedro run --pipeline process_data

Similarly, you can run specific nodes or tags by using the --nodes and/or --tags options followed by the name(s) of the node(s) or tag(s) you want to run.

Visualizing the Pipelines

You can visualize the datasets, nodes, and connections of the Kedro pipelines in this project by running the following command:

kedro viz --autoreload

The image below shows the pipelines visualization for this project:

Formatting, Linting and Testing

  • Run just format to format your code
  • Run just lint to run linter
  • Run just test to run the tests
  • Run just validate to run all of the above (format, lint, and test)

You can configure Ruff by editing the .ruff.toml file. It is currently set to the default configuration.

Have a look at the file src/tests/test_run.py for instructions on how to write your tests. You can configure the coverage threshold in your project's pyproject.toml file under the [tool.coverage.report] section.

Docker

This project includes a multi-stage Dockerfile, which produces an image with the code and the dependencies installed. You can build the image with:

just docker-build

Then, you can run the Kedro pipelines inside a container with the image you just built by running:

just docker-run

The outputs of the pipelines will still be saved in your local data/ folder, because the docker container mounts the data/ folder as a volume.

Github Actions

This project includes a Github Actions workflow that runs the formatters, linters, and tests on every push/PR to the main or develop branches. You can find the workflow file in .github/workflows/format-lint-test.yml.

About

Data pipelines that integrate Brazilian geographic and demographic datasets with fast‑food locations to generate an expansion strategy for McDonald's in Brazil

Resources

License

Stars

Watchers

Forks

Contributors