This repository contains a Python application that automates the process of scraping, uploading, and storing publications from the city's official gazette.
The application is containerized using Docker and orchestrated with Docker Compose, ensuring a consistent and easy-to-manage development and execution environment. Libraries and tools aligned with best practices and the modern Python ecosystem were selected, aiming for a better development experience, greater robustness, and ease of future maintenance.
At this address, you can access the full implementation:
- The documentation with a Swagger graphical interface at: https://foakz.ddns.net/api/docs
- The endpoint to list all entries: https://foakz.ddns.net/api/gazettes/
- Filtering entries by date (e.g., month 7, year 2025): https://foakz.ddns.net/api/gazettes/?month=7&year=2025
- Paginating 10 entries: https://foakz.ddns.net/api/gazettes/?skip=0&limit=10
- Scraper: Uses Selenium to navigate the
https://www.natal.rn.gov.br/domwebsite and download all official gazette publications from the month prior to the current month. - Uploader: Uploads the downloaded PDF files to
https://0x0.stand retrieves their public URLs. The uploader service,0x0.st, is a temporary file hoster known as THE NULL POINTER. - Database Storage: Stores the public URLs and publication dates in a table in a PostgreSQL database.
- Idempotent Storage: The record creation function in the database (
crud.py) checks for duplicates by URL, ensuring that the same publication is not stored multiple times, making the operation idempotent. - Public API: A FastAPI application that provides endpoints to list and filter the stored official gazette records.
- Orchestrator: A main script that executes the entire end-to-end workflow, which could, for example, be executed by a cron job or an API call.
- Language: Python 3.13
- Package Manager:
uv - Web Data Collection: Selenium
- API Framework: FastAPI
- Database: PostgreSQL
- ORM: SQLAlchemy
- HTTP Client: httpx
- Containerization: Docker, Docker Compose
The project follows a src layout to separate the business logic from configuration files, maintaining organization and aiming to honor Python's packaging best practices.
/
├── docker-compose.yml
├── docker-entrypoint.sh # Entry script for the Docker container
├── Dockerfile
├── main.py # Entry point of the FastAPI application
├── pyproject.toml
├── README.md
├── .env.example # Example environment variables file
├── .python-version # Defines the Python version used by the project (e.g., for pyenv)
├── requirements.txt # Project dependencies
├── uv.lock # uv dependency lock file for reproducible builds
├── src/
│ └── gazette_scraper/
│ ├── __init__.py
│ ├── api_client.py # Client to communicate with our own API
│ ├── config.py # Manages and validates environment variables using Pydantic Settings, ensuring robustness and typing.
│ ├── crud.py # Database Read/Write operations
│ ├── database.py # DB session management
│ ├── initialize_db.py # Script to create the initial tables
│ ├── main_runner.py # Main workflow orchestrator
│ ├── models.py # SQLAlchemy table models
│ ├── schemas.py # Pydantic data models
│ ├── scraper.py # Scraping logic with Selenium
│ └── uploader.py # File upload logic
└── tests/
├── conftest.py # Pytest configurations and fixtures
└── test_api.py # Tests for the API
This project aims to adhere to good development practices, including:
- Comprehensive Docstrings: I have tried to implement clear and informative docstrings in all critical functions and methods, describing their purpose, arguments, and return values.
- Detailed Logging: The application uses a configured logging system to record important events, facilitating the tracking and debugging of the execution flow.
- Resource Management with Context Managers: The Scraper uses the context manager protocol (
withstatement) to ensure that the Selenium WebDriver is initialized and, crucially, cleanly and automatically closed, even in case of errors, preventing resource leaks. - Concurrent Processing with Multithreading: To optimize efficiency in scraping and uploading files, concurrent processing using multithreading was implemented. This approach was chosen over asynchronous solutions due to the synchronous nature of the Selenium library. The number of workers was limited to avoid 429 (Misdirected Request) responses from the target servers, ensuring more stable operations.
I made the architectural decision to handle communication between the orchestrator (main_runner.py) and the FastAPI application (main.py) via HTTP calls to the API, rather than direct calls to the database layer functions. This approach prioritizes:
- Separation of Concerns: Each component (scraper, uploader, API, orchestrator) has a distinct and well-defined responsibility.
- API as a Contract: FastAPI defines a clear and consistent interface for data interaction, ensuring that all clients (including
main_runner.py) adhere to the same rules and validations. - Independent Scalability and Deployment: Components can be scaled and deployed independently, offering flexibility for future growth.
- Robustness and Maintainability: Changes in one component (e.g., changing the database from postgres to mysql) do not necessarily impact others, as long as the API contract remains stable.
- Docker
- Docker Compose
-
Clone the repository:
git clone git@github.com:felixoakz/gazette_scraper.git # via SSH git clone https://github.com/felixoakz/gazette_scraper.git # via HTTPS cd gazette_scraper
-
Create the environment file: Copy the example environment file. The default values are already configured to work with
docker-compose.yml.cp .env.example .env
-
Build and start the services: This command will build the
appimage and start theapp,db, andseleniumcontainers in detached mode.docker-compose up -d --build
The database tables will be created and application logging will be configured automatically on the 'app' container's startup through the 'docker-entrypoint.sh' script.
With the Docker services running, execute the main workflow script inside the app container:
docker-compose exec app python -m src.gazette_scraper.main_runnerThis will trigger the complete process:
- Fetching the PDFs from the website.
- Uploading the files to
0x0.st. - Storing the resulting URLs in the PostgreSQL database via the API.
The API will be accessible locally at http://localhost:8000 (or on the APP_PORT you defined in .env).
- Interactive Documentation (Swagger UI): Access http://localhost:8000/docs to view the interactive API documentation, which includes all endpoints, their details, and a graphical interface to test requests directly in the browser.
- API Base URL:
http://localhost:8000/api
GET /api/gazettes/: Lists all stored official gazettes.curl -X 'GET' \ 'http://localhost:8000/api/gazettes/' \ -H 'accept: application/json'
GET /api/gazettes/?month=7&year=2025: Filters official gazettes by month and year.curl -X 'GET' \ 'http://localhost:8000/api/gazettes/?month=7&year=2025' \ -H 'accept: application/json'
GET /api/gazettes/?month=7&skip=0&limit=15: Paginating the results.curl -X 'GET' \ 'http://localhost:8000/api/gazettes/?skip=0&limit=15' \ -H 'accept: application/json'
POST /api/gazettes/: Creates a new official gazette entry (used internally bymain_runner).
The tests are written with pytest and use mocks to simulate database interactions, allowing them to be run independently without needing the Docker containers to be running, and are simpler compared to using an in-memory sqlite db.
The tests/conftest.py file is necessary for pytest to locate the application modules (like main.py), resolving import issues during test execution.
-
Install development dependencies: From the project root, install the main and development (
pytest) dependencies:uv pip install -e '.[dev]' -
Run the test suite: With the dependencies installed, run
pytest:pytest
This section lists improvements and features that could be considered for the future of the project:
- Enhanced Error Handling: Implement retry and circuit breaker strategies for external operations (scraping, upload, API calls), making the application more fault-tolerant, as I noticed that the city's website does not respond well to some sequential calls.
- Production Deployment: Implement a CI/CD pipeline to automate deployment to production environments according to the branch, using tools like Gunicorn for the ASGI server.
- Test Improvement: Expand test coverage with more unit, integration, and end-to-end tests to ensure the application's robustness.
- DB Connection Management: Refactor the database session management in
main.pyusing decorators or more advanced FastAPI dependencies for cleaner and more efficient control. - Database Migrations: Adopt a database migration tool (e.g., Alembic) to manage schema changes in a controlled and versioned manner.
- API Authentication and Authorization: Implement security mechanisms (e.g., OAuth2, JWT) to protect the API endpoints if it is exposed publicly.
- Asynchronous Processing with Message Queues: Integrate message queues (e.g., RabbitMQ, Kafka) to decouple the scraping/upload process from the API, allowing for scalability and resilience.
- Monitoring and Alerting: Add monitoring tools (e.g., Prometheus, Grafana) to track the application's health and configure alerts for anomalies or critical failures.
- Optimization for Large Volume of Files: If the volume of files to be processed increases significantly, consider transitioning from batch processing to per-file processing (download, upload, insert) with internal concurrency. This would optimize resource usage (memory and disk) and allow for a more continuous flow, although it would require more granular concurrency orchestration.