Gazette Scraper: A Python Automation Project

This repository contains a Python application that automates the process of scraping, uploading, and storing publications from the city's official gazette.

The application is containerized using Docker and orchestrated with Docker Compose, ensuring a consistent and easy-to-manage development and execution environment. Libraries and tools aligned with best practices and the modern Python ecosystem were selected, aiming for a better development experience, greater robustness, and ease of future maintenance.

Application Deployment

At this address, you can access the full implementation:

The documentation with a Swagger graphical interface at: https://foakz.ddns.net/api/docs
The endpoint to list all entries: https://foakz.ddns.net/api/gazettes/
Filtering entries by date (e.g., month 7, year 2025): https://foakz.ddns.net/api/gazettes/?month=7&year=2025
Paginating 10 entries: https://foakz.ddns.net/api/gazettes/?skip=0&limit=10

Key Features

Scraper: Uses Selenium to navigate the https://www.natal.rn.gov.br/dom website and download all official gazette publications from the month prior to the current month.
Uploader: Uploads the downloaded PDF files to https://0x0.st and retrieves their public URLs. The uploader service, 0x0.st, is a temporary file hoster known as THE NULL POINTER.
Database Storage: Stores the public URLs and publication dates in a table in a PostgreSQL database.
Idempotent Storage: The record creation function in the database (crud.py) checks for duplicates by URL, ensuring that the same publication is not stored multiple times, making the operation idempotent.
Public API: A FastAPI application that provides endpoints to list and filter the stored official gazette records.
Orchestrator: A main script that executes the entire end-to-end workflow, which could, for example, be executed by a cron job or an API call.

Technologies Used

Language: Python 3.13
Package Manager: uv
Web Data Collection: Selenium
API Framework: FastAPI
Database: PostgreSQL
ORM: SQLAlchemy
HTTP Client: httpx
Containerization: Docker, Docker Compose

Project Structure

The project follows a src layout to separate the business logic from configuration files, maintaining organization and aiming to honor Python's packaging best practices.

/
├── docker-compose.yml
├── docker-entrypoint.sh # Entry script for the Docker container
├── Dockerfile
├── main.py              # Entry point of the FastAPI application
├── pyproject.toml
├── README.md
├── .env.example         # Example environment variables file
├── .python-version      # Defines the Python version used by the project (e.g., for pyenv)
├── requirements.txt     # Project dependencies
├── uv.lock              # uv dependency lock file for reproducible builds
├── src/
│   └── gazette_scraper/
│       ├── __init__.py
│       ├── api_client.py    # Client to communicate with our own API
│       ├── config.py        # Manages and validates environment variables using Pydantic Settings, ensuring robustness and typing.
│       ├── crud.py          # Database Read/Write operations
│       ├── database.py      # DB session management
│       ├── initialize_db.py # Script to create the initial tables
│       ├── main_runner.py   # Main workflow orchestrator
│       ├── models.py        # SQLAlchemy table models
│       ├── schemas.py       # Pydantic data models
│       ├── scraper.py       # Scraping logic with Selenium
│       └── uploader.py      # File upload logic
└── tests/
    ├── conftest.py      # Pytest configurations and fixtures
    └── test_api.py      # Tests for the API

Code and Documentation Standards

This project aims to adhere to good development practices, including:

Comprehensive Docstrings: I have tried to implement clear and informative docstrings in all critical functions and methods, describing their purpose, arguments, and return values.
Detailed Logging: The application uses a configured logging system to record important events, facilitating the tracking and debugging of the execution flow.
Resource Management with Context Managers: The Scraper uses the context manager protocol (with statement) to ensure that the Selenium WebDriver is initialized and, crucially, cleanly and automatically closed, even in case of errors, preventing resource leaks.
Concurrent Processing with Multithreading: To optimize efficiency in scraping and uploading files, concurrent processing using multithreading was implemented. This approach was chosen over asynchronous solutions due to the synchronous nature of the Selenium library. The number of workers was limited to avoid 429 (Misdirected Request) responses from the target servers, ensuring more stable operations.

Architectural Design Decisions

I made the architectural decision to handle communication between the orchestrator (main_runner.py) and the FastAPI application (main.py) via HTTP calls to the API, rather than direct calls to the database layer functions. This approach prioritizes:

Separation of Concerns: Each component (scraper, uploader, API, orchestrator) has a distinct and well-defined responsibility.
API as a Contract: FastAPI defines a clear and consistent interface for data interaction, ensuring that all clients (including main_runner.py) adhere to the same rules and validations.
Independent Scalability and Deployment: Components can be scaled and deployed independently, offering flexibility for future growth.
Robustness and Maintainability: Changes in one component (e.g., changing the database from postgres to mysql) do not necessarily impact others, as long as the API contract remains stable.

-> Running the Project

Prerequisites

Docker
Docker Compose

Configuration

Clone the repository:

git clone git@github.com:felixoakz/gazette_scraper.git # via SSH
git clone https://github.com/felixoakz/gazette_scraper.git # via HTTPS

cd gazette_scraper

Create the environment file: Copy the example environment file. The default values are already configured to work with docker-compose.yml.
```
cp .env.example .env
```
Build and start the services: This command will build the app image and start the app, db, and selenium containers in detached mode.
```
docker-compose up -d --build
```
The database tables will be created and application logging will be configured automatically on the 'app' container's startup through the 'docker-entrypoint.sh' script.

Running the Automation

With the Docker services running, execute the main workflow script inside the app container:

docker-compose exec app python -m src.gazette_scraper.main_runner

This will trigger the complete process:

Fetching the PDFs from the website.
Uploading the files to 0x0.st.
Storing the resulting URLs in the PostgreSQL database via the API.

Using the API

The API will be accessible locally at http://localhost:8000 (or on the APP_PORT you defined in .env).

Interactive Documentation (Swagger UI): Access http://localhost:8000/docs to view the interactive API documentation, which includes all endpoints, their details, and a graphical interface to test requests directly in the browser.
API Base URL: http://localhost:8000/api

Endpoints

GET /api/gazettes/: Lists all stored official gazettes.

curl -X 'GET' \
  'http://localhost:8000/api/gazettes/' \
  -H 'accept: application/json'

GET /api/gazettes/?month=7&year=2025: Filters official gazettes by month and year.

curl -X 'GET' \
  'http://localhost:8000/api/gazettes/?month=7&year=2025' \
  -H 'accept: application/json'

GET /api/gazettes/?month=7&skip=0&limit=15: Paginating the results.

curl -X 'GET' \
  'http://localhost:8000/api/gazettes/?skip=0&limit=15' \
  -H 'accept: application/json'

POST /api/gazettes/: Creates a new official gazette entry (used internally by main_runner).

Tests

The tests are written with pytest and use mocks to simulate database interactions, allowing them to be run independently without needing the Docker containers to be running, and are simpler compared to using an in-memory sqlite db.

The tests/conftest.py file is necessary for pytest to locate the application modules (like main.py), resolving import issues during test execution.

How to Run the Tests

Install development dependencies: From the project root, install the main and development (pytest) dependencies:
```
uv pip install -e '.[dev]'
```
Run the test suite: With the dependencies installed, run pytest:
```
pytest
```

-> Possible Improvements and Implementations

This section lists improvements and features that could be considered for the future of the project:

Enhanced Error Handling: Implement retry and circuit breaker strategies for external operations (scraping, upload, API calls), making the application more fault-tolerant, as I noticed that the city's website does not respond well to some sequential calls.
Production Deployment: Implement a CI/CD pipeline to automate deployment to production environments according to the branch, using tools like Gunicorn for the ASGI server.
Test Improvement: Expand test coverage with more unit, integration, and end-to-end tests to ensure the application's robustness.
DB Connection Management: Refactor the database session management in main.py using decorators or more advanced FastAPI dependencies for cleaner and more efficient control.
Database Migrations: Adopt a database migration tool (e.g., Alembic) to manage schema changes in a controlled and versioned manner.
API Authentication and Authorization: Implement security mechanisms (e.g., OAuth2, JWT) to protect the API endpoints if it is exposed publicly.
Asynchronous Processing with Message Queues: Integrate message queues (e.g., RabbitMQ, Kafka) to decouple the scraping/upload process from the API, allowing for scalability and resilience.
Monitoring and Alerting: Add monitoring tools (e.g., Prometheus, Grafana) to track the application's health and configure alerts for anomalies or critical failures.
Optimization for Large Volume of Files: If the volume of files to be processed increases significantly, consider transitioning from batch processing to per-file processing (download, upload, insert) with internal concurrency. This would optimize resource usage (memory and disk) and allow for a more continuous flow, although it would require more granular concurrency orchestration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gazette Scraper: A Python Automation Project

Application Deployment

Key Features

Technologies Used

Project Structure

Code and Documentation Standards

Architectural Design Decisions

-> Running the Project

Prerequisites

Configuration

Running the Automation

Using the API

Endpoints

Tests

How to Run the Tests

-> Possible Improvements and Implementations

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/gazette_scraper		src/gazette_scraper
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

felixoakz/gazette_scraper

Folders and files

Latest commit

History

Repository files navigation

Gazette Scraper: A Python Automation Project

Application Deployment

Key Features

Technologies Used

Project Structure

Code and Documentation Standards

Architectural Design Decisions

-> Running the Project

Prerequisites

Configuration

Running the Automation

Using the API

Endpoints

Tests

How to Run the Tests

-> Possible Improvements and Implementations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages