Skip to content

Commit a18227e

Browse files
AlfredNwolisacka-y
andauthored
Feat: transitFeedSyncProcessing implementation (#819)
* feat: Add Transitland feed sync processor This commit: - Implements feed sync processing for Pub/Sub messages - Ensures database consistency during sync operations - Adds configuration files for feed sync settings - Includes comprehensive test coverage - Documents sync process and configuration options * lint fix * Refactor to use SQLAlchemy models for database operations Replaced raw SQL queries with SQLAlchemy ORM models for handling database operations in feed processing. Enhanced test coverage and updated mock configurations to align with the new ORM-based approach. * Remove unused freeze_time import from tests * Update functions-python/feed_sync_process_transitland/src/main.py Co-authored-by: cka-y <[email protected]> * Refactor FeedProcessor for enhanced logging and error handling Replaced custom logger setup with unified Logger class. Improved error handling and rollback in database transactions. Added location support and refined feed ID management. Updated test cases to reflect these changes. * Update logging and refactor feed processing Replaced direct logger calls with a unified log_message function to support both local and GCP logging. Refactored the test cases to mock enhanced logging and implemented new test scenarios to cover additional edge cases, ensuring robustness in feed processing. * lint fix * added pycountry to requirements.txt * added additional test cases & included pycountry in requirements.txt * added additional test cases & included pycountry in requirements.txt * fix * Add detailed error handling and checks for feed creation Refactored test coverage for feed processing, publish to batch topic, and event processing scenarios. * Refactor mocking of PublisherClient in test setup. * Update requirements: move pycountry to helpers * Update requirements: pycountry * Handle empty country name in get_country_code function * Update test log message for empty country code * fix: last test --------- Co-authored-by: cka-y <[email protected]> Co-authored-by: cka-y <[email protected]>
1 parent 70f6dda commit a18227e

File tree

16 files changed

+1994
-31
lines changed

16 files changed

+1994
-31
lines changed

functions-python/batch_process_dataset/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,4 @@ google-api-core
2121
google-cloud-firestore
2222
google-cloud-datastore
2323
google-cloud-bigquery
24-
cloudevents~=1.10.1
24+
cloudevents~=1.10.1
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
[run]
2+
omit =
3+
*/test*/*
4+
*/dataset_service/*
5+
*/helpers/*
6+
7+
[report]
8+
exclude_lines =
9+
if __name__ == .__main__.:
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Environment variables for tokens function to run locally. Delete this line after rename the file.
2+
FEEDS_DATABASE_URL=postgresql://postgres:postgres@localhost:54320/MobilityDatabase
3+
PROJECT_ID=mobility-feeds-dev
4+
PUBSUB_TOPIC_NAME=my-topic
5+
DATASET_BATCH_TOPIC_NAME=dataset_batch_topic_{env}_
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# TLD Feed Sync Process
2+
3+
Subscribed to the topic set in the `feed-sync-dispatcher` function, `feed-sync-process` is triggered for each message published. It handles the processing of feed updates, ensuring data consistency and integrity. The function performs the following operations:
4+
5+
1. **Feed Status Check**: It verifies the current state of the feed in the database using external_id and source.
6+
2. **URL Validation**: Checks if the feed URL already exists in the database.
7+
3. **Feed Processing**: Based on the current state:
8+
- If no existing feed is found, creates a new feed entry
9+
- If feed exists with a different URL, creates a new feed and deprecates the old one
10+
- If feed exists with the same URL, no action is taken
11+
4. **Batch Processing Trigger**: For non-authenticated feeds, publishes events to the dataset batch topic for further processing.
12+
13+
The function maintains feed history through the `redirectingid` table and ensures proper status tracking with 'active' and 'deprecated' states.
14+
15+
# Message Format
16+
The function expects a Pub/Sub message with the following format:
17+
```json
18+
{
19+
"message": {
20+
"data": {
21+
"external_id": "feed-identifier",
22+
"feed_id": "unique-feed-id",
23+
"feed_url": "http://example.com/feed",
24+
"execution_id": "execution-identifier",
25+
"spec": "gtfs",
26+
"auth_info_url": null,
27+
"auth_param_name": null,
28+
"type": null,
29+
"operator_name": "Transit Agency Name",
30+
"country": "Country Name",
31+
"state_province": "State/Province",
32+
"city_name": "City Name",
33+
"source": "TLD",
34+
"payload_type": "new|update"
35+
}
36+
}
37+
}
38+
```
39+
40+
# Function Configuration
41+
The function is configured using the following environment variables:
42+
- `PROJECT_ID`: The Google Cloud project ID
43+
- `DATASET_BATCH_TOPIC_NAME`: The name of the topic for batch processing triggers
44+
- `FEEDS_DATABASE_URL`: The URL of the feeds database
45+
- `ENV`: [Optional] Environment identifier (e.g., 'dev', 'prod')
46+
47+
# Database Schema
48+
The function interacts with the following tables:
49+
1. `feed`: Stores feed information
50+
- Contains fields like id, data_type, feed_name, producer_url, etc.
51+
- Tracks feed status ('active' or 'deprecated')
52+
- Uses CURRENT_TIMESTAMP for created_at
53+
54+
2. `externalid`: Maps external identifiers to feed IDs
55+
- Links external_id and source to feed entries
56+
- Maintains source tracking
57+
58+
3. `redirectingid`: Tracks feed updates
59+
- Maps old feed IDs to new ones
60+
- Maintains update history
61+
62+
# Local development
63+
The local development of this function follows the same steps as the other functions.
64+
65+
Install Google Pub/Sub emulator, please refer to the [README.md](../README.md) file for more information.
66+
67+
## Python requirements
68+
69+
- Install the requirements
70+
```bash
71+
pip install -r ./functions-python/feed_sync_process_transitland/requirements.txt
72+
```
73+
74+
## Test locally with Google Cloud Emulators
75+
76+
- Execute the following commands to start the emulators:
77+
```bash
78+
gcloud beta emulators pubsub start --project=test-project --host-port='localhost:8043'
79+
```
80+
81+
- Create a Pub/Sub topic in the emulator:
82+
```bash
83+
curl -X PUT "http://localhost:8043/v1/projects/test-project/topics/feed-sync-transitland"
84+
```
85+
86+
- Start function
87+
```bash
88+
export PUBSUB_EMULATOR_HOST=localhost:8043 && ./scripts/function-python-run.sh --function_name feed_sync_process_transitland
89+
```
90+
91+
- [Optional]: Create a local subscription to print published messages:
92+
```bash
93+
./scripts/pubsub_message_print.sh feed-sync-process-transitland
94+
```
95+
96+
- Execute function
97+
```bash
98+
curl http://localhost:8080
99+
```
100+
101+
- To run/debug from your IDE use the file `main_local_debug.py`
102+
103+
# Test
104+
- Run the tests
105+
```bash
106+
./scripts/api-tests.sh --folder functions-python/feed_sync_dispatcher_transitland
107+
```
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
{
2+
"name": "feed-sync-process-transitland",
3+
"description": "Feed Sync process for Transitland feeds",
4+
"entry_point": "process_feed_event",
5+
"timeout": 540,
6+
"memory": "512Mi",
7+
"trigger_http": true,
8+
"include_folders": ["database_gen", "helpers"],
9+
"secret_environment_variables": [
10+
{
11+
"key": "FEEDS_DATABASE_URL"
12+
}
13+
],
14+
"ingress_settings": "ALLOW_INTERNAL_AND_GCLB",
15+
"max_instance_request_concurrency": 20,
16+
"max_instance_count": 10,
17+
"min_instance_count": 0,
18+
"available_cpu": 1
19+
}
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
"""
2+
Code to be able to debug locally without affecting the runtime cloud function.
3+
4+
Requirements:
5+
- Google Cloud SDK installed
6+
- Make sure to have the following environment variables set in your .env.local file:
7+
- PROJECT_ID
8+
- DATASET_BATCH_TOPIC_NAME
9+
- FEEDS_DATABASE_URL
10+
- Local database in running state
11+
12+
Usage:
13+
- python feed_sync_process_transitland/main_local_debug.py
14+
"""
15+
16+
import base64
17+
import json
18+
import os
19+
from unittest.mock import MagicMock, patch
20+
import logging
21+
import sys
22+
23+
import pytest
24+
from dotenv import load_dotenv
25+
26+
# Configure local logging first
27+
logging.basicConfig(
28+
level=logging.INFO,
29+
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
30+
stream=sys.stdout,
31+
)
32+
33+
logger = logging.getLogger("feed_processor")
34+
35+
# Mock the Google Cloud Logger
36+
37+
38+
class MockLogger:
39+
40+
"""Mock logger class"""
41+
42+
@staticmethod
43+
def init_logger():
44+
return MagicMock()
45+
46+
def __init__(self, name):
47+
self.name = name
48+
49+
def get_logger(self):
50+
return logger
51+
52+
def addFilter(self, filter):
53+
pass
54+
55+
56+
with patch("helpers.logger.Logger", MockLogger):
57+
from feed_sync_process_transitland.src.main import process_feed_event
58+
59+
# Load environment variables
60+
load_dotenv(dotenv_path=".env.rename_me")
61+
62+
63+
class CloudEvent:
64+
"""Cloud Event data structure."""
65+
66+
def __init__(self, attributes: dict, data: dict):
67+
self.attributes = attributes
68+
self.data = data
69+
70+
71+
@pytest.fixture
72+
def mock_pubsub():
73+
"""Fixture to mock PubSub client"""
74+
with patch("google.cloud.pubsub_v1.PublisherClient") as mock_publisher:
75+
publisher_instance = MagicMock()
76+
77+
def mock_topic_path(project_id, topic_id):
78+
return f"projects/{project_id}/topics/{topic_id}"
79+
80+
def mock_publish(topic_path, data):
81+
logger.info(
82+
f"[LOCAL DEBUG] Would publish to {topic_path}: {data.decode('utf-8')}"
83+
)
84+
future = MagicMock()
85+
future.result.return_value = "message_id"
86+
return future
87+
88+
publisher_instance.topic_path.side_effect = mock_topic_path
89+
publisher_instance.publish.side_effect = mock_publish
90+
mock_publisher.return_value = publisher_instance
91+
92+
yield mock_publisher
93+
94+
95+
def process_event_safely(cloud_event, description=""):
96+
"""Process event with error handling."""
97+
try:
98+
logger.info(f"\nProcessing {description}:")
99+
logger.info("-" * 50)
100+
result = process_feed_event(cloud_event)
101+
logger.info(f"Process result: {result}")
102+
return True
103+
except Exception as e:
104+
logger.error(f"Error processing {description}: {str(e)}")
105+
return False
106+
107+
108+
def main():
109+
"""Main function to run local debug tests"""
110+
logger.info("Starting local debug session...")
111+
112+
# Define test event data
113+
test_payload = {
114+
"external_id": "test-feed-1",
115+
"feed_id": "feed1",
116+
"feed_url": "https://example.com/test-feed-2",
117+
"execution_id": "local-debug-123",
118+
"spec": "gtfs",
119+
"auth_info_url": None,
120+
"auth_param_name": None,
121+
"type": None,
122+
"operator_name": "Test Operator",
123+
"country": "USA",
124+
"state_province": "CA",
125+
"city_name": "Test City",
126+
"source": "TLD",
127+
"payload_type": "new",
128+
}
129+
130+
# Create cloud event
131+
cloud_event = CloudEvent(
132+
attributes={
133+
"type": "com.google.cloud.pubsub.topic.publish",
134+
"source": f"//pubsub.googleapis.com/projects/{os.getenv('PROJECT_ID')}/topics/test-topic",
135+
},
136+
data={
137+
"message": {
138+
"data": base64.b64encode(
139+
json.dumps(test_payload).encode("utf-8")
140+
).decode("utf-8")
141+
}
142+
},
143+
)
144+
145+
# Set up mocks
146+
with patch(
147+
"google.cloud.pubsub_v1.PublisherClient", new_callable=MagicMock
148+
) as mock_publisher, patch("google.cloud.logging.Client", MagicMock()):
149+
publisher_instance = MagicMock()
150+
151+
def mock_topic_path(project_id, topic_id):
152+
return f"projects/{project_id}/topics/{topic_id}"
153+
154+
def mock_publish(topic_path, data):
155+
logger.info(
156+
f"[LOCAL DEBUG] Would publish to {topic_path}: {data.decode('utf-8')}"
157+
)
158+
future = MagicMock()
159+
future.result.return_value = "message_id"
160+
return future
161+
162+
publisher_instance.topic_path.side_effect = mock_topic_path
163+
publisher_instance.publish.side_effect = mock_publish
164+
mock_publisher.return_value = publisher_instance
165+
166+
# Process test event
167+
process_event_safely(cloud_event, "test feed event")
168+
169+
logger.info("Local debug session completed.")
170+
171+
172+
if __name__ == "__main__":
173+
main()
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Common packages
2+
functions-framework==3.*
3+
google-cloud-logging
4+
psycopg2-binary==2.9.6
5+
aiohttp~=3.10.5
6+
asyncio~=3.4.3
7+
urllib3~=2.2.2
8+
requests~=2.32.3
9+
attrs~=23.1.0
10+
pluggy~=1.3.0
11+
certifi~=2024.8.30
12+
13+
# SQL Alchemy and Geo Alchemy
14+
SQLAlchemy==2.0.23
15+
geoalchemy2==0.14.7
16+
17+
# Google specific packages for this function
18+
google-cloud-pubsub
19+
cloudevents~=1.10.1
20+
21+
# Additional packages for this function
22+
pandas
23+
pycountry
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Faker
2+
pytest~=7.4.3

functions-python/feed_sync_process_transitland/src/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)