Skip to content

Latest commit

 

History

History
342 lines (252 loc) · 10.2 KB

File metadata and controls

342 lines (252 loc) · 10.2 KB

AGENTS.md - AI Agent Development Guide for tap-restcountries

This document provides guidance for AI coding agents and developers working on this Singer tap.

Project Overview

  • Project Type: Singer Tap
  • Source: RestCountries
  • Stream Type: REST
  • Authentication: Custom or N/A
  • Framework: Meltano Singer SDK

Architecture

This tap follows the Singer specification and uses the Meltano Singer SDK to extract data from RestCountries.

Key Components

  1. Tap Class (tap_restcountries/tap.py): Main entry point, defines streams and configuration
  2. Client (tap_restcountries/client.py): Handles API communication and authentication
  3. Streams (tap_restcountries/streams.py): Define data streams and their schemas

    Development Guidelines for AI Agents

Understanding Singer Concepts

Before making changes, ensure you understand these Singer concepts:

  • Streams: Individual data endpoints (e.g., users, orders, transactions)
  • State: Tracks incremental sync progress using bookmarks
  • Catalog: Metadata about available streams and their schemas
  • Records: Individual data items emitted by the tap
  • Schemas: JSON Schema definitions for stream data

Common Tasks

Adding a New Stream

  1. Define stream class in tap_restcountries/streams.py
  2. Set name, path, primary_keys, and replication_key (set this to None if not applicable)
  3. Define schema using PropertiesList or JSON Schema
  4. Register stream in the tap's discover_streams() method

Example:

class MyNewStream(RestCountriesStream):
    name = "my_new_stream"
    path = "/api/v1/my_resource"
    primary_keys = ["id"]
    replication_key = "updated_at"

    schema = PropertiesList(
        Property("id", StringType, required=True),
        Property("name", StringType),
        Property("updated_at", DateTimeType),
    ).to_dict()

Modifying Authentication

Handling Pagination

The SDK provides built-in pagination classes. Use these instead of overriding get_next_page_token() directly.

Built-in Paginator Classes:

  1. SimpleHeaderPaginator: For APIs using Link headers (RFC 5988)

    from singer_sdk.pagination import SimpleHeaderPaginator
    
    class MyStream(RestCountriesStream):
        def get_new_paginator(self):
            return SimpleHeaderPaginator()
  2. HeaderLinkPaginator: For APIs with Link: <url>; rel="next" headers

    from singer_sdk.pagination import HeaderLinkPaginator
    
    class MyStream(RestCountriesStream):
        def get_new_paginator(self):
            return HeaderLinkPaginator()
  3. JSONPathPaginator: For cursor/token in response body

    from singer_sdk.pagination import JSONPathPaginator
    
    class MyStream(RestCountriesStream):
        def get_new_paginator(self):
            return JSONPathPaginator("$.pagination.next_token")
  4. SinglePagePaginator: For non-paginated endpoints

    from singer_sdk.pagination import SinglePagePaginator
    
    class MyStream(RestCountriesStream):
        def get_new_paginator(self):
            return SinglePagePaginator()

Creating Custom Paginators:

For complex pagination logic, create a custom paginator class:

from singer_sdk.pagination import BasePageNumberPaginator

class MyCustomPaginator(BasePageNumberPaginator):
    def has_more(self, response):
        """Check if there are more pages."""
        data = response.json()
        return data.get("has_more", False)

    def get_next_url(self, response):
        """Get the next page URL."""
        data = response.json()
        if self.has_more(response):
            return data.get("next_url")
        return None

# Use in stream
class MyStream(RestCountriesStream):
    def get_new_paginator(self):
        return MyCustomPaginator(start_value=1)

Common Pagination Patterns:

  • Offset-based: Extend BaseOffsetPaginator
  • Page-based: Extend BasePageNumberPaginator
  • Cursor-based: Extend BaseAPIPaginator with custom logic
  • HATEOAS/HAL: Use JSONPathPaginator with appropriate JSON path

Only override get_next_page_token() as a last resort for very simple cases.

State and Incremental Sync

  • Set replication_key to enable incremental sync (e.g., "updated_at")
  • Override get_starting_timestamp() to set initial sync point
  • State automatically managed by SDK
  • Access current state via get_context_state()

Schema Evolution

  • Use flexible schemas during development
  • Add new properties without breaking changes
  • Consider making fields optional when unsure
  • Use th.Property("field", th.StringType) for basic types
  • Nest objects with th.ObjectType(...)

Testing

Run tests to verify your changes:

# Install dependencies
uv sync

# Run all tests
uv run pytest

# Run specific test
uv run pytest tests/test_core.py -k test_name

Configuration

Configuration properties are defined in the tap class:

  • Required vs optional properties
  • Secret properties (passwords, tokens)
  • Mark sensitive data with secret=True parameter
  • Defaults specified in config schema

Example configuration schema:

from singer_sdk import typing as th

config_jsonschema = th.PropertiesList(
    th.Property("api_url", th.StringType, required=True),
    th.Property("api_key", th.StringType, required=True, secret=True),
    th.Property("start_date", th.DateTimeType),
    th.Property("user_agent", th.StringType, default="tap-mysource"),
).to_dict()

Example test with config:

tap-restcountries --config config.json --discover
tap-restcountries --config config.json --catalog catalog.json

Keeping meltano.yml and Tap Settings in Sync

When this tap is used with Meltano, the settings defined in meltano.yml must stay in sync with the config_jsonschema in the tap class. Configuration drift between these two sources causes confusion and runtime errors.

When to sync:

  • Adding new configuration properties to the tap
  • Removing or renaming existing properties
  • Changing property types, defaults, or descriptions
  • Marking properties as required or secret

How to sync:

  1. Update config_jsonschema in tap_restcountries/tap.py
  2. Update the corresponding settings block in meltano.yml
  3. Update .env.example with the new environment variable

Example - adding a new batch_size setting:

# tap_restcountries/tap.py
config_jsonschema = th.PropertiesList(
    th.Property("api_url", th.StringType, required=True),
    th.Property("api_key", th.StringType, required=True, secret=True),
    th.Property("batch_size", th.IntegerType, default=100),  # New setting
).to_dict()
# meltano.yml
plugins:
  extractors:
    - name: tap-restcountries
      settings:
        - name: api_url
          kind: string
        - name: api_key
          kind: string
          sensitive: true
        - name: batch_size  # New setting
          kind: integer
          value: 100
# .env.example
TAP_RESTCOUNTRIES_API_URL=https://api.example.com
TAP_RESTCOUNTRIES_API_KEY=your_api_key_here
TAP_RESTCOUNTRIES_BATCH_SIZE=100  # New setting

Setting kind mappings:

Python Type Meltano Kind
StringType string
IntegerType integer
BooleanType boolean
NumberType number
DateTimeType date_iso8601
ArrayType array
ObjectType object

Any properties with secret=True should be marked with sensitive: true in meltano.yml.

Best practices:

  • Always update all three files (tap.py, meltano.yml, .env.example) in the same commit
  • Use the same default values in all locations
  • Keep descriptions consistent between code docstrings and meltano.yml description fields

Note: This guidance is consistent with target and mapper templates in the Singer SDK. See the SDK documentation for canonical reference.

Common Pitfalls

  1. Rate Limiting: Implement backoff using RESTStream built-in retry logic
  2. Large Responses: Use pagination, don't load entire dataset into memory
  3. Schema Mismatches: Validate data matches schema, handle null values
  4. State Management: Don't modify state directly, use SDK methods
  5. Timezone Handling: Use UTC, parse ISO 8601 datetime strings
  6. Error Handling: Let SDK handle retries, log warnings for data issues

SDK Resources

Best Practices

  1. Logging: Use self.logger for structured logging
  2. Validation: Validate API responses before emitting records
  3. Documentation: Update README with new streams and config options
  4. Type Hints: Add type hints to improve code clarity
  5. Testing: Write tests for new streams and edge cases
  6. Performance: Profile slow streams, optimize API calls
  7. Error Messages: Provide clear, actionable error messages

File Structure

tap-restcountries/
├── tap_restcountries/
│   ├── __init__.py
│   ├── tap.py          # Main tap class
│   ├── client.py       # API client
│   └── streams.py      # Stream definitions
├── tests/
│   ├── __init__.py
│   └── test_core.py
├── config.json         # Example configuration
├── pyproject.toml      # Dependencies and metadata
└── README.md          # User documentation

Additional Resources

Making Changes

When implementing changes:

  1. Understand the existing code structure
  2. Follow Singer and SDK patterns
  3. Test thoroughly with real API credentials
  4. Update documentation and docstrings
  5. Ensure backward compatibility when possible
  6. Run linting and type checking

Questions?

If you're uncertain about an implementation:

  • Check SDK documentation for similar examples
  • Review other Singer taps for patterns
  • Test incrementally with small changes
  • Validate against the Singer specification