Skip to content

Latest commit

 

History

History
387 lines (302 loc) · 15.5 KB

File metadata and controls

387 lines (302 loc) · 15.5 KB

Customization Guide

This guide helps you identify the key areas to customize when adapting this Arc XP ETL Migration Starter Kit for your own CMS source.

Quick Start Checklist

  • Extract - Update adapter/extract.py to connect to your CMS API
  • Identify - Modify adapter/identify.py to recognize your content types
  • Convert (content_elements) - CRITICAL - Every organization needs to customize set_content_elements() or subclass StoryConverter
  • Convert (additional fields) - Add ANS field methods for fields your organization needs
  • Convert (custom converters) - Create additional converter classes if needed
  • Register Converters - Add custom converters to CONVERTER_CLASSES in adapter_temporal/activities.py
  • Inventory - Adjust adapter/inventory.py schema if needed
  • Source System Name - Update "fake-news" references to your CMS name
  • Environment Variables - Set ORG, ENV, WEBSITE, BEARER_TOKEN

Getting Started: Key Customization Points

Note: For detailed explanations of what each component does, see adapter/ReadMe.md. This guide focuses on how to customize each component.

Important: The points below are starting places for customization, not an exhaustive list. You will likely need to customize additional methods and areas as you work through your specific CMS requirements. See the Finding Customization Points section below for methods to locate all customization opportunities in the codebase.

1. Extract Function (adapter/extract.py)

What to customize:

  • API endpoint URL
  • Authentication method (if required)
  • Request parameters or headers
  • Response parsing logic

Example:

def extract_cms_source():
    # TODO: Replace with your CMS API endpoint
    source_url = "https://your-cms-api.com/content"
    
    # TODO: Add authentication if needed
    headers = {"Authorization": "Bearer YOUR_TOKEN"}
    
    req = requests.get(source_url, headers=headers)
    if req.ok:
        return_data = req.json()
    return return_data

2. Identify Content Type (adapter/identify.py)

What to customize:

  • Content type detection logic based on your source data structure
  • Field names that identify stories, images, videos, etc.
  • Converter selection logic
  • ANS ID generation source (what unique identifier to use)

Key areas to update:

  • Story identification - Update field names that identify stories (e.g., headlinetitle)
  • Image identification - Update how images are detected
  • Additional content types - Add detection logic for your custom content types
  • ANS ID generation - Update set_ans_id() to use your unique identifiers

Example:

def identify_content_type(content_item: ContentItem, org: str):
    content = content_item.content
    
    # TODO: Update these field names to match your CMS structure
    # Example: if your CMS uses "title" instead of "headline"
    if content.get("title") and content.get("canonical_url"):
        content_item.arc_xp_type = ContentTypeArcXP.STORY.value
        # ... rest of logic

3. Transform/Convert Functions (adapter/convert_story.py)

What to customize: This is not a complete list of the methods in this repository that you may customize, but a selection of the first places in which to start. You will likely add new methods also, to add transformations for other ANS fields that your organization requires.

Priority customization points:

  • base_ans property (StoryConverter): Update source.system from "fake-news" to your CMS name
  • base_ans property (ImageConverter): Update source.system from "fake-news" to your CMS name, and customize the image URL source field (currently uses self.source_data.get("url"))
  • set_content_elements(): CRITICAL - Every organization needs to customize this. content_elements are the body of your story and contain lists, embeds and other formatting that will be unique to your organization's prior CMS.
  • set_headline(): Map your headline field name
  • set_publish(): Adjust date format parsing if needed
  • set_section(): Map your section field and add section name mapping if needed
  • set_credits(): Map author/credit fields
  • set_canonical_url(): Map URL field

Finding customization points:

  • Look for self.source_data.get("field_name") - these are where source data is accessed
  • All set_*() methods map source fields to ANS fields
  • See adapter/ReadMe.md for a complete reference of all converter methods

Example:

def set_headline(self, text: str = ""):
    # TODO: Update "headline" to match your CMS field name
    self.ans["headlines"]["basic"] = (
        text if text else self.source_data.get("title", "")  # Changed "headline" to "title"
    )

4. Image Converter (adapter/convert_image.py)

What to customize:

  • base_ans property: Update source.system from "fake-news" to your CMS name, and customize the image URL source field - the base_ans property uses self.source_data.get("url") and self.source_data["url"] to identify the image URL. You'll need to either:
    • Update these field references to match your CMS field name (e.g., self.source_data.get("image_url") or self.source_data.get("media_url")), or
    • Create a new method to extract the image URL and use that instead
  • Image metadata mapping (alt text, caption, etc.)
  • Image processing and transformation logic

5. Inventory Database (adapter/inventory.py)

What to customize:

  • Database schema (if you need additional fields)
  • Table name (currently "inventory")
  • Field names in load_inventory() method
  • Database location/path

6. Source System References

Search for: "fake-news" throughout the codebase

Files to update:

  • adapter/convert_story.py: "system": "fake-news" → your CMS name
  • adapter/convert_image.py: "system": "fake-news" → your CMS name

7. Temporal Activities (Optional - only if using Temporal adapter)

What to customize:

  • call_extract_api() - API endpoint (uses EXTRACT_API_URL environment variable)
  • Temporal server URL (uses TEMPORAL_SERVER_URL environment variable)
  • Add additional Temporal activities and workflows to implement extra processing, if necessary

Testing Your Customizations

  1. Start with fixtures: Create test fixtures matching your CMS data structure
  2. Test identify: Verify content type detection works
  3. Test convert: Check ANS output matches expectations
  4. Test end-to-end: Run full workflow with sample data

Creating Custom Converters

Every organization will need to customize set_content_elements() or subclass StoryConverter to handle different content structures. Most organizations will also need to create additional custom converter classes for different content types or formats.

Why Create Custom Converters?

  • Different content formats: Your CMS may structure content differently (blocks, rich text, markdown, etc.)
  • Additional ANS fields: Your organization may need to set ANS fields not covered by the base converter
  • Content type variations: Different story types may need different processing logic

Pattern 1: Subclass StoryConverter for Content Elements

Most common customization: Override set_content_elements() to handle your conjjtent structure.

Example:

class StoryConverterCustomFormat(StoryConverter):
    """
    Custom converter for stories with structured content blocks.
    
    Every organization needs to customize content_elements handling.
    """
    
    def set_content_elements(self, story_data=None):
        """
        CUSTOMIZE: Override to handle your content structure.
        
        This method is the most commonly customized since content formats
        vary significantly between CMSs.
        """
        # Get your content data - adjust field name as needed
        data = story_data if story_data else self.source_data.get("content_blocks", [])
        
        content_elements = []
        
        # Process your structured content
        for block in data:
            if block["type"] == "paragraph":
                content_elements.append({
                    "type": "text",
                    "content": block["text"]
                })
            elif block["type"] == "heading":
                level = block.get("level", 1)
                content_elements.append({
                    "type": f"heading{level}",
                    "content": block["text"]
                })
            elif block["type"] == "list":
                content_elements.append({
                    "type": "list",
                    "list_type": block.get("list_type", "unordered"),
                    "items": block["items"]
                })
            # Add more block types as needed
        
        self.ans["content_elements"] = content_elements

Pattern 2: Add New ANS Field Methods

Add methods to set additional ANS fields your organization needs:

class StoryConverterExtended(StoryConverter):
    """Converter with additional ANS fields."""
    
    def set_custom_metadata(self):
        """Set custom ANS fields specific to your organization."""
        # TODO: Map your CMS fields to ANS fields
        if self.source_data.get("custom_field"):
            self.ans["additional_properties"]["custom_field"] = \
                self.source_data["custom_field"]
    
    def build_ans(self):
        """Override build_ans to include custom methods."""
        super().build_ans()  # Call parent to set standard fields
        self.set_custom_metadata()  # Add your custom fields
        return self.ans

Pattern 3: Register Custom Converters

After creating a custom converter:

  1. Import it in adapter_temporal/activities.py:
from adapter.convert_story import StoryConverterCustomFormat
  1. Add to CONVERTER_CLASSES dictionary:
CONVERTER_CLASSES = {
    "StoryConverter": StoryConverter,
    "StoryConverterHTMLBody": StoryConverterHTMLBody,
    "StoryConverterRecipeBody": StoryConverterRecipeBody,
    "ImageConverter": ImageConverter,
    "StoryConverterCustomFormat": StoryConverterCustomFormat,  # Your custom converter
}
  1. Reference in adapter/identify.py when identifying content:
# In identify_content_type()
if content.get("custom_format_indicator"):
    content_item.converter_name = ConverterNames.CUSTOM_FORMAT.value
  1. Add to ConverterNames enum in adapter/identify.py:
class ConverterNames(str, Enum):
    STORY = "StoryConverter"
    HTML_STORY = "StoryConverterHTMLBody"
    RECEPIE = "StoryConverterRecipeBody"
    IMAGE = "ImageConverter"
    CUSTOM_FORMAT = "StoryConverterCustomFormat"  # Add your converter

Common Customization Patterns

Pattern 1: Different Field Names

If your CMS uses different field names, search and replace:

  • headline → your headline field
  • publish_date → your date field
  • section → your section field

Pattern 2: Reformatting and Concatenating Fields

You may need to reformat original data or concatenate multiple source fields to create a single ANS field.

Common scenarios:

  • Date reformatting: Convert from your CMS date format to Arc XP format
  • URL processing: Extract path from full URL, remove query parameters
  • Author parsing: Split "Name, Organization" format into separate fields
  • Headline construction: Combine multiple fields (e.g., title + subtitle or category + title)
  • Identifier generation: Create unique IDs from multiple source fields
  • Re-map section hierarchy: Map your CMS section names to Arc XP section names, especially if you're restructuring your section organization (e.g., "sports/baseball" → "/sports" or "news/local" → "/local-news")

Pattern 3: Different Content Structure

If your content structure differs:

  • Subclass StoryConverter and override set_content_elements()
  • Update identify_content_type() to detect your structure
  • Register your new converter class

Pattern 4: Additional Content Types

To add new content types:

  1. Add to ContentTypeArcXP enum in adapter/identify.py
  2. Add detection logic in identify_content_type()
  3. Create converter class in adapter/convert_*.py
  4. Register in CONVERTER_CLASSES in adapter_temporal/activities.py
  5. Add to ConverterNames enum in adapter/identify.py

Finding Customization Points

Method 1: Search for TODO Comments

grep -r "TODO" adapter/

Method 2: Search for "fake-news" References

grep -r "fake-news" adapter/

Method 3: Look for source_data.get() Calls

These indicate where source data is accessed:

grep -r "source_data.get" adapter/

Method 4: Review Converter Methods

All set_*() methods in converter classes are customization points:

grep -r "def set_" adapter/convert_*.py

Extending Converters with Additional ANS Fields

Many organizations need to set additional ANS fields beyond what the base converters provide. Here's how to add them:

Adding New ANS Field Methods

  1. Create a new method in your converter class (or subclass):
class StoryConverterExtended(StoryConverter):
    """Converter with additional ANS fields."""
    
    def set_subtitle(self):
        """Set ANS subtitle field from your CMS data."""
        # TODO: Update field name to match your CMS
        subtitle = self.source_data.get("subtitle", "")
        if subtitle:
            self.ans["subheadlines"] = {"basic": subtitle}
    
    def set_distributor(self):
        """Set distributor information if your content comes from wires."""
        # TODO: Map your distributor field
        if self.source_data.get("wire_source"):
            self.ans["distributor"] = {
                "category": self.source_data["wire_source"],
                "name": self.source_data.get("wire_name", "")
            }
    
    def build_ans(self):
        """Override build_ans to include all custom methods."""
        # Call parent to set standard fields
        super().build_ans()
        # Add your custom field setters
        self.set_subtitle()
        self.set_distributor()
        return self.ans
  1. Reference ANS documentation to understand field structure:

  2. Test with sample data to ensure fields are set correctly

Common ANS Fields to Consider Adding

  • subheadlines.basic - Subtitle
  • description.basic - Story description/abstract
  • taxonomy.tags - Content tags
  • distributor - For wire content
  • workflow - Publishing workflow status
  • promo_items.basic - For featured media in a story
  • additional_properties - Custom metadata
  • additional_properties.expiration_date - Only for images, and usually those that come in via a wires source

Recommended Customization Order

  1. Extract - Get your data flowing first
  2. Identify - Ensure content types are detected correctly
  3. Convert (basic) - Get one content type working end-to-end
  4. Convert (content_elements) - CRITICAL - Customize content_elements handling
  5. Convert (additional fields) - Add ANS fields your organization needs
  6. Convert (advanced) - Add remaining content types and edge cases
  7. Polish - Update naming, logging, error messages

Getting Help

  • Review fixture files in /fixtures to understand expected data structures
  • Check adapter/ReadMe.md for detailed explanations of each component
  • Examine test files in adapter_temporal/tests/ for usage examples