Customization Guide

This guide helps you identify the key areas to customize when adapting this Arc XP ETL Migration Starter Kit for your own CMS source.

Quick Start Checklist

Extract - Update adapter/extract.py to connect to your CMS API
Identify - Modify adapter/identify.py to recognize your content types
Convert (content_elements) - CRITICAL - Every organization needs to customize set_content_elements() or subclass StoryConverter
Convert (additional fields) - Add ANS field methods for fields your organization needs
Convert (custom converters) - Create additional converter classes if needed
Register Converters - Add custom converters to CONVERTER_CLASSES in adapter_temporal/activities.py
Inventory - Adjust adapter/inventory.py schema if needed
Source System Name - Update "fake-news" references to your CMS name
Environment Variables - Set ORG, ENV, WEBSITE, BEARER_TOKEN

Getting Started: Key Customization Points

Note: For detailed explanations of what each component does, see adapter/ReadMe.md. This guide focuses on how to customize each component.

Important: The points below are starting places for customization, not an exhaustive list. You will likely need to customize additional methods and areas as you work through your specific CMS requirements. See the Finding Customization Points section below for methods to locate all customization opportunities in the codebase.

1. Extract Function (`adapter/extract.py`)

What to customize:

API endpoint URL
Authentication method (if required)
Request parameters or headers
Response parsing logic

Example:

def extract_cms_source():
    # TODO: Replace with your CMS API endpoint
    source_url = "https://your-cms-api.com/content"
    
    # TODO: Add authentication if needed
    headers = {"Authorization": "Bearer YOUR_TOKEN"}
    
    req = requests.get(source_url, headers=headers)
    if req.ok:
        return_data = req.json()
    return return_data

2. Identify Content Type (`adapter/identify.py`)

What to customize:

Content type detection logic based on your source data structure
Field names that identify stories, images, videos, etc.
Converter selection logic
ANS ID generation source (what unique identifier to use)

Key areas to update:

Story identification - Update field names that identify stories (e.g., headline → title)
Image identification - Update how images are detected
Additional content types - Add detection logic for your custom content types
ANS ID generation - Update set_ans_id() to use your unique identifiers

Example:

def identify_content_type(content_item: ContentItem, org: str):
    content = content_item.content
    
    # TODO: Update these field names to match your CMS structure
    # Example: if your CMS uses "title" instead of "headline"
    if content.get("title") and content.get("canonical_url"):
        content_item.arc_xp_type = ContentTypeArcXP.STORY.value
        # ... rest of logic

3. Transform/Convert Functions (`adapter/convert_story.py`)

What to customize: This is not a complete list of the methods in this repository that you may customize, but a selection of the first places in which to start. You will likely add new methods also, to add transformations for other ANS fields that your organization requires.

Priority customization points:

base_ans property (StoryConverter): Update source.system from "fake-news" to your CMS name
base_ans property (ImageConverter): Update source.system from "fake-news" to your CMS name, and customize the image URL source field (currently uses self.source_data.get("url"))
set_content_elements(): CRITICAL - Every organization needs to customize this. content_elements are the body of your story and contain lists, embeds and other formatting that will be unique to your organization's prior CMS.
set_headline(): Map your headline field name
set_publish(): Adjust date format parsing if needed
set_section(): Map your section field and add section name mapping if needed
set_credits(): Map author/credit fields
set_canonical_url(): Map URL field

Finding customization points:

Look for self.source_data.get("field_name") - these are where source data is accessed
All set_*() methods map source fields to ANS fields
See adapter/ReadMe.md for a complete reference of all converter methods

Example:

def set_headline(self, text: str = ""):
    # TODO: Update "headline" to match your CMS field name
    self.ans["headlines"]["basic"] = (
        text if text else self.source_data.get("title", "")  # Changed "headline" to "title"
    )

4. Image Converter (`adapter/convert_image.py`)

What to customize:

base_ans property: Update source.system from "fake-news" to your CMS name, and customize the image URL source field - the base_ans property uses self.source_data.get("url") and self.source_data["url"] to identify the image URL. You'll need to either:
- Update these field references to match your CMS field name (e.g., self.source_data.get("image_url") or self.source_data.get("media_url")), or
- Create a new method to extract the image URL and use that instead
Image metadata mapping (alt text, caption, etc.)
Image processing and transformation logic

5. Inventory Database (`adapter/inventory.py`)

What to customize:

Database schema (if you need additional fields)
Table name (currently "inventory")
Field names in load_inventory() method
Database location/path

6. Source System References

Search for: "fake-news" throughout the codebase

Files to update:

adapter/convert_story.py: "system": "fake-news" → your CMS name
adapter/convert_image.py: "system": "fake-news" → your CMS name

7. Temporal Activities (Optional - only if using Temporal adapter)

What to customize:

call_extract_api() - API endpoint (uses EXTRACT_API_URL environment variable)
Temporal server URL (uses TEMPORAL_SERVER_URL environment variable)
Add additional Temporal activities and workflows to implement extra processing, if necessary

Testing Your Customizations

Start with fixtures: Create test fixtures matching your CMS data structure
Test identify: Verify content type detection works
Test convert: Check ANS output matches expectations
Test end-to-end: Run full workflow with sample data

Creating Custom Converters

Every organization will need to customize set_content_elements() or subclass StoryConverter to handle different content structures. Most organizations will also need to create additional custom converter classes for different content types or formats.

Why Create Custom Converters?

Different content formats: Your CMS may structure content differently (blocks, rich text, markdown, etc.)
Additional ANS fields: Your organization may need to set ANS fields not covered by the base converter
Content type variations: Different story types may need different processing logic

Pattern 1: Subclass StoryConverter for Content Elements

Most common customization: Override set_content_elements() to handle your conjjtent structure.

Example:

class StoryConverterCustomFormat(StoryConverter):
    """
    Custom converter for stories with structured content blocks.
    
    Every organization needs to customize content_elements handling.
    """
    
    def set_content_elements(self, story_data=None):
        """
        CUSTOMIZE: Override to handle your content structure.
        
        This method is the most commonly customized since content formats
        vary significantly between CMSs.
        """
        # Get your content data - adjust field name as needed
        data = story_data if story_data else self.source_data.get("content_blocks", [])
        
        content_elements = []
        
        # Process your structured content
        for block in data:
            if block["type"] == "paragraph":
                content_elements.append({
                    "type": "text",
                    "content": block["text"]
                })
            elif block["type"] == "heading":
                level = block.get("level", 1)
                content_elements.append({
                    "type": f"heading{level}",
                    "content": block["text"]
                })
            elif block["type"] == "list":
                content_elements.append({
                    "type": "list",
                    "list_type": block.get("list_type", "unordered"),
                    "items": block["items"]
                })
            # Add more block types as needed
        
        self.ans["content_elements"] = content_elements

Pattern 2: Add New ANS Field Methods

Add methods to set additional ANS fields your organization needs:

class StoryConverterExtended(StoryConverter):
    """Converter with additional ANS fields."""
    
    def set_custom_metadata(self):
        """Set custom ANS fields specific to your organization."""
        # TODO: Map your CMS fields to ANS fields
        if self.source_data.get("custom_field"):
            self.ans["additional_properties"]["custom_field"] = \
                self.source_data["custom_field"]
    
    def build_ans(self):
        """Override build_ans to include custom methods."""
        super().build_ans()  # Call parent to set standard fields
        self.set_custom_metadata()  # Add your custom fields
        return self.ans

Pattern 3: Register Custom Converters

After creating a custom converter:

Import it in adapter_temporal/activities.py:

from adapter.convert_story import StoryConverterCustomFormat

Add to CONVERTER_CLASSES dictionary:

CONVERTER_CLASSES = {
    "StoryConverter": StoryConverter,
    "StoryConverterHTMLBody": StoryConverterHTMLBody,
    "StoryConverterRecipeBody": StoryConverterRecipeBody,
    "ImageConverter": ImageConverter,
    "StoryConverterCustomFormat": StoryConverterCustomFormat,  # Your custom converter
}

Reference in adapter/identify.py when identifying content:

# In identify_content_type()
if content.get("custom_format_indicator"):
    content_item.converter_name = ConverterNames.CUSTOM_FORMAT.value

Add to ConverterNames enum in adapter/identify.py:

class ConverterNames(str, Enum):
    STORY = "StoryConverter"
    HTML_STORY = "StoryConverterHTMLBody"
    RECEPIE = "StoryConverterRecipeBody"
    IMAGE = "ImageConverter"
    CUSTOM_FORMAT = "StoryConverterCustomFormat"  # Add your converter

Common Customization Patterns

Pattern 1: Different Field Names

If your CMS uses different field names, search and replace:

headline → your headline field
publish_date → your date field
section → your section field

Pattern 2: Reformatting and Concatenating Fields

You may need to reformat original data or concatenate multiple source fields to create a single ANS field.

Common scenarios:

Date reformatting: Convert from your CMS date format to Arc XP format
URL processing: Extract path from full URL, remove query parameters
Author parsing: Split "Name, Organization" format into separate fields
Headline construction: Combine multiple fields (e.g., title + subtitle or category + title)
Identifier generation: Create unique IDs from multiple source fields
Re-map section hierarchy: Map your CMS section names to Arc XP section names, especially if you're restructuring your section organization (e.g., "sports/baseball" → "/sports" or "news/local" → "/local-news")

Pattern 3: Different Content Structure

If your content structure differs:

Subclass StoryConverter and override set_content_elements()
Update identify_content_type() to detect your structure
Register your new converter class

Pattern 4: Additional Content Types

To add new content types:

Add to ContentTypeArcXP enum in adapter/identify.py
Add detection logic in identify_content_type()
Create converter class in adapter/convert_*.py
Register in CONVERTER_CLASSES in adapter_temporal/activities.py
Add to ConverterNames enum in adapter/identify.py

Finding Customization Points

Method 1: Search for TODO Comments

grep -r "TODO" adapter/

Method 2: Search for "fake-news" References

grep -r "fake-news" adapter/

Method 3: Look for `source_data.get()` Calls

These indicate where source data is accessed:

grep -r "source_data.get" adapter/

Method 4: Review Converter Methods

All set_*() methods in converter classes are customization points:

grep -r "def set_" adapter/convert_*.py

Extending Converters with Additional ANS Fields

Many organizations need to set additional ANS fields beyond what the base converters provide. Here's how to add them:

Adding New ANS Field Methods

Create a new method in your converter class (or subclass):

class StoryConverterExtended(StoryConverter):
    """Converter with additional ANS fields."""
    
    def set_subtitle(self):
        """Set ANS subtitle field from your CMS data."""
        # TODO: Update field name to match your CMS
        subtitle = self.source_data.get("subtitle", "")
        if subtitle:
            self.ans["subheadlines"] = {"basic": subtitle}
    
    def set_distributor(self):
        """Set distributor information if your content comes from wires."""
        # TODO: Map your distributor field
        if self.source_data.get("wire_source"):
            self.ans["distributor"] = {
                "category": self.source_data["wire_source"],
                "name": self.source_data.get("wire_name", "")
            }
    
    def build_ans(self):
        """Override build_ans to include all custom methods."""
        # Call parent to set standard fields
        super().build_ans()
        # Add your custom field setters
        self.set_subtitle()
        self.set_distributor()
        return self.ans

Reference ANS documentation to understand field structure:
- ANS fields: https://arcxp.github.io/ans-schema/
- Each field has a specific structure that must be followed
Test with sample data to ensure fields are set correctly

Common ANS Fields to Consider Adding

subheadlines.basic - Subtitle
description.basic - Story description/abstract
taxonomy.tags - Content tags
distributor - For wire content
workflow - Publishing workflow status
promo_items.basic - For featured media in a story
additional_properties - Custom metadata
additional_properties.expiration_date - Only for images, and usually those that come in via a wires source

Recommended Customization Order

Extract - Get your data flowing first
Identify - Ensure content types are detected correctly
Convert (basic) - Get one content type working end-to-end
Convert (content_elements) - CRITICAL - Customize content_elements handling
Convert (additional fields) - Add ANS fields your organization needs
Convert (advanced) - Add remaining content types and edge cases
Polish - Update naming, logging, error messages

Getting Help

Review fixture files in /fixtures to understand expected data structures
Check adapter/ReadMe.md for detailed explanations of each component
Examine test files in adapter_temporal/tests/ for usage examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customization Guide

Quick Start Checklist

Getting Started: Key Customization Points

1. Extract Function (`adapter/extract.py`)

2. Identify Content Type (`adapter/identify.py`)

3. Transform/Convert Functions (`adapter/convert_story.py`)

4. Image Converter (`adapter/convert_image.py`)

5. Inventory Database (`adapter/inventory.py`)

6. Source System References

7. Temporal Activities (Optional - only if using Temporal adapter)

Testing Your Customizations

Creating Custom Converters

Why Create Custom Converters?

Pattern 1: Subclass StoryConverter for Content Elements

Pattern 2: Add New ANS Field Methods

Pattern 3: Register Custom Converters

Common Customization Patterns

Pattern 1: Different Field Names

Pattern 2: Reformatting and Concatenating Fields

Pattern 3: Different Content Structure

Pattern 4: Additional Content Types

Finding Customization Points

Method 1: Search for TODO Comments

Method 2: Search for "fake-news" References

Method 3: Look for `source_data.get()` Calls

Method 4: Review Converter Methods

Extending Converters with Additional ANS Fields

Adding New ANS Field Methods

Common ANS Fields to Consider Adding

Recommended Customization Order

Getting Help

FilesExpand file tree

CUSTOMIZATION_GUIDE.md

Latest commit

History

CUSTOMIZATION_GUIDE.md

File metadata and controls

Customization Guide

Quick Start Checklist

Getting Started: Key Customization Points

1. Extract Function (adapter/extract.py)

2. Identify Content Type (adapter/identify.py)

3. Transform/Convert Functions (adapter/convert_story.py)

4. Image Converter (adapter/convert_image.py)

5. Inventory Database (adapter/inventory.py)

6. Source System References

7. Temporal Activities (Optional - only if using Temporal adapter)

Testing Your Customizations

Creating Custom Converters

Why Create Custom Converters?

Pattern 1: Subclass StoryConverter for Content Elements

Pattern 2: Add New ANS Field Methods

Pattern 3: Register Custom Converters

Common Customization Patterns

Pattern 1: Different Field Names

Pattern 2: Reformatting and Concatenating Fields

Pattern 3: Different Content Structure

Pattern 4: Additional Content Types

Finding Customization Points

Method 1: Search for TODO Comments

Method 2: Search for "fake-news" References

Method 3: Look for source_data.get() Calls

Method 4: Review Converter Methods

Extending Converters with Additional ANS Fields

Adding New ANS Field Methods

Common ANS Fields to Consider Adding

Recommended Customization Order

Getting Help

1. Extract Function (`adapter/extract.py`)

2. Identify Content Type (`adapter/identify.py`)

3. Transform/Convert Functions (`adapter/convert_story.py`)

4. Image Converter (`adapter/convert_image.py`)

5. Inventory Database (`adapter/inventory.py`)

Method 3: Look for `source_data.get()` Calls