This guide helps you identify the key areas to customize when adapting this Arc XP ETL Migration Starter Kit for your own CMS source.
- Extract - Update
adapter/extract.pyto connect to your CMS API - Identify - Modify
adapter/identify.pyto recognize your content types - Convert (content_elements) - CRITICAL - Every organization needs to customize
set_content_elements()or subclassStoryConverter - Convert (additional fields) - Add ANS field methods for fields your organization needs
- Convert (custom converters) - Create additional converter classes if needed
- Register Converters - Add custom converters to
CONVERTER_CLASSESinadapter_temporal/activities.py - Inventory - Adjust
adapter/inventory.pyschema if needed - Source System Name - Update "fake-news" references to your CMS name
- Environment Variables - Set
ORG,ENV,WEBSITE,BEARER_TOKEN
Note: For detailed explanations of what each component does, see adapter/ReadMe.md. This guide focuses on how to customize each component.
Important: The points below are starting places for customization, not an exhaustive list. You will likely need to customize additional methods and areas as you work through your specific CMS requirements. See the Finding Customization Points section below for methods to locate all customization opportunities in the codebase.
What to customize:
- API endpoint URL
- Authentication method (if required)
- Request parameters or headers
- Response parsing logic
Example:
def extract_cms_source():
# TODO: Replace with your CMS API endpoint
source_url = "https://your-cms-api.com/content"
# TODO: Add authentication if needed
headers = {"Authorization": "Bearer YOUR_TOKEN"}
req = requests.get(source_url, headers=headers)
if req.ok:
return_data = req.json()
return return_dataWhat to customize:
- Content type detection logic based on your source data structure
- Field names that identify stories, images, videos, etc.
- Converter selection logic
- ANS ID generation source (what unique identifier to use)
Key areas to update:
- Story identification - Update field names that identify stories (e.g.,
headline→title) - Image identification - Update how images are detected
- Additional content types - Add detection logic for your custom content types
- ANS ID generation - Update
set_ans_id()to use your unique identifiers
Example:
def identify_content_type(content_item: ContentItem, org: str):
content = content_item.content
# TODO: Update these field names to match your CMS structure
# Example: if your CMS uses "title" instead of "headline"
if content.get("title") and content.get("canonical_url"):
content_item.arc_xp_type = ContentTypeArcXP.STORY.value
# ... rest of logicWhat to customize: This is not a complete list of the methods in this repository that you may customize, but a selection of the first places in which to start. You will likely add new methods also, to add transformations for other ANS fields that your organization requires.
Priority customization points:
base_ansproperty (StoryConverter): Updatesource.systemfrom "fake-news" to your CMS namebase_ansproperty (ImageConverter): Updatesource.systemfrom "fake-news" to your CMS name, and customize the image URL source field (currently usesself.source_data.get("url"))set_content_elements(): CRITICAL - Every organization needs to customize this.content_elementsare the body of your story and contain lists, embeds and other formatting that will be unique to your organization's prior CMS.set_headline(): Map your headline field nameset_publish(): Adjust date format parsing if neededset_section(): Map your section field and add section name mapping if neededset_credits(): Map author/credit fieldsset_canonical_url(): Map URL field
Finding customization points:
- Look for
self.source_data.get("field_name")- these are where source data is accessed - All
set_*()methods map source fields to ANS fields - See adapter/ReadMe.md for a complete reference of all converter methods
Example:
def set_headline(self, text: str = ""):
# TODO: Update "headline" to match your CMS field name
self.ans["headlines"]["basic"] = (
text if text else self.source_data.get("title", "") # Changed "headline" to "title"
)What to customize:
base_ansproperty: Updatesource.systemfrom "fake-news" to your CMS name, and customize the image URL source field - thebase_ansproperty usesself.source_data.get("url")andself.source_data["url"]to identify the image URL. You'll need to either:- Update these field references to match your CMS field name (e.g.,
self.source_data.get("image_url")orself.source_data.get("media_url")), or - Create a new method to extract the image URL and use that instead
- Update these field references to match your CMS field name (e.g.,
- Image metadata mapping (alt text, caption, etc.)
- Image processing and transformation logic
What to customize:
- Database schema (if you need additional fields)
- Table name (currently "inventory")
- Field names in
load_inventory()method - Database location/path
Search for: "fake-news" throughout the codebase
Files to update:
adapter/convert_story.py:"system": "fake-news"→ your CMS nameadapter/convert_image.py:"system": "fake-news"→ your CMS name
What to customize:
call_extract_api()- API endpoint (usesEXTRACT_API_URLenvironment variable)- Temporal server URL (uses
TEMPORAL_SERVER_URLenvironment variable) - Add additional Temporal activities and workflows to implement extra processing, if necessary
- Start with fixtures: Create test fixtures matching your CMS data structure
- Test identify: Verify content type detection works
- Test convert: Check ANS output matches expectations
- Test end-to-end: Run full workflow with sample data
Every organization will need to customize set_content_elements() or subclass StoryConverter to handle different content structures. Most organizations will also need to create additional custom converter classes for different content types or formats.
- Different content formats: Your CMS may structure content differently (blocks, rich text, markdown, etc.)
- Additional ANS fields: Your organization may need to set ANS fields not covered by the base converter
- Content type variations: Different story types may need different processing logic
Most common customization: Override set_content_elements() to handle your conjjtent structure.
Example:
class StoryConverterCustomFormat(StoryConverter):
"""
Custom converter for stories with structured content blocks.
Every organization needs to customize content_elements handling.
"""
def set_content_elements(self, story_data=None):
"""
CUSTOMIZE: Override to handle your content structure.
This method is the most commonly customized since content formats
vary significantly between CMSs.
"""
# Get your content data - adjust field name as needed
data = story_data if story_data else self.source_data.get("content_blocks", [])
content_elements = []
# Process your structured content
for block in data:
if block["type"] == "paragraph":
content_elements.append({
"type": "text",
"content": block["text"]
})
elif block["type"] == "heading":
level = block.get("level", 1)
content_elements.append({
"type": f"heading{level}",
"content": block["text"]
})
elif block["type"] == "list":
content_elements.append({
"type": "list",
"list_type": block.get("list_type", "unordered"),
"items": block["items"]
})
# Add more block types as needed
self.ans["content_elements"] = content_elementsAdd methods to set additional ANS fields your organization needs:
class StoryConverterExtended(StoryConverter):
"""Converter with additional ANS fields."""
def set_custom_metadata(self):
"""Set custom ANS fields specific to your organization."""
# TODO: Map your CMS fields to ANS fields
if self.source_data.get("custom_field"):
self.ans["additional_properties"]["custom_field"] = \
self.source_data["custom_field"]
def build_ans(self):
"""Override build_ans to include custom methods."""
super().build_ans() # Call parent to set standard fields
self.set_custom_metadata() # Add your custom fields
return self.ansAfter creating a custom converter:
- Import it in
adapter_temporal/activities.py:
from adapter.convert_story import StoryConverterCustomFormat- Add to
CONVERTER_CLASSESdictionary:
CONVERTER_CLASSES = {
"StoryConverter": StoryConverter,
"StoryConverterHTMLBody": StoryConverterHTMLBody,
"StoryConverterRecipeBody": StoryConverterRecipeBody,
"ImageConverter": ImageConverter,
"StoryConverterCustomFormat": StoryConverterCustomFormat, # Your custom converter
}- Reference in
adapter/identify.pywhen identifying content:
# In identify_content_type()
if content.get("custom_format_indicator"):
content_item.converter_name = ConverterNames.CUSTOM_FORMAT.value- Add to
ConverterNamesenum inadapter/identify.py:
class ConverterNames(str, Enum):
STORY = "StoryConverter"
HTML_STORY = "StoryConverterHTMLBody"
RECEPIE = "StoryConverterRecipeBody"
IMAGE = "ImageConverter"
CUSTOM_FORMAT = "StoryConverterCustomFormat" # Add your converterIf your CMS uses different field names, search and replace:
headline→ your headline fieldpublish_date→ your date fieldsection→ your section field
You may need to reformat original data or concatenate multiple source fields to create a single ANS field.
Common scenarios:
- Date reformatting: Convert from your CMS date format to Arc XP format
- URL processing: Extract path from full URL, remove query parameters
- Author parsing: Split "Name, Organization" format into separate fields
- Headline construction: Combine multiple fields (e.g.,
title + subtitleorcategory + title) - Identifier generation: Create unique IDs from multiple source fields
- Re-map section hierarchy: Map your CMS section names to Arc XP section names, especially if you're restructuring your section organization (e.g., "sports/baseball" → "/sports" or "news/local" → "/local-news")
If your content structure differs:
- Subclass
StoryConverterand overrideset_content_elements() - Update
identify_content_type()to detect your structure - Register your new converter class
To add new content types:
- Add to
ContentTypeArcXPenum inadapter/identify.py - Add detection logic in
identify_content_type() - Create converter class in
adapter/convert_*.py - Register in
CONVERTER_CLASSESinadapter_temporal/activities.py - Add to
ConverterNamesenum inadapter/identify.py
grep -r "TODO" adapter/grep -r "fake-news" adapter/These indicate where source data is accessed:
grep -r "source_data.get" adapter/All set_*() methods in converter classes are customization points:
grep -r "def set_" adapter/convert_*.pyMany organizations need to set additional ANS fields beyond what the base converters provide. Here's how to add them:
- Create a new method in your converter class (or subclass):
class StoryConverterExtended(StoryConverter):
"""Converter with additional ANS fields."""
def set_subtitle(self):
"""Set ANS subtitle field from your CMS data."""
# TODO: Update field name to match your CMS
subtitle = self.source_data.get("subtitle", "")
if subtitle:
self.ans["subheadlines"] = {"basic": subtitle}
def set_distributor(self):
"""Set distributor information if your content comes from wires."""
# TODO: Map your distributor field
if self.source_data.get("wire_source"):
self.ans["distributor"] = {
"category": self.source_data["wire_source"],
"name": self.source_data.get("wire_name", "")
}
def build_ans(self):
"""Override build_ans to include all custom methods."""
# Call parent to set standard fields
super().build_ans()
# Add your custom field setters
self.set_subtitle()
self.set_distributor()
return self.ans-
Reference ANS documentation to understand field structure:
- ANS fields: https://arcxp.github.io/ans-schema/
- Each field has a specific structure that must be followed
-
Test with sample data to ensure fields are set correctly
subheadlines.basic- Subtitledescription.basic- Story description/abstracttaxonomy.tags- Content tagsdistributor- For wire contentworkflow- Publishing workflow statuspromo_items.basic- For featured media in a storyadditional_properties- Custom metadataadditional_properties.expiration_date- Only for images, and usually those that come in via a wires source
- Extract - Get your data flowing first
- Identify - Ensure content types are detected correctly
- Convert (basic) - Get one content type working end-to-end
- Convert (content_elements) - CRITICAL - Customize content_elements handling
- Convert (additional fields) - Add ANS fields your organization needs
- Convert (advanced) - Add remaining content types and edge cases
- Polish - Update naming, logging, error messages
- Review fixture files in
/fixturesto understand expected data structures - Check
adapter/ReadMe.mdfor detailed explanations of each component - Examine test files in
adapter_temporal/tests/for usage examples