Skip to content

AutoGluon Cloud Fails to Handle Multi-Partition Batch Transform Job Correctly #136

@tonyhoo

Description

@tonyhoo

Description:

When running batch transform with autogluon.cloud on a dataset that is partitioned into multiple records (using MultiRecord strategy), AutoGluon Cloud appears to fail when the input CSV file contains multiple partitions. The issue arises when headers from different partitions are not handled properly, leading to misaligned columns and prediction failures during inference.

Steps to Reproduce:

The following script can be used to reproduce the issue:

from autogluon.cloud import TabularCloudPredictor
import pandas as pd

# Load datasets
train_data = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv")
test_data = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv")
test_data.drop(columns=['class'], inplace=True)

# Cloud Predictor Arguments
predictor_init_args = {"label": "class"}  
predictor_fit_args = {"train_data": train_data, "time_limit": 60}  

# Initialize Cloud Predictor and Fit
cloud_predictor = TabularCloudPredictor(cloud_output_path='tonyhu-autogluon')
cloud_predictor.fit(predictor_init_args=predictor_init_args, predictor_fit_args=predictor_fit_args)

# Batch Inference with small max_payload to force multiple partitions
result = cloud_predictor.predict(test_data, backend_kwargs={"transformer_kwargs": {"max_payload": 1}})

Expected Behavior:

The batch transform job should handle multiple partitions correctly, aligning columns across the partitions and ignoring or managing headers if present in individual partitions.

Observed Behavior:

The job fails with the following error logs:

Bad HTTP status received from algorithm: 500
invalid literal for int() with base 10: '0.1': Error while type casting for column 'capital-loss'

Logs show that the columns are misaligned for certain partitions:

test_columns: [' 11th', ' Machine-op-inspct', ' Male', ' Never-married', ' Other-relative', ' Private', ' United-States', ' White', '0', '0.1', '207443', '50', '62', '7']
2024-09-13T21:56:19,062 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - train_columns: ['age', 'capital-gain', 'capital-loss', 'education', 'education-num', 'fnlwgt', 'hours-per-week', 'marital-status', 'native-country', 'occupation', 'race', 'relationship', 'sex', 'workclass']

This suggests that the CSV header is not not being duplicated across partitions, leading to column misalignment.

Environment:

  • autogluon==1.1.0
  • Running batch transform in SageMaker with MultiRecord strategy.
  • MaxPayloadInMB=1 is set to ensure multiple partitions.

Additional Information:

The issue seems to be that AutoGluon Cloud is not handling the headers properly when dealing with batch transform partitioned records. In a multi-partition job, not all batch will have the header/column, which is causing the column misalignment.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions