Skip to content

Conversation

@sfc-gh-ajiang
Copy link
Collaborator

No description provided.

Copy link
Collaborator

@sfc-gh-dhung sfc-gh-dhung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember this is a public facing sample, please be sure the code quality is high. It's especially important for the code to be simple and readable, with self documenting variable/function names and sufficient comments for non-experts to understand

Comment on lines 17 to 22
index = int(os.environ.get("SNOWFLAKE_JOB_INDEX", 0))

# Only head node saves and returns results
if index != 0:
print(f"Worker node (index {index}) - exiting")
exit(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? ML Job/CR takes care of multi-node management, the driver script only gets run on the head node

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes I got an error

ValueError: Model is not trained yet. Please call fit first.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/ajiang/PycharmProjects/sf-samples/samples/ml/ml_jobs/e2e_task_graph/src/pipeline_local.py", line 134, in <module>
    run_pipeline(
  File "/Users/ajiang/PycharmProjects/sf-samples/samples/ml/ml_jobs/e2e_task_graph/src/pipeline_local.py", line 65, in run_pipeline
    model_obj = job.result()["model_obj"]
  File "/Users/ajiang/.pyenv/versions/3.10.18/lib/python3.10/site-packages/snowflake/ml/_internal/telemetry.py", line 611, in wrap
    return ctx.run(execute_func_with_statement_params)
  File "/Users/ajiang/.pyenv/versions/3.10.18/lib/python3.10/site-packages/snowflake/ml/_internal/telemetry.py", line 576, in execute_func_with_statement_params
    result = func(*args, **kwargs)
  File "/Users/ajiang/.pyenv/versions/3.10.18/lib/python3.10/site-packages/snowflake/ml/jobs/job.py", line 288, in result
    return cast(T, self._result.get_value())
  File "/Users/ajiang/.pyenv/versions/3.10.18/lib/python3.10/site-packages/snowflake/ml/jobs/_interop/results.py", line 47, in get_value
    self._raise_exception(ex, wrap_exceptions)
  File "/Users/ajiang/.pyenv/versions/3.10.18/lib/python3.10/site-packages/snowflake/ml/jobs/_interop/results.py", line 26, in _raise_exception
    raise RuntimeError(f"Job execution failed with error: {exception!r}") from exception
RuntimeError: Job execution failed with error: ValueError('Model is not trained yet. Please call fit first.')

Do you have any suggestions to resolve it?

Copy link
Collaborator

@sfc-gh-dhung sfc-gh-dhung Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this doesn't happen if you remove these lines? what is the stack trace you see in the job itself?

# Load the datasets
serialized = json.loads(ctx.get_predecessor_return_value("PREPARE_DATA"))

except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a more specific exception type

Comment on lines 33 to 37
print(f"Error loading dataset info: {e}")
parser = argparse.ArgumentParser()
parser.add_argument("--dataset-info", type=str, required=True)
args = parser.parse_args()
serialized = json.loads(args.dataset_info)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having argparse in an except block seems like a terrible pattern

artifact_dir = config.artifact_dir

# Load the datasets
serialized = json.loads(ctx.get_predecessor_return_value("PREPARE_DATA"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is serialized referring to here? Can you use a more meaningful name?

Comment on lines 44 to 45
if not hasattr(model_obj, 'feature_weights'):
model_obj.feature_weights = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Comment on lines -147 to -149
# NOTE: Remove `target_instances=2` to run training on a single node
# See https://docs.snowflake.com/en/developer-guide/snowflake-ml/ml-jobs/distributed-ml-jobs
@remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the main points of this sample is to demonstrate how easy it is to convert a local pipeline to pushing certain steps down into ML Jobs. Needing to write a separate script file which we submit_file() just for this conversion severely weakens this story. Why can't we just keep using a @remote() decorated function? @remote(...) should convert the function into an MLJobDefinition which we can directly use in pipeline_dag without needing an explicit MLJobDefinition.register() call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is currently @remote does not create job definition and it creates a job directly. Currently, we only merged the PR for phase one and phase 2 is in review.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hold off on merging this until @remote is ready then

metrics = {**train_metrics, **test_metrics}
if artifact_dir:
model_pkl = cp.dumps(model_obj)
model_path = os.path.join(config.artifact_dir, "model.pkl")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use artifact_dir here. If prior code changes such that artifact_dir can be derived from something besides config.artifact_dir, this will break

"model_obj": model_obj,
"metrics": metrics,
}
__return__= result_dict No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not set __return__ for Task initiated jobs as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants