-
Notifications
You must be signed in to change notification settings - Fork 294
SNOW-2367850: task integration example update #250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
sfc-gh-dhung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember this is a public facing sample, please be sure the code quality is high. It's especially important for the code to be simple and readable, with self documenting variable/function names and sufficient comments for non-experts to understand
| index = int(os.environ.get("SNOWFLAKE_JOB_INDEX", 0)) | ||
|
|
||
| # Only head node saves and returns results | ||
| if index != 0: | ||
| print(f"Worker node (index {index}) - exiting") | ||
| exit(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? ML Job/CR takes care of multi-node management, the driver script only gets run on the head node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes I got an error
ValueError: Model is not trained yet. Please call fit first.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/ajiang/PycharmProjects/sf-samples/samples/ml/ml_jobs/e2e_task_graph/src/pipeline_local.py", line 134, in <module>
run_pipeline(
File "/Users/ajiang/PycharmProjects/sf-samples/samples/ml/ml_jobs/e2e_task_graph/src/pipeline_local.py", line 65, in run_pipeline
model_obj = job.result()["model_obj"]
File "/Users/ajiang/.pyenv/versions/3.10.18/lib/python3.10/site-packages/snowflake/ml/_internal/telemetry.py", line 611, in wrap
return ctx.run(execute_func_with_statement_params)
File "/Users/ajiang/.pyenv/versions/3.10.18/lib/python3.10/site-packages/snowflake/ml/_internal/telemetry.py", line 576, in execute_func_with_statement_params
result = func(*args, **kwargs)
File "/Users/ajiang/.pyenv/versions/3.10.18/lib/python3.10/site-packages/snowflake/ml/jobs/job.py", line 288, in result
return cast(T, self._result.get_value())
File "/Users/ajiang/.pyenv/versions/3.10.18/lib/python3.10/site-packages/snowflake/ml/jobs/_interop/results.py", line 47, in get_value
self._raise_exception(ex, wrap_exceptions)
File "/Users/ajiang/.pyenv/versions/3.10.18/lib/python3.10/site-packages/snowflake/ml/jobs/_interop/results.py", line 26, in _raise_exception
raise RuntimeError(f"Job execution failed with error: {exception!r}") from exception
RuntimeError: Job execution failed with error: ValueError('Model is not trained yet. Please call fit first.')
Do you have any suggestions to resolve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this doesn't happen if you remove these lines? what is the stack trace you see in the job itself?
| # Load the datasets | ||
| serialized = json.loads(ctx.get_predecessor_return_value("PREPARE_DATA")) | ||
|
|
||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a more specific exception type
| print(f"Error loading dataset info: {e}") | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("--dataset-info", type=str, required=True) | ||
| args = parser.parse_args() | ||
| serialized = json.loads(args.dataset_info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having argparse in an except block seems like a terrible pattern
| artifact_dir = config.artifact_dir | ||
|
|
||
| # Load the datasets | ||
| serialized = json.loads(ctx.get_predecessor_return_value("PREPARE_DATA")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is serialized referring to here? Can you use a more meaningful name?
| if not hasattr(model_obj, 'feature_weights'): | ||
| model_obj.feature_weights = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
| # NOTE: Remove `target_instances=2` to run training on a single node | ||
| # See https://docs.snowflake.com/en/developer-guide/snowflake-ml/ml-jobs/distributed-ml-jobs | ||
| @remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the main points of this sample is to demonstrate how easy it is to convert a local pipeline to pushing certain steps down into ML Jobs. Needing to write a separate script file which we submit_file() just for this conversion severely weakens this story. Why can't we just keep using a @remote() decorated function? @remote(...) should convert the function into an MLJobDefinition which we can directly use in pipeline_dag without needing an explicit MLJobDefinition.register() call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is currently @remote does not create job definition and it creates a job directly. Currently, we only merged the PR for phase one and phase 2 is in review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's hold off on merging this until @remote is ready then
| metrics = {**train_metrics, **test_metrics} | ||
| if artifact_dir: | ||
| model_pkl = cp.dumps(model_obj) | ||
| model_path = os.path.join(config.artifact_dir, "model.pkl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use artifact_dir here. If prior code changes such that artifact_dir can be derived from something besides config.artifact_dir, this will break
| "model_obj": model_obj, | ||
| "metrics": metrics, | ||
| } | ||
| __return__= result_dict No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not set __return__ for Task initiated jobs as well?
No description provided.