Merge pull request #88 from ciaran28/main

ciaran28 · web-flow · commit 3c493ed7d711 · 2023-05-15T15:13:57.000+01:00
Bug Fix
diff --git a/README.md b/README.md
@@ -168,36 +168,19 @@ Secrets in GitHub should look exactly like below. The secrets are case sensitive
 
 <img width="893" alt="image" src="https://user-images.githubusercontent.com/108273509/205954210-c123c407-4c83-4952-ab4b-cd6c485efc2f.png">
 
-- Azure Resources created (Production Environment snapshot)
+- Azure Resources created (Production Environment snapshot - For speed I have hashed out all environment deployments except Sandbox. Update onDeploy.yaml to deploy all environments)
   
 <img width="1175" alt="image" src="https://user-images.githubusercontent.com/108273509/194638664-fa6e1809-809e-45b2-9655-9312f32f24bb.png">
 
 
 ---
 ---
- 
 
-# Repo Guidance 
+## Running Pipelines
 
-## Databricks as Infrastructure
-<details close>
-<summary>Click Dropdown... </summary>
+- The end to end machine learning pipleine will be pre-configured in the "workflows" section in databricks. This utilises a Job Cluster which will automatically upload the necessary dependencies contained within a python wheel file 
 
-<br>
-There are many ways that a User may create Databricks Jobs, Notebooks, Clusters, Secret Scopes etc. <br>
-<br>
-For example, they may interact with the Databricks API/CLI by using: <br>
-<br>
-i. VS Code on their local machine, <br>
-ii. the Databricks GUI online; or <br>
-iii. a YAML Pipeline deployment on a DevOps Agent (e.g. GitHub Actions or Azure DevOps etc). <br>
-<br>
- 
-The programmatic way in which the first two scenarios allow us to interact with the Databricks API is akin to "Continuous **Development**", as opposed to "Continuous **Deployment**". The former is strong on flexibility, however, it is somewhat weak on governance, accountability and reproducibility. <br>
-
-In a nutshell, Continuous **Development** _is a partly manual process where developers can deploy any changes to customers by simply clicking a button, while continuous **Deployment** emphasizes automating the entire process_.
-
-</details>
+- If you wish to run the machine learning scripts from the Notebook instead, first upload the dependencies (automatic upload is in development). Simply navigate to python wheel file contained within the dist/ folder. Manually upload the python wheel file to the cluster that you wish to run for the Notebook. 
 
 ---
 ---
diff --git a/mlOps/modelOps/data_science/nyc_taxi/train_register.py b/mlOps/modelOps/data_science/nyc_taxi/train_register.py
@@ -8,6 +8,9 @@
 # Install pypi packages azureml-sdk[databricks], lightgbm, uszipcode
 # The above will be automated in due course 
 
+
+# https://learn.microsoft.com/en-us/azure/databricks/_extras/notebooks/source/machine-learning/automl-feature-store-example.html
+
 # COMMAND ----------
 
 from pyspark.sql import *
@@ -154,7 +157,7 @@ def __init__(self, spark: SparkSession, experiment_name: str, namespace: str, wo
         self.track_in_azure_ml = False
         self.namespace = namespace
         self.ws = workspace
-        self.model_folder = "outputs"
+        self.model_folder = "cached_models"
         self.dbutils = SparkRunner().get_dbutils()
 
 
@@ -337,6 +340,9 @@ def train_model(
             )
             
             #Save The Model  
+
+            self.create_model_folder()
+
             model_file_path = self.get_model_file_path("taxi_example_fare_packaged")
             print(f"ModelFilePath: {model_file_path}")
             joblib.dump(