Skip to content

Commit 159e7ce

Browse files
committed
2 parents 0f5b3be + 878e106 commit 159e7ce

File tree

3 files changed

+66
-40
lines changed

3 files changed

+66
-40
lines changed

articles/machine-learning/how-to-track-experiments.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ The following metrics can be added to a run while training an experiment. To vie
4848
If you want to track or monitor your experiment, you must add code to start logging when you submit the run. The following are ways to trigger the run submission:
4949
* __Run.start_logging__ - Add logging functions to your training script and start an interactive logging session in the specified experiment. **start_logging** creates an interactive run for use in scenarios such as notebooks. Any metrics that are logged during the session are added to the run record in the experiment.
5050
* __ScriptRunConfig__ - Add logging functions to your training script and load the entire script folder with the run. **ScriptRunConfig** is a class for setting up configurations for script runs. With this option, you can add monitoring code to be notified of completion or to get a visual widget to monitor.
51+
* __Designer logging__ - Add logging functions to a drag-&-drop designer pipeline by using the __Execute Python Script__ module. Add Python code to log designer experiments.
5152

5253
## Set up the workspace
5354
Before adding logging and submitting an experiment, you must set up the workspace.
@@ -100,8 +101,33 @@ This example expands on the basic sklearn Ridge model from above. It does a simp
100101
[!notebook-python[] (~/MachineLearningNotebooks/how-to-use-azureml/training/train-on-local/train-on-local.ipynb?name=src)]
101102
[!notebook-python[] (~/MachineLearningNotebooks/how-to-use-azureml/training/train-on-local/train-on-local.ipynb?name=run)]
102103

104+
## Option 3: Log designer experiments
103105

106+
Use the __Execute Python Script__ module to add logging logic to your designer experiments. You can log any value using this workflow, but it's especially useful to log metrics from the __Evaluate Model__ module to track model performance across different runs.
104107

108+
1. Connect an __Execute Python Script__ module to the output of your __Evaluate Model__ module.
109+
110+
![Connect Execute Python Script module to Evaluate Model module](./media/how-to-track-experiments/designer-logging-pipeline.png)
111+
112+
1. Paste the following code into the __Execute Python Script__ code editor to log the mean absolute error for your trained model:
113+
114+
```python
115+
# dataframe1 contains the values from Evaluate Model
116+
def azureml_main(dataframe1 = None, dataframe2 = None):
117+
print(f'Input pandas.DataFrame #1: {dataframe1}')
118+
119+
from azureml.core import Run
120+
121+
run = Run.get_context()
122+
123+
# Log the mean absolute error to the current run to see the metric in the module detail pane.
124+
run.log(name='Mean_Absolute_Error', value=dataframe1['Mean_Absolute_Error'])
125+
126+
# Log the mean absolute error to the parent run to see the metric in the run details page.
127+
run.parent.log(name='Mean_Absolute_Error', value=dataframe1['Mean_Absolute_Error'])
128+
129+
return dataframe1,
130+
```
105131

106132
## Manage a run
107133

16.7 KB
Loading
Lines changed: 40 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
title: Best practices for SQL on-demand (preview) in Azure Synapse Analytics
3-
description: Recommendations and best practices you should know as you work with SQL on-demand (preview).
2+
title: Best practices for SQL on-demand (preview)
3+
description: Recommendations and best practices you should know when you work with SQL on-demand (preview).
44
services: synapse-analytics
55
author: filippopovic
66
manager: craigg
@@ -14,60 +14,60 @@ ms.reviewer: jrasnick
1414

1515
# Best practices for SQL on-demand (preview) in Azure Synapse Analytics
1616

17-
In this article, you'll find a collection of best practices for using SQL on-demand (preview). SQL on-demand is an additional resource within Azure Synapse Analytics.
17+
In this article, you'll find a collection of best practices for using SQL on-demand (preview). SQL on-demand is a resource in Azure Synapse Analytics.
1818

1919
## General considerations
2020

21-
SQL on-demand allows you to query files in your Azure storage accounts. It doesn't have local storage or ingestion capabilities. As such, all files that the query targets are external to SQL on-demand. Everything related to reading files from storage might have an impact on query performance.
21+
SQL on-demand allows you to query files in your Azure storage accounts. It doesn't have local storage or ingestion capabilities. So all files that the query targets are external to SQL on-demand. Everything related to reading files from storage might have an impact on query performance.
2222

23-
## Colocate Azure Storage account and SQL on-demand
23+
## Colocate your Azure storage account and SQL on-demand
2424

2525
To minimize latency, colocate your Azure storage account and your SQL on-demand endpoint. Storage accounts and endpoints provisioned during workspace creation are located in the same region.
2626

27-
For optimal performance, if you access other storage accounts with SQL on-demand, make sure they are in the same region. If they aren't in the same region, there will be increased latency for the data's network transfer between the remote and endpoint's regions.
27+
For optimal performance, if you access other storage accounts with SQL on-demand, make sure they're in the same region. If they aren't in the same region, there will be increased latency for the data's network transfer between the remote region and the endpoint's region.
2828

2929
## Azure Storage throttling
3030

31-
Multiple applications and services may access your storage account. Storage throttling occurs when the combined IOPS or throughput generated by applications, services, and SQL on-demand workload exceed the limits of the storage account. As a result, you'll experience a significant negative effect on query performance.
31+
Multiple applications and services might access your storage account. Storage throttling occurs when the combined IOPS or throughput generated by applications, services, and SQL on-demand workload exceed the limits of the storage account. As a result, you'll experience a significant negative effect on query performance.
3232

33-
Once throttling is detected, SQL on-demand has built-in handling of this scenario. SQL on-demand will make requests to storage at a slower pace until throttling is resolved.
33+
When throttling is detected, SQL on-demand has built-in handling to resolve it. SQL on-demand will make requests to storage at a slower pace until throttling is resolved.
3434

3535
> [!TIP]
36-
> For optimal query execution, you shouldn't stress the storage account with other workloads during query execution.
36+
> For optimal query execution, don't stress the storage account with other workloads during query execution.
3737
3838
## Prepare files for querying
3939

4040
If possible, you can prepare files for better performance:
4141

42-
- Convert CSV and JSON to Parquet - Parquet is columnar format. Since it's compressed, its file sizes are smaller than CSV or JSON files with the same data. SQL on-demand will need less time and storage requests to read it.
42+
- Convert CSV and JSON to Parquet. Parquet is a columnar format. Because it's compressed, its file sizes are smaller than CSV or JSON files that contain the same data. SQL on-demand will need less time and fewer storage requests to read it.
4343
- If a query targets a single large file, you'll benefit from splitting it into multiple smaller files.
44-
- Try keeping your CSV file size below 10 GB.
44+
- Try to keep your CSV file size below 10 GB.
4545
- It's better to have equally sized files for a single OPENROWSET path or an external table LOCATION.
46-
- Partition your data by storing partitions to different folders or file names - check [use filename and filepath functions to target specific partitions](#use-fileinfo-and-filepath-functions-to-target-specific-partitions).
46+
- Partition your data by storing partitions to different folders or file names. See [Use filename and filepath functions to target specific partitions](#use-filename-and-filepath-functions-to-target-specific-partitions).
4747

48-
## Push wildcards to lower levels in path
48+
## Push wildcards to lower levels in the path
4949

50-
You can use wildcards in your path to [query multiple files and folders](develop-storage-files-overview.md#query-multiple-files-or-folders). SQL on-demand lists files in your storage account starting from first * using storage API and eliminates files that don't match specified path. Reducing initial list of files can improve performance if there are many files that match specified path up to first wildcard.
50+
You can use wildcards in your path to [query multiple files and folders](develop-storage-files-overview.md#query-multiple-files-or-folders). SQL on-demand lists files in your storage account, starting from the first * using storage API. It eliminates files that don't match the specified path. Reducing the initial list of files can improve performance if there are many files that match the specified path up to the first wildcard.
5151

5252
## Use appropriate data types
5353

54-
The data types you use in your query impact performance. You can get better performance if you:
54+
The data types you use in your query affect performance. You can get better performance if you follow these guidelines:
5555

5656
- Use the smallest data size that will accommodate the largest possible value.
57-
- If maximum character value length is 30 characters, use character data type of length 30.
58-
- If all character column values are of fixed size, use char or nchar. Otherwise, use varchar or nvarchar.
59-
- If maximum integer column value is 500, use smallint as it is smallest data type that can accommodate this value. You can find integer data type ranges [here](https://docs.microsoft.com/sql/t-sql/data-types/int-bigint-smallint-and-tinyint-transact-sql?view=sql-server-ver15).
60-
- If possible, use varchar and char instead of nvarchar and nchar.
61-
- Use integer-based data types if possible. Sort, join, and group by operations are performed faster on integers than on characters data.
62-
- If you're using schema inference, [check inferred data type](#check-inferred-data-types).
57+
- If the maximum character value length is 30 characters, use a character data type of length 30.
58+
- If all character column values are of fixed size, use **char** or **nchar**. Otherwise, use **varchar** or **nvarchar**.
59+
- If the maximum integer column value is 500, use **smallint** because it's the smallest data type that can accommodate this value. You can find integer data type ranges in [this article](https://docs.microsoft.com/sql/t-sql/data-types/int-bigint-smallint-and-tinyint-transact-sql?view=sql-server-ver15).
60+
- If possible, use **varchar** and **char** instead of **nvarchar** and **nchar**.
61+
- Use integer-based data types if possible. SORT, JOIN, and GROUP BY operations complete faster on integers than on character data.
62+
- If you're using schema inference, [check inferred data types](#check-inferred-data-types).
6363

6464
## Check inferred data types
6565

66-
[Schema inference](query-parquet-files.md#automatic-schema-inference) helps you quickly write queries and explore data without knowing file schema. This comfort comes at the expense of inferred data types being larger than they actually are. It happens when there isn't enough information in source files to make sure appropriate data type is used. For example, Parquet files don't contain metadata about maximum character column length and SQL on-demand infers it as varchar(8000).
66+
[Schema inference](query-parquet-files.md#automatic-schema-inference) helps you quickly write queries and explore data without knowing file schemas. The cost of this convenience is that inferred data types are larger than the actual data types. This happens when there isn't enough information in the source files to make sure the appropriate data type is used. For example, Parquet files don't contain metadata about maximum character column length. So SQL on-demand infers it as varchar(8000).
6767

68-
You can check resulting data types of your query using [sp_describe_first_results_set](https://docs.microsoft.com/sql/relational-databases/system-stored-procedures/sp-describe-first-result-set-transact-sql?view=sql-server-ver15).
68+
You can use [sp_describe_first_results_set](https://docs.microsoft.com/sql/relational-databases/system-stored-procedures/sp-describe-first-result-set-transact-sql?view=sql-server-ver15) to check the resulting data types of your query.
6969

70-
The following example shows how you can optimize inferred data types. Procedure is used to show inferred data types.
70+
The following example shows how you can optimize inferred data types. This procedure is used to show the inferred data types:
7171
```sql
7272
EXEC sp_describe_first_result_set N'
7373
SELECT
@@ -79,15 +79,15 @@ EXEC sp_describe_first_result_set N'
7979
) AS nyc';
8080
```
8181

82-
Here is the result set.
82+
Here's the result set:
8383

8484
|is_hidden|column_ordinal|name|system_type_name|max_length|
8585
|----------------|---------------------|----------|--------------------|-------------------||
8686
|0|1|vendor_id|varchar(8000)|8000|
8787
|0|2|pickup_datetime|datetime2(7)|8|
8888
|0|3|passenger_count|int|4|
8989

90-
Once we know inferred data types for query, we can specify appropriate data types:
90+
After you know the inferred data types for the query, you can specify appropriate data types:
9191

9292
```sql
9393
SELECT
@@ -98,44 +98,44 @@ FROM
9898
FORMAT='PARQUET'
9999
)
100100
WITH (
101-
vendor_id varchar(4), -- we used length of 4 instead of inferred 8000
101+
vendor_id varchar(4), -- we used length of 4 instead of the inferred 8000
102102
pickup_datetime datetime2,
103103
passenger_count int
104104
) AS nyc;
105105
```
106106

107-
## Use fileinfo and filepath functions to target specific partitions
107+
## Use filename and filepath functions to target specific partitions
108108

109-
Data is often organized in partitions. You can instruct SQL on-demand to query particular folders and files. This function will reduce the number of files and amount of data the query needs to read and process. An added bonus is that you'll achieve better performance.
109+
Data is often organized in partitions. You can instruct SQL on-demand to query particular folders and files. Doing so will reduce the number of files and the amount of data the query needs to read and process. An added bonus is that you'll achieve better performance.
110110

111-
For more information, check [filename](develop-storage-files-overview.md#filename-function) and [filepath](develop-storage-files-overview.md#filepath-function) functions and examples on how to [query specific files](query-specific-files.md).
111+
For more information, read about the [filename](develop-storage-files-overview.md#filename-function) and [filepath](develop-storage-files-overview.md#filepath-function) functions and see the examples for [querying specific files](query-specific-files.md).
112112

113113
> [!TIP]
114-
> Always cast result of filepath and fileinfo functions to appropriate data types. If you use character data types, make sure appropriate length is used.
114+
> Always cast the results of the filepath and filename functions to appropriate data types. If you use character data types, be sure to use the appropriate length.
115115
116116
> [!NOTE]
117-
> Functions used for partition elimination, filepath and fileinfo, are not currently supported for external tables other than those created automatically for each table created in Apache Spark for Azure Synapse Analytics.
117+
> Functions used for partition elimination, filepath and filename, aren't currently supported for external tables, other than those created automatically for each table created in Apache Spark for Azure Synapse Analytics.
118118
119-
If your stored data isn't partitioned, consider partitioning it so you can use these functions to optimize queries targeting those files. When [querying partitioned Apache Spark for Azure Synapse tables](develop-storage-files-spark-tables.md) from SQL on-demand, the query will automatically target only the files needed.
119+
If your stored data isn't partitioned, consider partitioning it. That way you can use these functions to optimize queries that target those files. When you [query partitioned Apache Spark for Azure Synapse tables](develop-storage-files-spark-tables.md) from SQL on-demand, the query will automatically target only the necessary files.
120120

121-
## Use PARSER_VERSION 2.0 for querying CSV files
121+
## Use PARSER_VERSION 2.0 to query CSV files
122122

123-
You can use performance optimized parser when querying CSV files. Check [PARSER_VERSION](develop-openrowset.md) for details.
123+
You can use a performance-optimized parser when you query CSV files. For details, see [PARSER_VERSION](develop-openrowset.md).
124124

125125
## Use CETAS to enhance query performance and joins
126126

127127
[CETAS](develop-tables-cetas.md) is one of the most important features available in SQL on-demand. CETAS is a parallel operation that creates external table metadata and exports the SELECT query results to a set of files in your storage account.
128128

129-
You can use CETAS to store frequently used parts of queries, like joined reference tables, to a new set of files. Next, you can join to this single external table instead of repeating common joins in multiple queries.
129+
You can use CETAS to store frequently used parts of queries, like joined reference tables, to a new set of files. You can then join to this single external table instead of repeating common joins in multiple queries.
130130

131131
As CETAS generates Parquet files, statistics will be automatically created when the first query targets this external table, resulting in improved performance.
132132

133-
## AAD pass-through performance
133+
## Azure AD Pass-through performance
134134

135-
SQL on-demand allows you to access files in storage using AAD pass-through or SAS credential. You might experience slower performance with AAD pass-through comparing to SAS.
135+
SQL on-demand allows you to access files in storage by using Azure Active Directory (Azure AD) Pass-through or SAS credentials. You might experience slower performance with Azure AD Pass-through than you would with SAS.
136136

137-
If you need better performance, try SAS credentials to access storage until AAD pass-through performance is improved.
137+
If you need better performance, try using SAS credentials to access storage until Azure AD Pass-through performance is improved.
138138

139139
## Next steps
140140

141-
Review the [Troubleshooting](../sql-data-warehouse/sql-data-warehouse-troubleshoot.md?toc=/azure/synapse-analytics/toc.json&bc=/azure/synapse-analytics/breadcrumb/toc.json) article for common issues and solutions. If you're working with SQL pool rather than SQL on-demand, see the [Best Practices for SQL pool](best-practices-sql-pool.md) article for specific guidance.
141+
Review the [troubleshooting](../sql-data-warehouse/sql-data-warehouse-troubleshoot.md?toc=/azure/synapse-analytics/toc.json&bc=/azure/synapse-analytics/breadcrumb/toc.json) article for solutions to common problems. If you're working with SQL pools rather than SQL on-demand, see [Best practices for SQL pools](best-practices-sql-pool.md) for specific guidance.

0 commit comments

Comments
 (0)