MicrosoftDocs
diff --git a/‎articles/ai-foundry/openai/how-to/evaluations.md‎
Lines changed: 10 additions & 58 deletions b/‎articles/ai-foundry/openai/how-to/evaluations.md‎
Lines changed: 10 additions & 58 deletions
@@ -9,6 +9,7 @@ ms.custom: references_regions
 ms.date: 07/31/2025
 author: mrbullwinkle
 ms.author: mbullwin
+zone_pivot_groups: openai-fine-tuning
 recommendations: false
 ---
 
@@ -135,72 +136,23 @@ When you click into each testing criteria, you will see different types of grade
 
 ## Getting started
 
-1. Select the **Azure OpenAI Evaluation (PREVIEW)** within [Azure AI Foundry portal](https://ai.azure.com/?cid=learnDocs). To see this view as an option may need to first select an existing Azure OpenAI resource in a supported region.
-2. Select **+ New evaluation**
+::: zone pivot="programming-language-studio"
 
-    :::image type="content" source="../media/how-to/evaluations/new-evaluation.png" alt-text="Screenshot of the Azure OpenAI evaluation UX with new evaluation selected." lightbox="../media/how-to/evaluations/new-evaluation.png":::
+[!INCLUDE [Azure AI Foundry portal fine-tuning](../includes/evaluation-foundry.md)]
 
-3. Choose how you would like to provide test data for evaluation. You can import stored Chat Completions, create data using provided default templates, or upload your own data. Let's walk through uploading your own data. 
+::: zone-end
 
-    :::image type="content" source="../media/how-to/evaluations/create-new-eval.png" alt-text="Screenshot of the Azure OpenAI create new evaluation." lightbox="../media/how-to/evaluations/create-new-eval.png":::
+::: zone pivot="programming-language-python"
 
-4. Select your evaluation data which will be in `.jsonl` format. If you already have an existing data, you can select one, or upload a new data.
+[!INCLUDE [Python SDK fine-tuning](../includes/evaluation-python.md)]
 
-    :::image type="content" source="../media/how-to/evaluations/upload-data-1.png" alt-text="Screenshot of data upload options." lightbox="../media/how-to/evaluations/upload-data-1.png":::
+::: zone-end
 
-   When you upload new data, you'll see the first three lines of the file as a preview on the right side:
+::: zone pivot="rest-api"
 
-    :::image type="content" source="../media/how-to/evaluations/upload-data-2.png" alt-text="Screenshot of data upload with example selection." lightbox="../media/how-to/evaluations/upload-data-2.png":::
+[!INCLUDE [REST API fine-tuning](../includes/evaluation-rest.md)]
 
-   If you need a sample test file, you can use this sample `.jsonl` text. This sample contains sentences of various technical content, and we are going to be assessing semantic similarity across these sentences.
-
-    ```jsonl
-    {"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}      
-    {"input": [{"role": "system", "content": "Create a comprehensive yet concise summary that captures the essential technical details and significance of the topic."}, {"role": "user", "content": "Self-attention mechanisms are vital in modern transformer models, allowing them to evaluate the relevance of different parts of an input sequence when processing each element. In self-attention, each position in a sequence learns to focus on all other positions, forming a weighted sum of their values. Attention weights are derived using queries, keys, and values, with compatibility between a query and key determining attention to each value. This mechanism enables capturing long-range dependencies and relationships within a sequence, making it effective for tasks needing context understanding and inter-part relationships."}], "output": "Self-attention is crucial in transformers, enabling each sequence position to evaluate and collect data from all others. It uses queries, keys, and values for attention weights, effectively capturing long-range dependencies and contextual links."}      
-    {"input": [{"role": "system", "content": "Synthesize the technical information into a brief, accurate summary that maintains the core concepts and their significance."}, {"role": "user", "content": "The transformer architecture significantly changed natural language processing by introducing a new method for sequence processing. Unlike recurrent neural networks, transformers process whole sequences at once using self-attention. The architecture consists of an encoder and decoder, each with multiple layers of multi-head attention, feed-forward networks, and normalization layers. The encoder processes the input sequence, and the decoder produces the output sequence. Positional encodings are added to retain sequence order information since the model lacks inherent sequence position awareness. This architecture underlies many leading language models like BERT and GPT."}], "output": "Transformers are a transformative neural architecture processing sequences with self-attention instead of recurrence. Their encoder-decoder design, with multi-head attention and position encodings, is fundamental to modern language models like BERT and GPT."}      
-    {"input": [{"role": "system", "content": "Distill the technical content into a clear, concise summary that preserves the most important concepts and their practical applications."}, {"role": "user", "content": "Cosine similarity is a metric for measuring similarity between vectors by calculating the cosine of the angle between them. The outcome ranges from -1 to 1, with 1 indicating identical orientation, 0 orthogonality, and -1 opposite orientation. In machine learning and NLP, cosine similarity is useful for comparing document vectors, word embeddings, or any high-dimensional representations. It's often preferred over Euclidean distance because it's invariant to vector magnitude, focusing instead on orientation. This makes it particularly useful for comparing documents or embeddings of various lengths or scales."}], "output": "Cosine similarity evaluates vector similarity by computing their angle's cosine, with values from -1 to 1. It's widely used in ML and NLP for comparing document vectors and embeddings due to its magnitude-invariant nature."}      
-    {"input": [{"role": "system", "content": "Provide a succinct summary that captures the key technical concepts and their relationships while maintaining clarity and accuracy."}, {"role": "user", "content": "Neural network activation functions are crucial for introducing non-linearity to deep learning models. Commonly used activation functions include ReLU (Rectified Linear Unit), which outputs the input directly if positive and zero otherwise, sigmoid, which compresses values between 0 and 1, and tanh, which normalizes values between -1 and 1. Each function has specific benefits and applications: ReLU avoids vanishing gradients and enables sparse activation, sigmoid is useful for binary classification, and tanh is often used in hidden layers for its zero-centered output. The activation function choice greatly affects model training dynamics and ultimate performance."}], "output": "Activation functions introduce non-linearity to neural networks, with ReLU, sigmoid, and tanh being the most common. Each serves distinct purposes: ReLU prevents vanishing gradients, sigmoid is suitable for binary classification, and tanh offers zero-centered output, all affecting model performance differently."}      
-    {"input": [{"role": "system", "content": "Create a focused summary that emphasizes the main technical concepts and their practical implications while maintaining accuracy."}, {"role": "user", "content": "Batch normalization is a technique that stabilizes and accelerates deep neural network training by normalizing each layer's inputs. It normalizes activations using the mean and variance computed over the current data batch. This normalization reduces internal covariate shift, where the input distribution changes during training. Batch normalization includes learnable parameters (gamma and beta) allowing the network to reverse normalization if necessary. Benefits include enabling higher learning rates, lessening the importance of precise parameter initialization, and serving as a regularizer, potentially removing the need for dropout."}], "output": "Batch normalization stabilizes neural network training by normalizing layer inputs using batch statistics. It addresses internal covariate shift with learnable parameters, allowing higher learning rates and improved training dynamics, while offering regularization benefits."}      
-    {"input": [{"role": "system", "content": "Summarize the technical content while preserving the essential concepts and their interconnections, ensuring clarity and precision."}, {"role": "user", "content": "Gradient descent optimization algorithms are foundational to training machine learning models. The basic version updates parameters in the steepest descent direction of the loss function. Variants like Stochastic Gradient Descent (SGD) use random samples to estimate gradients, enhancing computational efficiency. Advanced optimizers like Adam combine momentum (helping overcome local minima) and RMSprop (adapting learning rates for each parameter). These optimizers maintain running averages of gradients and their squares, enabling adaptive learning rates and more efficient training. The optimizer choice can significantly influence model convergence speed and final performance."}], "output": "Gradient descent algorithms optimize ML models by updating parameters based on loss gradients. Variants like SGD use sampling for efficiency, while advanced optimizers like Adam combine momentum with adaptive learning rates for improved training performance."}      
-    {"input": [{"role": "system", "content": "Deliver a concise yet comprehensive summary that highlights the key technical aspects and their practical significance."}, {"role": "user", "content": "Cross-validation is a resampling method used to evaluate machine learning models on limited data. The most common form, k-fold cross-validation, involves dividing the dataset into k equal parts, training the model on k-1 folds while validating on the remaining fold, and rotating through all combinations. This provides a more robust model performance assessment than a single train-test split, helping detect overfitting and estimate model generalization to unseen data. Variations include stratified k-fold (maintaining class distribution), leave-one-out (k equals dataset size), and time series cross-validation for temporal data."}], "output": "Cross-validation evaluates ML models by training and testing on different data splits, typically using k-fold methodology. This approach offers better performance assessment than single splits, with variations for different data types and requirements."}      
-    {"input": [{"role": "system", "content": "Generate a clear and focused summary that captures the essential technical details while maintaining their relationships and significance."}, {"role": "user", "content": "Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second task. This approach is powerful in deep learning, where pre-trained models on large datasets (like ImageNet for computer vision or BERT for NLP) are fine-tuned on specific downstream tasks. Transfer learning reduces the need for large amounts of task-specific training data and computational resources, as the model has already learned useful features from the source domain. Common strategies include feature extraction (freezing pre-trained layers) and fine-tuning (updating all or some pre-trained weights)."}], "output": "Transfer learning reuses models trained on one task for different tasks, particularly effective in deep learning. It leverages pre-trained models through feature extraction or fine-tuning, reducing data and computational needs for new tasks."}      
-    {"input": [{"role": "system", "content": "Provide a precise and informative summary that distills the key technical concepts while maintaining their relationships and practical importance."}, {"role": "user", "content": "Ensemble methods combine multiple machine learning models to create a more robust and accurate predictor. Common techniques include bagging (training models on random data subsets), boosting (sequentially training models to correct earlier errors), and stacking (using a meta-model to combine base model predictions). Random Forests, a popular bagging method, create multiple decision trees using random feature subsets. Gradient Boosting builds trees sequentially, with each tree correcting the errors of previous ones. These methods often outperform single models by reducing overfitting and variance while capturing different data aspects."}], "output": "Ensemble methods enhance prediction accuracy by combining multiple models through techniques like bagging, boosting, and stacking. Popular implementations include Random Forests (using multiple trees with random features) and Gradient Boosting (sequential error correction), offering better performance than single models."}
-    ```
-    
-5. If you would like to create new responses using inputs from your test data, you can select 'Generate new responses'. This will inject the input fields from our evaluation file into individual prompts for a model of your choice to generate output.
-
-:::image type="content" source="../media/how-to/evaluations/eval-generate-1.png" alt-text="Screenshot of the UX showing selected import test data." lightbox="../media/how-to/evaluations/eval-generate-1.png":::
-   
-You will select the model of your choice. If you do not have a model, you can create a new model deployment. The selected model will take the input data and generate its own unique outputs, which in this case will be stored in a variable called `{{sample.output_text}}`. We'll then use that output later as part of our testing criteria. Alternatively, you could provide your own custom system message and individual message examples manually.
-
-:::image type="content" source="../media/how-to/evaluations/eval-generate-2.png" alt-text="Screenshot of the UX for generating model responses." lightbox="../media/how-to/evaluations/eval-generate-2.png":::
-
-6. For creating a test criteria, select **Add**. For the example file we provided, we are going to be assessing semantic similarity. Select **Model Scorer**, which contains test criteria presets for Semantic Similarity.
-
-    :::image type="content" source="../media/how-to/evaluations/eval-semantic-similarity-1.png" alt-text="Screenshot of the semantic similarity UX config highlighting Model scorer." lightbox="../media/how-to/evaluations/eval-semantic-similarity-1.png":::
-
-   Select **Semantic Similarity** at the top. Scroll to the bottom, and in `User` section, specify `{{item.output}}` as `Ground truth`, and specify `{{sample.output_text}}` as `Output`. This will take the original reference output from your evaluation `.jsonl` file (the sample file provided) and compare it against the output that is generated by the model you chose in the previous step.
-
-    :::image type="content" source="../media/how-to/evaluations/eval-semantic-similarity-2.png" alt-text="Screenshot of the semantic similarity UX config with generated output." lightbox="../media/how-to/evaluations/eval-semantic-similarity-2.png":::
-
-:::image type="content" source="../media/how-to/evaluations/eval-semantic-similarity-3.png" alt-text="Screenshot of the semantic similarity UX config." lightbox="../media/how-to/evaluations/eval-semantic-similarity-3.png":::
-
-7. Select **Add** to add this testing criteria. If you would like to add additional testing criteria, you can add them at this step.
-
-8. You are ready to create your Evaluation. Provide your Evaluation name, review everything looks correct, and **Submit** to create the Evaluation job. You'll be taken to a status page for your evaluation job, which will show the status as "Waiting".
-
-:::image type="content" source="../media/how-to/evaluations/eval-submit-job.png" alt-text="Screenshot of the evaluation job submit UX." lightbox="../media/how-to/evaluations/eval-submit-job.png":::
-:::image type="content" source="../media/how-to/evaluations/eval-submit-job-2.png" alt-text="Screenshot of the evaluation job submit UX, with a status of waiting." lightbox="../media/how-to/evaluations/eval-submit-job-2.png":::
-
-9. Once your evaluation job has created, you can select the job to view the full details of the job:
-
-:::image type="content" source="../media/how-to/evaluations/test-complete.png" alt-text="Screenshot of a completed semantic similarity test with mix of pass and failures." lightbox="../media/how-to/evaluations/test-complete.png":::
-
-10. For semantic similarity **View output details** contains a JSON representation that you can copy/paste of your passing tests.
-
-:::image type="content" source="../media/how-to/evaluations/output-details.png" alt-text="Screenshot of the evaluation status UX with output details." lightbox="../media/how-to/evaluations/output-details.png":::
-
-11. You can also add more Eval runs by selecting **+ Add Run** button at the top left corner of your evaluation job page.
+::: zone-end
 
 ## Types of Testing Criteria