diff --git a/.github/.markdownlint.json b/.github/.markdownlint.json new file mode 100644 index 0000000..f3d69ad --- /dev/null +++ b/.github/.markdownlint.json @@ -0,0 +1,11 @@ +{ + "default": true, + "MD005": false, + "MD013": false, + "MD028": false, + "MD029": false, + "MD033": false, + "MD048": false, + "MD040": false, + "MD041": false +} diff --git a/.github/workflows/validate_and_fix_markdown.yml b/.github/workflows/validate_and_fix_markdown.yml new file mode 100644 index 0000000..9ed1182 --- /dev/null +++ b/.github/workflows/validate_and_fix_markdown.yml @@ -0,0 +1,44 @@ +name: Validate and Fix Markdown + +on: + pull_request: + branches: + - main + push: + branches: + - main + +permissions: + contents: write + +jobs: + validate-and-fix-markdown: + runs-on: ubuntu-latest + + steps: + - name: Checkout repository + uses: actions/checkout@v4 + with: + fetch-depth: 0 + + - name: Set up Node.js + uses: actions/setup-node@v3 + with: + node-version: '16' + + - name: Install Markdown Linter + run: npm install -g markdownlint-cli + + - name: Lint and Fix Markdown files + run: markdownlint '**/*.md' --fix --config .github/.markdownlint.json + + - name: Configure Git + run: | + git config --global user.email "github-actions[bot]@users.noreply.github.com" + git config --global user.name "github-actions[bot]" + + - name: Commit changes + run: | + git add -A + git commit -m "Fix Markdown syntax issues" || echo "No changes to commit" + git push origin HEAD:${{ github.event.pull_request.head.ref }} diff --git a/.github/workflows/validate_and_fix_notebook.yml b/.github/workflows/validate_and_fix_notebook.yml new file mode 100644 index 0000000..c0e782d --- /dev/null +++ b/.github/workflows/validate_and_fix_notebook.yml @@ -0,0 +1,57 @@ +name: Validate and Fix Notebook + +on: + pull_request: + branches: + - main + push: + branches: + - main + +permissions: + contents: write + +jobs: + validate-and-fix-notebook: + runs-on: ubuntu-latest + + steps: + - name: Checkout repository + uses: actions/checkout@v4 + with: + fetch-depth: 0 + + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: '3.x' + + - name: Install Jupyter and nbformat + run: | + pip install jupyter nbformat + + - name: Validate and Fix Notebook + run: | + python -c " + import nbformat + import glob + for file in glob.glob('**/*.ypyb', recursive=True): + with open(file, 'r') as f: + nb = nbformat.read(f, as_version=4) + nbformat.validate(nb) + if 'application/vnd.beylor-adapt+notebook' not in nb.metadata: + nb.metadata['application/vnd.beylor-adapt+notebook'] = {'version': '1.0'} + with open(file, 'w') as f: + nbformat.write(nb, f) + " + + - name: Configure Git + run: | + git config --global user.email "github-actions[bot]@users.noreply.github.com" + git config --global user.name "github-actions[bot]" + + - name: Commit changes + run: | + git add -A + git commit -m "Fix notebook format issues" || echo "No changes to commit" + git push origin HEAD:${{ github.event.pull_request.head.ref }} diff --git a/Deployment-Pipelines/README.md b/Deployment-Pipelines/README.md index 99207aa..c718826 100644 --- a/Deployment-Pipelines/README.md +++ b/Deployment-Pipelines/README.md @@ -5,12 +5,11 @@ Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-15 +Last updated: 2025-04-21 ------------------------------------------ -> Lakehouse Schema and Deployment Pipelines - +> Lakehouse Schema and Deployment Pipelines
List of References (Click to expand) @@ -33,15 +32,14 @@ Last updated: 2025-04-15 - [Overview](#overview) - [Demo](#demo) - - [Create a Workspace](#create-a-workspace) - - [Create a Lakehouse](#create-a-lakehouse) - - [Create a New Semantic Model](#create-a-new-semantic-model) - - [Auto-Generate Report with Copilot](#auto-generate-report-with-copilot) - - [Create a Deployment Pipeline](#create-a-deployment-pipeline) - - [Deploy to Production](#deploy-to-production) + - [Create a Workspace](#create-a-workspace) + - [Create a Lakehouse](#create-a-lakehouse) + - [Create a New Semantic Model](#create-a-new-semantic-model) + - [Auto-Generate Report with Copilot](#auto-generate-report-with-copilot) + - [Create a Deployment Pipeline](#create-a-deployment-pipeline) + - [Deploy to Production](#deploy-to-production) - [How to refresh the data](#how-to-refresh-the-data) -
## Overview @@ -61,12 +59,12 @@ Process Overview: > `Specifics for Lakehouse:` For lakehouses, the deployment process typically `includes the structure and metadata but not the actual data tables`. This is why you might see the structure and semantic models deployed, but the tables themselves need to be manually refreshed or reloaded in the target environment.

> `Deployment Rules:` You can set deployment rules to manage different stages and change content settings during deployment. For example, you can specify default lakehouses for notebooks to avoid manual changes post-deployment. -## Demo +## Demo
Centered Image
- + ### Create a Workspace 1. Navigate to the Microsoft Fabric portal. @@ -122,11 +120,10 @@ Process Overview: image -4. At this point, you should see something similar like following:  +4. At this point, you should see something similar like following: image - ### Auto-Generate Report with Copilot > [!NOTE] @@ -193,6 +190,7 @@ Process Overview: | **Incremental Refresh** | Refreshes only the data that has changed since the last refresh, improving efficiency. Click [here to understand more about incremental refresh](../Workloads-Specific/PowerBi/IncrementalRefresh.md)| - **Evaluate Changes**: Checks for changes in the data source based on a DateTime column.
- **Retrieve Data**: Only changed data is retrieved and loaded.
- **Replace Data**: Updated data is processed and replaced. | Steps to Set Up Incremental Refresh: + 1. **Create or Open a Dataflow**: Start by creating a new Dataflow Gen2 or opening an existing one. 2. **Configure the Query**: Ensure your query includes a DateTime column that can be used to filter the data. 3. **Enable Incremental Refresh**: Right-click the query and select Incremental Refresh. Configure the settings, such as the DateTime column and the time range for data extraction. diff --git a/Deployment-Pipelines/samples/data/readme.md b/Deployment-Pipelines/samples/data/readme.md index e97058d..041ab01 100644 --- a/Deployment-Pipelines/samples/data/readme.md +++ b/Deployment-Pipelines/samples/data/readme.md @@ -1,3 +1,2 @@ - -Last updated: 2025-04-15 +Last updated: 2025-04-21 diff --git a/GitHub-Integration.md b/GitHub-Integration.md index 795704b..20d9303 100644 --- a/GitHub-Integration.md +++ b/GitHub-Integration.md @@ -1,12 +1,12 @@ -# Integrating GitHub with Microsoft Fabric - Overview +# Integrating GitHub with Microsoft Fabric - Overview Costa Rica -[![GitHub](https://badgen.net/badge/icon/github?icon=github&label)](https://github.com) +[![GitHub](https://badgen.net/badge/icon/github?icon=github&label)](https://github.com) [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-15 +Last updated: 2025-04-21 ---------- @@ -25,8 +25,6 @@ Last updated: 2025-04-15
Table of Content (Click to expand) -- [Wiki](#wiki) -- [Content](#content) - [Connect a workspace to a Git repo](#connect-a-workspace-to-a-git-repo) - [Connecting to a workspace Already linked to GitHub](#connecting-to-a-workspace-already-linked-to-github) - [Commit changes to git](#commit-changes-to-git) @@ -36,7 +34,7 @@ Last updated: 2025-04-15
-https://github.com/user-attachments/assets/64f099a1-b749-47a6-b723-fa1cb5c575a3 + ## Connect a workspace to a Git repo diff --git a/Monitoring-Observability/FabricActivatorRulePipeline/README.md b/Monitoring-Observability/FabricActivatorRulePipeline/README.md index 01a5c2a..99d14b7 100644 --- a/Monitoring-Observability/FabricActivatorRulePipeline/README.md +++ b/Monitoring-Observability/FabricActivatorRulePipeline/README.md @@ -5,11 +5,12 @@ Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-15 +Last updated: 2025-04-21 ---------- > This process shows how to set up Microsoft Fabric Activator to automate workflows by detecting file creation events in a storage system and triggering another pipeline to run.
+> > 1. **First Pipeline**: The process starts with a pipeline that ends with a `Copy Data` activity. This activity uploads data into the `Lakehouse`.
> 2. **Event Stream Setup**: An `Event Stream` is configured in Activator to monitor the Lakehouse for file creation or data upload events.
> 3. **Triggering the Second Pipeline**: Once the event is detected (e.g., a file is uploaded), the Event Stream triggers the second pipeline to continue the workflow. @@ -25,19 +26,19 @@ Last updated: 2025-04-15
List of Content (Click to expand) - - [Set Up the First Pipeline](#set-up-the-first-pipeline) - - [Configure Activator to Detect the Event](#configure-activator-to-detect-the-event) - - [Set Up the Second Pipeline](#set-up-the-second-pipeline) - - [Define the Rule in Activator](#define-the-rule-in-activator) - - [Test the Entire Workflow](#test-the-entire-workflow) - - [Troubleshooting If Needed](#troubleshooting-if-needed) +- [Set Up the First Pipeline](#set-up-the-first-pipeline) +- [Configure Activator to Detect the Event](#configure-activator-to-detect-the-event) +- [Set Up the Second Pipeline](#set-up-the-second-pipeline) +- [Define the Rule in Activator](#define-the-rule-in-activator) +- [Test the Entire Workflow](#test-the-entire-workflow) +- [Troubleshooting If Needed](#troubleshooting-if-needed)
> [!NOTE] > This code generates random data with fields such as id, name, age, email, and created_at, organizes it into a PySpark DataFrame, and saves it to a specified Lakehouse path using the Delta format. Click here to see the [example script](./GeneratesRandomData.ipynb) -https://github.com/user-attachments/assets/95206bf3-83a7-42c1-b501-4879df22ef7d + ## Set Up the First Pipeline @@ -50,14 +51,14 @@ https://github.com/user-attachments/assets/95206bf3-83a7-42c1-b501-4879df22ef7d - Ensure the file name and path are consistent and predictable (e.g., `trigger_file.json` in a specific folder). 3. **Publish and Test**: Publish the pipeline and test it to ensure the trigger file is created successfully. - https://github.com/user-attachments/assets/798a3b12-c944-459d-9e77-0112b5d82831 + ## Configure Activator to Detect the Event > [!TIP] > Event options: -https://github.com/user-attachments/assets/282fae9b-e1c6-490d-bd23-9ed9bdf6105d + 1. **Set Up an Event**: - Create a new event to monitor the location where the trigger file is created (e.g., ADLS or OneLake). Click on `Real-Time`: @@ -71,18 +72,18 @@ https://github.com/user-attachments/assets/282fae9b-e1c6-490d-bd23-9ed9bdf6105d image - Add a source: - + image image - https://github.com/user-attachments/assets/43a9654b-e8d0-44da-80b9-9f528483fa3b + 2. **Test Event Detection**: - Save the event and test it by manually running the first pipeline to ensure Activator detects the file creation. - Check the **Event Details** screen in Activator to confirm the event is logged. - https://github.com/user-attachments/assets/6b21194c-54b4-49de-9294-1bf78b1e5acd + ## Set Up the Second Pipeline @@ -91,13 +92,13 @@ https://github.com/user-attachments/assets/282fae9b-e1c6-490d-bd23-9ed9bdf6105d - Ensure it is configured to accept external triggers. 2. **Publish the Pipeline**: Publish the second pipeline and ensure it is ready to be triggered. - https://github.com/user-attachments/assets/5b630579-a0ec-4d5b-b973-d9b4fdd8254c + ## Define the Rule in Activator 1. **Setup the Activator**: - https://github.com/user-attachments/assets/7c88e080-d5aa-4920-acd6-94c2e4ae0568 + 2. **Create a New Rule**: - In `Activator`, create a rule that responds to the event you just configured. @@ -109,7 +110,7 @@ https://github.com/user-attachments/assets/282fae9b-e1c6-490d-bd23-9ed9bdf6105d - Save the rule and activate it. - Ensure the rule is enabled and ready to respond to the event. - https://github.com/user-attachments/assets/5f139eeb-bab0-4d43-9f22-bbe44503ed75 + ## Test the Entire Workflow @@ -117,9 +118,10 @@ https://github.com/user-attachments/assets/282fae9b-e1c6-490d-bd23-9ed9bdf6105d 2. **Monitor Activator**: Check the `Event Details` and `Rule Activation Details` in Activator to ensure the event is detected and the rule is activated. 3. **Verify the Second Pipeline**: Confirm that the second pipeline is triggered and runs successfully. - https://github.com/user-attachments/assets/0a1dab70-2317-4636-b0be-aa0cb301b496 + ## Troubleshooting (If Needed) + - If the second pipeline does not trigger: 1. Double-check the rule configuration in Activator. 2. Review the logs in Activator for any errors or warnings. @@ -128,4 +130,3 @@ https://github.com/user-attachments/assets/282fae9b-e1c6-490d-bd23-9ed9bdf6105d

Total Visitors

Visitor Count - diff --git a/Monitoring-Observability/MonitorUsage.md b/Monitoring-Observability/MonitorUsage.md index 03d0683..dcf11a6 100644 --- a/Monitoring-Observability/MonitorUsage.md +++ b/Monitoring-Observability/MonitorUsage.md @@ -6,7 +6,7 @@ Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-15 +Last updated: 2025-04-21 ---------- @@ -27,22 +27,22 @@ Last updated: 2025-04-15 -## Content +## Content - [Microsoft Fabric Capacity Metrics app](#microsoft-fabric-capacity-metrics-app) - - [Installation Steps](#installation-steps) - - [Configuration Steps](#configuration-steps) - - [Troubleshooting](#troubleshooting) + - [Installation Steps](#installation-steps) + - [Configuration Steps](#configuration-steps) + - [Troubleshooting](#troubleshooting) - [Admin monitoring](#admin-monitoring) - - [Configure the Admin Monitoring Workspace](#configure-the-admin-monitoring-workspace) - - [How to Use Data from Admin Monitoring Workspace Custom Reports](#how-to-use-data-from-admin-monitoring-workspace-custom-reports) + - [Configure the Admin Monitoring Workspace](#configure-the-admin-monitoring-workspace) + - [How to Use Data from Admin Monitoring Workspace Custom Reports](#how-to-use-data-from-admin-monitoring-workspace-custom-reports) - [Monitor Hub](#monitor-hub) - - [How to Access and Use the Monitor Hub](#how-to-access-and-use-the-monitor-hub) - - [Extending Activity History](#extending-activity-history) + - [How to Access and Use the Monitor Hub](#how-to-access-and-use-the-monitor-hub) + - [Extending Activity History](#extending-activity-history) -## Microsoft Fabric Capacity Metrics app +## Microsoft Fabric Capacity Metrics app -> The `Microsoft Fabric Capacity Metrics app` is designed to provide comprehensive monitoring capabilities for Microsoft Fabric capacities. It helps administrators track capacity consumption, identify performance bottlenecks, and make informed decisions about scaling and resource allocation. The app provides detailed insights into capacity utilization, throttling, and system events, enabling proactive management of resources to ensure optimal performance.

+> The `Microsoft Fabric Capacity Metrics app` is designed to provide comprehensive monitoring capabilities for Microsoft Fabric capacities. It helps administrators track capacity consumption, identify performance bottlenecks, and make informed decisions about scaling and resource allocation. The app provides detailed insights into capacity utilization, throttling, and system events, enabling proactive management of resources to ensure optimal performance.

> This app is essential for maintaining the health and efficiency of your Microsoft Fabric capacities | **Feature** | **Description** | @@ -58,7 +58,7 @@ Last updated: 2025-04-15 - Navigate to [Microsoft Fabric](https://app.fabric.microsoft.com/). In the left panel, locate the `Apps` icon and click on `Get apps`. image - + - Search for `Microsoft Fabric Capacity Metrics`: image @@ -72,6 +72,7 @@ Last updated: 2025-04-15 image ### Configuration Steps + 1. **Run the App for the First Time**: - In Microsoft Fabric, go to **Apps** and select the Microsoft Fabric Capacity Metrics app. - When prompted with `You have to connect to your own data to view this report`, select **Connect**. @@ -85,7 +86,7 @@ Last updated: 2025-04-15 - Go to the Power BI service and sign in with your admin account. - Click on the `Settings` gear icon in the top right corner. - Select `Admin Portal` from the dropdown menu. - + image 2. Access Capacity Settings: @@ -101,7 +102,7 @@ Last updated: 2025-04-15 image - - **UTC_offset**: Enter your organization's standard time in UTC (e.g., for Central Standard Time, enter `-6`). + - **UTC_offset**: Enter your organization's standard time in UTC (e.g., for Central Standard Time, enter `-6`). image @@ -128,7 +129,7 @@ Last updated: 2025-04-15 - If the app doesn't show data or can't refresh, try deleting the old app and reinstalling the latest version. - Update the semantic model credentials if needed. -## Admin monitoring +## Admin monitoring > `Admin monitoring workspace` in Microsoft Fabric is a powerful tool for administrators to track and analyze usage metrics across their organization. This workspace provides detailed insights into how different features and services are being utilized, helping admins make informed decisions to optimize performance and resource allocation. @@ -148,11 +149,11 @@ Benefits of Using Admin Monitoring Workspace: 3. **Optimize Resources**: Make data-driven decisions about scaling and resource allocation to ensure optimal performance. 4. **Ensure Compliance**: Use the Purview Hub to monitor data governance and compliance, ensuring that your organization adheres to relevant regulations and standards. - ### Configure the Admin Monitoring Workspace > [!IMPORTANT] -> - **Permissions**: `Only users with direct admin roles can set up the Admin Monitoring workspace`. If the admin role `is assigned through a group, data refreshes may fail`.
+> +> - **Permissions**: `Only users with direct admin roles can set up the Admin Monitoring workspace`. If the admin role `is assigned through a group, data refreshes may fail`.
> - **Read-Only Workspace**: The `Admin Monitoring workspace is read-only`. Users, including admins, cannot edit or view properties of items such as semantic models and reports within the workspace. `Admins can share reports and semantic models within the workspace with other users by assigning them a workspace viewer role or providing direct access links.` > - **Reinitializing the Workspace**: If needed, `you can reinitialize the workspace by executing an API call to delete the semantic model and then reinstalling the workspace`. @@ -202,7 +203,7 @@ Benefits of Using Admin Monitoring Workspace: image 2. **Create Custom Reports**: You can utilize copilot capabilities to automatically create your report and edit it. Request additional pages with your content or even ask questions about your data. - + image image @@ -213,7 +214,7 @@ Benefits of Using Admin Monitoring Workspace: | Semantic model access | Workspace access | | --- | --- | -| image | image | +| image | image | ## Monitor Hub @@ -239,16 +240,15 @@ Benefits of Using Admin Monitoring Workspace: > For example: -https://github.com/user-attachments/assets/0f7fecfb-0b04-422b-abca-fcbe8827e2a2 + 3. **Search and Filter**: - Use the keyword search box to find specific activities. - Apply filters to narrow down the results based on status, time period, item type, owner, and workspace location. - - | Column Options | Filter Options | - | --- | --- | - | image | image | + | Column Options | Filter Options | + | --- | --- | + | image | image | 5. **Take Actions**: If you have the necessary permissions, you can perform actions on activities by selecting the More options (...) next to the activity name. @@ -262,11 +262,12 @@ https://github.com/user-attachments/assets/0f7fecfb-0b04-422b-abca-fcbe8827e2a2 ### Extending Activity History -> To extend your activity tracking beyond 30 day, you can use `Microsoft Purview`:
+> To extend your activity tracking beyond 30 day, you can use `Microsoft Purview`:
+> > - Provides extended audit log retention up to 1 year with appropriate licensing.
> - Use the Purview portal to view and export detailed activity logs.
> - Utilize the Purview REST API to access scan history beyond 30 days. - + Steps to Access Microsoft Purview via Audit Logs: 1. **Navigate to the Admin Portal**: diff --git a/Monitoring-Observability/StepsCapacityAlert.md b/Monitoring-Observability/StepsCapacityAlert.md index 146d1a1..7225135 100644 --- a/Monitoring-Observability/StepsCapacityAlert.md +++ b/Monitoring-Observability/StepsCapacityAlert.md @@ -1,4 +1,4 @@ -# Steps to Configure Capacity Alerts - Overview +# Steps to Configure Capacity Alerts - Overview Costa Rica @@ -6,7 +6,7 @@ Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-15 +Last updated: 2025-04-21 ---------- @@ -18,7 +18,6 @@ Last updated: 2025-04-15 | **Threshold** | 80% | | **Recipients** | Capacity admins, Specific contacts | - 1. Go to the [Microsoft Fabric service](https://app.fabric.microsoft.com/) and sign in with your admin credentials. 2. **Access the Admin Portal**: - Click on the `Settings` gear icon in the top right corner. diff --git a/README.md b/README.md index cc0dec3..e601940 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-15 +Last updated: 2025-04-21 ------------------------------------------ @@ -13,7 +13,6 @@ Last updated: 2025-04-15 > This repository contains demos and guides for building a well-architected framework for a Microsoft Fabric enterprise-level data platform. These demos are intended as a guide. > `For official guidance, support, or more detailed information, please refer to Microsoft's official documentation or contact Microsoft directly`: [Microsoft Sales and Support](https://support.microsoft.com/contactus?ContactUsExperienceEntryPointAssetId=S.HP.SMC-HOME). For more detailed and official training, please visit the [Microsoft official training site](https://learn.microsoft.com/en-us/training/). -
List of References (Click to expand) @@ -23,7 +22,6 @@ Last updated: 2025-04-15
-
Table of Content (Click to expand) @@ -41,8 +39,8 @@ Last updated: 2025-04-15 - An `Azure subscription is required`. All other resources, including instructions for creating a Resource Group, are provided in this workshop. - `Contributor role assigned or any custom role that allows`: access to manage all resources, and the ability to deploy resources within subscription. - If you choose to use a Terraform approach, please ensure that: - - [Terraform is installed on your local machine](https://developer.hashicorp.com/terraform/tutorials/azure-get-started/install-cli#install-terraform). - - [Install the Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) to work with both Terraform and Azure commands. + - [Terraform is installed on your local machine](https://developer.hashicorp.com/terraform/tutorials/azure-get-started/install-cli#install-terraform). + - [Install the Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) to work with both Terraform and Azure commands. ## Infrastructure as Code (IaC) @@ -51,29 +49,28 @@ Last updated: 2025-04-15
1. Consistency and Reproducibility - - **Consistent Environments**: IaC ensures that your development, testing, and production environments are consistent. `This reduces the it works on my machine problem` and ensures that applications run reliably across different environments. - - **Reproducibility**: With IaC, you can `recreate your infrastructure from scratch in a consistent manner.` This is particularly useful for `disaster recovery and scaling`. - +- **Consistent Environments**: IaC ensures that your development, testing, and production environments are consistent. `This reduces the it works on my machine problem` and ensures that applications run reliably across different environments. +- **Reproducibility**: With IaC, you can `recreate your infrastructure from scratch in a consistent manner.` This is particularly useful for `disaster recovery and scaling`. +
2. Version Control - - **Source Control**: By storing IaC configurations in version control systems like GitHub, you `can track changes, collaborate with team members, and roll back to previous versions if needed.` - - **Change Management**: Version control `provides a history of changes, making it easier to understand what changes were made, who made them, and why.` - +- **Source Control**: By storing IaC configurations in version control systems like GitHub, you `can track changes, collaborate with team members, and roll back to previous versions if needed.` +- **Change Management**: Version control `provides a history of changes, making it easier to understand what changes were made, who made them, and why.` +
3. Flexibility and IaC tools Options - > Microsoft provides several IaC tools, including Terraform, Bicep, and ARM templates. Each tool offers different features and benefits, allowing you to choose the one that best fits your needs. - - **Terraform**: A popular IaC tool that uses a high-level configuration language to define and provision infrastructure. It `supports multiple cloud providers, making it a versatile choice.` - - **Bicep**: A domain-specific language that uses declarative syntax to deploy Azure resources. It offers a `concise and easy-to-read alternative to JSON-based ARM templates.` - - **ARM Templates**: JSON files that` define the infrastructure and configuration for your Azure solution.` They provide a detailed and flexible way to manage Azure resources. - +- **Terraform**: A popular IaC tool that uses a high-level configuration language to define and provision infrastructure. It `supports multiple cloud providers, making it a versatile choice.` +- **Bicep**: A domain-specific language that uses declarative syntax to deploy Azure resources. It offers a `concise and easy-to-read alternative to JSON-based ARM templates.` +- **ARM Templates**: JSON files that`define the infrastructure and configuration for your Azure solution.` They provide a detailed and flexible way to manage Azure resources. +
@@ -89,7 +86,7 @@ Last updated: 2025-04-15 - **Dynamic Scaling**: IaC enables `dynamic scaling of resources based on demand.` This ensures that your infrastructure can handle varying workloads efficiently. - **Resource Optimization**: By automating the `provisioning and de-provisioning of resources,` IaC helps optimize resource usage and reduce costs. - +
@@ -100,9 +97,8 @@ Last updated: 2025-04-15
- > [!TIP] -> Just in case, find here some [additional Terraform templates for different Azure resources across different areas](https://github.com/MicrosoftCloudEssentials-LearningHub/AzureTerraformTemplates-v0.0.0). +> Just in case, find here some [additional Terraform templates for different Azure resources across different areas](https://github.com/MicrosoftCloudEssentials-LearningHub/AzureTerraformTemplates-v0.0.0). > E.g [Demonstration: Deploying Azure Resources for a Data Platform](./Terraform) @@ -115,7 +111,7 @@ Last updated: 2025-04-15 - **Fabric Workspace Integration**: Integrate your Fabric workspace with [GitHub](./GitHub-Integration.md) or Azure DevOps to manage code related to data objects and workflows. - **Continuous Integration/Continuous Deployment (CI/CD)**: Implement CI/CD pipelines to [automate the deployment](./Deployment-Pipelines/) of changes to your data platform. -## Security +## Security > Implementing robust security measures ensures that sensitive data is protected, access is controlled, and compliance requirements are met. @@ -125,10 +121,11 @@ Last updated: 2025-04-15 | **Data Protection & Encryption** | - **Data Masking:** Hide sensitive information from unauthorized users.
- **Audit Logs:** Keep detailed records to monitor user activities and detect anomalies.
- **Encryption at Rest:** Use Azure Storage Service Encryption and Transparent Data Encryption (TDE) to protect stored data.
- **Encryption in Transit:** Secure communications with TLS/SSL protocols and VPNs. | | **Networking & Granular Controls** | - **Granular Security Controls:** Implement layered security measures to comprehensively protect sensitive data.
- **Networking:** Leverage Fabric’s unified platform to simplify secure network configurations. For more details, see [Networking](#networking) | -## Networking +## Networking > Networking is a critical component of any enterprise-level data platform. In Microsoft Fabric, networking configurations are simplified and secured through its `unified platform.`: -> - **Simplified Configuration**: Microsoft Fabric provides a unified platform that integrates different networking components, making it easier to configure and manage network settings. This unified approach reduces complexity and ensures that all networking elements work seamlessly together.
+> +> - **Simplified Configuration**: Microsoft Fabric provides a unified platform that integrates different networking components, making it easier to configure and manage network settings. This unified approach reduces complexity and ensures that all networking elements work seamlessly together.
> - **Centralized Management**: With a unified platform, you can manage all networking configurations from a single interface. This centralization streamlines operations and enhances visibility into network performance and security. | **Category** | **Description**| diff --git a/Terraform/README.md b/Terraform/README.md index 843cc90..1dfdbed 100644 --- a/Terraform/README.md +++ b/Terraform/README.md @@ -5,7 +5,7 @@ Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-15 +Last updated: 2025-04-21 ------------------------------------------ @@ -72,7 +72,6 @@ az ad user list --query "[].{Name:displayName, ObjectId:id, Email:userPrincipalN image - Here is an example value for `admin_principal_id` which is Object ID you retrieved. ```hcl @@ -89,9 +88,9 @@ admin_principal_id = "12345678-1234-1234-1234-1234567890ab" > 2. Create a Storage Container: Within the storage account, create a new container to store the Terraform state file. > 3. Configure Terraform Backend: In your Terraform configuration file (e.g., [remote-storage.tf](./src/remote-storage.tf), add the backend configuration for Azure Blob Storage. -## How to execute it +## How to execute it -```mermaid +```mermaid graph TD; A[az login] --> B(terraform init) B --> C{Terraform provisioning stage} @@ -99,17 +98,18 @@ graph TD; C -->|Order Now| E[terraform apply] C -->|Delete Resource if needed| F[terraform destroy] ``` + > [!IMPORTANT] > Please modify `terraform.tfvars` with your information, then run the following flow. If you need more visual guidance, please check the video that illustrates the provisioning steps. Be aware that the template uses an F64 Fabric capacity as SKU. Once deployed and activated, you can pause your capacity after you finish or delete the whole resource group after the workshop is completed. -https://github.com/user-attachments/assets/668be278-fae7-466e-8452-860f27771073 + 1. **Login to Azure**: This command logs you into your Azure account. It opens a browser window where you can enter your Azure credentials. Once logged in, you can manage your Azure resources from the command line. ```sh cd ./Terraform/src/ ``` - + ```sh az login ``` @@ -126,7 +126,7 @@ https://github.com/user-attachments/assets/668be278-fae7-466e-8452-860f27771073 img -3. **Terraform Provisioning Stage**: +3. **Terraform Provisioning Stage**: - **Review**: Creates an execution plan, showing what actions Terraform will take to achieve the desired state defined in your configuration files. It uses the variable values specified in `terraform.tfvars`. @@ -145,9 +145,9 @@ https://github.com/user-attachments/assets/668be278-fae7-466e-8452-860f27771073 image image - + - **Remove**: Destroys the infrastructure managed by Terraform. It prompts for confirmation before deleting any resources. It also uses the variable values specified in `terraform.tfvars`. - + ```sh terraform destroy -var-file terraform.tfvars ``` diff --git a/Terraform/troubleshooting.md b/Terraform/troubleshooting.md index a0182ee..57c8a3f 100644 --- a/Terraform/troubleshooting.md +++ b/Terraform/troubleshooting.md @@ -1,24 +1,24 @@ -# Troubleshooting: Known Errors +# Troubleshooting: Known Errors Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-15 +Last updated: 2025-04-21 ------------------------------------------ -## Content +## Content - [Terraform is not recognized](#terraform-is-not-recognized) - - [Step 1: Download Terraform](#step-1-download-terraform) - - [Step 2: Install Terraform](#step-2-install-terraform) - - [For Windows:](#for-windows) - - [For macOS:](#for-macos) - - [For Linux:](#for-linux) - - [Step 3: Verify the Installation](#step-3-verify-the-installation) - - [Step 4: Initialize Terraform](#step-4-initialize-terraform) + - [Step 1: Download Terraform](#step-1-download-terraform) + - [Step 2: Install Terraform](#step-2-install-terraform) + - [For Windows:](#for-windows) + - [For macOS:](#for-macos) + - [For Linux:](#for-linux) + - [Step 3: Verify the Installation](#step-3-verify-the-installation) + - [Step 4: Initialize Terraform](#step-4-initialize-terraform) - [Resource Group Not Found](#resource-group-not-found) - [Resource Not Found](#resource-not-found) @@ -40,17 +40,17 @@ At line:1 char:1 image

- ### Step 1: Download Terraform > By command line: + 1. Open your command prompt. 2. Use curl to download Terraform. Replace VERSION with the desired version number (e.g., 1.1.4): - + ``` curl -o terraform.zip https://releases.hashicorp.com/terraform/VERSION/terraform_VERSION_windows_amd64.zip ``` - + image 3. Use tar to extract the ZIP file: @@ -62,6 +62,7 @@ At line:1 char:1 image > By GUI: + 1. Go to the [Terraform download page](https://developer.hashicorp.com/terraform/install). 2. Download the appropriate package for your operating system (e.g., Windows, macOS, Linux). @@ -69,7 +70,7 @@ At line:1 char:1 ### Step 2: Install Terraform -#### For Windows: +#### For Windows 1. Extract the downloaded ZIP file to a directory of your choice (e.g., `C:\terraform`). @@ -78,62 +79,71 @@ At line:1 char:1 2. Add the directory to your system's PATH: > By command line:
`Assuming you have moved terraform.exe to C:\terraform, you can add this directory to the PATH using the following command` - + ``` setx PATH "%PATH%;C:\terraform" ``` - - + image - - + > By GUI: - Open the Start menu and search for `Environment Variables`. - Click on `Edit the system environment variables` - + image - + - In the System Properties window, click on `Environment Variables`. - + image - + - Under `System variables`, find the `Path` variable and click `Edit`. - Click `New` and add the path to the directory where you extracted Terraform (e.g., `C:\terraform`). - Click `OK` to close all windows. -#### For macOS: +#### For macOS 1. Open a terminal. 2. Move the Terraform binary to a directory included in your PATH (e.g., `/usr/local/bin`): + ```sh sudo mv ~/Downloads/terraform /usr/local/bin/ ``` + 3. Ensure the directory is in your PATH by adding the following line to your `~/.bash_profile` or `~/.zshrc` file: + ```sh export PATH=$PATH:/usr/local/bin ``` + 4. Reload your profile: + ```sh source ~/.bash_profile # or source ~/.zshrc ``` -#### For Linux: +#### For Linux 1. Open a terminal. 2. Move the Terraform binary to a directory included in your PATH (e.g., `/usr/local/bin`): + ```sh sudo mv ~/Downloads/terraform /usr/local/bin/ ``` + 3. Ensure the directory is in your PATH by adding the following line to your `~/.bashrc` or `~/.profile` file: + ```sh export PATH=$PATH:/usr/local/bin ``` + 4. Reload your profile: + ```sh source ~/.bashrc # or source ~/.profile ``` ### Step 3: Verify the Installation + 1. Open a new terminal or command prompt. 2. Run the following command to verify the installation. You should see the installed version of Terraform. @@ -173,7 +183,6 @@ Error: Failed to get existing workspaces: Error retrieving keys for Storage Acco image

-

Total Visitors

Visitor Count diff --git a/Workloads-Specific/DataFactory/BestPractices.md b/Workloads-Specific/DataFactory/BestPractices.md index 2b39df7..f288921 100644 --- a/Workloads-Specific/DataFactory/BestPractices.md +++ b/Workloads-Specific/DataFactory/BestPractices.md @@ -1,12 +1,12 @@ -# Azure Data Factory (ADF) Best Practices - Overview +# Azure Data Factory (ADF) Best Practices - Overview Costa Rica -[![GitHub](https://badgen.net/badge/icon/github?icon=github&label)](https://github.com) +[![GitHub](https://badgen.net/badge/icon/github?icon=github&label)](https://github.com) [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-16 +Last updated: 2025-04-21 ---------- @@ -28,36 +28,36 @@ Last updated: 2025-04-16 - [Architecture examples](#architecture-examples) - [Best Practices for ADF Pipelines](#best-practices-for-adf-pipelines) - - [Clear Pipeline Structure](#clear-pipeline-structure) - - [Example Pipeline Structure](#example-pipeline-structure) - - [Parameterization](#parameterization) - - [Incremental Loading](#incremental-loading) - - [Use Timestamps](#use-timestamps) - - [Change Data Capture CDC](#change-data-capture-cdc) - - [Delta Loads](#delta-loads) - - [Partitioning](#partitioning) - - [Error Handling and Monitoring](#error-handling-and-monitoring) - - [a. Use If Condition Activity](#a-use-if-condition-activity) - - [b. Configure Activity Fault Tolerance](#b-configure-activity-fault-tolerance) - - [c. Custom Error Handling: Use Web Activity for error handling](#c-custom-error-handling-use-web-activity-for-error-handling) - - [d. Pipeline Monitoring: Monitor activity runs.](#d-pipeline-monitoring-monitor-activity-runs) - - [Security Measures](#security-measures) - - [Use Azure Key Vault](#use-azure-key-vault) - - [Store Secrets](#store-secrets) - - [Access Policies](#access-policies) - - [Secure Access](#secure-access) - - [Rotate Secrets](#rotate-secrets) - - [Source Control](#source-control) - - [Resource Management](#resource-management) - - [Testing and Validation](#testing-and-validation) - - [Documentation](#documentation) - - [Regular Updates](#regular-updates) - - [Performance Tuning](#performance-tuning) + - [Clear Pipeline Structure](#clear-pipeline-structure) + - [Example Pipeline Structure](#example-pipeline-structure) + - [Parameterization](#parameterization) + - [Incremental Loading](#incremental-loading) + - [Use Timestamps](#use-timestamps) + - [Change Data Capture CDC](#change-data-capture-cdc) + - [Delta Loads](#delta-loads) + - [Partitioning](#partitioning) + - [Error Handling and Monitoring](#error-handling-and-monitoring) + - [a. Use If Condition Activity](#a-use-if-condition-activity) + - [b. Configure Activity Fault Tolerance](#b-configure-activity-fault-tolerance) + - [c. Custom Error Handling: Use Web Activity for error handling](#c-custom-error-handling-use-web-activity-for-error-handling) + - [d. Pipeline Monitoring: Monitor activity runs.](#d-pipeline-monitoring-monitor-activity-runs) + - [Security Measures](#security-measures) + - [Use Azure Key Vault](#use-azure-key-vault) + - [Store Secrets](#store-secrets) + - [Access Policies](#access-policies) + - [Secure Access](#secure-access) + - [Rotate Secrets](#rotate-secrets) + - [Source Control](#source-control) + - [Resource Management](#resource-management) + - [Testing and Validation](#testing-and-validation) + - [Documentation](#documentation) + - [Regular Updates](#regular-updates) + - [Performance Tuning](#performance-tuning) - [Recommended Training Modules on Microsoft Learn](#recommended-training-modules-on-microsoft-learn)
-## Architecture examples +## Architecture examples image @@ -78,7 +78,6 @@ Last updated: 2025-04-16 | **Organized Layout** | Arrange activities in a logical sequence and avoid overlapping lines. | - Place activities in a left-to-right or top-to-bottom flow to visually represent the data flow.
- Group related activities together and use containers for better organization. | | **Error Handling and Logging**| Include error handling and logging activities to capture and manage errors. | - Add a Web Activity to log errors to a monitoring system.
- Use Try-Catch blocks to handle errors gracefully and ensure the pipeline continues running. | - #### Example Pipeline Structure > Pipeline: CopySalesDataPipeline @@ -107,9 +106,9 @@ graph TD - If needed, add parameters: image - + image - + image - **Activities Inside ForEach**: @@ -124,7 +123,6 @@ graph TD image - - **Set Variable Activity**: Log the status of the copy operation. - **Name**: `LogStatus` - **Annotation**: `Log the status of the copy operation` @@ -138,6 +136,7 @@ graph TD image ### Parameterization +> > Use parameters to make your pipelines more flexible and easier to manage. | **Best Practice** | **Description** | **Example** | @@ -148,6 +147,7 @@ graph TD | **Parameterize Datasets** | Parameterize datasets to handle different data sources or destinations. | - Create a dataset with a parameterized file path to handle different file names dynamically.
- Use parameters in datasets to switch between different databases or tables.
- Define parameters for connection strings to dynamically connect to different data sources. | ### Incremental Loading +> > Implement incremental data loading to improve efficiency. | **Best Practice** | **Description** | **Example** | @@ -176,6 +176,7 @@ graph TD - Use a Stored Procedure activity to update the `LastLoadedTimestamp` in the watermark table. #### Change Data Capture (CDC) +> > Utilize CDC to capture and load only the changes made to the source data. 1. **Enable CDC on Source Table**: @@ -189,6 +190,7 @@ graph TD - Inside the ForEach activity, use Copy Data activities to apply the changes to the destination. #### Delta Loads +> > Perform delta loads to update only the changed data instead of full loads. 1. **Track Changes**: @@ -202,6 +204,7 @@ graph TD - After loading, reset the `ChangeFlag` to 0. #### Partitioning +> > Partition large datasets to improve performance and manageability. 1. **Partition Your Data**: @@ -215,6 +218,7 @@ graph TD - Inside the ForEach activity, use a Copy Data activity to load data for each partition. ### Error Handling and Monitoring +> > Set up robust error handling and monitoring to quickly identify and resolve issues. | **Best Practice** | **Description** | **Example** | @@ -225,6 +229,7 @@ graph TD | **Custom Logging** | Implement custom logging to capture detailed error information. | - Use a Web Activity to log errors to an external logging service or database.
- Implement an Azure Function to log detailed error information and call it from the pipeline.
- Use a Set Variable activity to capture error details and write them to a log file in Azure Blob Storage. | #### a. **Use If Condition Activity** + 1. **Create a Pipeline**: - Open Microsoft Fabric and navigate to Azure Data Factory. @@ -251,6 +256,7 @@ graph TD image #### b. **Configure Activity Fault Tolerance** + 1. **Set Retry Policy**: - Select an activity within your pipeline. - In the activity settings, configure the retry policy by specifying the number of retries and the interval between retries. @@ -271,17 +277,18 @@ graph TD image -#### d. **Pipeline Monitoring**: Monitor activity runs. +#### d. **Pipeline Monitoring**: Monitor activity runs - In the ADF monitoring interface, navigate to the `Monitor` section, if you don't see it click on `...`. - Check the status of individual activities within your pipelines for success, failure, and skipped activities. Or search for any specific pipeline. - Click on the activity to see the `Details`, and click on the `Pipeline Run ID`: image - + image ### Security Measures +> > Apply security best practices to protect your data. | **Best Practice** | **Description** | **Example** | @@ -292,6 +299,7 @@ graph TD | **Audit Logs** | Enable auditing to track access and changes to ADF resources. | - Use Azure Monitor to collect and analyze audit logs for ADF activities.
- Enable diagnostic settings to send logs to Azure Log Analytics, Event Hubs, or a storage account.
- Regularly review audit logs to detect and respond to unauthorized access or changes. | ### Use Azure Key Vault +> > Store sensitive information such as connection strings, passwords, and API keys in Azure Key Vault to enhance security and manage secrets efficiently. | **Best Practice** | **Description** | **Example** | @@ -325,8 +333,8 @@ graph TD image - #### Access Policies +> > Configure access policies to control who can access secrets. 1. **Set Up Access Policies in Key Vault**: @@ -343,10 +351,12 @@ graph TD > Use managed identities to securely access Key Vault secrets. **Grant Key Vault Access to Managed Identity**: - - In the Key Vault, add an access policy to grant the Data Factory managed identity access to the required secrets. - - Example: Grant `Get` and `List` permissions to the managed identity. + +- In the Key Vault, add an access policy to grant the Data Factory managed identity access to the required secrets. +- Example: Grant `Get` and `List` permissions to the managed identity. #### Rotate Secrets +> > Regularly rotate secrets to enhance security. 1. **Update Secrets in Key Vault**: @@ -359,10 +369,10 @@ graph TD - Ensure that relevant teams are notified when secrets are rotated. - Example: Use Logic Apps to send email notifications when secrets are updated. - -### Source Control +### Source Control > Benefits of Git Integration:
+> > - **Version Control**: Track and audit changes, and revert to previous versions if needed.
> - **Collaboration**: Multiple team members can work on the same project simultaneously.
> - **Incremental Saves**: Save partial changes without publishing them live.
@@ -397,6 +407,7 @@ graph TD - Collaborate with team members through code reviews and comments. ### Resource Management +> > Optimize resource usage to improve performance and reduce costs. | **Best Practice** | **Description** | **Example** | @@ -407,6 +418,7 @@ graph TD | **Resource Tagging** | Tag resources for better organization and cost tracking. | - Apply tags to ADF resources to categorize and track costs by project or department.
- Use tags to identify and manage resources associated with specific business units.
- Implement tagging policies to ensure consistent resource tagging across the organization. | ### Testing and Validation +> > Regularly test and validate your pipelines to ensure they work as expected. | **Best Practice** | **Description** | **Example** | @@ -417,6 +429,7 @@ graph TD | **Automated Testing** | Automate testing processes to ensure consistency and reliability. | - Use Azure DevOps pipelines to automate the testing of ADF pipelines.
- Schedule automated tests to run after each deployment or code change.
- Integrate automated testing with CI/CD pipelines to ensure continuous validation. | ### Documentation +> > Maintain comprehensive documentation for your pipelines. | **Best Practice** | **Description** | **Example** | @@ -427,6 +440,7 @@ graph TD | **Knowledge Sharing** | Share documentation with the team to ensure everyone is informed. | - Use a shared platform like SharePoint or Confluence to store and share documentation.
- Conduct regular training sessions to keep the team updated on best practices.
- Encourage team members to contribute to and update the documentation. | ### Regular Updates +> > Keep your pipelines and ADF environment up to date. | **Best Practice** | **Description** | **Example** | @@ -437,6 +451,7 @@ graph TD | **Security Patches** | Apply security patches promptly to protect against vulnerabilities. | - Monitor security advisories and apply patches to ADF and related services.
- Implement a patch management process to ensure timely updates.
- Conduct regular security assessments to identify and address vulnerabilities. | ### Performance Tuning +> > Continuously monitor and tune performance. | **Best Practice** | **Description** | **Example** | @@ -447,6 +462,7 @@ graph TD | **Resource Allocation** | Allocate resources efficiently to balance performance and cost. | - Adjust the number of Data Integration Units (DIUs) based on workload requirements.
- Use resource groups to manage and allocate resources effectively.
- Monitor resource usage and adjust allocations to optimize performance. | ## Recommended Training Modules on Microsoft Learn + - [Introductory training modules for Azure Data Factory](https://learn.microsoft.com/en-us/azure/data-factory/quickstart-learn-modules) - [Quickstart: Get started with Azure Data Factory](https://learn.microsoft.com/en-us/azure/data-factory/quickstart-get-started) - [Introduction to Azure Data Factory](https://learn.microsoft.com/en-us/training/modules/intro-to-azure-data-factory/): This module covers the basics of ADF and how it can help integrate your data sources diff --git a/Workloads-Specific/DataFactory/HowMonitorChanges.md b/Workloads-Specific/DataFactory/HowMonitorChanges.md index 100ec65..197adb1 100644 --- a/Workloads-Specific/DataFactory/HowMonitorChanges.md +++ b/Workloads-Specific/DataFactory/HowMonitorChanges.md @@ -5,7 +5,7 @@ Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-16 +Last updated: 2025-04-21 ---------- @@ -60,7 +60,7 @@ Last updated: 2025-04-16 image -## Create a pipeline +## Create a pipeline 1. **Log in to Azure Portal**: Open your web browser and go to the Azure Portal. Enter your credentials to log in. 2. **Go to Data Factory**: Use the search bar at the top to search for `Data Factory` and select your Data Factory instance from the list. @@ -108,7 +108,7 @@ Last updated: 2025-04-16 image - ## How to see who modified a pipeline +## How to see who modified a pipeline 1. **Log in to Azure Portal**: Open your web browser and go to the Azure Portal. Enter your credentials to log in. 2. **Go to Azure Data Factory**: Once logged in, use the search bar at the top to search for `Data Factory` and select your Data Factory instance from the list. diff --git a/Workloads-Specific/DataScience/AI_integration/README.md b/Workloads-Specific/DataScience/AI_integration/README.md new file mode 100644 index 0000000..59099bc --- /dev/null +++ b/Workloads-Specific/DataScience/AI_integration/README.md @@ -0,0 +1,403 @@ +# Demostration: How to integrate AI in Microsoft Fabric + +Costa Rica + +[![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) +[brown9804](https://github.com/brown9804) + +Last updated: 2025-04-21 + +------------------------------------------ + +> Fabric's OneLake datastore provides a unified data storage solution that supports differents data formats and sources. This feature simplifies data access and management, enabling efficient data preparation and model training. + +
+List of References (Click to expand) + +- [Unleashing the Power of Microsoft Fabric and SynapseML](https://blog.fabric.microsoft.com/en-us/blog/unleashing-the-power-of-synapseml-and-microsoft-fabric-a-guide-to-qa-on-pdf-documents-2) +- [Building a RAG application with Microsoft Fabric](https://techcommunity.microsoft.com/t5/startups-at-microsoft/building-high-scale-rag-applications-with-microsoft-fabric/ba-p/4217816) +- [Building Custom AI Applications with Microsoft Fabric: Implementing Retrieval-Augmented Generation](https://support.fabric.microsoft.com/en-us/blog/building-custom-ai-applications-with-microsoft-fabric-implementing-retrieval-augmented-generation-for-enhanced-language-models?ft=Alicia%20Li%20%28ASA%29:author) +- [Avail the Power of Microsoft Fabric from within Azure Machine Learning](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/avail-the-power-of-microsoft-fabric-from-within-azure-machine/ba-p/3980702) +- [AI and Machine Learning on Databricks - Azure Databricks | Microsoft Learn]( https://learn.microsoft.com/en-us/azure/databricks/machine-learning) +- [Training and Inference of LLMs with PyTorch Fully Sharded Data Parallel](https://techcommunity.microsoft.com/t5/microsoft-developer-community/training-and-inference-of-llms-with-pytorch-fully-sharded-data/ba-p/3845995) +- [Harness the Power of LangChain in Microsoft Fabric for Advanced Document Summarization](https://blog.fabric.microsoft.com/en-us/blog/harness-the-power-of-langchain-in-microsoft-fabric-for-advanced-document-summarization) +- [Integrating Azure AI and Microsoft Fabric for Next-Gen AI Solutions](https://build.microsoft.com/en-US/sessions/91971ab3-93e4-429d-b2d7-5b60b2729b72) +- [Generative AI with Microsoft Fabric](https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/generative-ai-with-microsoft-fabric/ba-p/4219444) +- [Harness Microsoft Fabric AI Skill to Unlock Context-Rich Insights from Your Data](https://blog.fabric.microsoft.com/en-us/blog/harness-microsoft-fabric-ai-skill-to-unlock-context-rich-insights-from-your-data) +- [LangChain-AzureOpenAI Parameter API Reference](https://python.langchain.com/api_reference/openai/chat_models/langchain_openai.chat_models.azure.AzureChatOpenAI.html#) + +
+ +
+Table of Content (Click to expand) + +- [Overview](#overview) +- [Demo](#demo) + - [Set Up Your Environment](#set-up-your-environment) + - [Install Required Libraries](#install-required-libraries) + - [Configure Azure OpenAI Service](#configure-azure-openai-service) + - [Basic Usage of LangChain Transformer](#basic-usage-of-langchain-transformer) + - [Using LangChain for Large Scale Literature Review](#using-langchain-for-large-scale-literature-review) + - [Machine Learning Integration with Microsoft Fabric](#machine-learning-integration-with-microsoft-fabric) + +
+ +## Overview + +> Microsoft Fabric is a comprehensive data analytics platform that brings together various data services to provide an end-to-end solution for data engineering, data science, data warehousing, real-time analytics, and business intelligence. It's designed to simplify the process of working with data and to enable organizations to gain insights more efficiently.

+> Capabilities Enabled by LLMs:
+> +> - `Document Summarization`: LLMs can process and summarize large documents, making it easier to extract key information.
+> - `Question Answering:` Users can perform Q&A tasks on PDF documents, allowing for interactive data exploration.
+> - `Embedding Generation`: LLMs can generate embeddings for document chunks, which can be stored in a vector store for efficient search and retrieval. + +## Demo + +Tools in practice: + +| **Tool** | **Description**| +|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **LangChain**| LangChain is a framework for developing applications powered by language models. It can be used with Azure OpenAI to build applications that require natural language understanding and generation.
**Use Case**: Creating complex applications that involve multiple steps or stages of processing, such as preprocessing text data, applying a language model, and postprocessing the results. | +| **SynapseML**| SynapseML is an open-source library that simplifies the creation of massively scalable machine learning pipelines. It integrates with Azure OpenAI to provide distributed computing capabilities, allowing you to apply large language models at scale.
**Use Case**: Applying powerful language models to massive amounts of data, enabling scenarios like batch processing of text data or large-scale text analytics. | + +### Set Up Your Environment + +1. **Register the Resource Provider**: Ensure that the `microsoft.fabric` resource provider is registered in your subscription. + + image + +2. **Create a Microsoft Fabric Resource**: + - Navigate to the Azure Portal. + - Create a new resource of type **Microsoft Fabric**. + - Choose the appropriate subscription, resource group, capacity name, region, size, and administrator. + + image + +3. **Enable Fabric Capacity in Power BI**: + - Go to the Power BI workspace. + - Select the Fabric capacity license and the Fabric resource created in Azure. + + image + +4. **Pause Fabric Compute When Not in Use**: To save costs, remember to pause the Fabric compute in Azure when you're not using it. + + image + +### Install Required Libraries + +1. **Access Microsoft Fabric**: + - Open your web browser and navigate to the Microsoft Fabric portal. + - Sign in with your Azure credentials. +2. **Select Your Workspace**: From the Microsoft Fabric home page, select the workspace where you want to configure SynapseML. +3. **Create a New Cluster**: + - Within the **Data Science** component, you should find options to create a new cluster. + + image + + - Follow the prompts to configure and create your cluster, specifying the details such as cluster name, region, node size, and node count. + + image + + image + +4. **Install SynapseML on Your Cluster**: Configure your cluster to include the SynapseML package. + + image + + image + + ~~~ + %pip show synapseml + ~~~ + +5. **Install LangChain and Other Dependencies**: + > You can use `%pip install` to install the necessary packages + + ```python + %pip install openai langchain_community + ``` + + Or you can use the environment configuration: + + image + + You can also try with the `.yml file` approach. Just upload your list of dependencies. E.g: + + ```yml + dependencies: + - pip: + - synapseml==1.0.8 + - langchain==0.3.4 + - langchain_community==0.3.4 + - openai==1.53.0 + - langchain.openai==0.2.4 + ``` + +### Configure Azure OpenAI Service + +> [!NOTE] +> Click [here](./src/fabric-llms-overview_sample.ipynb) to see all notebook + +1. **Set Up API Keys**: Ensure you have the API key and endpoint URL for your deployed model. Set these as environment variables + + image + + ```python + import os + + # Set the API version for the Azure OpenAI service + os.environ["OPENAI_API_VERSION"] = "2023-08-01-preview" + + # Set the base URL for the Azure OpenAI service + os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-resource-name.openai.azure.com" + + # Set the API key for Azure OpenAI + os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key" + ``` + +2. **Initialize Azure OpenAI Class**: Create an instance of the Azure OpenAI class using the environment variables set above. + + image + + ```python + from langchain_openai import AzureChatOpenAI + + # Set the API base URL + api_base = os.environ["AZURE_OPENAI_ENDPOINT"] + + # Create an instance of the Azure OpenAI Class + llm = AzureChatOpenAI( + openai_api_key=os.environ["AZURE_OPENAI_API_KEY"], + temperature=0.7, + verbose=True, + top_p=0.9 + ) + ``` + +3. **Call the Deployed Model**: Use the Azure OpenAI service to generate text or perform other language model tasks. Here's an example of generating a response based on a prompt + + image + + ```python + # Define a prompt + messages = [ + ( + "system", + "You are a helpful assistant that translates English to French. Translate the user sentence.", + ), + ("human", "Hi, how are you?"), + ] + + # Generate a response from the Azure OpenAI service using the invoke method + ai_msg = llm.invoke(messages) + + # Print the response + print(ai_msg) + ``` + +Make sure to replace `"your_openai_api_key"`, `"https://your_openai_api_base/"`, `"your_deployment_name"`, and `"your_model_name"` with your actual API key, base URL, deployment name, and model name from your Azure OpenAI instance. This example demonstrates how to configure and use an existing Azure OpenAI instance in Microsoft Fabric. + +### Basic Usage of LangChain Transformer + +> [!NOTE] +> E.g: Automate the process of generating definitions for technology terms using a language model. +> `The LangChain Transformer` is a tool that makes it easy to use advanced language models for `generating and transforming text`. It works by `setting up a template for what you want to create, linking this template to a language model, and then processing your data to produce the desired output`. This setup `helps automate tasks like defining technology terms or generating other text-based content`, making your workflow smoother and more efficient. + +> `LangChain Transformer helps you automate the process of generating and transforming text data using advanced language models`, making it easier to integrate AI capabilities into your data workflows.
+> +> 1. `Prompt Creation`: Start by `defining a template for the kind of text you want to generate or analyze`. For example, you might create a prompt that asks the model to define a specific technology term.
+> 2. `Chain Setup`: Then `set up a chain that links this prompt to a language model`. This chain is responsible for sending the prompt to the model and receiving the generated response.
+> 3. `Transformer Configuration`: The LangChain Transformer is `configured to use this chain`. It specifies how the `input data (like a list of technology names) should be processed and what kind of output (like definitions) should be produced`.
+> 4. `Data Processing`: Finally, `apply this setup to a dataset.` E.g., list of technology names in a DataFrame, and the transformer will use the language model to generate definitions for each technology. + +1. **Create a Prompt Template**: Define a prompt template for generating definitions. + + image + + ```python + from langchain.prompts import PromptTemplate + + copy_prompt = PromptTemplate( + input_variables=["technology"], + template="Define the following word: {technology}", + ) + ``` + +2. **Set Up an LLMChain**: Create an LLMChain with the defined prompt template. + + image + + ```python + from langchain.chains import LLMChain + + chain = LLMChain(llm=llm, prompt=copy_prompt) + ``` + +3. **Configure LangChain Transformer**: Set up the LangChain transformer to execute the processing chain. + + image + + ```python + # Set up the LangChain transformer to execute the processing chain. + from synapse.ml.cognitive.langchain import LangchainTransformer + + openai_api_key= os.environ["AZURE_OPENAI_API_KEY"] + + transformer = ( + LangchainTransformer() + .setInputCol("technology") + .setOutputCol("definition") + .setChain(chain) + .setSubscriptionKey(openai_api_key) + .setUrl(api_base) + ) + ``` + +4. **Create a Test DataFrame**: Construct a DataFrame with technology names. + + image + + ```python + from pyspark.sql import SparkSession + from pyspark.sql.functions import udf + from pyspark.sql.types import StringType + + # Initialize Spark session + spark = SparkSession.builder.appName("example").getOrCreate() + + # Construct a DataFrame with technology names + df = spark.createDataFrame( + [ + (0, "docker"), (1, "spark"), (2, "python") + ], + ["label", "technology"] + ) + + # Define a simple UDF to transform the technology column + def transform_technology(tech): + return tech.upper() + + # Register the UDF + transform_udf = udf(transform_technology, StringType()) + + # Apply the UDF to the DataFrame + transformed_df = df.withColumn("transformed_technology", transform_udf(df["technology"])) + + # Show the transformed DataFrame + transformed_df.show() + ``` + +### Using LangChain for Large Scale Literature Review + +> [!NOTE] +> E.g: Automating the extraction and summarization of academic papers: script for an agent using LangChain to extract content from an online PDF and generate a prompt based on that content. +> An `agent` in the context of programming and artificial intelligence is a `software entity that performs tasks autonomously`. It can interact with its`environment, make decisions, and execute actions based on predefined rules or learned behavior.` + +1. **Define Functions for Content Extraction and Prompt Generation**: Extract content from PDFs linked in arXiv papers and generate prompts for extracting specific information. + + image + + ```python + from langchain.document_loaders import OnlinePDFLoader + + def paper_content_extraction(inputs: dict) -> dict: + arxiv_link = inputs["arxiv_link"] + loader = OnlinePDFLoader(arxiv_link) + pages = loader.load_and_split() + return {"paper_content": pages[0].page_content + pages[1].page_content} + + def prompt_generation(inputs: dict) -> dict: + output = inputs["Output"] + prompt = ( + "find the paper title, author, summary in the paper description below, output them. " + "After that, Use websearch to find out 3 recent papers of the first author in the author section below " + "(first author is the first name separated by comma) and list the paper titles in bullet points: " + "\n" + output + "." + ) + return {"prompt": prompt} + ``` + +2. **Create a Sequential Chain for Information Extraction**: Set up a chain to extract structured information from an arXiv link + + image + + ```python + from langchain.chains import TransformChain, SimpleSequentialChain + + paper_content_extraction_chain = TransformChain( + input_variables=["arxiv_link"], + output_variables=["paper_content"], + transform=paper_content_extraction, + verbose=False, + ) + + paper_summarizer_template = """ + You are a paper summarizer, given the paper content, it is your job to summarize the paper into a short summary, + and extract authors and paper title from the paper content. + """ + ``` + +### Machine Learning Integration with Microsoft Fabric + +1. **Train and Register Machine Learning Models**: Use Microsoft Fabric's native integration with the MLflow framework to log the trained machine learning models, the used hyperparameters, and evaluation metrics. + + image + + ```python + import mlflow + from mlflow.models import infer_signature + from sklearn.datasets import make_regression + from sklearn.ensemble import RandomForestRegressor + + # Generate synthetic regression data + X, y = make_regression(n_features=4, n_informative=2, random_state=0, shuffle=False) + + # Model parameters + params = {"n_estimators": 3, "random_state": 42} + + # Model tags for MLflow + model_tags = { + "project_name": "grocery-forecasting", + "store_dept": "produce", + "team": "stores-ml", + "project_quarter": "Q3-2023" + } + + # Log MLflow entities + with mlflow.start_run() as run: + # Train the model + model = RandomForestRegressor(**params).fit(X, y) + + # Infer the model signature + signature = infer_signature(X, model.predict(X)) + + # Log parameters and the model + mlflow.log_params(params) + mlflow.sklearn.log_model(model, artifact_path="sklearn-model", signature=signature) + + # Register the model with tags + model_uri = f"runs:/{run.info.run_id}/sklearn-model" + model_version = mlflow.register_model(model_uri, "RandomForestRegressionModel", tags=model_tags) + + # Output model registration details + print(f"Model Name: {model_version.name}") + print(f"Model Version: {model_version.version}") + ``` + +2. **Compare and Filter Machine Learning Models**: Use MLflow to search among multiple models saved within the workspace. + + image + + ```python + from pprint import pprint + from mlflow.tracking import MlflowClient + + client = MlflowClient() + for rm in client.search_registered_models(): + pprint(dict(rm), indent=4) + ``` + +
+

Total Visitors

+ Visitor Count +
diff --git a/Workloads-Specific/DataScience/AI_integration/src/fabric-llms-overview_sample.ipynb b/Workloads-Specific/DataScience/AI_integration/src/fabric-llms-overview_sample.ipynb new file mode 100644 index 0000000..7b0c18d --- /dev/null +++ b/Workloads-Specific/DataScience/AI_integration/src/fabric-llms-overview_sample.ipynb @@ -0,0 +1,1194 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "519955e9-2dad-456d-93db-a332d38e9433", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "# Fabric: Highlights into AI/LLMs" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "d312e8d9-03fe-4b3d-aa6d-c52e3022ae39", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T03:58:26.7170509Z", + "execution_start_time": "2024-10-31T03:58:19.270951Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "e267b6ab-5133-4598-8251-d64374cd11e5", + "queued_time": "2024-10-31T03:58:18.9132075Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 5, + "statement_ids": [ + 5 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 5, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Name: synapseml\r\n", + "Version: 1.0.8\r\n", + "Summary: Synapse Machine Learning\r\n", + "Home-page: https://github.com/Microsoft/SynapseML\r\n", + "Author: Microsoft\r\n", + "Author-email: synapseml-support@microsoft.com\r\n", + "License: MIT\r\n", + "Location: /home/trusted-service-user/cluster-env/clonedenv/lib/python3.11/site-packages\r\n", + "Requires: \r\n", + "Required-by: \r\n" + ] + } + ], + "source": [ + "!pip show synapseml" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "427610d0-3fae-45e3-8150-92ee7674f44c", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T03:58:28.6254349Z", + "execution_start_time": "2024-10-31T03:58:27.1124616Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "0e9f6c0f-062b-4e5d-9061-afcd89c8fd75", + "queued_time": "2024-10-31T03:58:19.3223486Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 6, + "statement_ids": [ + 6 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 6, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Name: langchain-openai\r\n", + "Version: 0.2.4\r\n", + "Summary: An integration package connecting OpenAI and LangChain\r\n", + "Home-page: https://github.com/langchain-ai/langchain\r\n", + "Author: \r\n", + "Author-email: \r\n", + "License: MIT\r\n", + "Location: /home/trusted-service-user/cluster-env/clonedenv/lib/python3.11/site-packages\r\n", + "Requires: langchain-core, openai, tiktoken\r\n", + "Required-by: \r\n" + ] + } + ], + "source": [ + "!pip show langchain-openai" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "baeeb853-2104-4edf-abf4-4d4be50cb977", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T03:58:30.5465258Z", + "execution_start_time": "2024-10-31T03:58:29.0000586Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "716d9975-263b-4d92-b25c-b342106f5f43", + "queued_time": "2024-10-31T03:58:19.511824Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 7, + "statement_ids": [ + 7 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 7, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Name: langchain\r\n", + "Version: 0.3.6\r\n", + "Summary: Building applications with LLMs through composability\r\n", + "Home-page: https://github.com/langchain-ai/langchain\r\n", + "Author: \r\n", + "Author-email: \r\n", + "License: MIT\r\n", + "Location: /home/trusted-service-user/cluster-env/clonedenv/lib/python3.11/site-packages\r\n", + "Requires: aiohttp, langchain-core, langchain-text-splitters, langsmith, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity\r\n", + "Required-by: langchain-community\r\n" + ] + } + ], + "source": [ + "!pip show langchain" + ] + }, + { + "cell_type": "markdown", + "id": "c58cc406-c4f5-4607-a740-0802e8e4b550", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Ensure you have the API key and endpoint URL for your deployed model. Set these as environment variables" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "3c8ada7c-2632-4c69-86d2-f5260ee8f1b7", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T04:20:14.3495341Z", + "execution_start_time": "2024-10-31T04:20:14.1128215Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "2573bf75-fe6d-40dc-b9f6-e06ebb9f7f73", + "queued_time": "2024-10-31T04:20:13.6194485Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 22, + "statement_ids": [ + 22 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 22, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import os\n", + "\n", + "os.environ[\"OPENAI_API_VERSION\"] = \"2023-08-01-preview\"\n", + "os.environ[\"AZURE_OPENAI_ENDPOINT\"] = \"https://your-resource.openai.azure.com/openai/deployments/gpt-4o-mini/chat/completions?api-version=2024-08-01-preview\"\n", + "os.environ[\"AZURE_OPENAI_API_KEY\"] = \"your-value\"" + ] + }, + { + "cell_type": "markdown", + "id": "3fac48a9-45fb-4e86-9792-8ee340b0ac60", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Create an instance of the Azure OpenAI class using the environment variables set above" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "5db10350-8000-4cbd-9bdf-d7da62d7fe61", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T04:20:14.9382032Z", + "execution_start_time": "2024-10-31T04:20:14.7083469Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "7dfaca5a-f738-4010-bba1-f764ea70f450", + "queued_time": "2024-10-31T04:20:14.027325Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 23, + "statement_ids": [ + 23 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 23, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from langchain_openai import AzureChatOpenAI\n", + "\n", + "# Set the API base URL\n", + "api_base = os.environ[\"AZURE_OPENAI_ENDPOINT\"]\n", + "\n", + "# Create an instance of the Azure OpenAI Class\n", + "llm = AzureChatOpenAI(\n", + " openai_api_key=os.environ[\"AZURE_OPENAI_API_KEY\"],\n", + " temperature=0.7,\n", + " verbose=True,\n", + " top_p=0.9\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "b17d7450-34b5-4ece-8e20-a77ddcdd93c4", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Use the Azure OpenAI service to generate text or perform other language model tasks" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "cfc5fd62-085a-4eff-9192-696d9f249a8e", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T04:20:16.0500538Z", + "execution_start_time": "2024-10-31T04:20:15.2936074Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "e14e4d0b-1fd0-4dac-a07d-6479d6536ce3", + "queued_time": "2024-10-31T04:20:14.4969185Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 24, + "statement_ids": [ + 24 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 24, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "content='Salut, comment ça va ?' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 6, 'prompt_tokens': 33, 'total_tokens': 39, 'completion_tokens_details': None, 'prompt_tokens_details': None}, 'model_name': 'gpt-4o-mini', 'system_fingerprint': 'fp_d54531d9eb', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {}}], 'finish_reason': 'stop', 'logprobs': None, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'protected_material_code': {'filtered': False, 'detected': False}, 'protected_material_text': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}} id='run-8cb7f29a-44c1-4f65-a648-15afb2d793dc-0' usage_metadata={'input_tokens': 33, 'output_tokens': 6, 'total_tokens': 39, 'input_token_details': {}, 'output_token_details': {}}\n" + ] + } + ], + "source": [ + "# Define a prompt\n", + "messages = [\n", + " (\n", + " \"system\",\n", + " \"You are a helpful assistant that translates English to French. Translate the user sentence.\",\n", + " ),\n", + " (\"human\", \"Hi, how are you?\"),\n", + "]\n", + "\n", + "# Generate a response from the Azure OpenAI service using the invoke method\n", + "ai_msg = llm.invoke(messages)\n", + "\n", + "# Print the response\n", + "print(ai_msg)" + ] + }, + { + "cell_type": "markdown", + "id": "79729106-c7f1-4879-bc2b-871b50c2ac9a", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Define a prompt template for generating definitions" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "ca633361-c27b-4294-b8a7-9fc4a316afa4", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T04:20:16.587491Z", + "execution_start_time": "2024-10-31T04:20:16.3655978Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "cc3215f4-71a5-4231-af47-9bd9a8f5698a", + "queued_time": "2024-10-31T04:20:14.7799392Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 25, + "statement_ids": [ + 25 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 25, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from langchain.prompts import PromptTemplate\n", + "\n", + "copy_prompt = PromptTemplate(\n", + " input_variables=[\"technology\"],\n", + " template=\"Define the following word: {technology}\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "899839d9-adca-4042-b662-73edcad7e432", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Create an LLMChain with the defined prompt template" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "bd4f65ca-049b-481d-bbbd-a017c6c0119b", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T04:20:17.1233668Z", + "execution_start_time": "2024-10-31T04:20:16.9052959Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "27790d83-509f-4716-bb69-9c288ad069ba", + "queued_time": "2024-10-31T04:20:15.1325692Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 26, + "statement_ids": [ + 26 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 26, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from langchain.chains import LLMChain\n", + "\n", + "chain = LLMChain(llm=llm, prompt=copy_prompt)\n" + ] + }, + { + "cell_type": "markdown", + "id": "936b3ddf-cc65-436c-ba4e-ae0abe21fc2c", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Set up the LangChain transformer to execute the processing chain\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "63a00038-37b4-49ee-9c53-128c8acf9d01", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T04:20:18.181457Z", + "execution_start_time": "2024-10-31T04:20:17.4351576Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "3fb30420-f0c9-477b-ad1a-001dc0d8d37a", + "queued_time": "2024-10-31T04:20:15.6799013Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 27, + "statement_ids": [ + 27 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 27, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from synapse.ml.cognitive.langchain import LangchainTransformer\n", + "\n", + "openai_api_key= os.environ[\"AZURE_OPENAI_API_KEY\"]\n", + "\n", + "transformer = (\n", + " LangchainTransformer()\n", + " .setInputCol(\"technology\")\n", + " .setOutputCol(\"definition\")\n", + " .setChain(chain)\n", + " .setSubscriptionKey(openai_api_key)\n", + " .setUrl(api_base)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "c74293f0-925e-4987-a6a1-b3b9b8e14b9d", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Construct a DataFrame with technology names." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "8e03963e-2fcf-4934-b96f-ac27b4e0353c", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T04:24:08.3891172Z", + "execution_start_time": "2024-10-31T04:24:02.0675933Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "856f5b73-26e8-4d20-a901-356cd92b9c2a", + "queued_time": "2024-10-31T04:24:01.6603792Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 29, + "statement_ids": [ + 29 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 29, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+-----+----------+----------------------+\n", + "|label|technology|transformed_technology|\n", + "+-----+----------+----------------------+\n", + "| 0| docker| DOCKER|\n", + "| 1| spark| SPARK|\n", + "| 2| python| PYTHON|\n", + "+-----+----------+----------------------+\n", + "\n" + ] + } + ], + "source": [ + "from pyspark.sql import SparkSession\n", + "from pyspark.sql.functions import udf\n", + "from pyspark.sql.types import StringType\n", + "\n", + "# Initialize Spark session\n", + "spark = SparkSession.builder.appName(\"example\").getOrCreate()\n", + "\n", + "# Construct a DataFrame with technology names\n", + "df = spark.createDataFrame(\n", + " [\n", + " (0, \"docker\"), (1, \"spark\"), (2, \"python\")\n", + " ],\n", + " [\"label\", \"technology\"]\n", + ")\n", + "\n", + "# Define a simple UDF to transform the technology column\n", + "def transform_technology(tech):\n", + " return tech.upper()\n", + "\n", + "# Register the UDF\n", + "transform_udf = udf(transform_technology, StringType())\n", + "\n", + "# Apply the UDF to the DataFrame\n", + "transformed_df = df.withColumn(\"transformed_technology\", transform_udf(df[\"technology\"]))\n", + "\n", + "# Show the transformed DataFrame\n", + "transformed_df.show()" + ] + }, + { + "cell_type": "markdown", + "id": "47ab1ba6-deaf-488d-9e95-8202669d948c", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Extract content from PDFs linked in arXiv papers and generate prompts for extracting specific information.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "8b52c87e-5971-4d28-bc4b-4160d29a1c24", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T04:27:08.3224773Z", + "execution_start_time": "2024-10-31T04:27:08.0430507Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "4eeab690-4159-41dc-be69-3cceed484314", + "queued_time": "2024-10-31T04:27:07.6309068Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 30, + "statement_ids": [ + 30 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 30, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from langchain.document_loaders import OnlinePDFLoader\n", + "\n", + "def paper_content_extraction(inputs: dict) -> dict:\n", + " arxiv_link = inputs[\"arxiv_link\"]\n", + " loader = OnlinePDFLoader(arxiv_link)\n", + " pages = loader.load_and_split()\n", + " return {\"paper_content\": pages[0].page_content + pages[1].page_content}\n", + "\n", + "def prompt_generation(inputs: dict) -> dict:\n", + " output = inputs[\"Output\"]\n", + " prompt = (\n", + " \"find the paper title, author, summary in the paper description below, output them. \"\n", + " \"After that, Use websearch to find out 3 recent papers of the first author in the author section below \"\n", + " \"(first author is the first name separated by comma) and list the paper titles in bullet points: \"\n", + " \"\\n\" + output + \".\"\n", + " )\n", + " return {\"prompt\": prompt}" + ] + }, + { + "cell_type": "markdown", + "id": "89d79c38-ba0c-4062-911c-7ede02536298", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Set up a chain to extract structured information from an arXiv link\n" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "e85241a0-11c2-49c1-9b2e-63187cb24d9a", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T04:28:11.2331925Z", + "execution_start_time": "2024-10-31T04:28:11.0134852Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "232b4aa0-1b84-47f8-bb5d-347a575d9640", + "queued_time": "2024-10-31T04:28:10.663514Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 31, + "statement_ids": [ + 31 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 31, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from langchain.chains import TransformChain, SimpleSequentialChain\n", + "\n", + "paper_content_extraction_chain = TransformChain(\n", + " input_variables=[\"arxiv_link\"],\n", + " output_variables=[\"paper_content\"],\n", + " transform=paper_content_extraction,\n", + " verbose=False,\n", + ")\n", + "\n", + "paper_summarizer_template = \"\"\"\n", + "You are a paper summarizer, given the paper content, it is your job to summarize the paper into a short summary, \n", + "and extract authors and paper title from the paper content.\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "64937339-791c-4aad-953b-ca990bfd324a", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Use Microsoft Fabric's native integration with the MLflow framework to log the trained machine learning models, the used hyperparameters, and evaluation metrics." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "5bac7684-a123-4733-baa3-a748ff0fd070", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [ + { + "data": { + "application/vnd.livy.statement-meta+json": { + "execution_finish_time": "2024-10-31T04:36:54.8917645Z", + "execution_start_time": "2024-10-31T04:36:44.7561664Z", + "livy_statement_state": "available", + "normalized_state": "finished", + "parent_msg_id": "d2abef17-25d7-41c4-a62f-051d9b5fe8d7", + "queued_time": "2024-10-31T04:36:44.2999954Z", + "session_id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "session_start_time": null, + "spark_pool": null, + "state": "finished", + "statement_id": 33, + "statement_ids": [ + 33 + ] + }, + "text/plain": [ + "StatementMeta(, 7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325, 33, Finished, Available, Finished)" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Registered model 'RandomForestRegressionModel' already exists. Creating a new version of this model...\n", + "2024/10/31 04:36:52 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: RandomForestRegressionModel, version 2\n", + "Created version '2' of model 'RandomForestRegressionModel'.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model Name: RandomForestRegressionModel\n", + "Model Version: 2\n" + ] + }, + { + "data": { + "application/vnd.mlflow.run-widget+json": { + "data": { + "metrics": {}, + "params": { + "n_estimators": "3", + "random_state": "42" + }, + "tags": { + "mlflow.rootRunId": "20c75f63-d266-40b1-83f7-d9c76fd1f4f4", + "mlflow.runName": "icy_hamster_xr34qfzf", + "mlflow.user": "4b3a56ea-6f42-450e-b7c3-fb2932c7ac32", + "synapseml.experiment.artifactId": "17b41ab7-b0e0-4adc-9fc9-403dd72b6e5b", + "synapseml.experimentName": "Notebook-1", + "synapseml.livy.id": "7383b5d4-1dea-4b9b-85d6-fe5ef5b7d325", + "synapseml.notebook.artifactId": "789d5fef-b2a1-409b-996f-0cdb4e748a90", + "synapseml.user.id": "ea5a1fdc-a08c-493a-bce9-8422f28ecd05", + "synapseml.user.name": "System Administrator" + } + }, + "info": { + "artifact_uri": "sds://onelakewestus3.pbidedicated.windows.net/6361aeaa-b63a-44ea-b28f-26db10b31a6c/17b41ab7-b0e0-4adc-9fc9-403dd72b6e5b/20c75f63-d266-40b1-83f7-d9c76fd1f4f4/artifacts", + "end_time": 1730349412, + "experiment_id": "d52403ad-a9c2-41ba-b582-9b8e9a57917e", + "lifecycle_stage": "active", + "run_id": "20c75f63-d266-40b1-83f7-d9c76fd1f4f4", + "run_name": "", + "run_uuid": "20c75f63-d266-40b1-83f7-d9c76fd1f4f4", + "start_time": 1730349405, + "status": "FINISHED", + "user_id": "7ebfac85-3ebb-440f-a743-e52052051f6a" + }, + "inputs": { + "dataset_inputs": [] + } + } + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import mlflow\n", + "from mlflow.models import infer_signature\n", + "from sklearn.datasets import make_regression\n", + "from sklearn.ensemble import RandomForestRegressor\n", + "\n", + "# Generate synthetic regression data\n", + "X, y = make_regression(n_features=4, n_informative=2, random_state=0, shuffle=False)\n", + "\n", + "# Model parameters\n", + "params = {\"n_estimators\": 3, \"random_state\": 42}\n", + "\n", + "# Model tags for MLflow\n", + "model_tags = {\n", + " \"project_name\": \"grocery-forecasting\",\n", + " \"store_dept\": \"produce\",\n", + " \"team\": \"stores-ml\",\n", + " \"project_quarter\": \"Q3-2023\"\n", + "}\n", + "\n", + "# Log MLflow entities\n", + "with mlflow.start_run() as run:\n", + " # Train the model\n", + " model = RandomForestRegressor(**params).fit(X, y)\n", + "\n", + " # Infer the model signature\n", + " signature = infer_signature(X, model.predict(X))\n", + "\n", + " # Log parameters and the model\n", + " mlflow.log_params(params)\n", + " mlflow.sklearn.log_model(model, artifact_path=\"sklearn-model\", signature=signature)\n", + "\n", + " # Register the model with tags\n", + " model_uri = f\"runs:/{run.info.run_id}/sklearn-model\"\n", + " model_version = mlflow.register_model(model_uri, \"RandomForestRegressionModel\", tags=model_tags)\n", + "\n", + " # Output model registration details\n", + " print(f\"Model Name: {model_version.name}\")\n", + " print(f\"Model Version: {model_version.version}\")" + ] + }, + { + "cell_type": "markdown", + "id": "315ebdcd-e78c-4bc5-93d6-f202d02bddc5", + "metadata": { + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "source": [ + "## Use MLflow to search among multiple models saved within the workspace" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60e6f7d3-d1ec-4ccc-9745-6c7938d2f4bc", + "metadata": { + "jupyter": { + "outputs_hidden": false, + "source_hidden": false + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark" + }, + "nteract": { + "transient": { + "deleting": false + } + } + }, + "outputs": [], + "source": [ + "from pprint import pprint\n", + "from mlflow.tracking import MlflowClient\n", + "\n", + "client = MlflowClient()\n", + "for rm in client.search_registered_models():\n", + " pprint(dict(rm), indent=4)" + ] + } + ], + "metadata": { + "dependencies": { + "environment": { + "environmentId": "766562be-9e21-456c-b270-cac7e4bf8d18", + "workspaceId": "6361aeaa-b63a-44ea-b28f-26db10b31a6c" + } + }, + "kernel_info": { + "name": "synapse_pyspark" + }, + "kernelspec": { + "display_name": "Synapse PySpark", + "language": "Python", + "name": "synapse_pyspark" + }, + "language_info": { + "name": "python" + }, + "microsoft": { + "language": "python", + "language_group": "synapse_pyspark", + "ms_spell_check": { + "ms_spell_check_language": "en" + } + }, + "nteract": { + "version": "nteract-front-end@1.0.0" + }, + "spark_compute": { + "compute_id": "/trident/default", + "session_options": { + "conf": { + "spark.synapse.nbs.session.timeout": "1200000" + } + } + }, + "widgets": {} + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/Workloads-Specific/PowerBi/ConfigureCloudConnectionsGateways.md b/Workloads-Specific/PowerBi/ConfigureCloudConnectionsGateways.md index 7833668..7bfb5bc 100644 --- a/Workloads-Specific/PowerBi/ConfigureCloudConnectionsGateways.md +++ b/Workloads-Specific/PowerBi/ConfigureCloudConnectionsGateways.md @@ -5,7 +5,7 @@ Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-16 +Last updated: 2025-04-21 ------------------------------------------ @@ -30,21 +30,17 @@ Last updated: 2025-04-16
Table of Contents (Click to expand) -- [Power Bi: Cloud Connections & Gateways](#power-bi-cloud-connections--gateways) - - [Wiki](#wiki) - - [Content](#content) - - [How to Manage Cloud connections](#how-to-manage-cloud-connections) - - [Creating Shareable Connections](#creating-shareable-connections) - - [Managing Connections](#managing-connections) - - [Admin Monitoring Workspace](#admin-monitoring-workspace) - - [Identify Access per report](#identify-access-per-report) - - [Restrict Access from new gateway connections](#restrict-access-from-new-gateway-connections) - - [On-premises Data Gateways](#on-premises-data-gateways) - - [Virtual Network VNet Data Gateways](#virtual-network-vnet-data-gateways) +- [How to Manage Cloud connections](#how-to-manage-cloud-connections) + - [Creating Shareable Connections](#creating-shareable-connections) + - [Managing Connections](#managing-connections) +- [Admin Monitoring Workspace](#admin-monitoring-workspace) +- [Identify Access per report](#identify-access-per-report) +- [Restrict Access from new gateway connections](#restrict-access-from-new-gateway-connections) + - [On-premises Data Gateways](#on-premises-data-gateways) + - [Virtual Network VNet Data Gateways](#virtual-network-vnet-data-gateways)
- ## How to Manage Cloud connections Managing cloud connections in Power BI, below you can find differences between personal and shareable cloud connections: @@ -77,9 +73,10 @@ Managing cloud connections in Power BI, below you can find differences between p | --- | --- | | Private | Contains sensitive or confidential information, and the visibility of the data source may be restricted to authorized users. It is completely isolated from other data sources. Examples include Facebook data, a text file containing stock awards, or a workbook containing an employee review. | | Organizational | Limits the visibility of a data source to a trusted group of people. It is isolated from all Public data sources, but is visible to other Organizational data sources. A common example is a Microsoft Word document on an intranet SharePoint site with permissions enabled for a trusted group. | -| Public | Gives everyone visibility to the data. Only files, internet data sources, or workbook data can be marked Public. Examples include data from a Wikipedia page, or a local file containing data copied from a public web page.| +| Public | Gives everyone visibility to the data. Only files, internet data sources, or workbook data can be marked Public. Examples include data from a Wikipedia page, or a local file containing data copied from a public web page.| + +Steps: -Steps: - Go to [Power Bi](https://app.powerbi.com/) - Click on ⚙️, and go to `Manage connections and gateways` @@ -91,7 +88,7 @@ Steps: ### Managing Connections -> - `Switching to Shareable Connections`: If you want to switch from a personal cloud connection to a shareable one, you can do so in the Semantic model settings. This allows you to leverage the benefits of shareable connections, such as easier management and sharing capabilities.
+> - `Switching to Shareable Connections`: If you want to switch from a personal cloud connection to a shareable one, you can do so in the Semantic model settings. This allows you to leverage the benefits of shareable connections, such as easier management and sharing capabilities.
> - `Granular Access Control`: Power BI allows for granular access control at the tenant, workspace, and semantic model levels. This means you can enforce access policies to ensure that only authorized users can create or use specific connections. - To assign the connection a semantic model, click on `...` over your semantic model, and go to `Settings` @@ -120,12 +117,12 @@ Steps to setup admin monitoring workspace: image -> The report can be accessed from the Admin monitoring workspace and is designed for admins to analyze various usage scenarios. +> The report can be accessed from the Admin monitoring workspace and is designed for admins to analyze various usage scenarios. | Report Name | Details | -| --- | --- | +| --- | --- | | Feature Usage and Adoption Report | This report provides an in-depth analysis of how different features are utilized and adopted across your Microsoft Fabric tenant. It includes pages for activity overview, analysis, and detailed activity scenarios, helping identify which users are making use of cloud connections. | -| Purview Hub | Offers insights into data governance and compliance. It helps administrators manage and monitor data policies, ensuring that data usage aligns with organizational standards and regulatory requirements. | +| Purview Hub | Offers insights into data governance and compliance. It helps administrators manage and monitor data policies, ensuring that data usage aligns with organizational standards and regulatory requirements. | image @@ -146,6 +143,7 @@ Benefits of sharing the semantic model: > [!IMPORTANT] > Other ways to get insights:
+> > - `Monitoring Usage`: You can monitor and manage cloud connections through the Power BI service. By navigating to the Manage connections and gateways section, you can see which users have access to and are using specific cloud connections.
> image
> - `Premium Capacity Metrics`: For a more detailed analysis, you can use the Premium Capacity Metrics app, which provides insights into the usage and performance of your Power BI Premium capacities. @@ -163,7 +161,7 @@ Benefits of sharing the semantic model: ## Restrict Access from new gateway connections -> Facilitate secure data transfer between Power BI or Power Apps and non-cloud data sources like on-premises SQL Server databases or SharePoint sites. +> Facilitate secure data transfer between Power BI or Power Apps and non-cloud data sources like on-premises SQL Server databases or SharePoint sites. Gateway Roles: @@ -181,13 +179,12 @@ Connection Roles: | `User` | - Can use the connection in Power BI reports and dataflows.
- Cannot see or update credentials. | | `User with Sharing` | - Can use the connection in Power BI reports and dataflows.
- Can share the data source with others with User permission. | - Steps to Manage Gateway and Connection Roles: - Go to [Power Bi/Fabric admin center](https://app.powerbi.com/) - Click on ⚙️, and go to `Manage Connections and Gateways` - Choose `Connections`, `On premises data gateway` or `Virtual Network data gateways`: - + image - Click on `...`, and select `Manage users`: @@ -212,7 +209,6 @@ Steps to Restrict Access for On-Premises Data Gateways: > - **Tenant-Level Control**: You can `restrict who can install on-premises data gateways at the tenant level through the Power Platform admin center`. This prevents unauthorized users from creating new gateway connections.
> - **Role Management**: Assign specific roles to users, such as Admin, Connection Creator, and Connection Creator with Sharing, `to control who can create and manage connections on the gateway`. - 1. **Access the Power Platform Admin Center**: Go to the [Power Platform Admin Center](https://admin.powerplatform.microsoft.com/ext/DataGateways). 2. **Navigate to Data Gateways**: - Click on **Data** (preview) in the left-hand menu. @@ -226,7 +222,7 @@ Steps to Restrict Access for On-Premises Data Gateways: image -### Virtual Network (VNet) Data Gateways +### Virtual Network (VNet) Data Gateways > Allow Power BI to connect to data services within an Azure virtual network without needing an on-premises data gateway. This setup is particularly useful for maintaining security and compliance by keeping data traffic within the Azure backbone. diff --git a/Workloads-Specific/PowerBi/ConfigureReadAccess.md b/Workloads-Specific/PowerBi/ConfigureReadAccess.md index a97ecb9..598c3c1 100644 --- a/Workloads-Specific/PowerBi/ConfigureReadAccess.md +++ b/Workloads-Specific/PowerBi/ConfigureReadAccess.md @@ -1,11 +1,11 @@ -# Demostration: How to Configure Read Access +# Demostration: How to Configure Read Access Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-16 +Last updated: 2025-04-21 ----------------------------------------- @@ -43,18 +43,16 @@ Last updated: 2025-04-16 -## Overview +## Overview **Create a Fabric Capacity**: Follow the prompts to configure and create the capacity. image - ## Viewer Role in Fabric Workspaces > `Fabric Workspaces` in Microsoft Fabric are `collaborative environments where users can manage, analyze, and visualize data`. These workspaces integrate various data services and tools, providing a `unified platform for data professional`s to work together - | **Capability** | **Description** | |------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | **View All Content** | - Users can view dashboards, reports, workbooks, and other content within the workspace.
- This includes content created by other users, enabling collaboration and shared insights. | @@ -82,10 +80,10 @@ Last updated: 2025-04-16 > Semantic Model: Provides a logical description of an analytical domain using business-friendly terminology and metrics. Capabilities: - - Data Representation: Organizes data into a star schema with facts and dimensions. - - Business Logic: Inherits business logic from parent lakehouses or warehouses. - - Visualization: Supports creating Power BI reports and dashboards for visual analysis. +- Data Representation: Organizes data into a star schema with facts and dimensions. +- Business Logic: Inherits business logic from parent lakehouses or warehouses. +- Visualization: Supports creating Power BI reports and dashboards for visual analysis. image @@ -98,27 +96,28 @@ Capabilities: image - ## SQL Analytics Endpoint in Fabric > Lakehouse: A data architecture platform for storing, managing, and analyzing both structured and unstructured data. Capabilities: - - Data Storage: Combines the capabilities of data lakes and data warehouses. - - SQL Analytics Endpoint: Provides a SQL-based experience for querying data. - - Automatic Table Discovery: Automatically registers and validates tables. + +- Data Storage: Combines the capabilities of data lakes and data warehouses. +- SQL Analytics Endpoint: Provides a SQL-based experience for querying data. +- Automatic Table Discovery: Automatically registers and validates tables. > SQL Analytics Endpoint: Allows users to query data in the lakehouse using SQL. Capabilities: - - T-SQL Queries: Supports T-SQL language for querying Delta tables. - - Read-Only Mode: Operates in read-only mode, allowing data analysis without modifying the data. - - Security: Implements SQL security for access control. + +- T-SQL Queries: Supports T-SQL language for querying Delta tables. +- Read-Only Mode: Operates in read-only mode, allowing data analysis without modifying the data. +- Security: Implements SQL security for access control. > Apache Endpoint: Used for real-time data streaming and processing. Capabilities: - - Event Streaming: Streams events to and from Real-Time Intelligence using Apache Kafka. - - Integration: Integrates with event streams to process and route real-time events. - - Scalability: Supports building scalable, real-time data systems. +- Event Streaming: Streams events to and from Real-Time Intelligence using Apache Kafka. +- Integration: Integrates with event streams to process and route real-time events. +- Scalability: Supports building scalable, real-time data systems. image diff --git a/Workloads-Specific/PowerBi/ConfigureWorkspaceApp.md b/Workloads-Specific/PowerBi/ConfigureWorkspaceApp.md index 53a1345..2b02dec 100644 --- a/Workloads-Specific/PowerBi/ConfigureWorkspaceApp.md +++ b/Workloads-Specific/PowerBi/ConfigureWorkspaceApp.md @@ -1,11 +1,11 @@ -# Demostration: How to Configure Workspace App +# Demostration: How to Configure Workspace App Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-16 +Last updated: 2025-04-21 ------------------------------------------ @@ -18,7 +18,7 @@ Last updated: 2025-04-16 2. Go to [Fabric](https://app.fabric.microsoft.com/), and assign the capacity created to the workspace desired. image
- + image
> Select the `large semantic model only if your model exceeds 10 GB`. If not, use the small model. The large setup is for models up to 10 GB. @@ -48,22 +48,22 @@ Last updated: 2025-04-16 image - - You will see something like this: - +- You will see something like this: + image image - - You can leverage copilot to modify your report: +- You can leverage copilot to modify your report: image - - Once you are ready, save your report: +- Once you are ready, save your report: image - - At this point you will have your `lakehouse`, with your `SQL analytics endpoint`, the `semantic model` and `the report`. - +- At this point you will have your `lakehouse`, with your `SQL analytics endpoint`, the `semantic model` and `the report`. + image 8. A paginated report, can also be created: @@ -89,10 +89,10 @@ Last updated: 2025-04-16 image - - Let's say you want only `viewer` permissions: +- Let's say you want only `viewer` permissions: 1. Need to give access to the lakehouse/sql analytics endpoint: - + image > `Read All SQL Endpoint Data` permission allows users to access and read data from SQL endpoints within the Fabric environment. This permission is typically required for users who need to:
@@ -100,8 +100,6 @@ Last updated: 2025-04-16 > - Access Reports: `View and interact with reports and dashboards that rely on SQL data sources`.
> - Data Analysis: `Perform data analysis and generate insights` using SQL-based data. - - image 2. Make sure the person already have access to the semantic model: diff --git a/Workloads-Specific/PowerBi/CopilotReports.md b/Workloads-Specific/PowerBi/CopilotReports.md index d4c3b5f..e21f51c 100644 --- a/Workloads-Specific/PowerBi/CopilotReports.md +++ b/Workloads-Specific/PowerBi/CopilotReports.md @@ -5,14 +5,14 @@ Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-16 +Last updated: 2025-04-21 ---------- -> Prerequisites: -> - **Admin Account**: Ensure you have admin privileges in Microsoft Fabric. -> - **Licenses**: You need a paid Fabric capacity (F64 or higher) or Power BI Premium capacity (P1 or higher). - +> Prerequisites: +> +> - **Admin Account**: Ensure you have admin privileges in Microsoft Fabric. +> - **Licenses**: You need a paid Fabric capacity (F64 or higher) or Power BI Premium capacity (P1 or higher).
List of References (Click to expand) @@ -32,14 +32,13 @@ Last updated: 2025-04-16
- -## How to Tenant configuration +## How to Tenant configuration 1. **Sign In**: Log in to Microsoft Fabric using your admin account credentials. 2. **Access Admin Portal**: Go to the Fabric settings and select the Admin portal from the menu. image - + 3. **Tenant Settings**: Navigate to the Tenant settings in the Admin portal. 4. **Enable Copilot**: Use the search feature to locate the Copilot settings. Toggle the switch to enable Copilot in Fabric. @@ -50,8 +49,9 @@ Last updated: 2025-04-16 image ## How to Configure Workspaces + 1. **Workspace Settings**: Ensure that your reports are located in a workspace with either Premium Power BI (P1 and above) or paid Fabric (F64 and above) capacity. - + image 2. **Apply Capacity**: Check your license type in the Workspace settings and apply either Premium capacity or Fabric capacity to the workspace. @@ -59,6 +59,7 @@ Last updated: 2025-04-16 image ## How to Using Copilot in Power BI + 1. **Access Copilot**: Once enabled, users can access Copilot across different workloads in Fabric, including Power BI. 2. **Generate Insights**: Use Copilot to transform and analyze data, generate insights, and create visualizations and reports. diff --git a/Workloads-Specific/PowerBi/HowUseRestAPI.md b/Workloads-Specific/PowerBi/HowUseRestAPI.md index 49117bf..a992aa3 100644 --- a/Workloads-Specific/PowerBi/HowUseRestAPI.md +++ b/Workloads-Specific/PowerBi/HowUseRestAPI.md @@ -1,17 +1,16 @@ -# Demostration: How to Use Power BI REST API +# Demostration: How to Use Power BI REST API Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-16 +Last updated: 2025-04-21 ---------- > The Power BI REST API provides programmatic access to several Power BI resources, enabling automation and embedding of analytics. -
List of References (Click to expand) @@ -27,19 +26,17 @@ Last updated: 2025-04-16
-
Table of Contents (Click to expand) - [Overview](#overview) - [How to work around the rate limits](#how-to-work-around-the-rate-limits) - - [Batch Request](#batch-request) - - [Example Implementation in Python](#example-implementation-in-python) + - [Batch Request](#batch-request) + - [Example Implementation in Python](#example-implementation-in-python)
-## Overview - +## Overview > [!IMPORTANT] > There are rate limits for Power BI REST API endpoints. @@ -71,7 +68,7 @@ Last updated: 2025-04-16 > Example of this works: -```mermaid +```mermaid graph TD A[Client Application] -->|Batch Request| B[Power BI REST API] B -->|Response| A @@ -137,8 +134,6 @@ response = batch_request(access_token, requests) print(response) ``` - -

Total Visitors

Visitor Count diff --git a/Workloads-Specific/PowerBi/IncrementalRefresh.md b/Workloads-Specific/PowerBi/IncrementalRefresh.md index 1c81a0c..a6ce326 100644 --- a/Workloads-Specific/PowerBi/IncrementalRefresh.md +++ b/Workloads-Specific/PowerBi/IncrementalRefresh.md @@ -1,11 +1,11 @@ -# Power Bi: Incremental Refresh for Reporting - Overview +# Power Bi: Incremental Refresh for Reporting - Overview Costa Rica [![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) [brown9804](https://github.com/brown9804) -Last updated: 2025-04-15 +Last updated: 2025-04-21 ---------- @@ -28,16 +28,15 @@ Last updated: 2025-04-15 - [Overview](#overview) - [How the VertiPaq Engine Works](#how-the-vertipaq-engine-works) - [How to create a unique key](#how-to-create-a-unique-key) - - [Best Practices for Creating Unique Keys in Power BI](#best-practices-for-creating-unique-keys-in-power-bi) - - [Strategies to Avoid High Cardinality in Power BI](#strategies-to-avoid-high-cardinality-in-power-bi) + - [Best Practices for Creating Unique Keys in Power BI](#best-practices-for-creating-unique-keys-in-power-bi) + - [Strategies to Avoid High Cardinality in Power BI](#strategies-to-avoid-high-cardinality-in-power-bi) - [Steps to Change a Column Type to Date in Power BI](#steps-to-change-a-column-type-to-date-in-power-bi) +## Overview -## Overview - -> Allows Power BI to refresh only the data that has changed or is new since the last refresh, rather than refreshing the entire dataset. Particularly useful for large datasets, reducing processing and transfer times. +> Allows Power BI to refresh only the data that has changed or is new since the last refresh, rather than refreshing the entire dataset. Particularly useful for large datasets, reducing processing and transfer times. | **Aspect** | **Details** | |---------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| @@ -113,7 +112,7 @@ Last updated: 2025-04-15 - **Optimize Data Model**: Ensure your data model is well-structured, preferably using a star schema, to improve performance and make troubleshooting easier. - **Monitor Performance**: Keep an eye on performance metrics to identify any bottlenecks or issues related to data transformations and loading. Regular monitoring can help you catch and address issues before they impact your reports and dashboards. -## How to create a unique key +## How to create a unique key > By concatenating multiple columns using DAX (Data Analysis Expressions) in Power BI @@ -125,11 +124,12 @@ Last updated: 2025-04-15 UniqueKey = [column1] & "_" & [column2] & "_" & [column3] ``` - For example: + For example: ```DAX UniqueKey = [DateTimeColumn] & "_" & [CallerID] & "_" & [CallID] ``` + - **Apply the Changes**: After entering the formula, press Enter to create the new column. In this DAX formula example, it concatenates the `DateTimeColumn`, `CallerID`, and `CallID` columns with underscores to create a unique key for each record. ### Best Practices for Creating Unique Keys in Power BI