diff --git a/collate-ai/index.mdx b/collate-ai/index.mdx index 2d13939f..0ad2337e 100644 --- a/collate-ai/index.mdx +++ b/collate-ai/index.mdx @@ -30,6 +30,9 @@ Collate AI brings generative AI to metadata management, making it effortless to Automatically create Data Quality tests to ensure health and stability. + + Understand and optimize the queries running in your sources. + Streamline metadata ingestion capabilities for every service in your Data Platform. diff --git a/collate-ai/sql-agent.mdx b/collate-ai/sql-agent.mdx index 05d35ed3..27afb201 100644 --- a/collate-ai/sql-agent.mdx +++ b/collate-ai/sql-agent.mdx @@ -14,6 +14,44 @@ Quickly understand and improve the performance of the queries that are executed - Query explanation and breakdown - Query performance improvements +Ask Questions related to SQL Queries + +## Setup Instructions + +1. Open a Database + +- Navigate to the database section in the platform. + +- Open any database. + +Open a Database + +Example: In our case, we opened Customers. + +2. Go to the Queries Tab + +Inside the database view, locate and click the Queries tab. + +Go to the Queries Tab + +3. Click the Ask Collate Icon + +In the Queries tab, click on the Ask Collate icon. + +Click the Get Collate Icon + +4. Redirect to Ask Collate + +After clicking the icon, you will be redirected to the Ask Collate interface. + +Ask Questions related to SQL Queries + +5. Ask Anything Related to SQL Queries + +In Ask Collate, type your question or request. + +You can ask anything related to SQL query creation, optimization, explanations, or filters. + update or suggest tiers ## Setup Instructions diff --git a/docs.json b/docs.json index 0008986b..3aa59a1b 100644 --- a/docs.json +++ b/docs.json @@ -71,7 +71,7 @@ "group": "Hybrid SaaS", "pages": [ "getting-started/day-1/hybrid-saas/index", - "getting-started/day-1/hybrid-saas/hybrid-ingestion-runner" + "getting-started/day-1/hybrid-saas/local-ingestion-agent" ] } ] @@ -1131,6 +1131,8 @@ "collate-ai/how-to-use-collate-ai", "collate-ai/tier-agent", "collate-ai/quality-agent", + "collate-ai/documentation-agent", + "/collate-ai/quality-agent" "collate-ai/documentation-agent" ] }, diff --git a/getting-started/day-1/hybrid-saas/airflow.mdx b/getting-started/day-1/hybrid-saas/airflow.mdx deleted file mode 100644 index 78e2aef7..00000000 --- a/getting-started/day-1/hybrid-saas/airflow.mdx +++ /dev/null @@ -1,270 +0,0 @@ ---- -title: Run the ingestion from your Airflow -description: Learn how to run Collate Ingestion workflows using Python, Docker, or Virtualenv operators in Airflow for secure and flexible metadata ingestion. -slug: /getting-started/day-1/hybrid-saas/airflow -sidebarTitle: Airflow -collate: true ---- - -import ExternalIngestion from '/snippets/deployment/external-ingestion.mdx' -import RunConnectorsClass from '/snippets/deployment/run-connectors-class.mdx' - - - -# Run the ingestion from your Airflow - -We can use Airflow in different ways: - -1. We can [extract metadata](/connectors/pipeline/airflow) from it, -2. And we can [connect it to the OpenMetadata UI](/deployment/ingestion/openmetadata) to deploy Workflows automatically. - -In this guide, we will show how to host the ingestion DAGs in your Airflow directly. - -1. [Python Operator](#python-operator) -2. [Docker Operator](#docker-operator) -3. [Python Virtualenv Operator](#python-virtualenv-operator) - -## Python Operator - -### Prerequisites - -Building a DAG using the `PythonOperator` requires devs to install the `openmetadata-ingestion` package in your Airflow's -environment. This is a comfortable approach if you have access to the Airflow host and can freely handle -dependencies. - -Installing the dependencies' is as easy as: - -``` -pip3 install openmetadata-ingestion[]==x.y.z -``` - -Where `x.y.z` is the version of the OpenMetadata ingestion package. Note that the version needs to match the server version. If we are using the server at 1.1.0, then the ingestion package needs to also be 1.1.0. - -The plugin parameter is a list of the sources that we want to ingest. An example would look like this `openmetadata-ingestion[mysql,snowflake,s3]==1.1.0`. - -### Example - -A DAG deployed using a Python Operator would then look like follows - -For example, preparing a metadata ingestion DAG with this operator will look as follows: - -```python -import yaml -from datetime import timedelta -from airflow import DAG - -try: - from airflow.operators.python import PythonOperator -except ModuleNotFoundError: - from airflow.operators.python_operator import PythonOperator - -from metadata.config.common import load_config_file -from metadata.workflow.metadata import MetadataWorkflow - - -from airflow.utils.dates import days_ago - -default_args = { - "owner": "user_name", - "email": ["username@org.com"], - "email_on_failure": False, - "retries": 3, - "retry_delay": timedelta(minutes=5), - "execution_timeout": timedelta(minutes=60) -} - -config = """ - -""" - -def metadata_ingestion_workflow(): - workflow_config = yaml.safe_load(config) - workflow = MetadataWorkflow.create(workflow_config) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() - -with DAG( - "sample_data", - default_args=default_args, - description="An example DAG which runs a OpenMetadata ingestion workflow", - start_date=days_ago(1), - is_paused_upon_creation=False, - schedule_interval='*/5 * * * *', - catchup=False, -) as dag: - ingest_task = PythonOperator( - task_id="ingest_using_recipe", - python_callable=metadata_ingestion_workflow, - ) -``` - -Note how we are preparing the `PythonOperator` by passing the `python_callable=metadata_ingestion_workflow` as -an argument, where `metadata_ingestion_workflow` is a function that instantiates the `Workflow` class and runs -the whole process. - -The drawback here? You need to install some requirements, which is not always possible. Here you have two alternatives, -either you use the `PythonVirtualenvOperator`, or read below on how to run the ingestion with the `DockerOperator`. - - - -## Docker Operator - -For this operator, we can use the `openmetadata/ingestion-base` image. -This is useful to prepare DAGs without any installation required on the environment, although it needs for the host -to have access to the Docker commands. - -### Prerequisites - -The airflow host should be able to run Docker commands. - -For example, if you are running Airflow in Docker Compose, that can be achieved preparing a volume mapping the -`docker.sock` file with 600 permissions. - -### Example - -```yaml -volumes: - - /var/run/docker.sock:/var/run/docker.sock:z # Need 666 permissions to run DockerOperator -``` - -Then, preparing a DAG looks like this: - -```python -from datetime import datetime - -from airflow import models -from airflow.providers.docker.operators.docker import DockerOperator - - -config = """ - -""" - -with models.DAG( - "ingestion-docker-operator", - schedule_interval='*/5 * * * *', - start_date=datetime(2021, 1, 1), - catchup=False, - tags=["OpenMetadata"], -) as dag: - DockerOperator( - command="python main.py", - image="openmetadata/ingestion-base:0.13.2", - environment={"config": config, "pipelineType": "metadata"}, - docker_url="unix://var/run/docker.sock", # To allow to start Docker. Needs chmod 666 permissions - tty=True, - auto_remove="True", - network_mode="host", # To reach the OM server - task_id="ingest", - dag=dag, - ) -``` - - - -Make sure to tune out the DAG configurations (`schedule_interval`, `start_date`, etc.) as your use case requires. - - - -Note that the example uses the image `openmetadata/ingestion-base:0.13.2`. Update that accordingly for higher version -once they are released. Also, the image version should be aligned with your OpenMetadata server version to avoid -incompatibilities. - -Another important point here is making sure that the Airflow will be able to run Docker commands to create the task. -As our example was done with Airflow in Docker Compose, that meant setting `docker_url="unix://var/run/docker.sock"`. - -The final important elements here are: -- `command="python main.py"`: This does not need to be modified, as we are shipping the `main.py` script in the - image, used to trigger the workflow. -- `environment={"config": config, "pipelineType": "metadata"}`: Again, in most cases you will just need to update - the `config` string to point to the right connector. - -Other supported values of `pipelineType` are `usage`, `lineage`, `profiler`, `dataInsight`, `elasticSearchReindex`, `dbt`, `application` or `TestSuite`. Pass the required flag -depending on the type of workflow you want to execute. Make sure that the YAML config reflects what ingredients -are required for your Workflow. - -## Python Virtualenv Operator - -You can use the [PythonVirtualenvOperator](https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator) -when working with an Airflow installation where: -1. You don't want to install dependencies directly on your Airflow host, -2. You don't have any Docker runtime, -3. Your Airflow's Python version is not supported by `openmetadata-ingestion`. - -### Prerequisites - -As stated in Airflow's [docs](https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator), -your Airflow host should have the `virtualenv` package installed. - -Moreover, if you're planning to use a different Python Version in the `virtualenv` than the one your Airflow uses, -you will need that version to be installed in the Airflow host. - -For example, if we use Airflow running with Python 3.7 but want the `virtualenv` to use Python 3.9, we need to install -in the host the following packages: `gcc python3.9-dev python3.9-distutils`. - -### Example - -In this example, we will be using a different Python version that the one Airflow is running: - -```python -from datetime import timedelta -from airflow import DAG - -try: - from airflow.operators.python import PythonVirtualenvOperator -except ModuleNotFoundError: - from airflow.operators.python_operator import PythonVirtualenvOperator - -from airflow.utils.dates import days_ago - -default_args = { - "owner": "user_name", - "email": ["username@org.com"], - "email_on_failure": False, - "retries": 3, - "retry_delay": timedelta(seconds=10), - "execution_timeout": timedelta(minutes=60), -} - - -def metadata_ingestion_workflow(): - from metadata.workflow.metadata import MetadataWorkflow - - import yaml - config = """ - ... - """ - - workflow_config = yaml.safe_load(config) - workflow = MetadataWorkflow.create(workflow_config) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() - - -with DAG( - "ingestion_dag", - default_args=default_args, - description="An example DAG which runs a OpenMetadata ingestion workflow", - start_date=days_ago(1), - is_paused_upon_creation=True, - catchup=False, -) as dag: - ingest_task = PythonVirtualenvOperator( - task_id="ingest_using_recipe", - requirements=[ - 'openmetadata-ingestion[mysql]~=1.3.0', # Specify any additional Python package dependencies - ], - system_site_packages=False, # Set to True if you want to include system site-packages in the virtual environment - python_version="3.9", # Remove if necessary - python_callable=metadata_ingestion_workflow - ) -``` - -Note that the function needs to follow this rules: -- The function must be defined using def, and not be part of a class. -- All imports must happen inside the function -- No variables outside of the scope may be referenced. diff --git a/getting-started/day-1/hybrid-saas/credentials.mdx b/getting-started/day-1/hybrid-saas/credentials.mdx deleted file mode 100644 index 5841d0a7..00000000 --- a/getting-started/day-1/hybrid-saas/credentials.mdx +++ /dev/null @@ -1,397 +0,0 @@ ---- -title: Managing Credentials Securely | Collate Ingestion Best Practices -description: Learn secure ways to manage credentials in Collate ingestion workflows using environment variables, Airflow connections, GitHub secrets, and existing services. -slug: /getting-started/day-1/hybrid-saas/credentials -sidebarTitle: Credentials -collate: true ---- - -# Managing Credentials - -## Existing Services - -What this means is that once a service is created, the only way to update its connection credentials is via -the **UI** or directly running an API call. This prevents the scenario where a new YAML config is created, using a name -of a service that already exists, but pointing to a completely different source system. - -One of the main benefits of this approach is that if an admin in our organisation creates the service from the UI, -then we can prepare any Ingestion Workflow without having to pass the connection details. - -For example, for an Athena YAML, instead of requiring the full set of credentials as below: - -```yaml -source: - type: athena - serviceName: my_athena_service - serviceConnection: - config: - type: Athena - awsConfig: - awsAccessKeyId: KEY - awsSecretAccessKey: SECRET - awsRegion: us-east-2 - s3StagingDir: s3 directory for datasource - workgroup: workgroup name - sourceConfig: - type: DatabaseMetadata - config: - markDeletedTables: true - includeTables: true - includeViews: true -sink: - type: metadata-rest - config: {} -workflowConfig: - openMetadataServerConfig: - hostPort: - authProvider: -``` - -We can use a simplified version: - -```yaml -source: - type: athena - serviceName: my_athena_service - sourceConfig: - config: - type: DatabaseMetadata - markDeletedTables: true - includeTables: true - includeViews: true -sink: - type: metadata-rest - config: {} -workflowConfig: - openMetadataServerConfig: - hostPort: - authProvider: -``` - -The workflow will then dynamically pick up the service connection details for `my_athena_service` and ingest -the metadata accordingly. - -If instead, you want to have the full source of truth in your DAGs or processes, you can keep reading on different -ways to secure the credentials in your environment and not have them at plain sight. - -## Securing Credentials - - - -Note that these are just a few examples. Any secure and automated approach to retrieve a string would work here, -as our only requirement is to pass the string inside the YAML configuration. - - - -When running Workflow with the CLI or your favourite scheduler, it's safer to not have the services' credentials -visible. For the CLI, the ingestion package can load sensitive information from environment variables. - -For example, if you are using the [Glue](/connectors/database/glue) connector you could specify the -AWS configurations as follows in the case of a JSON config file - -```json -[...] -"awsConfig": { - "awsAccessKeyId": "${AWS_ACCESS_KEY_ID}", - "awsSecretAccessKey": "${AWS_SECRET_ACCESS_KEY}", - "awsRegion": "${AWS_REGION}", - "awsSessionToken": "${AWS_SESSION_TOKEN}" -}, -[...] -``` - -Or - -```yaml -[...] -awsConfig: - awsAccessKeyId: '${AWS_ACCESS_KEY_ID}' - awsSecretAccessKey: '${AWS_SECRET_ACCESS_KEY}' - awsRegion: '${AWS_REGION}' - awsSessionToken: '${AWS_SESSION_TOKEN}' -[...] -``` - -for a YAML configuration. - -### AWS Credentials - -The AWS Credentials are based on the following [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/security/credentials/awsCredentials.json). -Note that the only required field is the `awsRegion`. This configuration is rather flexible to allow installations under AWS -that directly use instance roles for permissions to authenticate to whatever service we are pointing to without having to -write the credentials down. - -#### AWS Vault - -If using [aws-vault](https://github.com/99designs/aws-vault), it gets a bit more involved to run the CLI ingestion as the credentials are not globally available in the terminal. -In that case, you could use the following command after setting up the ingestion configuration file: - -```bash -aws-vault exec -- $SHELL -c 'metadata ingest -c ' -``` - -### GCP Credentials - -The GCP Credentials are based on the following [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/security/credentials/gcpCredentials.json). -These are the fields that you can export when preparing a Service Account. - -Once the account is created, you can see the fields in the exported JSON file from: - -``` -IAM & Admin > Service Accounts > Keys -``` - -You can validate the whole Google service account setup [here](/deployment/security/google). - -### Using GitHub Actions Secrets - -If running the ingestion in a GitHub Action, you can create [encrypted secrets](https://docs.github.com/en/actions/security-guides/encrypted-secrets) -to store sensitive information such as users and passwords. - -In the end, we'll map these secrets to environment variables in the process, that we can pick up with `os.getenv`, for example: - -```python -import os -import yaml - -from metadata.workflow.metadata import MetadataWorkflow - - - -CONFIG = f""" -source: - type: snowflake - serviceName: snowflake_from_github_actions - serviceConnection: - config: - type: Snowflake - username: {os.getenv('SNOWFLAKE_USERNAME')} -... -""" - - -def run(): - workflow_config = yaml.safe_load(CONFIG) - workflow = MetadataWorkflow.create(workflow_config) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() - - -if __name__ == "__main__": - run() -``` - -Make sure to update your step environment to pass the secrets as environment variables: - -```yaml -- name: Run Ingestion - run: | - source env/bin/activate - python ingestion-github-actions/snowflake_ingestion.py - # Add the env vars we need to load the snowflake credentials - env: - SNOWFLAKE_USERNAME: ${{ secrets.SNOWFLAKE_USERNAME }} - SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }} - SNOWFLAKE_WAREHOUSE: ${{ secrets.SNOWFLAKE_WAREHOUSE }} - SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }} -``` - -You can see a full demo setup [here](https://github.com/open-metadata/openmetadata-demo/tree/main/ingestion-github-actions). - -### Using Airflow Connections - -In any connector page, you might have seen an example on how to build a DAG to run the ingestion with Airflow -(e.g., [Athena](/connectors/database/athena/yaml)). - -A possible approach to retrieving sensitive information from Airflow would be using Airflow's -[Connections](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html). Note that these -connections can be stored as environment variables, to Airflow's underlying DB or to multiple external services such as -Hashicorp Vault. Note that for external systems, you'll need to provide the necessary package and configure the -[Secrets Backend](https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/secrets-backend/index.html). -The best way to choose how to store these credentials is to go through Airflow's [docs](https://airflow.apache.org/docs/apache-airflow/stable/concepts/connections.html). - -#### Example - -Let's go over an example on how to create a connection to extract data from MySQL and how a DAG would look like -afterwards. - -#### Step 1 - Create the Connection - -From our Airflow host, (e.g., `docker exec -it openmetadata_ingestion bash` if testing in Docker), you can run: - -```bash -airflow connections add 'my_mysql_db' \ - --conn-uri 'mysql+pymysql://openmetadata_user:openmetadata_password@mysql:3306/openmetadata_db' -``` - -You will see an output like - -``` -Successfully added `conn_id`=my_mysql_db : mysql+pymysql://openmetadata_user:openmetadata_password@mysql:3306/openmetadata_db -``` - -Checking the credentials from the Airflow UI, we will see: - - -Airflow Connection - -#### Step 2 - Understanding the shape of a Connection - -In the same host, we can open a Python shell to explore the Connection object with some more details. To do so, we first -need to pick up the connection from Airflow. We will use the `BaseHook` for that as the connection is not stored -in any external system. - -```python -from airflow.hooks.base import BaseHook - -# Retrieve the connection -connection = BaseHook.get_connection("my_mysql_db") - -# Access the connection details -connection.host # 'mysql' -connection.port # 3306 -connection.login # 'openmetadata_user' -connection.password # 'openmetadata_password' -``` - -Based on this information, we now know how to prepare the DAG! - -#### Step 3 - Write the DAG - -A full example on how to write a DAG to ingest data from our Connection can look like this: - -```python -import pathlib -import yaml -from datetime import timedelta -from airflow import DAG -from airflow.utils.dates import days_ago - -try: - from airflow.operators.python import PythonOperator -except ModuleNotFoundError: - from airflow.operators.python_operator import PythonOperator - -from metadata.config.common import load_config_file -from metadata.workflow.metadata import MetadataWorkflow - - - -# Import the hook -from airflow.hooks.base import BaseHook - -# Retrieve the connection -connection = BaseHook.get_connection("my_mysql_db") - -# Use the connection details when setting the YAML -# Note how we escaped the braces as {{}} to not be parsed by the f-string -config = f""" -source: - type: mysql - serviceName: mysql_from_connection - serviceConnection: - config: - type: Mysql - username: {connection.login} - password: {connection.password} - hostPort: {connection.host}:{connection.port} - # databaseSchema: schema - sourceConfig: - config: - markDeletedTables: true - includeTables: true - includeViews: true -sink: - type: metadata-rest - config: {{}} -workflowConfig: - openMetadataServerConfig: - hostPort: "" - authProvider: "" -""" - -def metadata_ingestion_workflow(): - workflow_config = yaml.safe_load(config) - workflow = MetadataWorkflow.create(workflow_config) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() - -with DAG( - "mysql_connection_ingestion", - description="An example DAG which runs a OpenMetadata ingestion workflow", - start_date=days_ago(1), - is_paused_upon_creation=False, - schedule_interval='*/5 * * * *', - catchup=False, -) as dag: - ingest_task = PythonOperator( - task_id="ingest_using_recipe", - python_callable=metadata_ingestion_workflow, - ) -``` - -#### Option B - Reuse an existing Service - -Following the explanation at the beginning of this doc, we can reuse the credentials from an existing service -in a DAG as well, and just omit the `serviceConnection` YAML entries: - -```python -import pathlib -import yaml -from datetime import timedelta -from airflow import DAG -from airflow.utils.dates import days_ago - -try: - from airflow.operators.python import PythonOperator -except ModuleNotFoundError: - from airflow.operators.python_operator import PythonOperator - -from metadata.config.common import load_config_file -from metadata.workflow.metadata import MetadataWorkflow - - - -config = """ -source: - type: mysql - serviceName: existing_mysql_service - sourceConfig: - config: - markDeletedTables: true - includeTables: true - includeViews: true -sink: - type: metadata-rest - config: {} -workflowConfig: - openMetadataServerConfig: - hostPort: "" - authProvider: "" -""" - -def metadata_ingestion_workflow(): - workflow_config = yaml.safe_load(config) - workflow = MetadataWorkflow.create(workflow_config) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() - -with DAG( - "mysql_connection_ingestion", - description="An example DAG which runs a OpenMetadata ingestion workflow", - start_date=days_ago(1), - is_paused_upon_creation=False, - schedule_interval='*/5 * * * *', - catchup=False, -) as dag: - ingest_task = PythonOperator( - task_id="ingest_using_recipe", - python_callable=metadata_ingestion_workflow, - ) -``` diff --git a/getting-started/day-1/hybrid-saas/gcp-composer.mdx b/getting-started/day-1/hybrid-saas/gcp-composer.mdx deleted file mode 100644 index 00c886a1..00000000 --- a/getting-started/day-1/hybrid-saas/gcp-composer.mdx +++ /dev/null @@ -1,181 +0,0 @@ ---- -title: Run the ingestion from GCP Composer -description: Learn to run Collate ingestion in GCP Composer using Python or KubernetesPod operators. Install packages or run containers for secure, scalable ingestion. -slug: /getting-started/day-1/hybrid-saas/gcp-composer -sidebarTitle: GCP Composer -collate: true ---- - -import ExternalIngestion from '/snippets/deployment/external-ingestion.mdx' -import RunConnectorsClass from '/snippets/deployment/run-connectors-class.mdx' - - - -# Run the ingestion from GCP Composer - -## Requirements - -This approach has been last tested against: -- Composer version 2.5.4 -- Airflow version 2.6.3 - -It also requires the ingestion package to be at least `openmetadata-ingestion==1.3.1.0`. - -## Using the Python Operator - -The most comfortable way to run the metadata workflows from GCP Composer is directly via a `PythonOperator`. Note that -it will require you to install the packages and plugins directly on the host. - -### Install the Requirements - -In your environment you will need to install the following packages: - -- `openmetadata-ingestion[]==x.y.z`. -- `sqlalchemy==1.4.27`: This is needed to align OpenMetadata version with the Composer internal requirements. - -Where `x.y.z` is the version of the OpenMetadata ingestion package. Note that the version needs to match the server version. If we are using the server at 1.1.0, then the ingestion package needs to also be 1.1.0. - -The plugin parameter is a list of the sources that we want to ingest. An example would look like this `openmetadata-ingestion[mysql,snowflake,s3]==1.1.0`. - -### Prepare the DAG! - -Note that this DAG is a usual connector DAG, just using the Airflow service with the `Backend` connection. - -As an example of a DAG pushing data to OpenMetadata under Google SSO, we could have: - -```python -from datetime import timedelta - -import yaml -from airflow import DAG - -try: - from airflow.operators.python import PythonOperator -except ModuleNotFoundError: - from airflow.operators.python_operator import PythonOperator - -from airflow.utils.dates import days_ago - -from metadata.workflow.metadata import MetadataWorkflow - - - -default_args = { - "owner": "user_name", - "email": ["username@org.com"], - "email_on_failure": False, - "retries": 3, - "retry_delay": timedelta(minutes=5), - "execution_timeout": timedelta(minutes=60), -} - -CONFIG = """ -... -""" - - -def metadata_ingestion_workflow(): - workflow_config = yaml.safe_load(CONFIG) - workflow = MetadataWorkflow.create(workflow_config) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() - - -with DAG( - "airflow_metadata_extraction", - default_args=default_args, - description="An example DAG which pushes Airflow data to OM", - start_date=days_ago(1), - is_paused_upon_creation=True, - schedule_interval="*/5 * * * *", - catchup=False, -) as dag: - ingest_task = PythonOperator( - task_id="ingest_using_recipe", - python_callable=metadata_ingestion_workflow, - ) -``` - - - -## Using the Kubernetes Pod Operator - -In this second approach we won't need to install absolutely anything to the GCP Composer environment. Instead, -we will rely on the `KubernetesPodOperator` to use the underlying k8s cluster of Composer. - -Then, the code won't directly run using the hosts' environment, but rather inside a container that we created -with only the `openmetadata-ingestion` package. - -**Note:** This approach only has the `openmetadata/ingestion-base` ready from version 0.12.1 or higher! - -### Prepare the DAG! - -```python -from datetime import datetime - -from airflow import models -from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator - - -CONFIG = """ -... -""" - - -with models.DAG( - "ingestion-k8s-operator", - schedule_interval="@once", - start_date=datetime(2021, 1, 1), - catchup=False, - tags=["OpenMetadata"], -) as dag: - KubernetesPodOperator( - task_id="ingest", - name="ingest", - cmds=["python", "main.py"], - image="openmetadata/ingestion-base:0.13.2", - namespace='default', - env_vars={"config": CONFIG, "pipelineType": "metadata"}, - dag=dag, - ) -``` - -Some remarks on this example code: - -#### Kubernetes Pod Operator - -You can name the task as you want (`task_id` and `name`). The important points here are the `cmds`, this should not -be changed, and the `env_vars`. The `main.py` script that gets shipped within the image will load the env vars -as they are shown, so only modify the content of the config YAML, but not this dictionary. - -Note that the example uses the image `openmetadata/ingestion-base:0.13.2`. Update that accordingly for higher version -once they are released. Also, the image version should be aligned with your OpenMetadata server version to avoid -incompatibilities. - -```python -KubernetesPodOperator( - task_id="ingest", - name="ingest", - cmds=["python", "main.py"], - image="openmetadata/ingestion-base:0.13.2", - namespace='default', - env_vars={"config": config, "pipelineType": "metadata"}, - dag=dag, -) -``` - -You can find more information about the `KubernetesPodOperator` and how to tune its configurations -[here](https://cloud.google.com/composer/docs/how-to/using/using-kubernetes-pod-operator). - -Note that depending on the kind of workflow you will be deploying, the YAML configuration will need to updated following -the official OpenMetadata docs, and the value of the `pipelineType` configuration will need to hold one of the following values: - -- `metadata` -- `usage` -- `lineage` -- `profiler` -- `TestSuite` - -Which are based on the `PipelineType` [JSON Schema definitions](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/services/ingestionPipelines/ingestionPipeline.json#L14) diff --git a/getting-started/day-1/hybrid-saas/github-actions.mdx b/getting-started/day-1/hybrid-saas/github-actions.mdx deleted file mode 100644 index 7d8590b0..00000000 --- a/getting-started/day-1/hybrid-saas/github-actions.mdx +++ /dev/null @@ -1,184 +0,0 @@ ---- -title: Run the ingestion from GitHub Actions -description: Automate Collate ingestion with GitHub Actions. Set up YAML configs, secure credentials, schedule workflows, and send Slack alerts on failure. -slug: /getting-started/day-1/hybrid-saas/github-actions -sidebarTitle: GitHub Actions -collate: true ---- - -import ExternalIngestion from '/snippets/deployment/external-ingestion.mdx' - - - -# Run the ingestion from GitHub Actions - - - -You can find a fully working demo of this setup [here](https://github.com/open-metadata/openmetadata-demo/tree/main/ingestion-github-actions). - - - -The process to run the ingestion from GitHub Actions is the same as running it from anywhere else. -1. Get the YAML configuration, -2. Prepare the Python Script -3. Schedule the Ingestion - -## 1. YAML Configuration - -For any connector and workflow, you can pick it up from its doc [page](/connectors). - -## 2. Prepare the Python Script - -In the GitHub Action we will just be triggering a custom Python script. This script will: - -- Load the secrets from environment variables (we don't want any security risks!), -- Prepare the Workflow class from the Ingestion Framework that contains all the logic on how to run the metadata ingestion, -- Execute the workflow and log the results. - -- A simplified version of such script looks like follows: - -```python -import os -import yaml - -from metadata.workflow.metadata import MetadataWorkflow - - - -CONFIG = f""" -source: - type: snowflake - serviceName: snowflake_from_github_actions - serviceConnection: - config: - type: Snowflake - username: {os.getenv('SNOWFLAKE_USERNAME')} -... -""" - - -def run(): - workflow_config = yaml.safe_load(CONFIG) - workflow = MetadataWorkflow.create(workflow_config) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() - - -if __name__ == "__main__": - run() -``` - -Note how we are securing the credentials using environment variables. You will need to create these env vars in your -GitHub repository. Follow the GitHub [docs](https://docs.github.com/en/actions/security-guides/encrypted-secrets) for -more information on how to create and use Secrets. - -In the end, we'll map these secrets to environment variables in the process, that we can pick up with `os.getenv`. - -## 3. Schedule the Ingestion - -Now that we have all the ingredients, we just need to build a simple GitHub Actions with the following steps: - -- Install Python -- Prepare virtual environment with the openmetadata-ingestion package -- Run the script! - -- It is as simple as this. Internally the function run we created will be sending the results to the OpenMetadata server, so there's nothing else we need to do here. - -A first version of the action could be: - -```yaml -name: ingest-snowflake -on: - # Any expression you'd like here - schedule: - - cron: '0 */2 * * *' - # If you also want to execute it manually - workflow_dispatch: - -permissions: - id-token: write - contents: read - -jobs: - ingest: - runs-on: ubuntu-latest - - steps: - # Pick up the repository code, where the script lives - - name: Checkout - uses: actions/checkout@v3 - - # Prepare Python in the GitHub Agent - - name: Set up Python 3.9 - uses: actions/setup-python@v4 - with: - python-version: 3.9 - - # Install the dependencies. Make sure that the client version matches the server! - - name: Install Deps - run: | - python -m venv env - source env/bin/activate - pip install "openmetadata-ingestion[snowflake]==1.0.2.0" - - - name: Run Ingestion - run: | - source env/bin/activate - python ingestion-github-actions/snowflake_ingestion.py - # Add the env vars we need to load the snowflake credentials - env: - SNOWFLAKE_USERNAME: ${{ secrets.SNOWFLAKE_USERNAME }} - SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }} - SNOWFLAKE_WAREHOUSE: ${{ secrets.SNOWFLAKE_WAREHOUSE }} - SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }} - SBX_JWT: ${{ secrets.SBX_JWT }} -``` - -## [Optional] - Getting Alerts in Slack - -A very interesting option that GitHub Actions provide is the ability to get alerts in Slack after our action fails. - -This can become specially useful if we want to be notified when our metadata ingestion is not working as expected. -We can use the same setup as above with a couple of slight changes: - -```yaml - - name: Run Ingestion - id: ingestion - continue-on-error: true - run: | - source env/bin/activate - python ingestion-github-actions/snowflake_ingestion.py - # Add the env vars we need to load the snowflake credentials - env: - SNOWFLAKE_USERNAME: ${{ secrets.SNOWFLAKE_USERNAME }} - SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }} - SNOWFLAKE_WAREHOUSE: ${{ secrets.SNOWFLAKE_WAREHOUSE }} - SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }} - SBX_JWT: ${{ secrets.SBX_JWT }} - - - name: Slack on Failure - if: steps.ingestion.outcome != 'success' - uses: slackapi/slack-github-action@v1.23.0 - with: - payload: | - { - "text": "🔥 Metadata ingestion failed! 🔥" - } - env: - SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }} - SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK - - - name: Force failure - if: steps.ingestion.outcome != 'success' - run: | - exit 1 -``` - -We have: - -- Marked the `Run Ingestion` step with a specific `id` and with `continue-on-error: true`. If anything happens, we don't want the action to stop. -- We added a step with `slackapi/slack-github-action@v1.23.0`. By passing a Slack Webhook link via a secret, we can send any payload to a -- specific Slack channel. You can find more info on how to set up a Slack Webhook [here](https://api.slack.com/messaging/webhooks). -- If our `ingestion` step fails, we still want to mark the action as failed, so we are forcing the failure we skipped before. diff --git a/getting-started/day-1/hybrid-saas/hybrid-ingestion-runner.mdx b/getting-started/day-1/hybrid-saas/hybrid-ingestion-runner.mdx deleted file mode 100644 index 09d88a90..00000000 --- a/getting-started/day-1/hybrid-saas/hybrid-ingestion-runner.mdx +++ /dev/null @@ -1,79 +0,0 @@ ---- -title: Hybrid Ingestion Runner | Secure Metadata Workflows in Your Cloud -description: Learn to configure and manage Hybrid Ingestion Runner to securely execute workflows in your cloud using AWS, Azure, or GCP secrets—without exposing credentials. -slug: /getting-started/day-1/hybrid-saas/hybrid-ingestion-runner -sidebarTitle: Hybrid Ingestion Runner -collate: true ---- - -# Hybrid Ingestion Runner - -The **Hybrid Ingestion Runner** is a component designed to enable Collate customers operating in hybrid environments to securely execute ingestion workflows within their own cloud infrastructure. In this setup, your SaaS instance is hosted on Collate’s cloud, while the workflows are going to deployed and executed within your private cloud. The Hybrid Runner acts as a bridge between these two environments, allowing ingestion workflows to be triggered and managed remotely—without requiring the customer to share secrets or sensitive credentials with Collate. It securely receives workflow execution requests and orchestrates them locally, maintaining full control and data privacy within the customer’s environment. - -## Prerequisites - -Before setting up the Hybrid Ingestion Runner, ensure the following: - -- Hybrid Runner has been setup. Contact the Collate team for assistance with setting up the Hybrid Runner in your infrastructure. -- Secrets manager configured on your cloud. - -## Configuration Steps for Admins - -Once your DevOps team has installed and configured the Hybrid Runner, follow these steps as a Collate Admin to configure services and manage ingestion workflows. - -### 1. Validate Hybrid Runner Setup - -- Go to **Settings > Preferences > Ingestion Runners** in the Collate UI. -- Look for your runner in the list. -- The status should display as **Connected**. - -> If the runner is not connected, reach out to Collate support. - - - - - -### 2. Create a New Service - -- Navigate to **Settings > Services**. -- Click **+ Add New Service**. -- Fill in the service details. -- In the “Ingestion Runner” dropdown, choose the hybrid runner. - - - -> Even if you're operating in hybrid mode, you can still choose "Collate SaaS Runner" to run the ingestion workflow within Collate's SaaS environment. - -### 3. Manage Secrets Securely - -When executing workflows on your Hybrid environment, you have to use your existing cloud provider's Secrets Manager to store sensitive credentials (like passwords or token), and reference them securely in Collate via the Hybrid Runner. - -Collate never stores or accesses these secrets directly—only the Hybrid Runner retrieves them at runtime from your own infrastructure. - -**Steps:** - -- Create your secret in your Secrets Manager of choice: - - **AWS Secrets Manager** - - **Azure Key Vault** - - **GCP Secret Manager** - -When creating a secret, store the value as-is (e.g., `password123`) without any additional formatting or encoding. The Hybrid Runner will handle the retrieval and decryption of the secret value at runtime. -For example, in AWS Secrets Manager, you can click on `Store a new secret` > `Other type of secret` > `Plaintext`. You need to paste the secret as-is, without any other formatting (such as quotes, JSON, etc.). - - - -Finally, in the service connection form in Collate, reference the secret using the `secret:` prefix followed by the full path to your secret. - -📌 **For example, in AWS Secrets Manager**, if your secret is stored at: `/my/database/password`, you would reference it in the service connection form as: - -```yaml -password: secret:/my/database/password -``` - - - -Note that this approach to handling secrets only works for values that are considered secrets in the connection form. - -You can identify these values since they mask the typing and have an icon on the right that toggles showing or hiding the input values. - - diff --git a/getting-started/day-1/hybrid-saas/index.mdx b/getting-started/day-1/hybrid-saas/index.mdx index 2f940bfa..5b912313 100644 --- a/getting-started/day-1/hybrid-saas/index.mdx +++ b/getting-started/day-1/hybrid-saas/index.mdx @@ -1,705 +1,79 @@ --- -title: Hybrid SaaS | Secure Metadata Ingestion for Collate -description: Learn how to securely ingest metadata using the Collate Ingestion Agent in your own infrastructure. Ideal for private networks and hybrid SaaS setups. +title: Hybrid Ingestion Runner | Secure Metadata Workflows in Your Cloud +description: Learn to configure and manage Hybrid Ingestion Runner to securely execute workflows in your cloud using AWS, Azure, or GCP secrets—without exposing credentials. slug: /getting-started/day-1/hybrid-saas sidebarTitle: Overview collate: true --- -# Hybrid SaaS +# Hybrid Ingestion Runner - +The **Hybrid Ingestion Runner** is a component designed to enable Collate customers operating in hybrid environments to securely execute ingestion workflows within their own cloud infrastructure. In this setup, your SaaS instance is hosted on Collate’s cloud, while the workflows are going to deployed and executed within your private cloud. The Hybrid Runner acts as a bridge between these two environments, allowing ingestion workflows to be triggered and managed remotely—without requiring the customer to share secrets or sensitive credentials with Collate. It securely receives workflow execution requests and orchestrates them locally, maintaining full control and data privacy within the customer’s environment. -There's two options on how to set up a data connector: -1. **Run the connector in Collate SaaS**: In this scenario, you'll get an IP when you add the service. You need to give - access to this IP in your data sources. -2. **Run the connector in your infrastructure or laptop**: In this case, Collate won't be accessing the data, but rather - you'd control where and how the process is executed and Collate will only receive the output of the metadata extraction. - This is an interesting option for sources lying behind private networks or when external SaaS services are not allowed to - connect to your data sources. +## Prerequisites -Any tool capable of running Python code can be used to configure the metadata extraction from your sources. +Before setting up the Hybrid Ingestion Runner, ensure the following: - +- Hybrid Runner has been setup. Contact the Collate team for assistance with setting up the Hybrid Runner in your infrastructure. +- Secrets manager configured on your cloud. -In this section we'll show you how the ingestion process works and how to test it from your laptop. +## Configuration Steps for Admins -## Collate Ingestion Agent +Once your DevOps team has installed and configured the Hybrid Runner, follow these steps as a Collate Admin to configure services and manage ingestion workflows. -The Collate Ingestion Agent is designed to facilitate metadata ingestion for hybrid deployments, allowing organizations to securely push metadata from their infrastructure into the Collate platform without exposing their internal systems. It provides a secure and efficient channel for running ingestion workflows while maintaining full control over data processing within your network. This document outlines the setup and usage of the Collate Ingestion Agent, emphasizing its role in hybrid environments and key functionalities. +### 1. Validate Hybrid Runner Setup -### Overview +- Go to **Settings > Preferences > Ingestion Runners** in the Collate UI. +- Look for your runner in the list. +- The status should display as **Connected**. -The Collate Ingestion Agent is ideal for scenarios where running connectors on-premises is necessary, providing a secure and efficient way to process metadata within your infrastructure. This eliminates concerns about data privacy and streamlines the ingestion process. +> If the runner is not connected, reach out to Collate support. -With the Collate Ingestion Agent, you can: -- Set up ingestion workflows easily without configuring YAML files manually. -- Leverage the Collate UI for a seamless and user-friendly experience. -- Manage various ingestion types, including metadata, profiling, lineage, usage, dbt, and data quality. + -### Setting Up the Collate Ingestion Agent + -#### 1. Prepare Your Environment -To begin, download the Collate-provided Docker image for the Ingestion Agent. The Collate team will provide the necessary credentials to authenticate and pull the image from the repository. +### 2. Create a New Service -**Run the following commands:** -- **Log in to Docker**: Use the credentials provided by Collate to authenticate. -- **Pull the Docker Image**: Run the command to pull the image into your local environment. +- Navigate to **Settings > Services**. +- Click **+ Add New Service**. +- Fill in the service details. +- In the “Ingestion Runner” dropdown, choose the hybrid runner. -Once the image is downloaded, you can start the Docker container to initialize the Ingestion Agent. + -#### 2. Configure the Agent +> Even if you're operating in hybrid mode, you can still choose "Collate SaaS Runner" to run the ingestion workflow within Collate's SaaS environment. -#### Access the Local Agent UI: -- Open your browser and navigate to the local instance of the Collate Ingestion Agent. +### 3. Manage Secrets Securely -#### Set Up the Connection: -- Enter your Collate platform URL (e.g., `https://.collate.com/api`). -- Add the ingestion bot token from the Collate settings under **Settings > Bots > Ingestion Bot**. +When executing workflows on your Hybrid environment, you have to use your existing cloud provider's Secrets Manager to store sensitive credentials (like passwords or token), and reference them securely in Collate via the Hybrid Runner. -#### Verify Services: -- Open the Collate UI and confirm that all available services (e.g., databases) are visible in the Ingestion Agent interface. +Collate never stores or accesses these secrets directly—only the Hybrid Runner retrieves them at runtime from your own infrastructure. -#### 3. Add a New Service +**Steps:** -1. Navigate to the **Database Services** section in the Ingestion Agent UI. -2. Click **Add New Service** and select the database type (e.g., Redshift). -3. Enter the necessary service configuration: - - **Service Name**: A unique name for the database service. - - **Host and Port**: Connection details for the database. - - **Username and Password**: Credentials to access the database. - - **Database Name**: The target database for ingestion. -4. Test the connection to ensure the service is properly configured. +- Create your secret in your Secrets Manager of choice: + - **AWS Secrets Manager** + - **Azure Key Vault** + - **GCP Secret Manager** -#### 4. Run Metadata Ingestion +When creating a secret, store the value as-is (e.g., `password123`) without any additional formatting or encoding. The Hybrid Runner will handle the retrieval and decryption of the secret value at runtime. +For example, in AWS Secrets Manager, you can click on `Store a new secret` > `Other type of secret` > `Plaintext`. You need to paste the secret as-is, without any other formatting (such as quotes, JSON, etc.). -1. After creating the service, navigate to the **Ingestion** tab and click **Add Ingestion**. -2. Select the ingestion type (e.g., metadata) and specify any additional configurations: - - Include specific schemas or tables. - - Enable options like DDL inclusion if required. -3. Choose whether to: - - Run the ingestion immediately via the agent. - - Download the YAML configuration file for running ingestion on an external scheduler. -4. Monitor the logs in real-time to track the ingestion process. + -#### 5. Verify Ingested Data +Finally, in the service connection form in Collate, reference the secret using the `secret:` prefix followed by the full path to your secret. -1. Return to the Collate platform and refresh the database service. -2. Verify that the ingested metadata, including schemas, tables, and column details, is available. -3. Explore additional ingestion options like profiling, lineage, or data quality for the service. - -### Additional Features - -The Collate Ingestion Agent supports various ingestion workflows, allowing you to: -- **Generate YAML Configurations**: Download YAML files for external scheduling. -- **Manage Ingestion Types**: Run metadata, profiling, lineage, usage, and other workflows as needed. -- **Monitor Progress**: View logs and monitor real-time ingestion activity. - -## 1. How does the Ingestion Framework work? - -The Ingestion Framework contains all the logic about how to connect to the sources, extract their metadata -and send it to the OpenMetadata server. We have built it from scratch with the main idea of making it an independent -component that can be run from - **literally** - anywhere. - -In order to install it, you just need to get it from [PyPI](https://pypi.org/project/openmetadata-ingestion/). - -```shell -pip install openmetadata-ingestion -``` - -We will show further examples later, but a piece of code is the best showcase for its simplicity. In order to run -a full ingestion process, you just need to execute a single function. For example, if we wanted to run the metadata -ingestion from within a simple Python script: - -```python -from metadata.workflow.metadata import MetadataWorkflow - - -# Specify your YAML configuration -CONFIG = """ -source: - ... -workflowConfig: - openMetadataServerConfig: - hostPort: 'http://localhost:8585/api' - authProvider: openmetadata - securityConfig: - jwtToken: ... -""" - -def run(): - workflow_config = yaml.safe_load(CONFIG) - workflow = MetadataWorkflow.create(workflow_config) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() - - -if __name__ == "__main__": - run() -``` - -Where this function runs is completely up to you, and you can adapt it to what makes the most sense within your -organization and engineering context. Below you'll see some examples of different orchestrators you can leverage -to execute the ingestion process. - -## 2. Ingestion Configuration - -In the example above, the `Workflow` class got created from a YAML configuration. Any Workflow that you execute (ingestion, -profiler, lineage,...) will have its own YAML representation. - -You can think about this configuration as the recipe you want to execute: where is your source, which pieces do you -extract, how are they processed and where are they sent. - -An example YAML config for extracting MySQL metadata looks like this: - -```yaml -source: - type: mysql - serviceName: mysql - serviceConnection: - config: - type: Mysql - username: openmetadata_user - authType: - password: openmetadata_password - hostPort: localhost:3306 - databaseSchema: openmetadata_db - sourceConfig: - config: - type: DatabaseMetadata -sink: - type: metadata-rest - config: {} -workflowConfig: - openMetadataServerConfig: - hostPort: 'http://localhost:8585/api' - authProvider: openmetadata - securityConfig: - jwtToken: ... -``` - - -You will find examples of all the workflow's YAML files at each Connector [page](/connectors). - - -We will now show you examples on how to configure and run every workflow externally by using Snowflake as an example. But -first, let's digest some information that will be common everywhere, the `workflowConfig`. - -### Workflow Config - -Here you will define information such as where are you hosting the OpenMetadata server, and the JWT token to authenticate. - - - -Review this section carefully to ensure you are properly managing service credentials and other security configurations. - - - -**Logger Level** - -You can specify the `loggerLevel` depending on your needs. If you are trying to troubleshoot an ingestion, running -with `DEBUG` will give you far more traces for identifying issues. - -**JWT Token** - -JWT tokens will allow your clients to authenticate against the OpenMetadata server. -To enable JWT Tokens, you will get more details [here](/deployment/security/enable-jwt-tokens). - -You can refer to the JWT Troubleshooting section [link](/deployment/security/jwt-troubleshooting) for any issues in -your JWT configuration. - -**Store Service Connection** - -If set to `true` (default), we will store the sensitive information either encrypted via the Fernet Key in the database -or externally, if you have configured any [Secrets Manager](/deployment/secrets-manager). - -If set to `false`, the service will be created, but the service connection information will only be used by the Ingestion -Framework at runtime, and won't be sent to the OpenMetadata server. - -**Secrets Manager Configuration** - -If you have configured any [Secrets Manager](/deployment/secrets-manager), you need to let the Ingestion Framework know -how to retrieve the credentials securely. - -Follow the [docs](/deployment/secrets-manager) to configure the secret retrieval based on your environment. - -**SSL Configuration** - -If you have added SSL to the [OpenMetadata server](/deployment/security/enable-ssl), then you will need to handle -the certificates when running the ingestion too. You can either set `verifySSL` to `ignore`, or have it as `validate`, -which will require you to set the `sslConfig.caCertificate` with a local path where your ingestion runs that points -to the server certificate file. - -Find more information on how to troubleshoot SSL issues [here](/deployment/security/enable-ssl/ssl-troubleshooting). +📌 **For example, in AWS Secrets Manager**, if your secret is stored at: `/my/database/password`, you would reference it in the service connection form as: ```yaml -workflowConfig: - loggerLevel: INFO # DEBUG, INFO, WARNING or ERROR - openMetadataServerConfig: - hostPort: "https://customer.getcollate.io/api" - authProvider: openmetadata - securityConfig: - jwtToken: "{bot_jwt_token}" - ## Store the service Connection information - storeServiceConnection: false -``` - -## 3. (Optional) Ingestion Pipeline - -Additionally, if you want to see your runs logged in the `Ingestions` tab of the connectors page in the UI as you would -when running the connectors natively with OpenMetadata, you can add the following configuration on your YAMLs: - -```yaml -source: - type: mysql - serviceName: mysql -[...] -workflowConfig: - openMetadataServerConfig: - hostPort: "https://customer.getcollate.io/api" - authProvider: openmetadata - securityConfig: - jwtToken: ... -ingestionPipelineFQN: . # E.g., mysql.marketing_metadata` -``` - -Adding the `ingestionPipelineFQN` - the Ingestion Pipeline Fully Qualified Name - will tell the Ingestion Framework -to log the executions and update the ingestion status, which will appear on the UI. Note that the action buttons -will be disabled, since OpenMetadata won't be able to interact with external systems. - -## 4. (Optional) Disable the Pipeline Service Client - -If you want to run your workflows **ONLY externally** without relying on OpenMetadata for any workflow management -or scheduling, you can update the following server configuration: - -```yaml -pipelineServiceClientConfiguration: - enabled: ${PIPELINE_SERVICE_CLIENT_ENABLED:-true} -``` - -by setting `enabled: false` or setting the `PIPELINE_SERVICE_CLIENT_ENABLED=false` as an environment variable. - -This will stop certain APIs and monitors related to the Pipeline Service Client (e.g., Airflow) from being operative. - -## Examples - - - -This is not an exhaustive list, and it will keep growing over time. Not because the orchestrators X or Y are not supported, -but just because we did not have the time yet to add it here. If you'd like to chip in and help us expand these guides and examples, -don't hesitate to reach to us in [Slack](https://slack.open-metadata.org/) or directly open a PR in -[GitHub](https://github.com/open-metadata/OpenMetadata/tree/main/openmetadata-docs/content). - - - - - Run the ingestion process externally from Airflow - - - Run the ingestion process externally using AWS MWAA - - - Run the ingestion process externally from GCP Composer - - - Run the ingestion process externally from GitHub Actions - - - -Let's jump now into some examples on how you could create the function the run the different workflows. Note that this code -can then be executed inside a DAG, a GitHub action, or a vanilla Python script. It will work for any environment. - -### Testing - -You can easily test every YAML configuration using the `metadata` CLI from the Ingestion Framework. -In order to install it, you just need to get it from [PyPI](https://pypi.org/project/openmetadata-ingestion/). - -In each of the examples below, we'll showcase how to run the CLI, assuming you have a YAML file that contains -the workflow configuration. - -### Metadata Workflow - -This is the first workflow you have to configure and run. It will take care of fetching the metadata from your sources, -be it Database Services, Dashboard Services, Pipelines, etc. - -The rest of the workflows (Lineage, Profiler,...) will be executed on top of the metadata already available in the platform. - - - - -The first step is to import the `MetadataWorkflow` class, which will take care of the full ingestion logic. We'll -add the import for printing the results at the end. - - - -Then, we need to pass the YAML configuration. For this simple example we are defining a variable, but you can -read from a file, parse secrets from your environment, or any other approach you'd need. In the end, it's just -Python code. - - -You can find complete YAMLs in each connector [docs](/connectors) and find more information about the available -configurations. - - - - -Finally, we'll prepare a function that we can execute anywhere. - -It will take care of instantiating the workflow, executing it and giving us the results. - - - -```python -import yaml -CONFIG = """ -source: - type: snowflake - serviceName: - serviceConnection: - config: - type: Snowflake - ... - sourceConfig: - config: - type: DatabaseMetadata - markDeletedTables: true - includeTables: true - ... -sink: - type: metadata-rest - config: {} -workflowConfig: - openMetadataServerConfig: - hostPort: "http://localhost:8585/api" - authProvider: openmetadata - securityConfig: - jwtToken: "{bot_jwt_token}" -""" -def run(): - workflow = MetadataWorkflow.create(CONFIG) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() -``` - - - -You can test the workflow via `metadata ingest -c `. - - - - -### Lineage Workflow - -This workflow will take care of scanning your query history and defining lineage relationships between your tables. - -You can find more information about this workflow [here](/connectors/ingestion/lineage). - - - - -The first step is to import the `MetadataWorkflow` class, which will take care of the full ingestion logic. We'll -add the import for printing the results at the end. - -Note that we are using the same class as in the Metadata Ingestion. - - - -Then, we need to pass the YAML configuration. For this simple example we are defining a variable, but you can -read from a file, parse secrets from your environment, or any other approach you'd need. - -Note how we have not added here the `serviceConnection`. Since the service would have been created during the -metadata ingestion, we can let the Ingestion Framework dynamically fetch the Service Connection information. - -If, however, you are configuring the workflow with `storeServiceConnection: false`, you'll need to explicitly -define the `serviceConnection`. - - -You can find complete YAMLs in each connector [docs](/connectors) and find more information about the available -configurations. - - - - -Finally, we'll prepare a function that we can execute anywhere. - -It will take care of instantiating the workflow, executing it and giving us the results. - - - -```python -import yaml -CONFIG = """ -source: - type: snowflake-lineage - serviceName: - sourceConfig: - config: - type: DatabaseLineage - queryLogDuration: 1 - parsingTimeoutLimit: 300 - ... -sink: - type: metadata-rest - config: {} -workflowConfig: - openMetadataServerConfig: - hostPort: "http://localhost:8585/api" - authProvider: openmetadata - securityConfig: - jwtToken: "{bot_jwt_token}" -""" -def run(): - workflow = MetadataWorkflow.create(CONFIG) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() -``` - - - -You can test the workflow via `metadata ingest -c `. - - - - -### Usage Workflow - -As with the lineage workflow, we'll scan the query history for any DML statements. The goal is to ingest queries -into the platform, figure out the relevancy of your assets and frequently joined tables. - - - - -The first step is to import the `UsageWorkflow` class, which will take care of the full ingestion logic. We'll -add the import for printing the results at the end. - - - -Then, we need to pass the YAML configuration. For this simple example we are defining a variable, but you can -read from a file, parse secrets from your environment, or any other approach you'd need. - -Note how we have not added here the `serviceConnection`. Since the service would have been created during the -metadata ingestion, we can let the Ingestion Framework dynamically fetch the Service Connection information. - -If, however, you are configuring the workflow with `storeServiceConnection: false`, you'll need to explicitly -define the `serviceConnection`. - - - -You can find complete YAMLs in each connector [docs](/connectors) and find more information about the available -configurations. - - - - -Finally, we'll prepare a function that we can execute anywhere. - -It will take care of instantiating the workflow, executing it and giving us the results. - - - -```python -import yaml -CONFIG = """ -source: - type: snowflake-usage - serviceName: - sourceConfig: - config: - type: DatabaseUsage - queryLogDuration: 1 - parsingTimeoutLimit: 300 - ... -processor: - type: query-parser - config: {} -stage: - type: table-usage - config: - filename: "/tmp/snowflake_usage" -bulkSink: - type: metadata-usage - config: - filename: "/tmp/snowflake_usage" -workflowConfig: - openMetadataServerConfig: - hostPort: "http://localhost:8585/api" - authProvider: openmetadata - securityConfig: - jwtToken: "{bot_jwt_token}" -""" -def run(): - workflow = UsageWorkflow.create(CONFIG) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() -``` - - - -You can test the workflow via `metadata usage -c `. - - - -### Profiler Workflow - -This workflow will execute queries against your database and send the results into OpenMetadata. The goal is to compute -metrics about your data and give you a high-level view of its shape, together with the sample data. - -This is an interesting previous step before creating Data Quality Workflows. - -You can find more information about this workflow [here](/how-to-guides/data-quality-observability/profiler/profiler-workflow). - - - - -The first step is to import the `ProfilerWorkflow` class, which will take care of the full ingestion logic. We'll -add the import for printing the results at the end. - - - -Then, we need to pass the YAML configuration. For this simple example we are defining a variable, but you can -read from a file, parse secrets from your environment, or any other approach you'd need. - -Note how we have not added here the `serviceConnection`. Since the service would have been created during the -metadata ingestion, we can let the Ingestion Framework dynamically fetch the Service Connection information. - -If, however, you are configuring the workflow with `storeServiceConnection: false`, you'll need to explicitly -define the `serviceConnection`. - - -You can find complete YAMLs in each connector [docs](/connectors) and find more information about the available -configurations. - - - - -Finally, we'll prepare a function that we can execute anywhere. - -It will take care of instantiating the workflow, executing it and giving us the results. - - - -```python -import yaml -CONFIG = """ -source: - type: snowflake - serviceName: - sourceConfig: - config: - type: Profiler - generateSampleData: true - ... -processor: - type: orm-profiler - config: {} -sink: - type: metadata-rest - config: {} -workflowConfig: - openMetadataServerConfig: - hostPort: "http://localhost:8585/api" - authProvider: openmetadata - securityConfig: - jwtToken: "{bot_jwt_token}" -""" -def run(): - workflow = ProfilerWorkflow.create(CONFIG) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() +password: secret:/my/database/password ``` -You can test the workflow via `metadata profile -c `. - - - - -### Data Quality Workflow - -This workflow will execute queries against your database and send the results into OpenMetadata. The goal is to compute -metrics about your data and give you a high-level view of its shape, together with the sample data. - -This is an interesting previous step before creating Data Quality Workflows. - -You can find more information about this workflow [here](/how-to-guides/data-quality-observability/quality/configure). - - - - -The first step is to import the `TestSuiteWorkflow` class, which will take care of the full ingestion logic. We'll -add the import for printing the results at the end. - - - -Then, we need to pass the YAML configuration. For this simple example we are defining a variable, but you can -read from a file, parse secrets from your environment, or any other approach you'd need. - -Note how we have not added here the `serviceConnection`. Since the service would have been created during the -metadata ingestion, we can let the Ingestion Framework dynamically fetch the Service Connection information. - -If, however, you are configuring the workflow with `storeServiceConnection: false`, you'll need to explicitly -define the `serviceConnection`. - -Moreover, see how we are not configuring any tests in the `processor`. You can do [that](/how-to-guides/data-quality-observability/quality/configure#full-yaml-config-example), -but even if nothing gets defined in the YAML, we will execute all the tests configured against the table. - - -You can find complete YAMLs in each connector [docs](/connectors) and find more information about the available -configurations. - - - - -Finally, we'll prepare a function that we can execute anywhere. - -It will take care of instantiating the workflow, executing it and giving us the results. - - - -```python -import yaml -CONFIG = """ -source: - type: TestSuite - serviceName: - sourceConfig: - config: - type: TestSuite - entityFullyQualifiedName: -processor: - type: orm-test-runner - config: {} -sink: - type: metadata-rest - config: {} -workflowConfig: - openMetadataServerConfig: - hostPort: "http://localhost:8585/api" - authProvider: openmetadata - securityConfig: - jwtToken: "{bot_jwt_token}" -""" -def run(): - workflow = TestSuiteWorkflow.create(CONFIG) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() -``` - - +Note that this approach to handling secrets only works for values that are considered secrets in the connection form. -You can test the workflow via `metadata test -c `. +You can identify these values since they mask the typing and have an icon on the right that toggles showing or hiding the input values. diff --git a/getting-started/day-1/hybrid-saas/local-ingestion-agent.mdx b/getting-started/day-1/hybrid-saas/local-ingestion-agent.mdx new file mode 100644 index 00000000..a2f24e4b --- /dev/null +++ b/getting-started/day-1/hybrid-saas/local-ingestion-agent.mdx @@ -0,0 +1,76 @@ +--- +title: ocal Ingestion Runner | Secure Metadata Ingestion for Collate +description: Learn how to securely ingest metadata using the Collate Ingestion Agent in your own infrastructure. Ideal for private networks and hybrid SaaS setups. +slug: /getting-started/day-1/hybrid-saas/local-ingestion-agent +sidebarTitle: Local Ingestion Agent +collate: true +--- + +# Local Ingestion Agent + + + +The Local Ingestion Agent is designed to facilitate metadata ingestion for hybrid deployments, allowing organizations to securely push metadata from their infrastructure into the Collate platform without exposing their internal systems. It provides a secure and efficient way for running ingestion workflows while maintaining full control over data processing within your network. + +The main difference between the Local Ingestion Agent and the Hybrid Runner[link to docs Hybrid SaaS page] is that the Local Ingestion Agent can quickly be set up in your own laptop, without relying on any other infrastructure. + +### Overview + +The Collate Ingestion Agent is ideal for scenarios where running connectors on-premises is necessary, providing a secure and efficient way to process metadata within your infrastructure. This eliminates concerns about data privacy and streamlines the ingestion process. + +With the Collate Ingestion Agent, you can: +- Set up ingestion workflows easily from a UI. +- Handle the end-to-end metadata extraction workflows: metadata, profiling, lineage, usage, dbt, auto classification and data quality. + +### Setting Up the Collate Ingestion Agent + +#### 1. Prepare Your Environment +You need to be able to run Docker images on your laptop. +Run the following commands: + +1. Log in to Docker with the credentials provided by Collate to authenticate. You can reach out to support@getcollate.io to your credentials. + +```shell +docker login --username AWS -p eyJwY... 118146679784.dkr.ecr.eu-west-1.amazonaws.com +``` + +2. Run the Docker Image to start the Local Agent: + +```shell +docker run -it --rm -p 8001:8001 -e CL_BASE_DIR='/ingestion/collate/collate-local-webserver/' -v ./.collate:/ingestion/collate/collate-local-webserver/.collate 118146679784.dkr.ecr.eu-west-1.amazonaws.com/collate-customers-local-ingestion: +``` + +#### 2. Configure the Agent + +#### Access the Local Agent UI: +- Open your browser and navigate to http://localhost:8001 + +#### Set Up the Connection: +- Enter your Collate platform URL (e.g., https://(your-company).getcollate.io/api). +- Add the ingestion bot token from the Collate settings under Settings > Bots > Ingestion Bot. + +#### 3. Add a New Service + +1. Navigate to the Database Services section in the Ingestion Agent UI +2. Click Add New Service and select the database type +3. Enter the necessary service configuration +4. Test the connection to ensure the service is properly configured + + +#### 4. Run Metadata Ingestion + +1. After creating the service, navigate to the Ingestion tab and click Add Ingestion. +2. Select the ingestion type (e.g., metadata) and specify any additional configurations for the workflow. +3. Run the ingestion & Monitor the logs in real-time to track the ingestion process. + +#### 5. Verify Ingested Data + +1. Return to the Collate platform at https://(your-company).getcollate.io and refresh the database services page. Your new service will appear there now. +2.Verify that the ingested metadata, including schemas, tables, and column details, is available. + +### Additional Features + +The Collate Ingestion Agent supports various ingestion workflows, allowing you to: +- **Generate YAML Configurations**: Download YAML files for external scheduling. +- **Manage Ingestion Types**: Run metadata, profiling, lineage, usage, and other workflows as needed. +- **Monitor Progress**: View logs and monitor real-time ingestion activity. diff --git a/getting-started/day-1/hybrid-saas/mwaa.mdx b/getting-started/day-1/hybrid-saas/mwaa.mdx deleted file mode 100644 index e0433c4b..00000000 --- a/getting-started/day-1/hybrid-saas/mwaa.mdx +++ /dev/null @@ -1,443 +0,0 @@ ---- -title: Run the ingestion from AWS MWAA -description: Set up Collate ingestion workflows on AWS MWAA using Python, ECS, or Virtualenv operators. Compare approaches and configure DAGs for secure metadata ingestion. -slug: /getting-started/day-1/hybrid-saas/mwaa -sidebarTitle: MWAA -collate: true ---- - -import ExternalIngestion from '/snippets/deployment/external-ingestion.mdx' -import RunConnectorsClass from '/snippets/deployment/run-connectors-class.mdx' - - - -# Run the ingestion from AWS MWAA - -When running ingestion workflows from MWAA we have three approaches: - -1. Install the openmetadata-ingestion package as a requirement in the Airflow environment. We will then run the process using a `PythonOperator` -2. Configure an ECS cluster and run the ingestion as an `ECSOperator`. -3. Install a plugin and run the ingestion with the `PythonVirtualenvOperator`. - -We will now discuss pros and cons of each aspect and how to configure them. - -## Ingestion Workflows as a Python Operator - -### PROs - -- It is the simplest approach -- We don’t need to spin up any further infrastructure - -### CONs - -- We need to install the [openmetadata-ingestion](https://pypi.org/project/openmetadata-ingestion/) package in the MWAA environment -- The installation can clash with existing libraries -- Upgrading the OM version will require to repeat the installation process - -To install the package, we need to update the `requirements.txt` file from the MWAA environment to add the following line: - -``` -openmetadata-ingestion[]==x.y.z -``` - -Where `x.y.z` is the version of the Collate ingestion package. Note that the version needs to match the server version. If we are using the server at 1.3.1, then the ingestion package needs to also be 1.3.1. - -The plugin parameter is a list of the sources that we want to ingest. An example would look like this `openmetadata-ingestion[mysql,snowflake,s3]==1.3.1`. - -A DAG deployed using a Python Operator would then look like follows - -```python -import json -from datetime import timedelta - -from airflow import DAG - -try: - from airflow.operators.python import PythonOperator -except ModuleNotFoundError: - from airflow.operators.python_operator import PythonOperator - -from airflow.utils.dates import days_ago - -from metadata.workflow.metadata import MetadataWorkflow - - -default_args = { - "retries": 3, - "retry_delay": timedelta(seconds=10), - "execution_timeout": timedelta(minutes=60), -} - -config = """ -YAML config -""" - -def metadata_ingestion_workflow(): - workflow_config = json.loads(config) - workflow = MetadataWorkflow.create(workflow_config) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() - -with DAG( - "redshift_ingestion", - default_args=default_args, - description="An example DAG which runs a Collate ingestion workflow", - start_date=days_ago(1), - is_paused_upon_creation=False, - catchup=False, -) as dag: - ingest_task = PythonOperator( - task_id="ingest_redshift", - python_callable=metadata_ingestion_workflow, - ) -``` - -Where you can update the YAML configuration and workflow classes accordingly. accordingly. Further examples on how to -run the ingestion can be found on the documentation (e.g., [Snowflake](/connectors/database/snowflake)). - - - -## Ingestion Workflows as an ECS Operator - -### PROs -- Completely isolated environment -- Easy to update each version - -### CONs -- We need to set up an ECS cluster and the required policies in MWAA to connect to ECS and handle Log Groups. - -We will now describe the steps, following the official AWS documentation. - -### 1. Create an ECS Cluster & Task Definition - -- The cluster needs a task to run in `FARGATE` mode. -- The required image is `docker.getcollate.io/openmetadata/ingestion-base:x.y.z` - - The same logic as above applies. The `x.y.z` version needs to match the server version. For example, `docker.getcollate.io/openmetadata/ingestion-base:1.3.1` - -We have tested this process with a Task Memory of 512MB and Task CPU (unit) of 256. This can be tuned depending on the amount of metadata that needs to be ingested. - -When creating the Task Definition, take notes on the **log groups** assigned, as we will need them to prepare the MWAA Executor Role policies. - -For example, if in the JSON from the Task Definition we see: - -```json -"logConfiguration": { - "logDriver": "awslogs", - "options": { - "awslogs-create-group": "true", - "awslogs-group": "/ecs/openmetadata", - "awslogs-region": "us-east-2", - "awslogs-stream-prefix": "ecs" - }, - "secretOptions": [] -} -``` - -We'll need to use the `/ecs/openmetadata` below when configuring the policies. - -### 2. Task Definition ARN & Networking - -1. From the AWS Console, copy your task definition ARN. It will look something like this `arn:aws:ecs:::task-definition/:`. -2. Get the network details on where the task should execute. We will be using a JSON like: - -```json -{ - "awsvpcConfiguration": { - "subnets": [ - "subnet-xxxyyyzzz", - "subnet-xxxyyyzzz" - ], - "securityGroups": [ - "sg-xxxyyyzzz" - ], - "assignPublicIp": "ENABLED" - } -} -``` - - - -If you want to extract MWAA metadata, add the **VPC**, **subnets** and **security groups** used when setting up MWAA. We need to -be in the same network environment as MWAA to reach the underlying database. - - - -### 3. Update MWAA Executor Role policies - -- Identify your MWAA executor role. This can be obtained from the details view of your MWAA environment. -- Add the following two policies to the role, the first with ECS permissions: - -```json -{ - "Version": "2012-10-17", - "Statement": [ - { - "Sid": "VisualEditor0", - "Effect": "Allow", - "Action": [ - "ecs:RunTask", - "ecs:DescribeTasks" - ], - "Resource": "*" - }, - { - "Action": "iam:PassRole", - "Effect": "Allow", - "Resource": [ - "*" - ], - "Condition": { - "StringLike": { - "iam:PassedToService": "ecs-tasks.amazonaws.com" - } - } - } - ] -} -``` - - -And for the Log Group permissions - -```json -{ - "Effect": "Allow", - "Action": [ - "logs:CreateLogStream", - "logs:CreateLogGroup", - "logs:PutLogEvents", - "logs:GetLogEvents", - "logs:GetLogRecord", - "logs:GetLogGroupFields", - "logs:GetQueryResults" - ], - "Resource": [ - "arn:aws:logs:::log-group:*", - "arn:aws:logs:*:*:log-group::*" - ] -} - -``` - -Note how you need to replace the `region`, `account-id` and the `log group` names for your Airflow Environment and ECS. - -### 4. Prepare the DAG - -A DAG created using the ECS Operator will then look like this: - -```python -from airflow import DAG -# If using Airflow < 2.5 -# from airflow.providers.amazon.aws.operators.ecs import ECSOperator -# If using Airflow > 2.5 -from airflow.providers.amazon.aws.operators.ecs import EcsRunTaskOperator -from airflow.utils.dates import days_ago - - -CLUSTER_NAME="openmetadata-ingestion" # Replace value for CLUSTER_NAME with your information. -CONTAINER_NAME="openmetadata-ingestion" # Replace value for CONTAINER_NAME with your information. -LAUNCH_TYPE="FARGATE" - -TASK_DEFINITION = "arn:aws:ecs:::task-definition/:" -NETWORK_CONFIG = { - "awsvpcConfiguration": { - "subnets": [ - "subnet-xxxyyyzzz", - "subnet-xxxyyyzzz" - ], - "securityGroups": [ - "sg-xxxyyyzzz" - ], - "assignPublicIp": "ENABLED" - } -} - -config = """ -YAML config -""" - - -with DAG( - dag_id="ecs_fargate_dag", - schedule_interval=None, - catchup=False, - start_date=days_ago(1), - is_paused_upon_creation=True, -) as dag: - ecs_operator_task = EcsRunTaskOperator( - task_id = "ecs_ingestion_task", - dag=dag, - cluster=CLUSTER_NAME, - task_definition=TASK_DEFINITION, - launch_type=LAUNCH_TYPE, - overrides={ - "containerOverrides":[ - { - "name":CONTAINER_NAME, - "command":["python", "main.py"], - "environment": [ - { - "name": "config", - "value": config - }, - { - "name": "pipelineType", - "value": "metadata" - }, - ], - }, - ], - }, - - network_configuration=NETWORK_CONFIG, - awslogs_group="/ecs/ingest", - awslogs_stream_prefix=f"ecs/{CONTAINER_NAME}", - ) -``` - -Note that depending on the kind of workflow you will be deploying, the YAML configuration will need to updated following -the official Collate docs, and the value of the `pipelineType` configuration will need to hold one of the following values: - -- `metadata` -- `usage` -- `lineage` -- `profiler` -- `TestSuite` - -Which are based on the `PipelineType` [JSON Schema definitions](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/services/ingestionPipelines/ingestionPipeline.json#L14) - -Moreover, one of the imports will depend on the MWAA Airflow version you are using: -- If using Airflow < 2.5: `from airflow.providers.amazon.aws.operators.ecs import ECSOperator` -- If using Airflow > 2.5: `from airflow.providers.amazon.aws.operators.ecs import EcsRunTaskOperator` - -Make sure to update the `ecs_operator_task` task call accordingly. - -## Ingestion Workflows as a Python Virtualenv Operator - -### PROs - -- Installation does not clash with existing libraries -- Simpler than ECS - -### CONs - -- We need to install an additional plugin in MWAA -- DAGs take longer to run due to needing to set up the virtualenv from scratch for each run. - -We need to update the `requirements.txt` file from the MWAA environment to add the following line: - -``` -virtualenv -``` - -Then, we need to set up a custom plugin in MWAA. Create a file named virtual_python_plugin.py. Note that you may need to update the python version (eg, python3.7 -> python3.10) depending on what your MWAA environment is running. -```python -""" -Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. - -Permission is hereby granted, free of charge, to any person obtaining a copy of -this software and associated documentation files (the "Software"), to deal in -the Software without restriction, including without limitation the rights to -use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of -the Software, and to permit persons to whom the Software is furnished to do so. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS -FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR -COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER -IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN -CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -""" -from airflow.plugins_manager import AirflowPlugin -import airflow.utils.python_virtualenv -from typing import List -import os - - -def _generate_virtualenv_cmd(tmp_dir: str, python_bin: str, system_site_packages: bool) -> List[str]: - cmd = ['python3', '/usr/local/airflow/.local/lib/python3.7/site-packages/virtualenv', tmp_dir] - if system_site_packages: - cmd.append('--system-site-packages') - if python_bin is not None: - cmd.append(f'--python={python_bin}') - return cmd - - -airflow.utils.python_virtualenv._generate_virtualenv_cmd = _generate_virtualenv_cmd - -os.environ["PATH"] = f"/usr/local/airflow/.local/bin:{os.environ['PATH']}" - - -class VirtualPythonPlugin(AirflowPlugin): - name = 'virtual_python_plugin' -``` - -This is modified from the [AWS sample](https://docs.aws.amazon.com/mwaa/latest/userguide/samples-virtualenv.html). - -Next, create the plugins.zip file and upload it according to [AWS docs](https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-dag-import-plugins.html). You will also need to [disable lazy plugin loading in MWAA](https://docs.aws.amazon.com/mwaa/latest/userguide/samples-virtualenv.html#samples-virtualenv-airflow-config). - -A DAG deployed using the PythonVirtualenvOperator would then look like: - -```python -from datetime import timedelta - -from airflow import DAG - -from airflow.operators.python import PythonVirtualenvOperator - -from airflow.utils.dates import days_ago - - -default_args = { - "retries": 3, - "retry_delay": timedelta(seconds=10), - "execution_timeout": timedelta(minutes=60), -} - -def metadata_ingestion_workflow(): - from metadata.workflow.metadata import MetadataWorkflow - - - import yaml - - config = """ -YAML config - """ - workflow_config = yaml.loads(config) - workflow = MetadataWorkflow.create(workflow_config) - workflow.execute() - workflow.raise_from_status() - workflow.print_status() - workflow.stop() - -with DAG( - "redshift_ingestion", - default_args=default_args, - description="An example DAG which runs a Collate ingestion workflow", - start_date=days_ago(1), - is_paused_upon_creation=False, - catchup=False, -) as dag: - ingest_task = PythonVirtualenvOperator( - task_id="ingest_redshift", - python_callable=metadata_ingestion_workflow, - requirements=['openmetadata-ingestion==1.0.5.0', - 'apache-airflow==2.4.3', # note, v2.4.3 is the first version that does not conflict with Collate's 'tabulate' requirements - 'apache-airflow-providers-amazon==6.0.0', # Amazon Airflow provider is necessary for MWAA - 'watchtower',], - system_site_packages=False, - dag=dag, - ) -``` - -Where you can update the YAML configuration and workflow classes accordingly. accordingly. Further examples on how to -run the ingestion can be found on the documentation (e.g., [Snowflake](/connectors/database/snowflake)). - -You will also need to determine the Collate ingestion extras and Airflow providers you need. Note that the Openmetadata version needs to match the server version. If we are using the server at 0.12.2, then the ingestion package needs to also be 0.12.2. An example of the extras would look like this `openmetadata-ingestion[mysql,snowflake,s3]==0.12.2.2`. -For Airflow providers, you will want to pull the provider versions from [the matching constraints file](https://raw.githubusercontent.com/apache/airflow/constraints-2.4.3/constraints-3.7.txt). Since this example installs Airflow Providers v2.4.3 on Python 3.7, we use that constraints file. - -Also note that the ingestion workflow function must be entirely self-contained as it will run by itself in the virtualenv. Any imports it needs, including the configuration, must exist within the function itself. - - diff --git a/getting-started/index.mdx b/getting-started/index.mdx index e9bce28e..16e7e459 100644 --- a/getting-started/index.mdx +++ b/getting-started/index.mdx @@ -27,7 +27,7 @@ need to set up your Collate environment in 30 minutes. Configure the Hybrid Runner to securely connect your environment and enable metadata ingestion. diff --git a/public/images/collate-ai/collate-ai-sql-agent.png b/public/images/collate-ai/collate-ai-sql-agent.png new file mode 100644 index 00000000..a3191342 Binary files /dev/null and b/public/images/collate-ai/collate-ai-sql-agent.png differ diff --git a/public/images/collate-ai/collate-ai-sql-agent1.png b/public/images/collate-ai/collate-ai-sql-agent1.png new file mode 100644 index 00000000..8173874c Binary files /dev/null and b/public/images/collate-ai/collate-ai-sql-agent1.png differ diff --git a/public/images/collate-ai/collate-ai-sql-agent2.png b/public/images/collate-ai/collate-ai-sql-agent2.png new file mode 100644 index 00000000..8f585e12 Binary files /dev/null and b/public/images/collate-ai/collate-ai-sql-agent2.png differ