diff --git a/docs/studio/index.md b/docs/studio/index.md new file mode 100644 index 000000000..1d89e4362 --- /dev/null +++ b/docs/studio/index.md @@ -0,0 +1,62 @@ +# DataChain Studio + +DataChain Studio is a web application that enables Machine Learning and Data teams to seamlessly + +- [Run and track jobs](user-guide/jobs/index.md) +- [Track experiments and manage models](user-guide/experiments/index.md) (via DVC integration) +- [Collaborate on data projects](user-guide/team-collaboration.md) + +DataChain Studio supports multiple workflows: +- **DataChain workflows**: For unstructured data processing and transformation +- **DVC + Git workflows**: For ML experiment tracking and model registry, maintaining Git as the single-source-of-truth + +Sign in to DataChain Studio using your GitHub.com, GitLab.com, or Bitbucket.org account, or with your email address. Explore the demo projects and datasets, and [let us know](user-guide/troubleshooting.md#support) if you need any help getting started. + +## Why DataChain Studio? + +- Simplify data processing job tracking, visualization, and collaboration. +- Support both modern DataChain workflows and traditional DVC experiment tracking. +- Keep your code, data and processing connected at all times. +- Apply your existing software engineering stack for data and ML teams. +- Build a comprehensive data processing and ML platform for transparency and discovery across all your projects. +- For DVC projects, maintain Git as the single-source-of-truth and use [GitOps](https://www.gitops.tech/) for deployment and automation. + +## Getting Started + +New to DataChain Studio? Start with these guides: + +- **[User Guide](user-guide/index.md)** - Learn how to use DataChain Studio features +- **[API Reference](api/index.md)** - Integrate with Studio programmatically +- **[Webhooks](webhooks.md)** - Set up event notifications +- **[Self-hosting](self-hosting/index.md)** - Deploy your own Studio instance + +## Key Features + +### Dataset Management +- Track and version your datasets +- Visualize data processing pipelines +- Share datasets across teams + +### Job Processing +- Run data processing jobs in the cloud +- Monitor job progress and logs +- Schedule recurring data processing tasks + +### ML Experiment Tracking (DVC Integration) +- Track and compare ML experiments +- Manage model lifecycle and registry +- Visualize metrics and plots +- Git-based experiment versioning + +### Team Collaboration +- Share projects with team members +- Control access with role-based permissions +- Integrate with development workflows + +### API Integration +- RESTful API for programmatic access +- Webhook notifications for automation +- Command-line tools for developers + + +Visit [studio.datachain.ai](https://studio.datachain.ai) to get started, or learn about [self-hosting](self-hosting/index.md) for enterprise deployments. diff --git a/docs/studio/self-hosting/configuration/ca-certificates.md b/docs/studio/self-hosting/configuration/ca-certificates.md new file mode 100644 index 000000000..c7a444252 --- /dev/null +++ b/docs/studio/self-hosting/configuration/ca-certificates.md @@ -0,0 +1,410 @@ +# CA Certificates + +This guide covers how to configure custom Certificate Authority (CA) certificates for your self-hosted DataChain Studio instance. This is necessary when your organization uses internal CAs or when connecting to services with custom certificates. + +## Overview + +DataChain Studio may need to trust custom CA certificates in several scenarios: + +- **Internal Git Servers**: Self-hosted GitLab, GitHub Enterprise with custom certificates +- **Storage Services**: S3-compatible storage with custom certificates +- **Corporate Proxies**: HTTPS proxies with internal certificates +- **Database Connections**: PostgreSQL/Redis with SSL using custom CAs + +## Configuration Methods + +### Kubernetes/Helm Deployment + +#### Method 1: ConfigMap with Certificate Files + +1. Create a ConfigMap with your CA certificates: + +```bash +kubectl create configmap custom-ca-certs \ + --namespace datachain-studio \ + --from-file=ca1.crt=/path/to/your/ca1.crt \ + --from-file=ca2.crt=/path/to/your/ca2.crt +``` + +2. Configure Helm values to mount the certificates: + +```yaml +# values.yaml +global: + customCaCerts: + enabled: true + configMapName: custom-ca-certs + +# Alternative: Inline certificates +global: + customCaCerts: + certificates: + - name: "internal-ca" + certificate: | + -----BEGIN CERTIFICATE----- + MIIDXTCCAkWgAwIBAgIJAKoK/heBjcOuMA0GCSqGSIb3DQEBBQUAMEUxCzAJBgNV + BAYTAkFVMRMwEQYDVQQIDApTb21lLVN0YXRlMSEwHwYDVQQKDBhJbnRlcm5ldCBX + ... (certificate content) ... + -----END CERTIFICATE----- + - name: "corporate-ca" + certificate: | + -----BEGIN CERTIFICATE----- + ... (another certificate) ... + -----END CERTIFICATE----- +``` + +3. Apply the configuration: + +```bash +helm upgrade datachain-studio datachain/studio \ + --namespace datachain-studio \ + --values values.yaml +``` + +#### Method 2: Direct Certificate Configuration + +Add certificates directly to your Helm values: + +```yaml +# values.yaml +global: + customCaCerts: + - |- + -----BEGIN CERTIFICATE----- + MIIDXTCCAkWgAwIBAgIJAKoK/heBjcOuMA0GCSqGSIb3DQEBBQUAMEUxCzAJBgNV + BAYTAkFVMRMwEQYDVQQIDApTb21lLVN0YXRlMSEwHwYDVQQKDBhJbnRlcm5ldCBX + ... (rest of certificate) ... + -----END CERTIFICATE----- + - |- + -----BEGIN CERTIFICATE----- + ... (another certificate) ... + -----END CERTIFICATE----- +``` + +### AWS AMI Deployment + +For AMI deployments, configure CA certificates directly on the instance: + +#### 1. Upload CA Certificates + +```bash +# Copy certificate files to the instance +scp -i your-key.pem ca-certificates.crt ubuntu@your-instance:/tmp/ + +# SSH to the instance +ssh -i your-key.pem ubuntu@your-instance + +# Install CA certificates +sudo cp /tmp/ca-certificates.crt /usr/local/share/ca-certificates/ +sudo update-ca-certificates +``` + +#### 2. Configure DataChain Studio + +Update the configuration file: + +```yaml +# /opt/datachain-studio/config.yml +global: + ssl: + caCertificates: + - /usr/local/share/ca-certificates/ca-certificates.crt + + # Alternatively, inline certificates + customCaCerts: + - | + -----BEGIN CERTIFICATE----- + ... certificate content ... + -----END CERTIFICATE----- +``` + +#### 3. Restart Services + +```bash +sudo systemctl restart datachain-studio +``` + +## Use Cases and Examples + +### Internal GitLab Server + +Configure CA certificates for connecting to an internal GitLab server: + +```yaml +global: + customCaCerts: + - |- + -----BEGIN CERTIFICATE----- + MIIDXTCCAkWgAwIBAgIJAKoK/heBjcOuMA0GCSqGSIb3DQEBBQUAMEUxCzAJBgNV + ... (your GitLab CA certificate) ... + -----END CERTIFICATE----- + + git: + gitlab: + enabled: true + url: "https://gitlab.internal.company.com" + clientId: "your-gitlab-client-id" + clientSecret: "your-gitlab-client-secret" + + # SSL verification settings + ssl: + verify: true + caCertificate: true # Use custom CA +``` + +### S3-Compatible Storage with Custom CA + +Configure certificates for custom S3-compatible storage: + +```yaml +global: + customCaCerts: + - |- + -----BEGIN CERTIFICATE----- + ... (your storage provider's CA certificate) ... + -----END CERTIFICATE----- + +storage: + type: s3 + s3: + endpoint: "https://s3.internal.company.com" + bucket: "datachain-studio-storage" + region: "us-east-1" + + # SSL settings + ssl: + enabled: true + verify: true + caCertificate: true +``` + +### Corporate Proxy with Custom CA + +Configure certificates for accessing external services through a corporate proxy: + +```yaml +global: + # Proxy configuration + proxy: + enabled: true + http: "http://proxy.company.com:8080" + https: "https://proxy.company.com:8080" + + # CA certificates for proxy SSL + customCaCerts: + - |- + -----BEGIN CERTIFICATE----- + ... (your proxy's CA certificate) ... + -----END CERTIFICATE----- +``` + +### Multiple CA Certificates + +Configure multiple CA certificates for different services: + +```yaml +global: + customCaCerts: + # Internal root CA + - |- + -----BEGIN CERTIFICATE----- + ... (internal root CA) ... + -----END CERTIFICATE----- + + # GitLab intermediate CA + - |- + -----BEGIN CERTIFICATE----- + ... (GitLab intermediate CA) ... + -----END CERTIFICATE----- + + # Storage service CA + - |- + -----BEGIN CERTIFICATE----- + ... (storage service CA) ... + -----END CERTIFICATE----- +``` + +## Certificate Chain Validation + +### Obtaining CA Certificates + +Get CA certificates from various sources: + +#### From a Website +```bash +# Extract certificate chain from a website +openssl s_client -showcerts -connect gitlab.company.com:443 /dev/null | openssl x509 -outform PEM > gitlab-ca.crt +``` + +#### From a Certificate File +```bash +# Extract CA from a certificate bundle +openssl x509 -in certificate-bundle.crt -out ca-certificate.crt +``` + +#### From System Trust Store +```bash +# Export system CA certificates +cat /etc/ssl/certs/ca-certificates.crt > system-cas.crt +``` + +### Validating Certificate Chains + +Test certificate validation: + +```bash +# Verify certificate against CA +openssl verify -CAfile ca-certificate.crt server-certificate.crt + +# Test SSL connection with custom CA +openssl s_client -connect gitlab.company.com:443 -CAfile ca-certificate.crt +``` + +## Troubleshooting + +### Common Issues + +**SSL verification errors:** +``` +ERROR: SSL certificate verification failed +``` + +**Solution:** +1. Verify CA certificate is correct and complete +2. Check certificate format (PEM format required) +3. Ensure certificate chain is complete + +**Certificate format issues:** +``` +ERROR: Invalid certificate format +``` + +**Solution:** +1. Ensure certificates are in PEM format +2. Check for proper BEGIN/END markers +3. Validate certificate using OpenSSL + +### Debugging CA Certificate Issues + +#### 1. Check Certificate Details + +```bash +# View certificate information +openssl x509 -in ca-certificate.crt -text -noout + +# Check certificate validity +openssl x509 -in ca-certificate.crt -noout -dates +``` + +#### 2. Test Certificate Trust + +```bash +# Test connection with custom CA +curl --cacert ca-certificate.crt https://gitlab.company.com + +# Test with verbose output +curl -v --cacert ca-certificate.crt https://gitlab.company.com +``` + +#### 3. Validate Certificate Chain + +```bash +# Check certificate chain +openssl verify -verbose -CAfile ca-certificate.crt intermediate.crt + +# Show certificate chain +openssl s_client -connect gitlab.company.com:443 -showcerts +``` + +### Container-Level Debugging + +For Kubernetes deployments, debug CA certificate issues: + +```bash +# Check if certificates are mounted correctly +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- ls -la /etc/ssl/certs/ + +# Test certificate from within container +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- openssl s_client -connect gitlab.company.com:443 -CApath /etc/ssl/certs + +# Check certificate trust store +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- cat /etc/ssl/certs/ca-certificates.crt | grep -A 20 "Your CA Name" +``` + +## Security Best Practices + +### Certificate Management + +1. **Regular Updates**: Keep CA certificates updated +2. **Secure Storage**: Store CA certificates securely +3. **Access Control**: Limit access to CA certificate files +4. **Validation**: Regularly validate certificate chains + +### Certificate Rotation + +```yaml +# Automated certificate rotation +global: + certificates: + rotation: + enabled: true + schedule: "0 2 * * 0" # Weekly on Sunday at 2 AM + backup: true + + validation: + enabled: true + checkExpiry: true + daysBeforeExpiry: 30 +``` + +### Monitoring and Alerting + +```yaml +# Certificate monitoring +monitoring: + certificates: + enabled: true + alerts: + - name: "Certificate Expiry Warning" + condition: "certificate_days_until_expiry < 30" + severity: "warning" + + - name: "Certificate Expiry Critical" + condition: "certificate_days_until_expiry < 7" + severity: "critical" +``` + +## Validation and Testing + +### Post-Configuration Validation + +```bash +# Test DataChain Studio connectivity +curl -I https://studio.company.com + +# Verify certificate trust +openssl s_client -connect studio.company.com:443 -verify_return_error + +# Check service logs for SSL errors +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio | grep -i ssl +``` + +### Integration Testing + +```bash +# Test Git integration +curl -k https://studio.company.com/api/git/test-connection + +# Test storage connectivity +curl -k https://studio.company.com/api/storage/health + +# Test webhook delivery +curl -k https://studio.company.com/api/webhooks/test +``` + +## Next Steps + +- Configure [SSL/TLS certificates](ssl-tls.md) for secure communications +- Set up [Git forge integrations](git-forges/index.md) with custom certificates +- Review [troubleshooting guides](../troubleshooting/index.md) for SSL/TLS issues +- Learn about [upgrading procedures](../upgrading/index.md) with certificate considerations diff --git a/docs/studio/self-hosting/configuration/git-forges/bitbucket.md b/docs/studio/self-hosting/configuration/git-forges/bitbucket.md new file mode 100644 index 000000000..e92eb4723 --- /dev/null +++ b/docs/studio/self-hosting/configuration/git-forges/bitbucket.md @@ -0,0 +1,575 @@ +# Bitbucket Configuration + +This guide covers how to configure DataChain Studio to integrate with Bitbucket Cloud or Bitbucket Server (Data Center). + +## Overview + +DataChain Studio integrates with Bitbucket using OAuth consumers, providing: + +- **Secure Authentication**: OAuth 1.0a/2.0 user authentication +- **Repository Access**: Access to Bitbucket repositories +- **Webhook Integration**: Automatic job triggering on Git events +- **Team Synchronization**: Bitbucket workspace and team mapping + +## Prerequisites + +- Bitbucket workspace with admin access (for Bitbucket Cloud) +- Bitbucket Server with admin access (for self-hosted) +- DataChain Studio deployment ready for configuration +- Valid domain name for DataChain Studio instance + +## Bitbucket Cloud Setup + +### 1. Create OAuth Consumer + +1. Go to [Bitbucket Cloud](https://bitbucket.org) +2. Navigate to your workspace settings +3. Go to **Settings** → **OAuth consumers** +4. Click **Add consumer** + +### 2. Configure OAuth Consumer + +Fill in the consumer details: + +- **Name**: `DataChain Studio` +- **Description**: `DataChain Studio integration for data processing workflows` +- **Callback URL**: `https://studio.yourcompany.com/auth/bitbucket/callback` +- **URL**: `https://studio.yourcompany.com` +- **Permissions**: Select the following: + - **Account**: Read + - **Team membership**: Read + - **Repositories**: Read, Write (if needed) + - **Pull requests**: Read, Write (if needed) + - **Issues**: Read, Write (optional) + - **Webhooks**: Read, Write + +### 3. Save Consumer Credentials + +After creating the consumer, save: +- **Key** (Client ID) +- **Secret** (Client Secret) + +## Bitbucket Server Setup + +### 1. Create Application Link + +1. Log in to Bitbucket Server as an administrator +2. Go to **Administration** → **Application links** +3. Click **Create link** + +### 2. Configure Application Link + +- **Application URL**: `https://studio.yourcompany.com` +- **Application Name**: `DataChain Studio` +- **Application Type**: Generic Application +- **Service Provider Name**: `DataChain Studio` +- **Consumer Key**: Generate a unique key +- **Shared Secret**: Generate a secure secret +- **Request Token URL**: `https://studio.yourcompany.com/auth/bitbucket/request_token` +- **Access Token URL**: `https://studio.yourcompany.com/auth/bitbucket/access_token` +- **Authorize URL**: `https://studio.yourcompany.com/auth/bitbucket/authorize` + +## DataChain Studio Configuration + +### Bitbucket Cloud Configuration + +Add the following to your `values.yaml` file: + +```yaml +global: + git: + bitbucket: + enabled: true + type: "cloud" # or "server" for Bitbucket Server + clientId: "your-bitbucket-client-id" + clientSecret: "your-bitbucket-client-secret" + webhookSecret: "your-webhook-secret" +``` + +### Bitbucket Server Configuration + +For Bitbucket Server deployments: + +```yaml +global: + git: + bitbucket: + enabled: true + type: "server" + url: "https://bitbucket.yourcompany.com" + consumerKey: "your-consumer-key" + consumerSecret: "your-consumer-secret" + webhookSecret: "your-webhook-secret" + + # SSL configuration for Bitbucket Server + ssl: + verify: true + caCertificate: | + -----BEGIN CERTIFICATE----- + ... (your Bitbucket Server CA certificate) ... + -----END CERTIFICATE----- +``` + +### Advanced Configuration + +For more complex setups: + +```yaml +global: + git: + bitbucket: + enabled: true + type: "cloud" # or "server" + url: "https://bitbucket.yourcompany.com" # Only for server + clientId: "your-client-id" + clientSecret: "your-client-secret" + webhookSecret: "your-webhook-secret" + + # OAuth configuration + oauth: + version: "2.0" # "1.0a" for older integrations + scopes: + - account + - team + - repository + - pullrequest + + # Additional OAuth parameters + redirectUri: "https://studio.yourcompany.com/auth/bitbucket/callback" + + # Webhook configuration + webhooks: + events: + - repo:push + - pullrequest:created + - pullrequest:updated + - pullrequest:approved + - pullrequest:merged + + # Webhook delivery settings + active: true + + # Rate limiting + rateLimit: + requestsPerHour: 1000 + burstSize: 50 + + # Connection settings + timeout: + connect: 30s + read: 60s + write: 30s + + # Repository access control + repositories: + # Allow specific repositories + allowList: + - "workspace/important-repo" + - "workspace/data-*" + + # Block specific repositories + blockList: + - "workspace/sensitive-repo" + + # Workspace filtering + workspaces: + allowList: + - "your-workspace" + - "partner-workspace" + blockList: + - "external-workspace" +``` + +### Secret Management + +For Kubernetes deployments, store sensitive data in secrets: + +```bash +# Create secret for Bitbucket OAuth credentials +kubectl create secret generic bitbucket-oauth \ + --namespace datachain-studio \ + --from-literal=client-id=your-client-id \ + --from-literal=client-secret=your-client-secret + +# Create secret for webhook secret +kubectl create secret generic bitbucket-webhook \ + --namespace datachain-studio \ + --from-literal=secret=your-webhook-secret +``` + +Reference secrets in configuration: + +```yaml +global: + git: + bitbucket: + enabled: true + type: "cloud" + clientIdSecret: + name: bitbucket-oauth + key: client-id + clientSecretSecret: + name: bitbucket-oauth + key: client-secret + webhookSecretSecret: + name: bitbucket-webhook + key: secret +``` + +## Webhook Configuration + +### Automatic Webhook Setup + +DataChain Studio can automatically configure webhooks: + +```yaml +global: + git: + bitbucket: + webhooks: + autoSetup: true + events: + - repo:push + - pullrequest:created + - pullrequest:updated + - pullrequest:merged + + # Additional webhook settings + active: true + skipCertVerification: false # Only for testing +``` + +### Manual Webhook Setup + +If automatic setup doesn't work, configure webhooks manually: + +#### Bitbucket Cloud: +1. Go to repository **Settings** → **Webhooks** +2. Click **Add webhook** +3. Configure: + - **Title**: `DataChain Studio` + - **URL**: `https://studio.yourcompany.com/api/webhooks/bitbucket` + - **Status**: Active + - **Triggers**: Select relevant events: + - Repository push + - Pull request created + - Pull request updated + - Pull request merged + - **Skip certificate verification**: Unchecked (unless testing) + +#### Bitbucket Server: +1. Go to repository **Settings** → **Hooks** +2. Enable **Web Post Hooks** +3. Configure: + - **URL**: `https://studio.yourcompany.com/api/webhooks/bitbucket` + - **Secret**: Your webhook secret + +## User Authentication + +Configure Bitbucket OAuth for user authentication: + +```yaml +global: + auth: + bitbucket: + enabled: true + type: "cloud" # or "server" + clientId: "your-oauth-client-id" + clientSecret: "your-oauth-client-secret" + + # OAuth scopes (for Cloud) + scopes: + - account + - team + - repository + + # Team synchronization + teamSync: + enabled: true + workspaceWhitelist: + - "your-workspace" +``` + +## Permissions and Access Control + +### Repository-Level Permissions + +Configure fine-grained repository access: + +```yaml +global: + git: + bitbucket: + permissions: + # Default repository permissions + default: + repository: read + pullrequest: read + + # Custom permissions for specific repositories + repositories: + "workspace/critical-repo": + repository: read + pullrequest: write + issues: read +``` + +### Team Mapping + +Map Bitbucket teams to DataChain Studio roles: + +```yaml +global: + teams: + bitbucket: + mapping: + # Bitbucket team → Studio role + "developers": "member" + "data-engineers": "member" + "administrators": "admin" + "viewers": "viewer" + + # Workspace-wide settings + defaultRole: "viewer" + syncInterval: "1h" +``` + +## Bitbucket Pipelines Integration + +### Pipeline Triggers + +Configure pipeline triggers from DataChain Studio: + +```yaml +global: + git: + bitbucket: + pipelines: + enabled: true + + # Pipeline trigger settings + triggers: + # Trigger on data changes + dataChange: + enabled: true + branch: "main" + variables: + DATACHAIN_TRIGGER: "data_change" + + # Custom pipeline variables + customVariables: + DATACHAIN_STUDIO_URL: "https://studio.yourcompany.com" + DATACHAIN_WEBHOOK_SECRET: "webhook-secret" +``` + +### Build Status Updates + +Update Bitbucket commit status from DataChain Studio jobs: + +```yaml +global: + git: + bitbucket: + buildStatus: + enabled: true + + # Status contexts + contexts: + dataProcessing: "datachain/processing" + dataValidation: "datachain/validation" + dataQuality: "datachain/quality" + + # Status details + url: "https://studio.yourcompany.com/jobs/{job_id}" + description: "DataChain data processing job" +``` + +## Monitoring and Debugging + +### Health Checks + +Monitor Bitbucket integration health: + +```yaml +monitoring: + bitbucket: + enabled: true + + healthChecks: + api: true + webhooks: true + oauth: true + + metrics: + - apiCalls + - responseTime + - errorRate + - webhookDelivery + + alerts: + - name: "Bitbucket API Errors" + condition: "bitbucket_api_error_rate > 5%" + duration: "5m" + severity: "warning" + + - name: "Bitbucket Webhook Failures" + condition: "bitbucket_webhook_failure_rate > 10%" + duration: "5m" + severity: "critical" +``` + +### Debug Configuration + +Enable debug logging for Bitbucket integration: + +```yaml +global: + logging: + level: DEBUG + components: + bitbucket: DEBUG + webhooks: DEBUG + oauth: DEBUG +``` + +## Testing the Integration + +### Test Bitbucket API Access + +#### Bitbucket Cloud: +```bash +# Test API connectivity +curl -H "Authorization: Bearer $BITBUCKET_TOKEN" \ + https://api.bitbucket.org/2.0/user + +# Test repository access +curl -H "Authorization: Bearer $BITBUCKET_TOKEN" \ + https://api.bitbucket.org/2.0/repositories/workspace +``` + +#### Bitbucket Server: +```bash +# Test API connectivity +curl -H "Authorization: Bearer $BITBUCKET_TOKEN" \ + https://bitbucket.yourcompany.com/rest/api/1.0/projects + +# Test user information +curl -H "Authorization: Bearer $BITBUCKET_TOKEN" \ + https://bitbucket.yourcompany.com/rest/api/1.0/users/username +``` + +### Test OAuth Flow + +```bash +# Test OAuth authorization URL (Cloud) +curl "https://bitbucket.org/site/oauth2/authorize?client_id=YOUR_CLIENT_ID&redirect_uri=https://studio.yourcompany.com/auth/bitbucket/callback&response_type=code" +``` + +### Test Webhook Delivery + +```bash +# Test webhook endpoint +curl -X POST https://studio.yourcompany.com/api/webhooks/bitbucket \ + -H "Content-Type: application/json" \ + -H "X-Event-Key: repo:push" \ + -H "X-Hook-UUID: webhook-uuid" \ + -d '{ + "push": { + "changes": [{ + "new": { + "name": "main", + "target": { + "hash": "abcdef123456" + } + } + }] + }, + "repository": { + "name": "test-repo", + "full_name": "workspace/test-repo" + } + }' +``` + +## Troubleshooting + +### Common Issues + +**OAuth authentication failures:** +- Verify client ID and secret are correct +- Check callback URL matches exactly +- Ensure required permissions are granted +- Verify OAuth version (1.0a vs 2.0) + +**API connectivity issues:** +- Test Bitbucket API endpoint accessibility +- Check SSL certificate validity +- Verify network connectivity +- Review API rate limits + +**Webhook delivery failures:** +- Confirm webhook URL is accessible +- Verify webhook secret matches +- Check SSL certificate validity +- Review webhook event configuration + +### Debug Commands + +```bash +# Check Bitbucket configuration +kubectl get configmap datachain-studio-config -n datachain-studio -o yaml | grep -A 20 bitbucket + +# View Bitbucket-related logs +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio | grep -i bitbucket + +# Test Bitbucket API from container +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + curl -v https://api.bitbucket.org/2.0/user + +# Test OAuth endpoint +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + curl -v https://bitbucket.org/site/oauth2/access_token +``` + +## Security Considerations + +### OAuth Security + +- Use confidential OAuth consumers +- Regularly rotate client secrets +- Limit OAuth scopes to minimum required +- Monitor OAuth token usage + +### Webhook Security + +- Always use webhook secrets +- Validate webhook signatures +- Use HTTPS for webhook URLs +- Monitor webhook delivery patterns + +### Access Control + +- Use principle of least privilege +- Regularly audit repository access +- Monitor API usage patterns +- Review team/workspace permissions + +## Migration from Other Git Forges + +When migrating from other Git forges to Bitbucket: + +1. **Export existing configuration** +2. **Set up Bitbucket OAuth consumer** +3. **Configure DataChain Studio for Bitbucket** +4. **Migrate repository connections** +5. **Update webhook configurations** +6. **Test integration thoroughly** +7. **Update user authentication** + +## Next Steps + +- Configure [GitHub integration](github.md) for additional Git forges +- Set up [GitLab integration](gitlab.md) if needed +- Review [SSL/TLS configuration](../ssl-tls.md) for secure communications +- Check [troubleshooting guide](../../troubleshooting/index.md) for common issues +- Configure [monitoring and alerting](../index.md#monitoring) for the integration diff --git a/docs/studio/self-hosting/configuration/git-forges/github.md b/docs/studio/self-hosting/configuration/git-forges/github.md new file mode 100644 index 000000000..020306b7e --- /dev/null +++ b/docs/studio/self-hosting/configuration/git-forges/github.md @@ -0,0 +1,472 @@ +# GitHub Configuration + +This guide covers how to configure DataChain Studio to integrate with GitHub.com or GitHub Enterprise Server. + +## Overview + +DataChain Studio integrates with GitHub using GitHub Apps, providing: + +- **Secure Authentication**: OAuth-based user authentication +- **Repository Access**: Fine-grained access to repositories +- **Webhook Integration**: Automatic job triggering on Git events +- **Team Synchronization**: GitHub organization and team mapping + +## Prerequisites + +- GitHub organization with admin access +- DataChain Studio deployment ready for configuration +- Valid domain name for DataChain Studio instance + +## GitHub App Setup + +### 1. Create GitHub App + +1. Navigate to your GitHub organization settings +2. Go to **Settings** → **Developer settings** → **GitHub Apps** +3. Click **New GitHub App** + +### 2. Configure Basic Information + +Fill in the application details: + +- **GitHub App name**: `DataChain Studio` +- **Description**: `DataChain Studio integration for data processing workflows` +- **Homepage URL**: `https://studio.yourcompany.com` +- **User authorization callback URL**: `https://studio.yourcompany.com/auth/github/callback` +- **Setup URL**: `https://studio.yourcompany.com/setup/github` +- **Webhook URL**: `https://studio.yourcompany.com/api/webhooks/github` +- **Webhook secret**: Generate a secure random string (save this for later) + +### 3. Configure Permissions + +Set the following repository permissions: + +#### Repository Permissions +- **Contents**: Read (for accessing repository files) +- **Metadata**: Read (for repository information) +- **Pull requests**: Read (for PR information) +- **Commit statuses**: Write (for updating commit status) +- **Issues**: Read (optional, for issue tracking) + +#### Organization Permissions +- **Members**: Read (for team synchronization) +- **Plan**: Read (for organization information) + +### 4. Subscribe to Events + +Enable the following webhook events: + +- **Push** - For triggering jobs on code changes +- **Pull request** - For PR-based workflows +- **Release** - For release-based deployments +- **Repository** - For repository changes +- **Installation** - For app installation changes + +### 5. Generate Private Key + +1. After creating the app, scroll down to **Private keys** +2. Click **Generate a private key** +3. Download the `.pem` file (you'll need this for configuration) + +### 6. Install the App + +1. Go to **Install App** tab +2. Install the app on your organization +3. Choose repositories (all or selected) +4. Complete the installation + +## DataChain Studio Configuration + +### Basic Configuration + +Add the following to your `values.yaml` file: + +```yaml +global: + git: + github: + enabled: true + appId: "123456" # Your GitHub App ID + privateKey: | + -----BEGIN RSA PRIVATE KEY----- + MIIEpAIBAAKCAQEA1234567890abcdef... + ... (your GitHub App private key) ... + -----END RSA PRIVATE KEY----- + webhookSecret: "your-webhook-secret" + + # Optional: GitHub Enterprise Server URL + # url: "https://github.enterprise.com" +``` + +### Advanced Configuration + +For more advanced setups: + +```yaml +global: + git: + github: + enabled: true + appId: "123456" + privateKey: | + -----BEGIN RSA PRIVATE KEY----- + ... (private key content) ... + -----END RSA PRIVATE KEY----- + webhookSecret: "your-webhook-secret" + + # GitHub Enterprise Server configuration + url: "https://github.enterprise.com" + apiUrl: "https://github.enterprise.com/api/v3" + + # SSL configuration for GitHub Enterprise + ssl: + verify: true + caCertificate: | + -----BEGIN CERTIFICATE----- + ... (GitHub Enterprise CA certificate) ... + -----END CERTIFICATE----- + + # Webhook configuration + webhooks: + events: + - push + - pull_request + - release + - repository + + # Custom webhook settings + deliveryTimeout: 30s + retryAttempts: 3 + + # Rate limiting + rateLimit: + requestsPerHour: 5000 + burstSize: 100 + + # Repository access control + repositories: + # Allow specific repositories + allowList: + - "org/important-repo" + - "org/data-*" + + # Block specific repositories + blockList: + - "org/sensitive-repo" + + # Organization filtering + organizations: + allowList: + - "your-org" + - "partner-org" + blockList: + - "external-org" +``` + +### Secret Management + +For Kubernetes deployments, store sensitive data in secrets: + +```bash +# Create secret for GitHub App private key +kubectl create secret generic github-app-key \ + --namespace datachain-studio \ + --from-file=private-key=/path/to/github-app.pem + +# Create secret for webhook secret +kubectl create secret generic github-webhook \ + --namespace datachain-studio \ + --from-literal=secret=your-webhook-secret +``` + +Then reference in your configuration: + +```yaml +global: + git: + github: + enabled: true + appId: "123456" + privateKeySecret: + name: github-app-key + key: private-key + webhookSecretSecret: + name: github-webhook + key: secret +``` + +## GitHub Enterprise Server + +For GitHub Enterprise Server deployments: + +```yaml +global: + git: + github: + enabled: true + appId: "your-app-id" + url: "https://github.enterprise.com" + apiUrl: "https://github.enterprise.com/api/v3" + + # Upload URL for Enterprise Server + uploadUrl: "https://github.enterprise.com/api/uploads" + + privateKey: | + -----BEGIN RSA PRIVATE KEY----- + ... private key ... + -----END RSA PRIVATE KEY----- + + # Custom CA certificate for Enterprise Server + ssl: + verify: true + caCertificate: | + -----BEGIN CERTIFICATE----- + ... Enterprise Server CA certificate ... + -----END CERTIFICATE----- +``` + +## Webhook Configuration + +### Automatic Webhook Setup + +DataChain Studio can automatically configure webhooks: + +```yaml +global: + git: + github: + webhooks: + autoSetup: true + events: + - push + - pull_request + - release + + # Webhook delivery settings + contentType: "application/json" + insecureSSL: false # Set to true only for testing + active: true +``` + +### Manual Webhook Setup + +If automatic setup doesn't work, configure webhooks manually: + +1. Go to repository **Settings** → **Webhooks** +2. Click **Add webhook** +3. Configure: + - **Payload URL**: `https://studio.yourcompany.com/api/webhooks/github` + - **Content type**: `application/json` + - **Secret**: Your webhook secret + - **Events**: Select individual events or "Send me everything" + - **Active**: ✓ Checked + +## User Authentication + +Configure GitHub OAuth for user authentication: + +```yaml +global: + auth: + github: + enabled: true + clientId: "your-oauth-app-client-id" + clientSecret: "your-oauth-app-client-secret" + + # OAuth scopes + scopes: + - user:email + - read:org + - repo + + # Team synchronization + teamSync: + enabled: true + organizationWhitelist: + - "your-org" +``` + +## Permissions and Access Control + +### Repository-Level Permissions + +Configure fine-grained repository access: + +```yaml +global: + git: + github: + permissions: + # Default repository permissions + default: + contents: read + metadata: read + pull_requests: read + + # Custom permissions for specific repositories + repositories: + "org/critical-repo": + contents: read + metadata: read + pull_requests: write + commit_statuses: write +``` + +### Team Mapping + +Map GitHub teams to DataChain Studio roles: + +```yaml +global: + teams: + github: + mapping: + # GitHub team slug → Studio role + "developers": "member" + "data-engineers": "member" + "admin-team": "admin" + "read-only": "viewer" + + # Organization-wide settings + defaultRole: "viewer" + syncInterval: "1h" +``` + +## Monitoring and Debugging + +### Health Checks + +Monitor GitHub integration health: + +```yaml +monitoring: + github: + enabled: true + + healthChecks: + api: true + webhooks: true + rateLimit: true + + alerts: + - name: "GitHub API Rate Limit" + condition: "github_rate_limit_remaining < 100" + severity: "warning" + + - name: "GitHub Webhook Failures" + condition: "github_webhook_failure_rate > 5%" + severity: "critical" +``` + +### Debug Configuration + +Enable debug logging for GitHub integration: + +```yaml +global: + logging: + level: DEBUG + components: + github: DEBUG + webhooks: DEBUG +``` + +## Testing the Integration + +### Test GitHub App Installation + +```bash +# Check app installation status +curl -H "Authorization: Bearer $GITHUB_TOKEN" \ + https://api.github.com/app/installations + +# Test repository access +curl -H "Authorization: Bearer $GITHUB_TOKEN" \ + https://api.github.com/installation/repositories +``` + +### Test Webhook Delivery + +```bash +# Test webhook endpoint +curl -X POST https://studio.yourcompany.com/api/webhooks/github \ + -H "Content-Type: application/json" \ + -H "X-GitHub-Event: ping" \ + -H "X-GitHub-Delivery: 12345-678-90" \ + -H "X-Hub-Signature-256: sha256=..." \ + -d '{"zen": "Testing webhook delivery"}' +``` + +### Validate Configuration + +```bash +# Test GitHub API connectivity +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + curl -H "Authorization: Bearer $GITHUB_TOKEN" https://api.github.com/user + +# Check webhook configuration +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio | grep -i github +``` + +## Troubleshooting + +### Common Issues + +**App installation failures:** +- Verify app permissions are correct +- Check organization access settings +- Ensure webhook URL is accessible + +**Authentication errors:** +- Validate GitHub App ID and private key +- Check private key format (PEM) +- Verify webhook secret matches + +**Webhook delivery failures:** +- Check webhook URL accessibility +- Verify SSL certificate validity +- Review webhook event configuration + +### Debug Commands + +```bash +# Check GitHub App configuration +kubectl get configmap datachain-studio-config -n datachain-studio -o yaml + +# View GitHub-related logs +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio | grep -i github + +# Test GitHub API connectivity +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + curl -v https://api.github.com +``` + +## Security Considerations + +### Private Key Security + +- Store private keys in Kubernetes secrets +- Rotate private keys regularly +- Limit access to private key files +- Use RBAC to control secret access + +### Webhook Security + +- Always use webhook secrets +- Validate webhook signatures +- Use HTTPS for webhook URLs +- Monitor webhook delivery logs + +### Access Control + +- Use principle of least privilege +- Regularly audit app permissions +- Monitor app installation changes +- Review repository access patterns + +## Next Steps + +- Configure [GitLab integration](gitlab.md) for additional Git forges +- Set up [SSL/TLS certificates](../ssl-tls.md) for secure communications +- Review [troubleshooting guide](../../troubleshooting/index.md) for common issues +- Configure [monitoring and alerting](../index.md#monitoring) for the integration diff --git a/docs/studio/self-hosting/configuration/git-forges/gitlab.md b/docs/studio/self-hosting/configuration/git-forges/gitlab.md new file mode 100644 index 000000000..492b70ea9 --- /dev/null +++ b/docs/studio/self-hosting/configuration/git-forges/gitlab.md @@ -0,0 +1,555 @@ +# GitLab Configuration + +This guide covers how to configure DataChain Studio to integrate with GitLab.com or self-hosted GitLab instances. + +## Overview + +DataChain Studio integrates with GitLab using OAuth applications, providing: + +- **Secure Authentication**: OAuth 2.0 user authentication +- **Repository Access**: Access to GitLab repositories and projects +- **Webhook Integration**: Automatic job triggering on Git events +- **Group Synchronization**: GitLab group and project mapping + +## Prerequisites + +- GitLab instance with admin access (GitLab.com or self-hosted) +- DataChain Studio deployment ready for configuration +- Valid domain name for DataChain Studio instance + +## GitLab OAuth Application Setup + +### 1. Create OAuth Application + +#### For GitLab.com: +1. Go to [GitLab.com](https://gitlab.com) +2. Navigate to **Settings** → **Applications** +3. Click **New application** + +#### For Self-hosted GitLab: +1. Log in to your GitLab instance as an administrator +2. Go to **Admin Area** → **Applications** +3. Click **New application** + +### 2. Configure Application Settings + +Fill in the application details: + +- **Name**: `DataChain Studio` +- **Redirect URI**: `https://studio.yourcompany.com/auth/gitlab/callback` +- **Confidential**: ✓ Checked +- **Scopes**: Select the following: + - `read_user` - Read user information + - `read_repository` - Read repository contents + - `read_api` - Read access to API + - `write_repository` - Write access to repositories (optional) + +### 3. Save Application Credentials + +After creating the application, save: +- **Application ID** (Client ID) +- **Secret** (Client Secret) + +### 4. Configure Webhooks (Optional) + +For automatic webhook setup, ensure your GitLab user/admin has: +- Admin access to repositories/groups where webhooks will be created +- API access permissions + +## DataChain Studio Configuration + +### Basic Configuration + +Add the following to your `values.yaml` file: + +```yaml +global: + git: + gitlab: + enabled: true + url: "https://gitlab.com" # Or your GitLab instance URL + clientId: "your-gitlab-client-id" + clientSecret: "your-gitlab-client-secret" + webhookSecret: "your-webhook-secret" +``` + +### Self-hosted GitLab Configuration + +For self-hosted GitLab instances: + +```yaml +global: + git: + gitlab: + enabled: true + url: "https://gitlab.yourcompany.com" + apiUrl: "https://gitlab.yourcompany.com/api/v4" + clientId: "your-gitlab-client-id" + clientSecret: "your-gitlab-client-secret" + webhookSecret: "your-webhook-secret" + + # SSL configuration for self-hosted GitLab + ssl: + verify: true + caCertificate: | + -----BEGIN CERTIFICATE----- + ... (your GitLab instance CA certificate) ... + -----END CERTIFICATE----- +``` + +### Advanced Configuration + +For more complex setups: + +```yaml +global: + git: + gitlab: + enabled: true + url: "https://gitlab.yourcompany.com" + apiUrl: "https://gitlab.yourcompany.com/api/v4" + clientId: "your-gitlab-client-id" + clientSecret: "your-gitlab-client-secret" + webhookSecret: "your-webhook-secret" + + # OAuth configuration + oauth: + scopes: + - read_user + - read_repository + - read_api + + # Additional OAuth parameters + redirectUri: "https://studio.yourcompany.com/auth/gitlab/callback" + + # Webhook configuration + webhooks: + events: + - push + - merge_requests + - tag_push + - releases + + # Webhook delivery settings + enableSSLVerification: true + pushEventsBranchFilter: "" # All branches + + # Rate limiting + rateLimit: + requestsPerMinute: 600 + burstSize: 100 + + # Connection settings + timeout: + connect: 30s + read: 60s + write: 30s + + # Repository/project access control + projects: + # Allow specific projects + allowList: + - "group/important-project" + - "group/data-*" + + # Block specific projects + blockList: + - "group/sensitive-project" + + # Group filtering + groups: + allowList: + - "data-team" + - "engineering" + blockList: + - "external-group" +``` + +### Secret Management + +For Kubernetes deployments, store sensitive data in secrets: + +```bash +# Create secret for GitLab OAuth credentials +kubectl create secret generic gitlab-oauth \ + --namespace datachain-studio \ + --from-literal=client-id=your-client-id \ + --from-literal=client-secret=your-client-secret + +# Create secret for webhook secret +kubectl create secret generic gitlab-webhook \ + --namespace datachain-studio \ + --from-literal=secret=your-webhook-secret +``` + +Reference secrets in configuration: + +```yaml +global: + git: + gitlab: + enabled: true + url: "https://gitlab.yourcompany.com" + clientIdSecret: + name: gitlab-oauth + key: client-id + clientSecretSecret: + name: gitlab-oauth + key: client-secret + webhookSecretSecret: + name: gitlab-webhook + key: secret +``` + +## Webhook Configuration + +### Automatic Webhook Setup + +DataChain Studio can automatically configure webhooks: + +```yaml +global: + git: + gitlab: + webhooks: + autoSetup: true + events: + - push_events + - merge_requests_events + - tag_push_events + - releases_events + + # Additional webhook settings + issues_events: false + wiki_page_events: false + deployment_events: false + job_events: false + pipeline_events: false + + # Security settings + enable_ssl_verification: true + push_events_branch_filter: "" +``` + +### Manual Webhook Setup + +If automatic setup doesn't work, configure webhooks manually: + +#### Project-level Webhooks: +1. Go to project **Settings** → **Webhooks** +2. Add webhook with: + - **URL**: `https://studio.yourcompany.com/api/webhooks/gitlab` + - **Secret Token**: Your webhook secret + - **Trigger Events**: + - ✓ Push events + - ✓ Merge request events + - ✓ Tag push events + - ✓ Releases events + - **SSL verification**: ✓ Enable SSL verification + +#### Group-level Webhooks: +1. Go to group **Settings** → **Webhooks** +2. Configure the same settings as project-level webhooks + +## User Authentication + +Configure GitLab OAuth for user authentication: + +```yaml +global: + auth: + gitlab: + enabled: true + url: "https://gitlab.yourcompany.com" + clientId: "your-oauth-client-id" + clientSecret: "your-oauth-client-secret" + + # OAuth scopes + scopes: + - read_user + - read_repository + - read_api + + # Group synchronization + groupSync: + enabled: true + groupWhitelist: + - "data-team" + - "engineering" +``` + +## Permissions and Access Control + +### Project-Level Permissions + +Configure fine-grained project access: + +```yaml +global: + git: + gitlab: + permissions: + # Default project permissions + default: + repository: read + issues: read + merge_requests: read + + # Custom permissions for specific projects + projects: + "group/critical-project": + repository: read + issues: write + merge_requests: write + deployments: read +``` + +### Group Mapping + +Map GitLab groups to DataChain Studio roles: + +```yaml +global: + teams: + gitlab: + mapping: + # GitLab group path → Studio role + "data-engineers": "member" + "senior-engineers": "admin" + "analysts": "viewer" + "contractors": "viewer" + + # Group-wide settings + defaultRole: "viewer" + syncInterval: "1h" + + # Nested group handling + includeSubgroups: true +``` + +## GitLab CI/CD Integration + +### Pipeline Triggers + +Configure pipeline triggers from DataChain Studio: + +```yaml +global: + git: + gitlab: + ci: + enabled: true + + # Pipeline trigger settings + triggers: + # Trigger on data changes + dataChange: + enabled: true + branch: "main" + variables: + DATACHAIN_TRIGGER: "data_change" + + # Trigger on schedule + scheduled: + enabled: true + cron: "0 2 * * *" + variables: + DATACHAIN_TRIGGER: "scheduled" + + # Job monitoring + monitoring: + enabled: true + pollInterval: 30s +``` + +### Job Status Updates + +Update GitLab commit status from DataChain Studio jobs: + +```yaml +global: + git: + gitlab: + commitStatus: + enabled: true + + # Status contexts + contexts: + dataProcessing: "datachain/processing" + dataValidation: "datachain/validation" + dataQuality: "datachain/quality" + + # Status details + targetUrl: "https://studio.yourcompany.com/jobs/{job_id}" + description: "DataChain data processing job" +``` + +## Monitoring and Debugging + +### Health Checks + +Monitor GitLab integration health: + +```yaml +monitoring: + gitlab: + enabled: true + + healthChecks: + api: true + webhooks: true + oauth: true + + metrics: + - apiCalls + - responseTime + - errorRate + - webhookDelivery + + alerts: + - name: "GitLab API Errors" + condition: "gitlab_api_error_rate > 5%" + duration: "5m" + severity: "warning" + + - name: "GitLab Webhook Failures" + condition: "gitlab_webhook_failure_rate > 10%" + duration: "5m" + severity: "critical" +``` + +### Debug Configuration + +Enable debug logging for GitLab integration: + +```yaml +global: + logging: + level: DEBUG + components: + gitlab: DEBUG + webhooks: DEBUG + oauth: DEBUG +``` + +## Testing the Integration + +### Test GitLab API Access + +```bash +# Test API connectivity +curl -H "Authorization: Bearer $GITLAB_TOKEN" \ + https://gitlab.yourcompany.com/api/v4/user + +# Test project access +curl -H "Authorization: Bearer $GITLAB_TOKEN" \ + https://gitlab.yourcompany.com/api/v4/projects +``` + +### Test OAuth Flow + +```bash +# Test OAuth authorization URL +curl "https://gitlab.yourcompany.com/oauth/authorize?client_id=YOUR_CLIENT_ID&redirect_uri=https://studio.yourcompany.com/auth/gitlab/callback&response_type=code&scope=read_user+read_repository" +``` + +### Test Webhook Delivery + +```bash +# Test webhook endpoint +curl -X POST https://studio.yourcompany.com/api/webhooks/gitlab \ + -H "Content-Type: application/json" \ + -H "X-Gitlab-Event: Push Hook" \ + -H "X-Gitlab-Token: your-webhook-secret" \ + -d '{ + "object_kind": "push", + "ref": "refs/heads/main", + "project": { + "name": "test-project", + "web_url": "https://gitlab.yourcompany.com/group/test-project" + } + }' +``` + +## Troubleshooting + +### Common Issues + +**OAuth authentication failures:** +- Verify client ID and secret are correct +- Check redirect URI matches exactly +- Ensure required scopes are granted +- Verify GitLab instance URL is correct + +**API connectivity issues:** +- Test GitLab API endpoint accessibility +- Check SSL certificate validity +- Verify network connectivity +- Review API rate limits + +**Webhook delivery failures:** +- Confirm webhook URL is accessible +- Verify webhook secret matches +- Check SSL certificate validity +- Review webhook event configuration + +### Debug Commands + +```bash +# Check GitLab configuration +kubectl get configmap datachain-studio-config -n datachain-studio -o yaml | grep -A 20 gitlab + +# View GitLab-related logs +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio | grep -i gitlab + +# Test GitLab API from container +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + curl -v https://gitlab.yourcompany.com/api/v4/version + +# Test OAuth endpoint +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + curl -v https://gitlab.yourcompany.com/oauth/token +``` + +## Security Considerations + +### OAuth Security + +- Use confidential OAuth applications +- Regularly rotate client secrets +- Limit OAuth scopes to minimum required +- Monitor OAuth token usage + +### Webhook Security + +- Always use webhook secrets +- Validate webhook signatures +- Use HTTPS for webhook URLs +- Monitor webhook delivery patterns + +### Network Security + +- Use TLS for all GitLab communications +- Validate SSL certificates +- Consider IP whitelisting +- Monitor API access patterns + +## Migration from Other Git Forges + +When migrating from other Git forges to GitLab: + +1. **Export existing configuration** +2. **Set up GitLab OAuth application** +3. **Configure DataChain Studio for GitLab** +4. **Migrate repository connections** +5. **Update webhook configurations** +6. **Test integration thoroughly** +7. **Update user authentication** + +## Next Steps + +- Configure [GitHub integration](github.md) for additional Git forges +- Set up [Bitbucket integration](bitbucket.md) if needed +- Review [SSL/TLS configuration](../ssl-tls.md) for secure communications +- Check [troubleshooting guide](../../troubleshooting/index.md) for common issues +- Configure [monitoring and alerting](../index.md#monitoring) for the integration diff --git a/docs/studio/self-hosting/configuration/git-forges/index.md b/docs/studio/self-hosting/configuration/git-forges/index.md new file mode 100644 index 000000000..55b59f6f3 --- /dev/null +++ b/docs/studio/self-hosting/configuration/git-forges/index.md @@ -0,0 +1,353 @@ +# Git Forges Configuration + +This section covers how to configure DataChain Studio to integrate with various Git hosting providers (forges) including GitHub, GitLab, and Bitbucket. + +## Overview + +DataChain Studio supports integration with multiple Git forges to enable: + +- **Repository Access**: Connect to Git repositories for code and data +- **Authentication**: OAuth-based user authentication +- **Webhook Integration**: Automatic job triggering on Git events +- **Team Management**: Synchronize teams and permissions + +## Supported Git Forges + +- **[GitHub](github.md)** - GitHub.com and GitHub Enterprise Server +- **[GitLab](gitlab.md)** - GitLab.com and self-hosted GitLab instances +- **[Bitbucket](bitbucket.md)** - Bitbucket Cloud and Bitbucket Server + +## General Configuration + +All Git forge integrations share common configuration patterns: + +### Basic Configuration Structure + +```yaml +global: + git: + # GitHub configuration + github: + enabled: true + appId: "your-app-id" + privateKey: "your-private-key" + webhookSecret: "your-webhook-secret" + + # GitLab configuration + gitlab: + enabled: true + url: "https://gitlab.com" + clientId: "your-client-id" + clientSecret: "your-client-secret" + webhookSecret: "your-webhook-secret" + + # Bitbucket configuration + bitbucket: + enabled: true + clientId: "your-client-id" + clientSecret: "your-client-secret" + webhookSecret: "your-webhook-secret" +``` + +### Common Configuration Options + +All Git forges support these common options: + +```yaml +git: + : + enabled: true|false + + # Authentication settings + clientId: "oauth-client-id" + clientSecret: "oauth-client-secret" + + # Webhook configuration + webhookSecret: "webhook-secret-key" + webhookEvents: + - push + - pull_request + - release + + # SSL/TLS settings + ssl: + verify: true + caCertificate: | + -----BEGIN CERTIFICATE----- + ... custom CA certificate ... + -----END CERTIFICATE----- + + # Rate limiting + rateLimit: + requestsPerHour: 5000 + burstSize: 100 + + # Timeout settings + timeout: + connect: 30s + read: 60s + write: 30s +``` + +## Multi-Forge Configuration + +DataChain Studio can be configured to work with multiple Git forges simultaneously: + +```yaml +global: + git: + # Primary forge + github: + enabled: true + appId: "123456" + privateKey: | + -----BEGIN RSA PRIVATE KEY----- + ... GitHub App private key ... + -----END RSA PRIVATE KEY----- + + # Secondary forge for internal repositories + gitlab: + enabled: true + url: "https://gitlab.internal.company.com" + clientId: "internal-gitlab-client-id" + clientSecret: "internal-gitlab-secret" + + # Additional forge for specific teams + bitbucket: + enabled: true + clientId: "bitbucket-client-id" + clientSecret: "bitbucket-secret" +``` + +## Authentication Flow + +### OAuth 2.0 Flow + +All Git forges use OAuth 2.0 for authentication: + +1. **User Authorization**: User authorizes DataChain Studio to access their Git forge account +2. **Code Exchange**: Studio exchanges authorization code for access token +3. **Token Storage**: Access tokens are securely stored and used for API calls +4. **Token Refresh**: Tokens are automatically refreshed when needed + +### Configuration Requirements + +Each forge requires specific OAuth application setup: + +- **Redirect URIs**: Must include Studio's callback URLs +- **Scopes**: Appropriate permissions for repository and user access +- **Webhook URLs**: For receiving Git events + +## Webhook Configuration + +### Automatic Webhook Setup + +DataChain Studio can automatically configure webhooks: + +```yaml +git: + : + webhooks: + autoSetup: true + events: + - push + - pull_request + - release + + # Custom webhook settings + ssl: + verify: true + + contentType: "application/json" + secret: "webhook-secret-key" +``` + +### Manual Webhook Configuration + +For manual webhook setup, configure each repository with: + +- **Payload URL**: `https://studio.yourcompany.com/api/webhooks/` +- **Content Type**: `application/json` +- **Secret**: Your configured webhook secret +- **Events**: `push`, `pull_request`, `release` + +## Security Configuration + +### SSL/TLS Configuration + +For self-hosted Git forges with custom certificates: + +```yaml +git: + gitlab: + url: "https://gitlab.internal.company.com" + ssl: + verify: true + caCertificate: | + -----BEGIN CERTIFICATE----- + ... your internal CA certificate ... + -----END CERTIFICATE----- +``` + +### Access Control + +Configure repository access patterns: + +```yaml +git: + : + access: + # Repository filtering + repositories: + allowed: + - "org/allowed-repo" + - "org/*-public" + blocked: + - "org/sensitive-repo" + + # User/organization filtering + organizations: + allowed: + - "your-org" + - "partner-org" + blocked: + - "external-org" +``` + +## Error Handling and Retry Logic + +Configure resilient Git forge connections: + +```yaml +git: + : + retry: + enabled: true + maxAttempts: 3 + initialDelay: 1s + maxDelay: 30s + exponentialBackoff: true + + circuitBreaker: + enabled: true + failureThreshold: 10 + recoveryTimeout: 60s +``` + +## Monitoring and Alerting + +Monitor Git forge integrations: + +```yaml +monitoring: + gitForges: + enabled: true + + healthChecks: + enabled: true + interval: 30s + timeout: 10s + + metrics: + - apiCalls + - responseTime + - errorRate + - webhookDelivery + + alerts: + - name: "Git Forge API Error Rate High" + condition: "error_rate > 5%" + duration: "5m" + severity: "warning" +``` + +## Testing Configuration + +### Connectivity Testing + +Test Git forge connections: + +```bash +# Test GitHub connection +curl -k https://studio.yourcompany.com/api/git/github/test + +# Test GitLab connection +curl -k https://studio.yourcompany.com/api/git/gitlab/test + +# Test webhook delivery +curl -X POST https://studio.yourcompany.com/api/webhooks/github \ + -H "Content-Type: application/json" \ + -H "X-GitHub-Event: ping" \ + -d '{"zen": "Test webhook"}' +``` + +### Configuration Validation + +Validate configuration before deployment: + +```bash +# Validate Helm configuration +helm template datachain-studio ./chart \ + --values values.yaml \ + --dry-run + +# Test OAuth flow +curl "https://github.com/login/oauth/authorize?client_id=YOUR_CLIENT_ID&redirect_uri=https://studio.yourcompany.com/auth/github/callback" +``` + +## Troubleshooting + +### Common Issues + +**OAuth authentication failures:** +- Verify client ID and secret +- Check redirect URI configuration +- Ensure proper scopes are granted + +**Webhook delivery failures:** +- Verify webhook secret matches +- Check webhook URL accessibility +- Review webhook event configuration + +**API rate limiting:** +- Monitor API usage +- Implement proper caching +- Configure rate limit settings + +### Debug Commands + +```bash +# Check Git forge connectivity +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio | grep -i git + +# Test OAuth flow +kubectl port-forward service/datachain-studio-frontend 8080:80 -n datachain-studio + +# Verify webhook configuration +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- curl -I https://api.github.com +``` + +## Migration Between Forges + +When migrating between Git forges: + +1. **Export Configuration**: Back up existing Git forge settings +2. **Configure New Forge**: Set up authentication with new provider +3. **Update Repositories**: Migrate repository connections +4. **Test Integration**: Verify all functionality works +5. **Update Webhooks**: Reconfigure webhook endpoints +6. **Cleanup**: Remove old forge configuration + +## Next Steps + +Choose your Git forge for detailed configuration: + +- **[GitHub Configuration](github.md)** - Set up GitHub.com or GitHub Enterprise +- **[GitLab Configuration](gitlab.md)** - Configure GitLab.com or self-hosted GitLab +- **[Bitbucket Configuration](bitbucket.md)** - Integrate with Bitbucket Cloud or Server + +For additional configuration options: + +- [SSL/TLS Configuration](../ssl-tls.md) for secure connections +- [CA Certificates](../ca-certificates.md) for custom certificate authorities +- [Troubleshooting Guide](../../troubleshooting/index.md) for common issues diff --git a/docs/studio/self-hosting/configuration/index.md b/docs/studio/self-hosting/configuration/index.md new file mode 100644 index 000000000..b283fa938 --- /dev/null +++ b/docs/studio/self-hosting/configuration/index.md @@ -0,0 +1,415 @@ +# Configuration + +This section covers how to configure your self-hosted DataChain Studio instance for optimal performance, security, and integration with your infrastructure. + +## Overview + +DataChain Studio configuration involves several key areas: + +- **[SSL/TLS Configuration](ssl-tls.md)** - Set up secure HTTPS connections +- **[CA Certificates](ca-certificates.md)** - Configure custom certificate authorities +- **[Git Forges](git-forges/index.md)** - Integrate with GitHub, GitLab, and Bitbucket + +## Basic Configuration + +### Environment Variables + +DataChain Studio can be configured using environment variables: + +```yaml +global: + envVars: + # Basic settings + DATACHAIN_STUDIO_URL: "https://studio.yourcompany.com" + DATACHAIN_STUDIO_SECRET_KEY: "your-secret-key" + + # Database settings + DATABASE_URL: "postgresql://user:pass@host:5432/datachain_studio" + REDIS_URL: "redis://host:6379" + + # Storage settings + STORAGE_TYPE: "s3" + S3_BUCKET: "your-studio-bucket" + S3_REGION: "us-east-1" + + # Git integration + GITHUB_APP_ID: "your-github-app-id" + GITLAB_CLIENT_ID: "your-gitlab-client-id" +``` + +### Configuration File + +For more complex configurations, use a YAML configuration file: + +```yaml +# values.yaml +global: + domain: studio.yourcompany.com + + # Security settings + security: + secretKey: "your-long-random-secret-key" + sessionTimeout: 3600 + csrfProtection: true + + # Feature flags + features: + webhooks: true + apiAccess: true + teamCollaboration: true + ssoIntegration: true + +# Database configuration +database: + type: postgresql + host: postgres.yourcompany.com + port: 5432 + name: datachain_studio + user: studio_user + password: secure-password + sslMode: require + + # Connection pooling + pool: + minConnections: 5 + maxConnections: 20 + +# Cache configuration +cache: + type: redis + host: redis.yourcompany.com + port: 6379 + password: redis-password + database: 0 + + # TTL settings + ttl: + sessions: 3600 + apiCache: 300 + dataCache: 1800 + +# Storage configuration +storage: + type: s3 + config: + bucket: datachain-studio-storage + region: us-east-1 + accessKey: your-access-key + secretKey: your-secret-key + endpoint: s3.amazonaws.com + + # Alternative: Google Cloud Storage + # type: gcs + # config: + # bucket: datachain-studio-storage + # projectId: your-project-id + # keyFile: /path/to/service-account.json + +# Logging configuration +logging: + level: INFO + format: json + outputs: + - console + - file + + # Log rotation + rotation: + maxSize: 100MB + maxAge: 30 + maxBackups: 10 +``` + +## Advanced Configuration + +### Performance Tuning + +```yaml +# Performance settings +performance: + # Worker processes + workers: + frontend: 4 + backend: 8 + jobProcessor: 2 + + # Memory limits + memory: + frontend: "1Gi" + backend: "2Gi" + jobProcessor: "4Gi" + + # CPU limits + cpu: + frontend: "500m" + backend: "1000m" + jobProcessor: "2000m" + + # Caching + cache: + enabled: true + size: "512Mi" + evictionPolicy: "lru" +``` + +### Security Configuration {#security} + +```yaml +# Security settings +security: + # Authentication + auth: + methods: + - local + - oauth + - saml + + # Password policy + passwordPolicy: + minLength: 8 + requireUppercase: true + requireLowercase: true + requireNumbers: true + requireSpecialChars: true + + # Session management + sessions: + timeout: 3600 + renewalThreshold: 300 + maxConcurrent: 5 + + # Network security + network: + allowedIPs: + - "10.0.0.0/8" + - "192.168.0.0/16" + + rateLimiting: + enabled: true + requestsPerMinute: 100 + burstSize: 20 + + # Data encryption + encryption: + atRest: + enabled: true + algorithm: "AES-256-GCM" + + inTransit: + enabled: true + minTlsVersion: "1.2" +``` + +### Integration Configuration + +```yaml +# External integrations +integrations: + # Git forges + git: + github: + enabled: true + appId: "123456" + privateKeyPath: "/etc/ssl/private/github.pem" + webhookSecret: "github-webhook-secret" + + gitlab: + enabled: true + url: "https://gitlab.yourcompany.com" + clientId: "gitlab-client-id" + clientSecret: "gitlab-client-secret" + webhookSecret: "gitlab-webhook-secret" + + bitbucket: + enabled: true + clientId: "bitbucket-client-id" + clientSecret: "bitbucket-client-secret" + + # Monitoring + monitoring: + prometheus: + enabled: true + endpoint: "/metrics" + port: 9090 + + grafana: + enabled: true + url: "https://grafana.yourcompany.com" + + alerts: + slack: + enabled: true + webhookUrl: "https://hooks.slack.com/..." + channel: "#datachain-alerts" + + email: + enabled: true + smtpHost: "smtp.yourcompany.com" + smtpPort: 587 + from: "datachain-studio@yourcompany.com" +``` + +### Backup Configuration + +```yaml +# Backup settings +backup: + enabled: true + + # Database backups + database: + enabled: true + schedule: "0 2 * * *" # Daily at 2 AM + retention: 30 # days + compression: true + + destination: + type: s3 + bucket: datachain-studio-backups + path: database/ + + # Storage backups + storage: + enabled: true + schedule: "0 3 * * 0" # Weekly on Sunday at 3 AM + retention: 12 # weeks + + destination: + type: s3 + bucket: datachain-studio-backups + path: storage/ +``` + +## Monitoring Configuration {#monitoring} + +### Metrics and Alerting + +```yaml +# Monitoring configuration +monitoring: + # Metrics collection + metrics: + enabled: true + interval: 30s + + collectors: + - system + - application + - database + - cache + - storage + + # Health checks + healthChecks: + enabled: true + interval: 10s + timeout: 5s + + endpoints: + - /health/live + - /health/ready + - /health/database + - /health/cache + + # Alerting rules + alerts: + rules: + - name: "High CPU Usage" + condition: "cpu_usage > 80" + duration: "5m" + severity: "warning" + + - name: "Database Connection Failed" + condition: "database_health == 0" + duration: "1m" + severity: "critical" + + - name: "Storage Full" + condition: "storage_usage > 90" + duration: "5m" + severity: "critical" +``` + +## Validation + +### Configuration Validation + +Validate your configuration before deployment: + +```bash +# For Helm deployments +helm template datachain-studio ./chart \ + --values values.yaml \ + --dry-run + +# For direct deployments +datachain-studio validate-config config.yaml +``` + +### Health Checks + +Monitor your configuration post-deployment: + +```bash +# Check service health +curl https://studio.yourcompany.com/health + +# Check database connectivity +curl https://studio.yourcompany.com/health/database + +# Check storage connectivity +curl https://studio.yourcompany.com/health/storage +``` + +## Troubleshooting + +### Common Configuration Issues + +**Database connection failures:** +- Verify connection string format +- Check network connectivity +- Confirm credentials and permissions + +**SSL/TLS certificate issues:** +- Validate certificate chain +- Check certificate expiration +- Verify domain name matches + +**Storage access problems:** +- Confirm bucket permissions +- Check access key validity +- Verify network connectivity + +### Configuration Testing + +```yaml +# Test configuration +test: + enabled: true + + # Unit tests + unit: + database: true + cache: true + storage: true + auth: true + + # Integration tests + integration: + gitForges: true + webhooks: true + api: true + + # Load tests + load: + enabled: false + users: 100 + duration: "10m" +``` + +## Next Steps + +- Configure [SSL/TLS certificates](ssl-tls.md) +- Set up [Git forge integrations](git-forges/index.md) +- Review [upgrading procedures](../upgrading/index.md) +- Check [troubleshooting guides](../troubleshooting/index.md) diff --git a/docs/studio/self-hosting/configuration/ssl-tls.md b/docs/studio/self-hosting/configuration/ssl-tls.md new file mode 100644 index 000000000..217d4bc42 --- /dev/null +++ b/docs/studio/self-hosting/configuration/ssl-tls.md @@ -0,0 +1,384 @@ +# SSL/TLS Configuration + +This guide covers how to configure SSL/TLS certificates for secure HTTPS access to your self-hosted DataChain Studio instance. + +## Overview + +DataChain Studio supports both SSL and TLS certificate configurations for secure communication. This includes: + +- **SSL Certificates**: Traditional SSL certificate configuration +- **TLS Certificates**: Modern TLS certificate setup (recommended) +- **Certificate Management**: Automated certificate renewal and validation +- **Security Hardening**: Advanced SSL/TLS security configurations + +## TLS Certificate Configuration (Recommended) + +### Prerequisites + +- Valid domain name pointing to your DataChain Studio instance +- TLS certificate and private key files +- Proper DNS resolution + +### Kubernetes/Helm Deployment + +For Kubernetes deployments, create a TLS secret and configure the ingress: + +#### 1. Create TLS Secret + +```bash +kubectl create secret tls datachain-studio-tls \ + --namespace datachain-studio \ + --cert=path/to/tls.crt \ + --key=path/to/tls.key +``` + +#### 2. Configure Helm Values + +Add the following to your `values.yaml`: + +```yaml +ingress: + enabled: true + className: nginx + + # TLS configuration + tls: + enabled: true + secretName: datachain-studio-tls + + # Security annotations + annotations: + nginx.ingress.kubernetes.io/ssl-redirect: "true" + nginx.ingress.kubernetes.io/force-ssl-redirect: "true" + nginx.ingress.kubernetes.io/ssl-ciphers: "ECDHE-RSA-AES128-GCM-SHA256,ECDHE-RSA-AES256-GCM-SHA384" + nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.2 TLSv1.3" + +global: + # Enforce HTTPS + tls: + enabled: true + minVersion: "1.2" + cipherSuites: + - "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256" + - "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384" + - "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305" +``` + +#### 3. Apply Configuration + +```bash +helm upgrade datachain-studio datachain/studio \ + --namespace datachain-studio \ + --values values.yaml \ + --wait +``` + +### AWS AMI Deployment + +For AWS AMI deployments, configure SSL/TLS directly on the instance: + +#### 1. Upload Certificates + +```bash +# Copy certificate files to the instance +scp -i your-key.pem tls.crt ubuntu@your-instance:/tmp/ +scp -i your-key.pem tls.key ubuntu@your-instance:/tmp/ + +# SSH to the instance +ssh -i your-key.pem ubuntu@your-instance + +# Move certificates to proper location +sudo mkdir -p /etc/ssl/datachain-studio/ +sudo mv /tmp/tls.crt /etc/ssl/datachain-studio/ +sudo mv /tmp/tls.key /etc/ssl/datachain-studio/ +sudo chown root:root /etc/ssl/datachain-studio/* +sudo chmod 644 /etc/ssl/datachain-studio/tls.crt +sudo chmod 600 /etc/ssl/datachain-studio/tls.key +``` + +#### 2. Configure DataChain Studio + +Update the configuration file: + +```yaml +# /opt/datachain-studio/config.yml +global: + domain: studio.yourcompany.com + + tls: + enabled: true + certFile: /etc/ssl/datachain-studio/tls.crt + keyFile: /etc/ssl/datachain-studio/tls.key + minVersion: "1.2" + + # Nginx SSL configuration + nginx: + ssl: + protocols: "TLSv1.2 TLSv1.3" + ciphers: "ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384" + prefer_server_ciphers: "on" + session_cache: "shared:SSL:10m" + session_timeout: "10m" +``` + +#### 3. Restart Services + +```bash +sudo systemctl restart datachain-studio +``` + +## SSL Certificate Configuration (Legacy) + +### Self-Signed Certificates + +For development or internal use, you can create self-signed certificates: + +```bash +# Generate private key +openssl genrsa -out studio.key 2048 + +# Generate certificate signing request +openssl req -new -key studio.key -out studio.csr \ + -subj "/C=US/ST=State/L=City/O=Organization/CN=studio.yourcompany.com" + +# Generate self-signed certificate +openssl x509 -req -days 365 -in studio.csr -signkey studio.key -out studio.crt + +# Clean up CSR +rm studio.csr +``` + +### Certificate Authority (CA) Signed Certificates + +For production use, obtain certificates from a trusted CA: + +#### 1. Generate Certificate Signing Request + +```bash +openssl genrsa -out studio.key 2048 +openssl req -new -key studio.key -out studio.csr +``` + +#### 2. Submit CSR to Certificate Authority + +Submit the CSR to your chosen CA (Let's Encrypt, DigiCert, etc.) and obtain the signed certificate. + +#### 3. Install Certificates + +Follow the same installation process as described in the TLS section above. + +## Let's Encrypt Integration + +### Automatic Certificate Management + +For Kubernetes deployments, use cert-manager for automatic Let's Encrypt certificates: + +#### 1. Install cert-manager + +```bash +kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml +``` + +#### 2. Create ClusterIssuer + +```yaml +apiVersion: cert-manager.io/v1 +kind: ClusterIssuer +metadata: + name: letsencrypt-prod +spec: + acme: + server: https://acme-v02.api.letsencrypt.org/directory + email: admin@yourcompany.com + privateKeySecretRef: + name: letsencrypt-prod + solvers: + - http01: + ingress: + class: nginx +``` + +#### 3. Configure Ingress for Auto-SSL + +```yaml +ingress: + enabled: true + className: nginx + + annotations: + cert-manager.io/cluster-issuer: "letsencrypt-prod" + nginx.ingress.kubernetes.io/ssl-redirect: "true" + + tls: + enabled: true + secretName: datachain-studio-tls-auto +``` + +### Manual Let's Encrypt (Certbot) + +For AMI deployments, use certbot for Let's Encrypt certificates: + +```bash +# Install certbot +sudo apt update +sudo apt install certbot + +# Obtain certificate (requires port 80 to be accessible) +sudo certbot certonly --standalone \ + -d studio.yourcompany.com \ + --email admin@yourcompany.com \ + --agree-tos \ + --no-eff-email + +# Certificates will be available at: +# /etc/letsencrypt/live/studio.yourcompany.com/fullchain.pem +# /etc/letsencrypt/live/studio.yourcompany.com/privkey.pem +``` + +## Certificate Validation + +### Testing SSL/TLS Configuration + +```bash +# Test SSL certificate +openssl s_client -connect studio.yourcompany.com:443 -servername studio.yourcompany.com + +# Check certificate expiration +echo | openssl s_client -connect studio.yourcompany.com:443 2>/dev/null | openssl x509 -dates -noout + +# Test SSL configuration +curl -I https://studio.yourcompany.com + +# Detailed SSL test (requires ssllabs-scan tool) +ssllabs-scan studio.yourcompany.com +``` + +### Certificate Chain Validation + +```bash +# Verify certificate chain +openssl verify -CAfile ca-bundle.crt studio.crt + +# Check certificate details +openssl x509 -in studio.crt -text -noout +``` + +## Security Hardening + +### Advanced TLS Configuration + +```yaml +# Advanced security configuration +global: + tls: + enabled: true + minVersion: "1.2" + maxVersion: "1.3" + + # Strong cipher suites only + cipherSuites: + - "TLS_AES_128_GCM_SHA256" + - "TLS_AES_256_GCM_SHA384" + - "TLS_CHACHA20_POLY1305_SHA256" + - "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256" + - "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384" + + # HSTS configuration + hsts: + enabled: true + maxAge: 31536000 # 1 year + includeSubdomains: true + preload: true + + # OCSP stapling + ocsp: + enabled: true + cache: true +``` + +### Security Headers + +```yaml +# Additional security headers +ingress: + annotations: + nginx.ingress.kubernetes.io/configuration-snippet: | + add_header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" always; + add_header X-Content-Type-Options "nosniff" always; + add_header X-Frame-Options "DENY" always; + add_header X-XSS-Protection "1; mode=block" always; + add_header Referrer-Policy "strict-origin-when-cross-origin" always; +``` + +## Certificate Renewal + +### Automated Renewal + +For Let's Encrypt certificates, set up automatic renewal: + +```bash +# Add cron job for certificate renewal +echo "0 12 * * * /usr/bin/certbot renew --quiet" | sudo crontab - +``` + +### Renewal Monitoring + +```bash +# Check certificate expiration +openssl x509 -in /path/to/certificate.crt -noout -dates + +# Set up expiration monitoring +#!/bin/bash +CERT_FILE="/etc/ssl/datachain-studio/tls.crt" +EXPIRY_DATE=$(openssl x509 -in $CERT_FILE -noout -enddate | cut -d= -f2) +EXPIRY_EPOCH=$(date -d "$EXPIRY_DATE" +%s) +CURRENT_EPOCH=$(date +%s) +DAYS_LEFT=$(( ($EXPIRY_EPOCH - $CURRENT_EPOCH) / 86400 )) + +if [ $DAYS_LEFT -lt 30 ]; then + echo "Certificate expires in $DAYS_LEFT days!" + # Send alert +fi +``` + +## Troubleshooting + +### Common SSL/TLS Issues + +**Certificate not trusted:** +- Verify certificate chain is complete +- Check CA bundle includes intermediate certificates +- Ensure root CA is trusted by client systems + +**TLS handshake failures:** +- Check cipher suite compatibility +- Verify TLS version support +- Review server and client configurations + +**Mixed content warnings:** +- Ensure all resources load over HTTPS +- Update HTTP references to use HTTPS +- Configure proper redirects + +### Debug Commands + +```bash +# Check certificate chain +openssl s_client -connect studio.yourcompany.com:443 -showcerts + +# Test specific TLS version +openssl s_client -connect studio.yourcompany.com:443 -tls1_2 + +# Check cipher suites +nmap --script ssl-enum-ciphers -p 443 studio.yourcompany.com + +# Monitor SSL logs +kubectl logs -f deployment/datachain-studio-frontend -n datachain-studio | grep -i ssl +``` + +## Next Steps + +- Configure [CA certificates](ca-certificates.md) for custom certificate authorities +- Set up [Git forge integrations](git-forges/index.md) +- Review [troubleshooting guides](../troubleshooting/index.md) +- Learn about [upgrading procedures](../upgrading/index.md) diff --git a/docs/studio/self-hosting/index.md b/docs/studio/self-hosting/index.md new file mode 100644 index 000000000..2a3f87a0c --- /dev/null +++ b/docs/studio/self-hosting/index.md @@ -0,0 +1,89 @@ +# Self-hosting DataChain Studio + +DataChain Studio Enterprise users can host DataChain Studio on their own infrastructure (on-premises) or in their cloud accounts. + +Please note that our support is needed to make DataChain Studio's cloud/Docker images available to you to enable installation. + +Below are the supported installation methods: + +- [AMI (AWS)](installation/aws-ami.md) +- [Kubernetes (Helm)](installation/k8s-helm.md) + +## System requirements + +### VM (AMI) + +Recommended requirements: + +- 32 GB RAM +- 4 vCPUs +- 100 GB disk space + +### Helm + +We recommend deploying DataChain Studio in an auto-scaling node group with a minimum of 2 nodes. + +Each node should have at least 16 GB of RAM and 4 vCPUs. + +Additionally, you'll need 100 GB of block storage for DataChain Studio's `PersistentVolume` + +## DataChain Studio's architecture + +DataChain Studio is composed of four pieces: + +- **Frontend Server**: Renders the web interface +- **Backend Server**: Stores all user information and handles API requests +- **Celery Beat**: Coordinates background tasks and job scheduling +- **Celery Worker**: Processes background tasks and data processing jobs + +## Key Features of Self-hosted Studio + +### Security and Privacy +- **Data Sovereignty**: Keep all data within your infrastructure +- **Network Isolation**: Deploy in private networks and VPCs +- **Access Control**: Integrate with your existing authentication systems +- **Compliance**: Meet regulatory requirements for data handling + +### Customization +- **Custom Domains**: Use your own domain names and SSL certificates +- **Branding**: Customize the interface with your organization's branding +- **Resource Management**: Control computational resources and scaling +- **Integration**: Connect with internal systems and tools + +### Administration +- **User Management**: Centralized user and team administration +- **Monitoring**: Built-in monitoring and alerting capabilities +- **Backup**: Automated backup and disaster recovery options +- **Updates**: Controlled update process for new features + +## Getting Started + +1. **[Installation](installation/index.md)** - Choose your deployment method +2. **[Configuration](configuration/index.md)** - Configure Studio for your environment +3. **[Upgrading](upgrading/index.md)** - Keep your installation up to date +4. **[Troubleshooting](troubleshooting/index.md)** - Resolve common issues + +## Support + +For self-hosting support: + +- Contact our enterprise support team +- Review the [troubleshooting guide](troubleshooting/index.md) +- Check the [configuration documentation](configuration/index.md) + +## Prerequisites + +Before installing DataChain Studio: + +- **Enterprise License**: Self-hosting requires an Enterprise license +- **Infrastructure Access**: Administrative access to your deployment environment +- **SSL Certificates**: Valid SSL certificates for secure communication +- **Database**: PostgreSQL database for storing application data +- **Storage**: Object storage for data and artifacts (S3, GCS, Azure Blob) + +## Next Steps + +- Review [installation options](installation/index.md) +- Plan your [configuration settings](configuration/index.md) +- Set up [monitoring and alerting](troubleshooting/index.md) +- Configure [authentication and access control](configuration/index.md) diff --git a/docs/studio/self-hosting/installation/aws-ami.md b/docs/studio/self-hosting/installation/aws-ami.md new file mode 100644 index 000000000..4eb5a4bdc --- /dev/null +++ b/docs/studio/self-hosting/installation/aws-ami.md @@ -0,0 +1,198 @@ +# AWS AMI Installation + +## Prerequisites + +### DataChain Studio Images + +The DataChain Studio machine image (AMI) and access to the DataChain Studio Docker images need to be provided by the DataChain team to enable the installation. + +### DNS + +Create a DNS record pointing to the IP address of the EC2 instance. This hostname will be used for DataChain Studio. + +## Installation + +1. Open the AWS Console + +2. Navigate to EC2 -> Instances + +3. Click **Launch instances** + +4. Provide a name for your EC2 instance + +5. Select **datachain-studio-selfhosted** from the AMI catalog + +6. Select an appropriate instance type. + - Minimum requirements: 16 GB RAM, 4 vCPUs + - Recommended requirements: 32 GB RAM, 8 vCPUs + +7. To enable SSH connections to the instance, select an existing key pair to use or create a new one. We recommend ED25519 keys. + +8. In the network settings, use either the default VPC or change it to a desired one. Under the Firewall setting, create a new security group with SSH, HTTP, and HTTPS access or use an existing one with the same level of access. + +!!! warning + It's important to ensure that your VPC has connectivity to your Git forge provider (GitHub.com, GitLab.com, Bitbucket.org) and your storage provider (S3, GCS, etc.), to ensure DataChain Studio can access these resources. + +9. Configure storage: + - Use at least 100 GB of EBS storage + - Consider using GP3 for better performance + - Enable encryption for security + +10. Launch the instance + +## Configuration + +Once the instance is running, you need to configure DataChain Studio: + +### Initial Setup + +1. SSH into the instance: + ```bash + ssh -i your-key.pem ubuntu@your-instance-ip + ``` + +2. Navigate to the configuration directory: + ```bash + cd /opt/datachain-studio + ``` + +3. Copy the example configuration: + ```bash + sudo cp config.example.yml config.yml + ``` + +4. Edit the configuration file: + ```bash + sudo nano config.yml + ``` + +### Configuration Parameters + +Edit the following parameters in `config.yml`: + +```yaml +# Basic configuration +domain: your-studio-domain.com +ssl: + enabled: true + cert_path: /etc/ssl/certs/studio.crt + key_path: /etc/ssl/private/studio.key + +# Database configuration +database: + host: localhost + port: 5432 + name: datachain_studio + user: studio + password: your-secure-password + +# Storage configuration +storage: + type: s3 + bucket: your-studio-bucket + region: us-east-1 + access_key: your-access-key + secret_key: your-secret-key + +# Git forge configuration +git: + github: + enabled: true + app_id: your-github-app-id + private_key_path: /etc/studio/github-private-key.pem + gitlab: + enabled: true + url: https://gitlab.com + app_id: your-gitlab-app-id + secret: your-gitlab-secret +``` + +### SSL Configuration + +1. Upload your SSL certificate and private key to the instance +2. Update the paths in the configuration file +3. Ensure proper file permissions: + ```bash + sudo chmod 600 /etc/ssl/private/studio.key + sudo chmod 644 /etc/ssl/certs/studio.crt + ``` + +### Start Services + +1. Start DataChain Studio services: + ```bash + sudo systemctl enable datachain-studio + sudo systemctl start datachain-studio + ``` + +2. Check service status: + ```bash + sudo systemctl status datachain-studio + ``` + +3. View logs: + ```bash + sudo journalctl -u datachain-studio -f + ``` + +## Verification + +1. Access DataChain Studio at `https://your-domain.com` +2. Check that all services are running: + ```bash + sudo docker ps + ``` +3. Verify database connectivity: + ```bash + sudo docker exec -it studio-db psql -U studio -d datachain_studio -c "SELECT version();" + ``` + +## Security Considerations + +### Network Security +- Use security groups to restrict access +- Enable VPC flow logs for monitoring +- Consider using AWS WAF for web application protection + +### Data Security +- Enable EBS encryption +- Use IAM roles instead of access keys where possible +- Regularly rotate secrets and keys +- Enable CloudTrail for audit logging + +### Backup Strategy +- Set up automated EBS snapshots +- Configure database backups +- Test restore procedures regularly + +## Troubleshooting + +### Common Issues + +**Services won't start:** +- Check configuration file syntax +- Verify SSL certificate paths and permissions +- Check Docker service status + +**Cannot access Studio:** +- Verify DNS resolution +- Check security group rules +- Confirm SSL certificate validity + +**Database connection issues:** +- Check database service status +- Verify connection parameters +- Check database logs + +### Getting Help + +- Check service logs: `sudo journalctl -u datachain-studio` +- Review configuration: `sudo cat /opt/datachain-studio/config.yml` +- Contact support with instance details and error messages + +## Next Steps + +- [Configure additional settings](../configuration/index.md) +- [Set up Git forge connections](../configuration/git-forges/index.md) +- [Configure SSL/TLS](../configuration/ssl-tls.md) +- [Learn about upgrading](../upgrading/index.md) diff --git a/docs/studio/self-hosting/installation/index.md b/docs/studio/self-hosting/installation/index.md new file mode 100644 index 000000000..f9c9da655 --- /dev/null +++ b/docs/studio/self-hosting/installation/index.md @@ -0,0 +1,106 @@ +# Installation + +DataChain Studio supports multiple installation methods to accommodate different infrastructure requirements and preferences. + +## Installation Methods + +### AWS AMI +Deploy DataChain Studio using a pre-configured Amazon Machine Image (AMI) for quick setup on AWS. + +**Best for:** +- Quick proof-of-concept deployments +- Single-instance installations +- Teams familiar with AWS EC2 + +[Get started with AWS AMI installation →](aws-ami.md) + +### Kubernetes (Helm) +Deploy DataChain Studio on Kubernetes using Helm charts for scalable, production-ready installations. + +**Best for:** +- Production deployments +- Scalable installations +- Teams with Kubernetes expertise +- Multi-environment deployments + +[Get started with Kubernetes installation →](k8s-helm.md) + +## Choosing Your Installation Method + +| Feature | AWS AMI | Kubernetes (Helm) | +|---------|---------|-------------------| +| **Setup Complexity** | Low | Medium | +| **Scalability** | Limited | High | +| **High Availability** | Manual setup | Built-in | +| **Resource Management** | Manual | Automatic | +| **Monitoring** | Basic | Advanced | +| **Backup/Recovery** | Manual | Automated | +| **Multi-environment** | Limited | Excellent | + +## Prerequisites + +Before installing DataChain Studio, ensure you have: + +### Infrastructure Requirements +- **Compute Resources**: See [system requirements](../index.md#system-requirements) +- **Network Access**: Internet connectivity for downloading dependencies +- **SSL Certificates**: Valid certificates for secure HTTPS communication +- **Domain Name**: Custom domain for accessing Studio + +### Dependencies +- **Database**: PostgreSQL 12+ for application data +- **Object Storage**: S3, GCS, or Azure Blob Storage for data and artifacts +- **Redis**: For caching and session management (optional but recommended) + +### Access Requirements +- **Administrative Access**: Full access to your deployment environment +- **DNS Control**: Ability to configure DNS records for your domain +- **Certificate Management**: Access to SSL certificate management + +## Planning Your Installation + +### 1. Choose Installation Method +Based on your requirements and expertise, select either: +- [AWS AMI](aws-ami.md) for simple, single-instance deployments +- [Kubernetes Helm](k8s-helm.md) for scalable, production deployments + +### 2. Plan Your Infrastructure +- **Networking**: VPC, subnets, security groups, load balancers +- **Storage**: Database sizing, object storage configuration +- **Security**: IAM roles, security groups, access policies +- **Monitoring**: Logging, metrics, alerting setup + +### 3. Prepare Configuration +- **Environment Variables**: Database URLs, storage credentials +- **SSL Certificates**: Certificate files and private keys +- **Authentication**: SSO configuration, user management +- **Feature Flags**: Enable/disable specific functionality + +## Installation Process Overview + +1. **Environment Setup**: Prepare your infrastructure and dependencies +2. **Installation**: Deploy DataChain Studio using your chosen method +3. **Initial Configuration**: Configure basic settings and authentication +4. **Verification**: Test the installation and verify functionality +5. **Post-installation**: Set up monitoring, backups, and maintenance + +## Getting Support + +For installation support: + +- **Documentation**: Follow the detailed guides for your chosen method +- **Enterprise Support**: Contact our support team for assistance +- **Community**: Join our community for peer support and discussions + +## Next Steps + +Choose your installation method: + +- **[AWS AMI Installation](aws-ami.md)** - Quick single-instance deployment +- **[Kubernetes Installation](k8s-helm.md)** - Scalable production deployment + +After installation, proceed to: + +- **[Configuration](../configuration/index.md)** - Configure Studio for your environment +- **[Upgrading](../upgrading/index.md)** - Learn about the upgrade process +- **[Troubleshooting](../troubleshooting/index.md)** - Resolve common issues diff --git a/docs/studio/self-hosting/installation/k8s-helm.md b/docs/studio/self-hosting/installation/k8s-helm.md new file mode 100644 index 000000000..7cf602de1 --- /dev/null +++ b/docs/studio/self-hosting/installation/k8s-helm.md @@ -0,0 +1,422 @@ +# Kubernetes (Helm) Installation + +This guide covers installing DataChain Studio on Kubernetes using Helm charts. + +## Prerequisites + +### Kubernetes Cluster + +- **Kubernetes version**: 1.19+ +- **Node requirements**: + - Minimum: 2 nodes with 8GB RAM, 4 vCPUs each + - Recommended: 3+ nodes with 16GB RAM, 8 vCPUs each +- **Storage**: 100GB persistent storage +- **Networking**: Cluster networking with ingress controller + +### Required Tools + +- `kubectl` configured to access your cluster +- `helm` 3.0+ +- Access to DataChain Studio container images + +### Access Requirements + +- Container registry access (provided by DataChain team) +- Valid DNS domain for DataChain Studio +- SSL certificates for HTTPS + +## Installation Steps + +### 1. Add DataChain Helm Repository + +```bash +helm repo add datachain https://charts.datachain.ai +helm repo update +``` + +### 2. Create Namespace + +```bash +kubectl create namespace datachain-studio +``` + +### 3. Configure Container Registry Access + +Create a secret for accessing DataChain Studio container images: + +```bash +kubectl create secret docker-registry datachain-registry \ + --namespace datachain-studio \ + --docker-server=registry.datachain.ai \ + --docker-username= \ + --docker-password= +``` + +### 4. Configure SSL Certificates + +Create TLS secret for your domain: + +```bash +kubectl create secret tls studio-tls \ + --namespace datachain-studio \ + --cert=path/to/tls.crt \ + --key=path/to/tls.key +``` + +### 5. Create Configuration File + +Create a `values.yaml` file with your configuration: + +```yaml +# Basic configuration +global: + domain: studio.yourcompany.com + storageClass: gp2 # or your preferred storage class + +# Image pull secrets +imagePullSecrets: + - name: datachain-registry + +# SSL/TLS configuration +ingress: + enabled: true + className: nginx # or your ingress class + tls: + enabled: true + secretName: studio-tls + annotations: + nginx.ingress.kubernetes.io/ssl-redirect: "true" + nginx.ingress.kubernetes.io/force-ssl-redirect: "true" + +# Database configuration +postgresql: + enabled: true + auth: + postgresPassword: "secure-postgres-password" + database: "datachain_studio" + primary: + persistence: + enabled: true + size: 50Gi + storageClass: gp2 + +# Redis configuration +redis: + enabled: true + auth: + enabled: true + password: "secure-redis-password" + +# Storage configuration +storage: + type: s3 + s3: + bucket: your-studio-bucket + region: us-east-1 + accessKey: your-access-key + secretKey: your-secret-key + +# Git integrations +git: + github: + enabled: true + appId: "your-github-app-id" + privateKey: | + -----BEGIN RSA PRIVATE KEY----- + your-github-private-key-content + -----END RSA PRIVATE KEY----- + + gitlab: + enabled: true + url: "https://gitlab.com" + clientId: "your-gitlab-client-id" + clientSecret: "your-gitlab-client-secret" + +# Resource limits +resources: + frontend: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" + + backend: + requests: + memory: "1Gi" + cpu: "500m" + limits: + memory: "2Gi" + cpu: "1000m" + + worker: + requests: + memory: "1Gi" + cpu: "500m" + limits: + memory: "2Gi" + cpu: "1000m" + +# Autoscaling +autoscaling: + enabled: true + minReplicas: 2 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 +``` + +### 6. Install DataChain Studio + +```bash +helm install datachain-studio datachain/studio \ + --namespace datachain-studio \ + --values values.yaml \ + --wait --timeout=10m +``` + +### 7. Verify Installation + +Check pod status: +```bash +kubectl get pods -n datachain-studio +``` + +Check services: +```bash +kubectl get services -n datachain-studio +``` + +Check ingress: +```bash +kubectl get ingress -n datachain-studio +``` + +## Configuration Options + +### Database Options + +#### External PostgreSQL +```yaml +postgresql: + enabled: false + +externalDatabase: + type: postgresql + host: your-postgres-host + port: 5432 + database: datachain_studio + username: studio_user + password: your-password +``` + +#### External Redis +```yaml +redis: + enabled: false + +externalRedis: + host: your-redis-host + port: 6379 + password: your-redis-password +``` + +### Storage Options + +#### AWS S3 +```yaml +storage: + type: s3 + s3: + bucket: your-bucket + region: us-east-1 + accessKey: your-access-key + secretKey: your-secret-key +``` + +#### Google Cloud Storage +```yaml +storage: + type: gcs + gcs: + bucket: your-bucket + projectId: your-project-id + keyFile: | + { + "type": "service_account", + "project_id": "your-project-id", + ... + } +``` + +#### Azure Blob Storage +```yaml +storage: + type: azure + azure: + accountName: your-account-name + accountKey: your-account-key + containerName: your-container +``` + +### High Availability Configuration + +```yaml +# Multiple replicas +replicaCount: + frontend: 3 + backend: 3 + worker: 2 + +# Pod disruption budgets +podDisruptionBudget: + enabled: true + minAvailable: 1 + +# Node affinity +affinity: + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchExpressions: + - key: app.kubernetes.io/name + operator: In + values: + - datachain-studio + topologyKey: kubernetes.io/hostname +``` + +## Upgrading + +### Check Current Version +```bash +helm list -n datachain-studio +``` + +### Upgrade to Latest Version +```bash +helm repo update +helm upgrade datachain-studio datachain/studio \ + --namespace datachain-studio \ + --values values.yaml \ + --wait +``` + +### Rollback if Needed +```bash +helm rollback datachain-studio -n datachain-studio +``` + +## Monitoring and Logging + +### Enable Monitoring +```yaml +monitoring: + enabled: true + serviceMonitor: + enabled: true + + prometheus: + enabled: true + + grafana: + enabled: true + adminPassword: your-grafana-password +``` + +### Log Configuration +```yaml +logging: + level: INFO + format: json + + # External log aggregation + fluentd: + enabled: true + host: your-log-aggregator + port: 24224 +``` + +## Security Considerations + +### Network Policies +```yaml +networkPolicy: + enabled: true + ingress: + - from: + - namespaceSelector: + matchLabels: + name: ingress-nginx + egress: + - to: + - namespaceSelector: {} +``` + +### Security Context +```yaml +securityContext: + runAsNonRoot: true + runAsUser: 1000 + fsGroup: 1000 + capabilities: + drop: + - ALL +``` + +### Pod Security Standards +```yaml +podSecurityContext: + seccompProfile: + type: RuntimeDefault +``` + +## Troubleshooting + +### Common Issues + +**Pods stuck in Pending:** +```bash +kubectl describe pod -n datachain-studio +``` + +**Database connection issues:** +```bash +kubectl logs -n datachain-studio +``` + +**SSL certificate problems:** +```bash +kubectl describe ingress -n datachain-studio +``` + +### Debug Commands + +```bash +# Check all resources +kubectl get all -n datachain-studio + +# Check events +kubectl get events -n datachain-studio --sort-by='.lastTimestamp' + +# Check logs +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio + +# Port forward for local access +kubectl port-forward service/datachain-studio-frontend 8080:80 -n datachain-studio +``` + +## Next Steps + +- [Configure additional settings](../configuration/index.md) +- [Set up monitoring and alerting](../configuration/index.md#monitoring) +- [Learn about backup procedures](../upgrading/index.md#backup) +- [Review security hardening](../configuration/index.md#security) diff --git a/docs/studio/self-hosting/troubleshooting/502-errors.md b/docs/studio/self-hosting/troubleshooting/502-errors.md new file mode 100644 index 000000000..376f6e2ae --- /dev/null +++ b/docs/studio/self-hosting/troubleshooting/502-errors.md @@ -0,0 +1,499 @@ +# 502 Bad Gateway Errors + +Getting HTTP 502 Bad Gateway errors when accessing DataChain Studio indicates that the web server cannot connect to the backend application services. This guide covers diagnosing and resolving these issues. + +## Overview + +502 Bad Gateway errors occur when: +- Backend services are not running or accessible +- Network connectivity issues between components +- Resource constraints preventing service startup +- Configuration problems with load balancers or ingress + +## Initial Diagnosis + +### Check Service Status + +#### Kubernetes Deployments + +```bash +# Check pod status +kubectl get pods -n datachain-studio + +# Check service status +kubectl get services -n datachain-studio + +# Check ingress status +kubectl get ingress -n datachain-studio + +# Look for events +kubectl get events -n datachain-studio --sort-by='.lastTimestamp' +``` + +#### AMI Deployments + +```bash +# SSH to the instance first +ssh -i your-key.pem ubuntu@your-instance-ip + +# Check system service status +sudo systemctl status datachain-studio + +# Check container status +sudo docker ps -a + +# Check logs +sudo journalctl -u datachain-studio -f +``` + +### Identify the Problem + +Common pod statuses indicating issues: + +- `ImagePullBackOff` / `ErrImagePull` - Container image issues +- `CrashLoopBackOff` - Application startup failures +- `Pending` - Resource or scheduling issues +- `CreateContainerConfigError` - Configuration problems + +## Container Image Issues + +### Image Pull Problems + +If pods show `ImagePullBackOff` or `ErrImagePull`: + +#### For Cloud Deployments + +```bash +# Check image pull secrets +kubectl get secrets -n datachain-studio | grep registry + +# Recreate registry secret if needed +kubectl delete secret datachain-registry -n datachain-studio + +kubectl create secret docker-registry datachain-registry \ + --namespace datachain-studio \ + --docker-server=registry.datachain.ai \ + --docker-username=your-username \ + --docker-password=your-password + +# Restart deployments +kubectl rollout restart deployment/datachain-studio-backend -n datachain-studio +kubectl rollout restart deployment/datachain-studio-frontend -n datachain-studio +kubectl rollout restart deployment/datachain-studio-worker -n datachain-studio +``` + +#### For Air-gapped Deployments + +```bash +# Check if images exist in internal registry +kubectl describe pod POD_NAME -n datachain-studio | grep -i image + +# Verify internal registry connectivity +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + curl -I https://registry.internal.company.com + +# Re-tag and push images if needed +docker tag datachain/studio-backend:VERSION registry.internal.company.com/datachain/studio-backend:VERSION +docker push registry.internal.company.com/datachain/studio-backend:VERSION +``` + +### Image Version Mismatches + +```bash +# Check configured image versions +kubectl get deployment -n datachain-studio -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}' + +# Update to correct versions if needed +kubectl set image deployment/datachain-studio-backend \ + datachain-studio-backend=registry.datachain.ai/studio-backend:CORRECT_VERSION \ + -n datachain-studio +``` + +## Application Startup Issues + +### Configuration Problems + +#### Check Configuration + +```bash +# Review configuration +kubectl get configmap datachain-studio-config -n datachain-studio -o yaml + +# Check for missing environment variables +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- env | grep -i studio + +# Validate secrets +kubectl get secrets -n datachain-studio +kubectl describe secret datachain-studio-secrets -n datachain-studio +``` + +#### Database Connection Issues + +```bash +# Test database connectivity +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + python -c " +import os +import psycopg2 +try: + conn = psycopg2.connect(os.environ['DATABASE_URL']) + print('Database connection: OK') +except Exception as e: + print(f'Database connection failed: {e}') +" + +# Check database pod status +kubectl get pods -l app=postgres -n datachain-studio + +# Check database logs +kubectl logs -f deployment/datachain-studio-postgres -n datachain-studio +``` + +#### Redis Connection Issues + +```bash +# Test Redis connectivity +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + python -c " +import redis +import os +try: + r = redis.from_url(os.environ.get('REDIS_URL', 'redis://localhost:6379')) + r.ping() + print('Redis connection: OK') +except Exception as e: + print(f'Redis connection failed: {e}') +" + +# Check Redis pod status +kubectl get pods -l app=redis -n datachain-studio + +# Check Redis logs +kubectl logs -f deployment/datachain-studio-redis -n datachain-studio +``` + +### Resource Constraints + +#### Check Resource Usage + +```bash +# Check node resources +kubectl describe nodes | grep -A 5 "Allocated resources" + +# Check pod resource requests and limits +kubectl describe pod POD_NAME -n datachain-studio | grep -A 10 -i resources + +# Check actual resource usage +kubectl top nodes +kubectl top pods -n datachain-studio +``` + +#### Resolve Resource Issues + +```bash +# Scale down other workloads temporarily +kubectl scale deployment other-deployment --replicas=0 -n other-namespace + +# Increase resource limits in Helm values +# values.yaml +resources: + backend: + requests: + memory: "1Gi" + cpu: "500m" + limits: + memory: "2Gi" + cpu: "1000m" + +# Apply changes +helm upgrade datachain-studio datachain/studio \ + --namespace datachain-studio \ + --values values.yaml +``` + +## Network Connectivity Issues + +### Service Discovery Problems + +```bash +# Check service endpoints +kubectl get endpoints -n datachain-studio + +# Test internal service connectivity +kubectl exec -it deployment/datachain-studio-frontend -n datachain-studio -- \ + curl -I http://datachain-studio-backend:8000/health + +# Check DNS resolution +kubectl exec -it deployment/datachain-studio-frontend -n datachain-studio -- \ + nslookup datachain-studio-backend.datachain-studio.svc.cluster.local +``` + +### Ingress Configuration Issues + +```bash +# Check ingress configuration +kubectl describe ingress datachain-studio-ingress -n datachain-studio + +# Check ingress controller logs +kubectl logs -f deployment/nginx-ingress-controller -n ingress-nginx + +# Test ingress rules +curl -H "Host: studio.yourcompany.com" http://INGRESS_IP/health +``` + +### Load Balancer Issues + +```bash +# Check load balancer status +kubectl get service datachain-studio-lb -n datachain-studio + +# Check load balancer endpoints +kubectl describe service datachain-studio-lb -n datachain-studio + +# Test load balancer connectivity +curl -I http://LOAD_BALANCER_IP:80/health +``` + +## SSL/TLS Related Issues + +### Certificate Problems + +```bash +# Check TLS secret +kubectl describe secret datachain-studio-tls -n datachain-studio + +# Verify certificate validity +kubectl get secret datachain-studio-tls -n datachain-studio -o jsonpath='{.data.tls\.crt}' | \ + base64 -d | openssl x509 -dates -noout + +# Test SSL connectivity +openssl s_client -connect studio.yourcompany.com:443 -servername studio.yourcompany.com +``` + +### SSL Termination Issues + +```bash +# Check if SSL is terminated at ingress +kubectl describe ingress datachain-studio-ingress -n datachain-studio | grep -i tls + +# Test without SSL (if applicable) +curl -I http://studio.yourcompany.com/health + +# Check SSL redirect configuration +curl -I -L http://studio.yourcompany.com/health +``` + +## Advanced Troubleshooting + +### Deep Dive Debugging + +#### Application Logs Analysis + +```bash +# Get detailed application logs +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio --previous + +# Search for specific error patterns +kubectl logs deployment/datachain-studio-backend -n datachain-studio | grep -i error +kubectl logs deployment/datachain-studio-backend -n datachain-studio | grep -i "502\|bad gateway" + +# Check application startup sequence +kubectl logs deployment/datachain-studio-backend -n datachain-studio | head -50 +``` + +#### Network Packet Analysis + +```bash +# Capture network traffic (requires privileged access) +kubectl exec -it deployment/datachain-studio-frontend -n datachain-studio -- \ + tcpdump -i any -n port 8000 + +# Test specific network paths +kubectl exec -it deployment/datachain-studio-frontend -n datachain-studio -- \ + traceroute datachain-studio-backend.datachain-studio.svc.cluster.local +``` + +#### Health Check Validation + +```bash +# Test health endpoints directly +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + curl -f http://localhost:8000/health + +# Test with verbose output +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + curl -v http://localhost:8000/health + +# Check health endpoint response time +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + time curl -f http://localhost:8000/health +``` + +## AMI-Specific Troubleshooting + +### Docker Container Issues + +```bash +# Check container status +sudo docker ps -a + +# Check container logs +sudo docker logs datachain-studio-backend +sudo docker logs datachain-studio-frontend + +# Restart containers +sudo docker restart datachain-studio-backend +sudo docker restart datachain-studio-frontend + +# Check container health +sudo docker exec datachain-studio-backend curl -f http://localhost:8000/health +``` + +### System Service Issues + +```bash +# Check systemd service status +sudo systemctl status datachain-studio +sudo systemctl status docker + +# Restart services +sudo systemctl restart datachain-studio +sudo systemctl restart docker + +# Check service logs +sudo journalctl -u datachain-studio -f +sudo journalctl -u docker -f + +# Check service configuration +sudo systemctl cat datachain-studio +``` + +### Nginx Configuration + +```bash +# Check nginx configuration +sudo nginx -t + +# Check nginx logs +sudo tail -f /var/log/nginx/error.log +sudo tail -f /var/log/nginx/access.log + +# Restart nginx +sudo systemctl restart nginx + +# Test nginx upstream +curl -I http://localhost:8000/health # Direct backend test +``` + +## Recovery Procedures + +### Quick Recovery Steps + +1. **Restart all services**: + ```bash + # Kubernetes + kubectl rollout restart deployment -n datachain-studio + + # AMI + sudo systemctl restart datachain-studio + ``` + +2. **Check and fix resource constraints**: + ```bash + kubectl top nodes + kubectl describe nodes | grep -A 5 "Allocated resources" + ``` + +3. **Verify configuration**: + ```bash + kubectl get configmap datachain-studio-config -n datachain-studio -o yaml + ``` + +4. **Test connectivity**: + ```bash + curl -f https://studio.yourcompany.com/health + ``` + +### Full Recovery Process + +1. **Stop all services** +2. **Check system resources and fix constraints** +3. **Verify configuration files** +4. **Check network connectivity** +5. **Start services in order**: Database → Redis → Backend → Frontend → Worker +6. **Validate each component before starting the next** +7. **Test full application functionality** + +## Prevention + +### Monitoring and Alerting + +Set up monitoring to catch 502 errors early: + +```yaml +# Prometheus alert example +- alert: High502ErrorRate + expr: rate(nginx_ingress_controller_requests{status="502"}[5m]) > 0.1 + for: 2m + labels: + severity: critical + annotations: + summary: "High rate of 502 errors" + description: "502 error rate is {{ $value }} per second" + +- alert: BackendServiceDown + expr: up{job="datachain-studio-backend"} == 0 + for: 1m + labels: + severity: critical + annotations: + summary: "Backend service is down" +``` + +### Health Checks + +Implement comprehensive health checks: + +```yaml +# Kubernetes readiness probe +readinessProbe: + httpGet: + path: /health/ready + port: 8000 + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 3 + +# Kubernetes liveness probe +livenessProbe: + httpGet: + path: /health/live + port: 8000 + initialDelaySeconds: 60 + periodSeconds: 30 + timeoutSeconds: 10 + failureThreshold: 3 +``` + +### Regular Maintenance + +1. **Monitor resource usage trends** +2. **Review logs regularly for warnings** +3. **Keep services updated** +4. **Test failover procedures** +5. **Document configuration changes** + +## Next Steps + +If 502 errors persist after trying these solutions: + +1. **Generate a [support bundle](support-bundle.md)** with diagnostic information +2. **Review recent changes** to configuration or infrastructure +3. **Check the [main troubleshooting guide](index.md)** for other common issues +4. **Contact support** with detailed error information and logs + +For other issues: +- [Configuration problems](../configuration/index.md) +- [Installation issues](../installation/index.md) +- [Upgrade problems](../upgrading/index.md) diff --git a/docs/studio/self-hosting/troubleshooting/index.md b/docs/studio/self-hosting/troubleshooting/index.md new file mode 100644 index 000000000..c3e38a3e7 --- /dev/null +++ b/docs/studio/self-hosting/troubleshooting/index.md @@ -0,0 +1,410 @@ +# Troubleshooting + +This section provides solutions to common issues encountered when running self-hosted DataChain Studio. + +## Common Issues + +- **[502 Errors](502-errors.md)** - Troubleshoot HTTP 502 Bad Gateway errors +- **[Support Bundle](support-bundle.md)** - Generate diagnostic information for support + +## General Troubleshooting + +### System Health Checks + +#### Check Service Status + +```bash +# Kubernetes deployments +kubectl get pods -n datachain-studio +kubectl get services -n datachain-studio +kubectl get ingress -n datachain-studio + +# AMI deployments +sudo systemctl status datachain-studio +sudo docker ps +``` + +#### Check Resource Usage + +```bash +# Kubernetes +kubectl top nodes +kubectl top pods -n datachain-studio + +# AMI +htop +df -h +free -h +``` + +#### Check Logs + +```bash +# Kubernetes - Application logs +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio +kubectl logs -f deployment/datachain-studio-frontend -n datachain-studio +kubectl logs -f deployment/datachain-studio-worker -n datachain-studio + +# AMI - System logs +sudo journalctl -u datachain-studio -f +sudo docker logs datachain-studio-backend +``` + +### Network Connectivity Issues + +#### Test External Connectivity + +```bash +# Test internet connectivity (if not air-gapped) +curl -I https://api.github.com +curl -I https://gitlab.com + +# Test internal DNS resolution +nslookup studio.yourcompany.com +dig studio.yourcompany.com + +# Test port connectivity +telnet studio.yourcompany.com 443 +nc -zv studio.yourcompany.com 443 +``` + +#### Test Service Connectivity + +```bash +# Test database connectivity +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + python -c "import psycopg2; conn=psycopg2.connect('postgresql://user:pass@host:port/db'); print('DB OK')" + +# Test Redis connectivity +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + redis-cli -h redis-host -p 6379 ping + +# Test Git forge connectivity +curl -f https://studio.yourcompany.com/api/git/github/test +``` + +### Authentication Issues + +#### OAuth Problems + +```bash +# Check OAuth configuration +kubectl get configmap datachain-studio-config -n datachain-studio -o yaml | grep -A 10 oauth + +# Test OAuth endpoints +curl -f https://studio.yourcompany.com/auth/github/login +curl -f https://studio.yourcompany.com/auth/gitlab/login + +# Check OAuth callback URLs +curl -I https://studio.yourcompany.com/auth/github/callback +``` + +#### SSL/TLS Certificate Issues + +```bash +# Check certificate validity +openssl s_client -connect studio.yourcompany.com:443 -servername studio.yourcompany.com + +# Check certificate expiration +echo | openssl s_client -connect studio.yourcompany.com:443 2>/dev/null | openssl x509 -dates -noout + +# Check certificate chain +openssl s_client -connect studio.yourcompany.com:443 -showcerts +``` + +### Database Issues + +#### Connection Problems + +```bash +# Test database connection +kubectl exec -it deployment/datachain-studio-postgres -n datachain-studio -- \ + psql -U studio -c "SELECT version();" + +# Check database logs +kubectl logs -f deployment/datachain-studio-postgres -n datachain-studio + +# Check connection pooling +kubectl exec -it deployment/datachain-studio-postgres -n datachain-studio -- \ + psql -U studio -c "SELECT * FROM pg_stat_activity;" +``` + +#### Performance Issues + +```bash +# Check database performance +kubectl exec -it deployment/datachain-studio-postgres -n datachain-studio -- \ + psql -U studio -c "SELECT * FROM pg_stat_database WHERE datname='datachain_studio';" + +# Check slow queries +kubectl exec -it deployment/datachain-studio-postgres -n datachain-studio -- \ + psql -U studio -c "SELECT query, mean_time, calls FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;" + +# Check database size +kubectl exec -it deployment/datachain-studio-postgres -n datachain-studio -- \ + psql -U studio -c "SELECT pg_size_pretty(pg_database_size('datachain_studio'));" +``` + +### Storage Issues + +#### Cloud Storage Connectivity + +```bash +# Test S3 connectivity +aws s3 ls s3://your-studio-bucket/ --region your-region + +# Test GCS connectivity +gsutil ls gs://your-studio-bucket/ + +# Test Azure Blob connectivity +az storage blob list --container-name your-container --account-name your-account +``` + +#### Persistent Volume Issues + +```bash +# Check PV status +kubectl get pv,pvc -n datachain-studio + +# Check storage class +kubectl get storageclass + +# Check volume mount issues +kubectl describe pod POD_NAME -n datachain-studio | grep -A 10 -i volume +``` + +### Performance Troubleshooting + +#### High CPU Usage + +```bash +# Check CPU usage by pod +kubectl top pods -n datachain-studio --sort-by=cpu + +# Check CPU usage inside container +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- top + +# Profile application performance +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + python -m cProfile -o profile.stats your_script.py +``` + +#### High Memory Usage + +```bash +# Check memory usage by pod +kubectl top pods -n datachain-studio --sort-by=memory + +# Check memory usage inside container +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- free -h + +# Check for memory leaks +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + ps aux --sort=-%mem | head -10 +``` + +#### Slow Response Times + +```bash +# Test response times +time curl -f https://studio.yourcompany.com/health +time curl -f https://studio.yourcompany.com/api/datasets + +# Check application metrics +curl https://studio.yourcompany.com/metrics + +# Check database query performance +kubectl exec -it deployment/datachain-studio-postgres -n datachain-studio -- \ + psql -U studio -c "EXPLAIN ANALYZE SELECT * FROM datasets LIMIT 10;" +``` + +### Configuration Issues + +#### Invalid Configuration + +```bash +# Validate Helm configuration +helm template datachain-studio ./chart --values values.yaml --dry-run + +# Check for configuration errors +kubectl describe pod POD_NAME -n datachain-studio | grep -i error + +# Validate YAML syntax +yamllint values.yaml +``` + +#### Missing Environment Variables + +```bash +# Check environment variables +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- env | grep -i studio + +# Check ConfigMap +kubectl get configmap datachain-studio-config -n datachain-studio -o yaml + +# Check Secrets +kubectl get secrets -n datachain-studio +kubectl describe secret SECRET_NAME -n datachain-studio +``` + +## Diagnostic Commands + +### Comprehensive Health Check + +```bash +#!/bin/bash +# health-check.sh + +echo "=== DataChain Studio Health Check ===" + +echo "1. Pod Status:" +kubectl get pods -n datachain-studio + +echo "2. Service Status:" +kubectl get services -n datachain-studio + +echo "3. Ingress Status:" +kubectl get ingress -n datachain-studio + +echo "4. Resource Usage:" +kubectl top pods -n datachain-studio + +echo "5. Recent Events:" +kubectl get events -n datachain-studio --sort-by='.lastTimestamp' | tail -10 + +echo "6. Application Health:" +curl -s -o /dev/null -w "%{http_code}" https://studio.yourcompany.com/health + +echo "7. Database Connectivity:" +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + python -c "from django.db import connection; connection.ensure_connection(); print('DB OK')" 2>/dev/null || echo "DB ERROR" +``` + +### Log Collection + +```bash +#!/bin/bash +# collect-logs.sh + +TIMESTAMP=$(date +%Y%m%d-%H%M%S) +LOG_DIR="datachain-studio-logs-${TIMESTAMP}" + +mkdir -p ${LOG_DIR} + +echo "Collecting DataChain Studio logs..." + +# Pod logs +kubectl logs deployment/datachain-studio-frontend -n datachain-studio > ${LOG_DIR}/frontend.log +kubectl logs deployment/datachain-studio-backend -n datachain-studio > ${LOG_DIR}/backend.log +kubectl logs deployment/datachain-studio-worker -n datachain-studio > ${LOG_DIR}/worker.log + +# System information +kubectl get pods -n datachain-studio -o wide > ${LOG_DIR}/pods.txt +kubectl describe pods -n datachain-studio > ${LOG_DIR}/pod-descriptions.txt +kubectl get events -n datachain-studio --sort-by='.lastTimestamp' > ${LOG_DIR}/events.txt + +# Configuration +helm get values datachain-studio -n datachain-studio > ${LOG_DIR}/helm-values.yaml +kubectl get configmap datachain-studio-config -n datachain-studio -o yaml > ${LOG_DIR}/configmap.yaml + +tar -czf ${LOG_DIR}.tar.gz ${LOG_DIR} +echo "Logs collected in ${LOG_DIR}.tar.gz" +``` + +## Getting Help + +### Self-Help Resources + +1. **Check Release Notes**: Review release notes for known issues +2. **Search Documentation**: Look for similar issues in documentation +3. **Community Forums**: Search community forums and discussions +4. **GitHub Issues**: Check the DataChain GitHub repository for similar issues + +### Contacting Support + +When contacting support, include: + +1. **System Information**: + - DataChain Studio version + - Kubernetes version (if applicable) + - Operating system and version + - Hardware/resource specifications + +2. **Problem Description**: + - Detailed description of the issue + - Steps to reproduce the problem + - Expected vs actual behavior + - Screenshots if applicable + +3. **Diagnostic Information**: + - Relevant log excerpts + - Configuration files (sanitized) + - Error messages + - System resource usage + +4. **Environment Details**: + - Network configuration + - Security settings + - External integrations + - Recent changes + +### Support Bundle + +Generate a comprehensive support bundle using our [support bundle tool](support-bundle.md). + +## Prevention + +### Monitoring and Alerting + +Set up monitoring to catch issues early: + +```yaml +# Example monitoring configuration +monitoring: + enabled: true + + alerts: + - name: "High Error Rate" + condition: "error_rate > 5%" + duration: "5m" + severity: "warning" + + - name: "Service Down" + condition: "up == 0" + duration: "1m" + severity: "critical" + + - name: "High Memory Usage" + condition: "memory_usage > 80%" + duration: "10m" + severity: "warning" +``` + +### Regular Maintenance + +1. **Update regularly**: Keep DataChain Studio and dependencies updated +2. **Monitor resources**: Watch CPU, memory, and storage usage trends +3. **Review logs**: Regularly review logs for warnings and errors +4. **Test backups**: Regularly test backup and restore procedures +5. **Security updates**: Apply security updates promptly + +### Best Practices + +1. **Use staging environment**: Test changes in staging before production +2. **Document configuration**: Keep configuration documented and version controlled +3. **Monitor performance**: Set up comprehensive monitoring and alerting +4. **Plan for scale**: Monitor usage trends and plan for capacity needs +5. **Security hygiene**: Regularly review and update security configurations + +## Next Steps + +For specific issues: + +- **[502 Errors](502-errors.md)** - Detailed troubleshooting for HTTP 502 errors +- **[Support Bundle](support-bundle.md)** - Generate diagnostic information + +For other topics: + +- [Configuration Guide](../configuration/index.md) for configuration issues +- [Upgrading Guide](../upgrading/index.md) for upgrade-related problems +- [Installation Guide](../installation/index.md) for installation issues diff --git a/docs/studio/self-hosting/troubleshooting/support-bundle.md b/docs/studio/self-hosting/troubleshooting/support-bundle.md new file mode 100644 index 000000000..b95792523 --- /dev/null +++ b/docs/studio/self-hosting/troubleshooting/support-bundle.md @@ -0,0 +1,459 @@ +# Support Bundle Generation + +When experiencing issues with your self-hosted DataChain Studio instance, a support bundle provides comprehensive diagnostic information to help identify and resolve problems quickly. + +## Overview + +A support bundle collects: +- System configuration and status +- Application logs and metrics +- Resource usage information +- Network connectivity details +- Database and storage status +- Error messages and stack traces + +## Automated Support Bundle + +### Kubernetes Deployment + +For Kubernetes deployments, use the automated support bundle script: + +```bash +#!/bin/bash +# generate-support-bundle.sh + +TIMESTAMP=$(date +%Y%m%d-%H%M%S) +BUNDLE_DIR="datachain-studio-support-${TIMESTAMP}" +NAMESPACE="datachain-studio" + +echo "Generating DataChain Studio support bundle..." +mkdir -p ${BUNDLE_DIR} + +# System Information +echo "Collecting system information..." +kubectl version --client > ${BUNDLE_DIR}/kubectl-version.txt +helm version > ${BUNDLE_DIR}/helm-version.txt +kubectl get nodes -o wide > ${BUNDLE_DIR}/nodes.txt +kubectl describe nodes > ${BUNDLE_DIR}/nodes-detailed.txt + +# Cluster Resources +echo "Collecting cluster resources..." +kubectl get all -n ${NAMESPACE} -o wide > ${BUNDLE_DIR}/all-resources.txt +kubectl describe pods -n ${NAMESPACE} > ${BUNDLE_DIR}/pods-detailed.txt +kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp' > ${BUNDLE_DIR}/events.txt + +# Configuration +echo "Collecting configuration..." +helm get values datachain-studio -n ${NAMESPACE} > ${BUNDLE_DIR}/helm-values.yaml +kubectl get configmap datachain-studio-config -n ${NAMESPACE} -o yaml > ${BUNDLE_DIR}/configmap.yaml +kubectl get secrets -n ${NAMESPACE} -o name > ${BUNDLE_DIR}/secrets-list.txt + +# Logs +echo "Collecting logs..." +for pod in $(kubectl get pods -n ${NAMESPACE} -o name); do + pod_name=$(basename ${pod}) + kubectl logs ${pod} -n ${NAMESPACE} --previous > ${BUNDLE_DIR}/logs-${pod_name}-previous.log 2>/dev/null || true + kubectl logs ${pod} -n ${NAMESPACE} > ${BUNDLE_DIR}/logs-${pod_name}.log 2>/dev/null || true +done + +# Resource Usage +echo "Collecting resource usage..." +kubectl top nodes > ${BUNDLE_DIR}/resource-usage-nodes.txt 2>/dev/null || echo "Metrics server not available" > ${BUNDLE_DIR}/resource-usage-nodes.txt +kubectl top pods -n ${NAMESPACE} > ${BUNDLE_DIR}/resource-usage-pods.txt 2>/dev/null || echo "Metrics server not available" > ${BUNDLE_DIR}/resource-usage-pods.txt + +# Ingress and Networking +echo "Collecting networking information..." +kubectl get ingress -n ${NAMESPACE} -o yaml > ${BUNDLE_DIR}/ingress.yaml +kubectl get services -n ${NAMESPACE} -o yaml > ${BUNDLE_DIR}/services.yaml +kubectl get endpoints -n ${NAMESPACE} > ${BUNDLE_DIR}/endpoints.txt + +# Storage +echo "Collecting storage information..." +kubectl get pv,pvc -n ${NAMESPACE} -o yaml > ${BUNDLE_DIR}/storage.yaml +kubectl get storageclass -o yaml > ${BUNDLE_DIR}/storage-classes.yaml + +# Health Checks +echo "Running health checks..." +kubectl exec -it deployment/datachain-studio-backend -n ${NAMESPACE} -- curl -s http://localhost:8000/health > ${BUNDLE_DIR}/health-check.json 2>/dev/null || echo "Health check failed" > ${BUNDLE_DIR}/health-check.json + +# Database Status +echo "Collecting database information..." +kubectl exec -it deployment/datachain-studio-postgres -n ${NAMESPACE} -- psql -U studio -c "SELECT version();" > ${BUNDLE_DIR}/database-version.txt 2>/dev/null || echo "Database not accessible" > ${BUNDLE_DIR}/database-version.txt +kubectl exec -it deployment/datachain-studio-postgres -n ${NAMESPACE} -- psql -U studio -c "SELECT * FROM pg_stat_database WHERE datname='datachain_studio';" > ${BUNDLE_DIR}/database-stats.txt 2>/dev/null || true + +# Package the bundle +echo "Creating support bundle archive..." +tar -czf ${BUNDLE_DIR}.tar.gz ${BUNDLE_DIR} +rm -rf ${BUNDLE_DIR} + +echo "Support bundle created: ${BUNDLE_DIR}.tar.gz" +echo "Please provide this file when contacting support." +``` + +### AMI Deployment + +For AMI deployments, use this script: + +```bash +#!/bin/bash +# generate-ami-support-bundle.sh + +TIMESTAMP=$(date +%Y%m%d-%H%M%S) +BUNDLE_DIR="datachain-studio-ami-support-${TIMESTAMP}" + +echo "Generating DataChain Studio AMI support bundle..." +mkdir -p ${BUNDLE_DIR} + +# System Information +echo "Collecting system information..." +uname -a > ${BUNDLE_DIR}/system-info.txt +lsb_release -a > ${BUNDLE_DIR}/os-version.txt 2>/dev/null || cat /etc/os-release > ${BUNDLE_DIR}/os-version.txt +docker version > ${BUNDLE_DIR}/docker-version.txt +free -h > ${BUNDLE_DIR}/memory-info.txt +df -h > ${BUNDLE_DIR}/disk-info.txt +lscpu > ${BUNDLE_DIR}/cpu-info.txt + +# Service Status +echo "Collecting service status..." +sudo systemctl status datachain-studio > ${BUNDLE_DIR}/service-status.txt +sudo systemctl status docker > ${BUNDLE_DIR}/docker-status.txt +sudo docker ps -a > ${BUNDLE_DIR}/containers.txt +sudo docker images > ${BUNDLE_DIR}/images.txt + +# Configuration +echo "Collecting configuration..." +sudo cp /opt/datachain-studio/config.yml ${BUNDLE_DIR}/config.yml 2>/dev/null || echo "Config file not found" > ${BUNDLE_DIR}/config.yml +sudo systemctl cat datachain-studio > ${BUNDLE_DIR}/systemd-service.txt + +# Logs +echo "Collecting logs..." +sudo journalctl -u datachain-studio --no-pager > ${BUNDLE_DIR}/service-logs.txt +sudo journalctl -u docker --no-pager > ${BUNDLE_DIR}/docker-logs.txt +sudo docker logs datachain-studio-backend > ${BUNDLE_DIR}/backend-logs.txt 2>/dev/null || true +sudo docker logs datachain-studio-frontend > ${BUNDLE_DIR}/frontend-logs.txt 2>/dev/null || true +sudo docker logs datachain-studio-worker > ${BUNDLE_DIR}/worker-logs.txt 2>/dev/null || true + +# System Logs +tail -1000 /var/log/syslog > ${BUNDLE_DIR}/syslog.txt 2>/dev/null || true +tail -1000 /var/log/messages > ${BUNDLE_DIR}/messages.txt 2>/dev/null || true + +# Network Information +echo "Collecting network information..." +ip addr show > ${BUNDLE_DIR}/network-interfaces.txt +netstat -tulpn > ${BUNDLE_DIR}/network-ports.txt +ss -tulpn > ${BUNDLE_DIR}/socket-stats.txt + +# Health Checks +echo "Running health checks..." +curl -s http://localhost:8000/health > ${BUNDLE_DIR}/health-check.json 2>/dev/null || echo "Health check failed" > ${BUNDLE_DIR}/health-check.json +curl -s -I https://studio.yourcompany.com/health > ${BUNDLE_DIR}/external-health-check.txt 2>/dev/null || echo "External health check failed" > ${BUNDLE_DIR}/external-health-check.txt + +# Database Information (if accessible) +echo "Collecting database information..." +sudo docker exec datachain-studio-postgres psql -U studio -c "SELECT version();" > ${BUNDLE_DIR}/database-version.txt 2>/dev/null || echo "Database not accessible" > ${BUNDLE_DIR}/database-version.txt + +# Package the bundle +echo "Creating support bundle archive..." +tar -czf ${BUNDLE_DIR}.tar.gz ${BUNDLE_DIR} +rm -rf ${BUNDLE_DIR} + +echo "Support bundle created: ${BUNDLE_DIR}.tar.gz" +echo "Please provide this file when contacting support." +``` + +## Manual Support Bundle + +If the automated scripts don't work, collect information manually: + +### System Information + +```bash +# Basic system info +uname -a +cat /etc/os-release +free -h +df -h +lscpu + +# Kubernetes info (if applicable) +kubectl version +kubectl get nodes +kubectl cluster-info +``` + +### Application Status + +```bash +# Kubernetes +kubectl get pods -n datachain-studio +kubectl get services -n datachain-studio +kubectl describe pods -n datachain-studio + +# AMI +sudo systemctl status datachain-studio +sudo docker ps -a +sudo docker images +``` + +### Configuration Files + +```bash +# Kubernetes +helm get values datachain-studio -n datachain-studio +kubectl get configmap datachain-studio-config -n datachain-studio -o yaml + +# AMI +sudo cat /opt/datachain-studio/config.yml +sudo systemctl cat datachain-studio +``` + +### Logs Collection + +```bash +# Kubernetes - Recent logs (last 1000 lines) +kubectl logs --tail=1000 deployment/datachain-studio-backend -n datachain-studio +kubectl logs --tail=1000 deployment/datachain-studio-frontend -n datachain-studio +kubectl logs --tail=1000 deployment/datachain-studio-worker -n datachain-studio + +# AMI - Service logs +sudo journalctl -u datachain-studio --lines=1000 +sudo docker logs --tail=1000 datachain-studio-backend +sudo docker logs --tail=1000 datachain-studio-frontend +``` + +### Network and Connectivity + +```bash +# Network configuration +ip addr show +netstat -tulpn + +# DNS resolution +nslookup studio.yourcompany.com +dig studio.yourcompany.com + +# Connectivity tests +curl -I https://studio.yourcompany.com/health +telnet studio.yourcompany.com 443 +``` + +### Database and Storage + +```bash +# Database status (Kubernetes) +kubectl exec -it deployment/datachain-studio-postgres -n datachain-studio -- \ + psql -U studio -c "SELECT version();" + +# Database status (AMI) +sudo docker exec datachain-studio-postgres \ + psql -U studio -c "SELECT version();" + +# Storage information +kubectl get pv,pvc -n datachain-studio # Kubernetes +df -h # AMI +``` + +## Sensitive Information Handling + +### Data Sanitization + +Before sharing support bundles, sanitize sensitive information: + +```bash +# Remove sensitive data from configuration files +sed -i 's/password: .*/password: [REDACTED]/g' config.yml +sed -i 's/secret: .*/secret: [REDACTED]/g' config.yml +sed -i 's/token: .*/token: [REDACTED]/g' config.yml + +# Remove sensitive environment variables from logs +sed -i 's/PASSWORD=.*/PASSWORD=[REDACTED]/g' logs.txt +sed -i 's/SECRET=.*/SECRET=[REDACTED]/g' logs.txt +sed -i 's/TOKEN=.*/TOKEN=[REDACTED]/g' logs.txt +``` + +### What to Redact + +Always redact these items: +- Passwords and passphrases +- API keys and tokens +- Database connection strings with passwords +- OAuth client secrets +- Private keys +- Personal identifiable information (PII) +- Internal IP addresses (if security sensitive) +- Domain names (if security sensitive) + +### What to Keep + +Keep these items for troubleshooting: +- Error messages and stack traces +- Configuration structure (without sensitive values) +- Resource usage statistics +- Network connectivity information +- Service status and health checks +- Log entries showing application behavior + +## Support Bundle Analysis + +### Common Issues Identified + +Support bundles help identify: + +1. **Resource Constraints** + - Out of memory conditions + - CPU throttling + - Disk space issues + - Network bandwidth problems + +2. **Configuration Errors** + - Invalid YAML syntax + - Missing required settings + - Incorrect service endpoints + - Certificate issues + +3. **Connectivity Problems** + - DNS resolution failures + - Network routing issues + - Firewall blocks + - SSL/TLS handshake failures + +4. **Application Errors** + - Database connection failures + - Authentication issues + - Missing dependencies + - Version incompatibilities + +### Automated Analysis + +Create a script to perform basic analysis: + +```bash +#!/bin/bash +# analyze-support-bundle.sh + +BUNDLE_DIR=$1 + +if [ -z "$BUNDLE_DIR" ]; then + echo "Usage: $0 " + exit 1 +fi + +echo "Analyzing support bundle: $BUNDLE_DIR" +echo "========================================" + +# Check for common error patterns +echo "Common Errors Found:" +grep -r -i "error\|failed\|exception" ${BUNDLE_DIR}/logs-* 2>/dev/null | head -10 + +echo "" +echo "Resource Usage Issues:" +if [ -f "${BUNDLE_DIR}/resource-usage-pods.txt" ]; then + awk 'NR>1 && ($3 > 80 || $4 > 80) {print $1 " - High resource usage: CPU " $3 " Memory " $4}' ${BUNDLE_DIR}/resource-usage-pods.txt +fi + +echo "" +echo "Pod Status Issues:" +if [ -f "${BUNDLE_DIR}/all-resources.txt" ]; then + grep -E "CrashLoopBackOff|ImagePullBackOff|Error|Failed" ${BUNDLE_DIR}/all-resources.txt +fi + +echo "" +echo "Recent Events:" +if [ -f "${BUNDLE_DIR}/events.txt" ]; then + tail -10 ${BUNDLE_DIR}/events.txt +fi +``` + +## Sharing Support Bundles + +### Secure Transfer + +When sharing support bundles: + +1. **Encrypt the bundle**: + ```bash + gpg --symmetric --cipher-algo AES256 datachain-studio-support-bundle.tar.gz + ``` + +2. **Use secure channels**: + - Support portal file upload + - Encrypted email + - Secure file sharing services + - Corporate file transfer tools + +3. **Share decryption key separately**: + - Different communication channel + - Phone call or secure messaging + - Time-limited access + +### Support Portal Upload + +If using a support portal: + +1. Create support ticket with issue description +2. Upload sanitized support bundle +3. Include reproduction steps +4. Specify urgency level +5. Provide contact information + +## Custom Support Bundle + +### Organization-Specific Information + +Add organization-specific diagnostic information: + +```bash +#!/bin/bash +# custom-diagnostics.sh + +# Custom health checks +echo "Running custom health checks..." + +# Check integration with internal services +curl -s -f https://internal-ldap.company.com/health > custom-ldap-check.txt || echo "LDAP check failed" > custom-ldap-check.txt + +# Check custom storage mounts +df -h /mnt/company-storage > custom-storage-check.txt 2>/dev/null || echo "Custom storage not mounted" > custom-storage-check.txt + +# Check VPN connectivity +ping -c 3 internal-gateway.company.com > custom-network-check.txt 2>&1 + +# Check custom certificates +openssl s_client -connect internal-ca.company.com:443 -servername internal-ca.company.com < /dev/null 2>&1 | openssl x509 -dates -noout > custom-cert-check.txt 2>/dev/null || echo "Custom CA check failed" > custom-cert-check.txt +``` + +### Environment-Specific Checks + +```bash +# Check air-gapped environment specifics +if [ -f "/etc/airgap-marker" ]; then + echo "Air-gapped environment detected" + + # Check internal registry connectivity + curl -I https://registry.internal.company.com > internal-registry-check.txt 2>&1 + + # Check internal DNS + nslookup studio.internal.company.com > internal-dns-check.txt 2>&1 + + # Check offline documentation + ls -la /opt/datachain-docs/ > offline-docs-check.txt 2>&1 +fi +``` + +## Next Steps + +After generating a support bundle: + +1. **Review the bundle** for sensitive information +2. **Sanitize** any confidential data +3. **Compress and encrypt** if required +4. **Upload to support portal** or send via secure channel +5. **Include detailed problem description** with the bundle +6. **Provide steps to reproduce** the issue +7. **Specify urgency level** and business impact + +For immediate assistance while waiting for support: +- Check the [main troubleshooting guide](index.md) +- Review [502 error troubleshooting](502-errors.md) +- Consult [configuration documentation](../configuration/index.md) +- Search community forums for similar issues diff --git a/docs/studio/self-hosting/upgrading/airgap-procedure.md b/docs/studio/self-hosting/upgrading/airgap-procedure.md new file mode 100644 index 000000000..af347f171 --- /dev/null +++ b/docs/studio/self-hosting/upgrading/airgap-procedure.md @@ -0,0 +1,501 @@ +# Airgap Upgrade Procedure + +This guide covers the upgrade procedure for DataChain Studio deployments in air-gapped environments without direct internet access. + +## Overview + +Air-gapped upgrades require manually transferring container images and Helm charts to your isolated environment before performing the upgrade. + +## Prerequisites + +- Administrative access to your air-gapped Kubernetes cluster +- Access to a connected system for downloading images and charts +- Container registry in your air-gapped environment +- Backup of current configuration and data +- Scheduled maintenance window + +## Pre-upgrade Preparation + +### 1. Review Release Notes + +On a connected system, review the release notes for: +- Breaking changes that may affect your deployment +- New configuration options +- Deprecated features +- Security updates + +### 2. Create Backups + +Follow the same backup procedures as the [regular upgrade](regular-procedure.md#pre-upgrade-preparation): + +```bash +# Database backup +kubectl exec -it deployment/datachain-studio-postgres -n datachain-studio -- \ + pg_dump -U studio datachain_studio > backup-$(date +%Y%m%d-%H%M%S).sql + +# Configuration backup +helm get values datachain-studio -n datachain-studio > values-backup-$(date +%Y%m%d).yaml + +# Storage backup +kubectl get pv,pvc -n datachain-studio -o yaml > pv-backup-$(date +%Y%m%d).yaml +``` + +## Download Assets (Connected System) + +### 1. Download Container Images + +On a system with internet access, download the required container images: + +```bash +# Set target version +TARGET_VERSION="1.2.3" + +# Download DataChain Studio images +docker pull datachain/studio-frontend:${TARGET_VERSION} +docker pull datachain/studio-backend:${TARGET_VERSION} +docker pull datachain/studio-worker:${TARGET_VERSION} +docker pull datachain/studio-scheduler:${TARGET_VERSION} + +# Download dependency images (if updated) +docker pull postgres:14 +docker pull redis:7 +docker pull nginx:1.24 + +# Export images to tar files +docker save datachain/studio-frontend:${TARGET_VERSION} > studio-frontend-${TARGET_VERSION}.tar +docker save datachain/studio-backend:${TARGET_VERSION} > studio-backend-${TARGET_VERSION}.tar +docker save datachain/studio-worker:${TARGET_VERSION} > studio-worker-${TARGET_VERSION}.tar +docker save datachain/studio-scheduler:${TARGET_VERSION} > studio-scheduler-${TARGET_VERSION}.tar + +# Export dependency images if needed +docker save postgres:14 > postgres-14.tar +docker save redis:7 > redis-7.tar +docker save nginx:1.24 > nginx-1.24.tar +``` + +### 2. Download Helm Chart + +```bash +# Update Helm repository +helm repo add datachain https://charts.datachain.ai +helm repo update + +# Download specific chart version +helm pull datachain/studio --version ${TARGET_VERSION} + +# This creates a file like: studio-${TARGET_VERSION}.tgz +``` + +### 3. Download Dependencies (if needed) + +```bash +# Download any additional charts or dependencies +helm dependency update studio-${TARGET_VERSION}.tgz +``` + +## Transfer Assets to Air-gapped Environment + +### 1. Transfer Files + +Transfer the downloaded files to your air-gapped environment: + +```bash +# Files to transfer: +# - studio-frontend-${TARGET_VERSION}.tar +# - studio-backend-${TARGET_VERSION}.tar +# - studio-worker-${TARGET_VERSION}.tar +# - studio-scheduler-${TARGET_VERSION}.tar +# - postgres-14.tar (if updated) +# - redis-7.tar (if updated) +# - nginx-1.24.tar (if updated) +# - studio-${TARGET_VERSION}.tgz + +# Use secure transfer method appropriate for your environment: +# - Physical media (USB drive, CD/DVD) +# - Secure file transfer over isolated network +# - Air-gapped file transfer tools +``` + +### 2. Verify Transfer Integrity + +```bash +# Verify checksums to ensure files weren't corrupted +sha256sum studio-frontend-${TARGET_VERSION}.tar +sha256sum studio-backend-${TARGET_VERSION}.tar +sha256sum studio-worker-${TARGET_VERSION}.tar +sha256sum studio-scheduler-${TARGET_VERSION}.tar +sha256sum studio-${TARGET_VERSION}.tgz + +# Compare with checksums from connected system +``` + +## Load Assets in Air-gapped Environment + +### 1. Load Container Images + +```bash +# Load images into local Docker daemon +docker load < studio-frontend-${TARGET_VERSION}.tar +docker load < studio-backend-${TARGET_VERSION}.tar +docker load < studio-worker-${TARGET_VERSION}.tar +docker load < studio-scheduler-${TARGET_VERSION}.tar + +# Load dependency images if needed +docker load < postgres-14.tar +docker load < redis-7.tar +docker load < nginx-1.24.tar + +# Verify images are loaded +docker images | grep datachain +``` + +### 2. Tag and Push to Internal Registry + +```bash +# Set your internal registry URL +INTERNAL_REGISTRY="registry.internal.company.com" + +# Tag images for internal registry +docker tag datachain/studio-frontend:${TARGET_VERSION} ${INTERNAL_REGISTRY}/datachain/studio-frontend:${TARGET_VERSION} +docker tag datachain/studio-backend:${TARGET_VERSION} ${INTERNAL_REGISTRY}/datachain/studio-backend:${TARGET_VERSION} +docker tag datachain/studio-worker:${TARGET_VERSION} ${INTERNAL_REGISTRY}/datachain/studio-worker:${TARGET_VERSION} +docker tag datachain/studio-scheduler:${TARGET_VERSION} ${INTERNAL_REGISTRY}/datachain/studio-scheduler:${TARGET_VERSION} + +# Push to internal registry +docker push ${INTERNAL_REGISTRY}/datachain/studio-frontend:${TARGET_VERSION} +docker push ${INTERNAL_REGISTRY}/datachain/studio-backend:${TARGET_VERSION} +docker push ${INTERNAL_REGISTRY}/datachain/studio-worker:${TARGET_VERSION} +docker push ${INTERNAL_REGISTRY}/datachain/studio-scheduler:${TARGET_VERSION} + +# Push dependency images if needed +docker tag postgres:14 ${INTERNAL_REGISTRY}/postgres:14 +docker tag redis:7 ${INTERNAL_REGISTRY}/redis:7 +docker push ${INTERNAL_REGISTRY}/postgres:14 +docker push ${INTERNAL_REGISTRY}/redis:7 +``` + +### 3. Extract and Install Helm Chart + +```bash +# Extract Helm chart +tar -xzf studio-${TARGET_VERSION}.tgz + +# Add to local Helm repository (if using chartmuseum or similar) +# Or use directly from extracted directory +``` + +## Update Configuration for Air-gapped Environment + +### 1. Update Image References + +Update your `values.yaml` to reference internal registry: + +```yaml +# values.yaml +global: + imageRegistry: "registry.internal.company.com" + +images: + frontend: + repository: datachain/studio-frontend + tag: "1.2.3" + pullPolicy: IfNotPresent + + backend: + repository: datachain/studio-backend + tag: "1.2.3" + pullPolicy: IfNotPresent + + worker: + repository: datachain/studio-worker + tag: "1.2.3" + pullPolicy: IfNotPresent + + scheduler: + repository: datachain/studio-scheduler + tag: "1.2.3" + pullPolicy: IfNotPresent + +# Update dependency images if needed +postgresql: + image: + registry: registry.internal.company.com + repository: postgres + tag: "14" + +redis: + image: + registry: registry.internal.company.com + repository: redis + tag: "7" +``` + +### 2. Configure Image Pull Secrets (if needed) + +```bash +# Create image pull secret for internal registry +kubectl create secret docker-registry internal-registry-secret \ + --namespace datachain-studio \ + --docker-server=registry.internal.company.com \ + --docker-username=your-username \ + --docker-password=your-password + +# Reference in values.yaml +``` + +```yaml +# values.yaml +imagePullSecrets: + - name: internal-registry-secret +``` + +## Perform the Upgrade + +### 1. Plan the Upgrade + +```bash +# Dry run to see what will change +helm upgrade datachain-studio ./studio \ + --namespace datachain-studio \ + --values values.yaml \ + --dry-run --debug +``` + +### 2. Execute the Upgrade + +```bash +# Perform the upgrade using local chart +helm upgrade datachain-studio ./studio \ + --namespace datachain-studio \ + --values values.yaml \ + --wait \ + --timeout 10m + +# Monitor upgrade progress +kubectl get pods -n datachain-studio -w +``` + +### 3. Verify Upgrade + +```bash +# Check upgrade status +helm status datachain-studio -n datachain-studio + +# Verify all pods are running with new images +kubectl get pods -n datachain-studio -o wide + +# Check pod images +kubectl describe pod POD_NAME -n datachain-studio | grep -i image + +# Test application health +curl -f https://studio.yourcompany.com/health +``` + +## Post-Upgrade Validation + +### 1. Image Verification + +Verify that pods are using the correct images from your internal registry: + +```bash +# Check pod images +kubectl get pods -n datachain-studio -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].image}{"\n"}{end}' + +# Verify images are from internal registry +kubectl describe pods -n datachain-studio | grep -i "image:" +``` + +### 2. Functional Testing + +Follow the same functional testing procedures as the [regular upgrade](regular-procedure.md#post-upgrade-validation): + +- Test authentication and authorization +- Verify API endpoints functionality +- Test database connectivity +- Validate Git integration +- Check webhook delivery + +### 3. Performance Validation + +Monitor system performance after upgrade: + +```bash +# Check resource usage +kubectl top nodes +kubectl top pods -n datachain-studio + +# Monitor application logs +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio + +# Test response times +time curl -f https://studio.yourcompany.com/health +``` + +## Rollback Procedure for Air-gapped Environment + +### 1. Rollback Using Helm + +```bash +# List release history +helm history datachain-studio -n datachain-studio + +# Rollback to previous version +helm rollback datachain-studio -n datachain-studio + +# Verify rollback +helm status datachain-studio -n datachain-studio +kubectl get pods -n datachain-studio +``` + +### 2. Image Rollback + +If you need to rollback to previous container images: + +```bash +# Load previous version images (if available) +docker load < studio-frontend-PREVIOUS_VERSION.tar +docker load < studio-backend-PREVIOUS_VERSION.tar + +# Tag and push to internal registry +docker tag datachain/studio-frontend:PREVIOUS_VERSION ${INTERNAL_REGISTRY}/datachain/studio-frontend:PREVIOUS_VERSION +docker push ${INTERNAL_REGISTRY}/datachain/studio-frontend:PREVIOUS_VERSION + +# Update values.yaml with previous version tags +# Then run helm upgrade with previous configuration +``` + +## Troubleshooting Air-gapped Upgrades + +### Image Pull Failures + +**Cannot pull images from internal registry:** + +```bash +# Check registry connectivity +nslookup registry.internal.company.com +telnet registry.internal.company.com 443 + +# Check authentication +kubectl get secret internal-registry-secret -n datachain-studio -o yaml + +# Test image pull manually +docker pull registry.internal.company.com/datachain/studio-frontend:VERSION +``` + +**Images not found in registry:** + +```bash +# Check if images were pushed correctly +curl -u username:password https://registry.internal.company.com/v2/_catalog +curl -u username:password https://registry.internal.company.com/v2/datachain/studio-frontend/tags/list + +# Re-push images if necessary +docker push ${INTERNAL_REGISTRY}/datachain/studio-frontend:${TARGET_VERSION} +``` + +### Chart Installation Issues + +**Chart not found:** + +```bash +# Verify chart directory structure +ls -la ./studio/ + +# Check chart.yaml +cat ./studio/Chart.yaml + +# Validate chart +helm lint ./studio +``` + +### Network Connectivity Issues + +**Services cannot communicate:** + +```bash +# Check internal DNS resolution +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- nslookup postgres-service + +# Check service endpoints +kubectl get endpoints -n datachain-studio + +# Test internal service connectivity +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- curl http://postgres-service:5432 +``` + +## Best Practices for Air-gapped Upgrades + +### Planning +1. **Test thoroughly** in air-gapped staging environment +2. **Prepare all assets** in advance on connected system +3. **Verify checksums** of all transferred files +4. **Document procedures** specific to your environment + +### Execution +1. **Minimize downtime** by preparing everything in advance +2. **Monitor carefully** during the upgrade process +3. **Have rollback assets ready** (previous version images) +4. **Test connectivity** to internal services + +### Maintenance +1. **Keep internal registry updated** with required images +2. **Maintain image versioning** strategy +3. **Document air-gap specific configurations** +4. **Plan for dependency updates** + +## Automation for Air-gapped Environments + +### Scripted Asset Preparation + +Create scripts to automate asset preparation: + +```bash +#!/bin/bash +# prepare-airgap-upgrade.sh + +TARGET_VERSION=$1 +INTERNAL_REGISTRY=$2 + +# Download images +docker pull datachain/studio-frontend:${TARGET_VERSION} +docker pull datachain/studio-backend:${TARGET_VERSION} + +# Save to tar files +docker save datachain/studio-frontend:${TARGET_VERSION} > studio-frontend-${TARGET_VERSION}.tar +docker save datachain/studio-backend:${TARGET_VERSION} > studio-backend-${TARGET_VERSION}.tar + +# Download Helm chart +helm pull datachain/studio --version ${TARGET_VERSION} + +echo "Assets prepared for air-gapped upgrade to version ${TARGET_VERSION}" +``` + +### Deployment Scripts + +```bash +#!/bin/bash +# deploy-airgap-upgrade.sh + +TARGET_VERSION=$1 +INTERNAL_REGISTRY=$2 + +# Load and push images +docker load < studio-frontend-${TARGET_VERSION}.tar +docker tag datachain/studio-frontend:${TARGET_VERSION} ${INTERNAL_REGISTRY}/datachain/studio-frontend:${TARGET_VERSION} +docker push ${INTERNAL_REGISTRY}/datachain/studio-frontend:${TARGET_VERSION} + +# Extract and upgrade +tar -xzf studio-${TARGET_VERSION}.tgz +helm upgrade datachain-studio ./studio --namespace datachain-studio --values values.yaml --wait + +echo "Air-gapped upgrade to version ${TARGET_VERSION} completed" +``` + +## Next Steps + +- Review [configuration changes](../configuration/index.md) for new version +- Update [monitoring setup](../configuration/index.md#monitoring) if needed +- Plan for [next upgrade cycle](index.md) with lessons learned +- Document air-gap specific procedures for your organization + +For issues during air-gapped upgrades, consult the [troubleshooting guide](../troubleshooting/index.md) and adapt solutions for your air-gapped environment. diff --git a/docs/studio/self-hosting/upgrading/index.md b/docs/studio/self-hosting/upgrading/index.md new file mode 100644 index 000000000..f1c938747 --- /dev/null +++ b/docs/studio/self-hosting/upgrading/index.md @@ -0,0 +1,396 @@ +# Upgrading DataChain Studio + +This section covers how to upgrade your self-hosted DataChain Studio instance to newer versions safely and efficiently. + +## Overview + +DataChain Studio upgrades involve updating: + +- **Application Code**: Core DataChain Studio services +- **Database Schema**: Database migrations and updates +- **Configuration**: New configuration options and changes +- **Dependencies**: Updated system dependencies and containers + +## Upgrade Methods + +- **[Regular Procedure](regular-procedure.md)** - Standard upgrade process for most deployments +- **[Airgap Procedure](airgap-procedure.md)** - Upgrade process for air-gapped environments + +## Before You Begin + +### Prerequisites + +- **Backup**: Complete backup of data and configuration +- **Maintenance Window**: Scheduled downtime for the upgrade +- **Access**: Administrative access to your deployment +- **Resources**: Sufficient system resources for the upgrade + +### Pre-upgrade Checklist + +- [ ] Review release notes for breaking changes +- [ ] Backup database and configuration files +- [ ] Test upgrade in staging environment +- [ ] Verify system requirements are met +- [ ] Plan rollback strategy if needed +- [ ] Notify users of scheduled maintenance + +## Upgrade Planning + +### Version Compatibility + +Check version compatibility before upgrading: + +- **Supported Upgrades**: Direct upgrades from previous major version +- **Skip Versions**: Intermediate versions may be required for large jumps +- **Breaking Changes**: Review changelog for breaking changes + +### System Requirements + +Verify system requirements for the target version: + +```yaml +# Minimum requirements may change between versions +systemRequirements: + kubernetes: "1.21+" + helm: "3.7+" + nodes: + minimum: 2 + recommended: 3 + resources: + cpu: "4 cores" + memory: "16GB RAM" + storage: "100GB" +``` + +### Backup Strategy {#backup} + +Always backup before upgrading: + +#### Database Backup +```bash +# PostgreSQL backup +kubectl exec -it postgres-pod -n datachain-studio -- \ + pg_dump -U studio datachain_studio > backup-$(date +%Y%m%d).sql + +# Or using helm backup job +helm install backup-job datachain/backup \ + --namespace datachain-studio \ + --set backup.type=database +``` + +#### Configuration Backup +```bash +# Backup Helm values +helm get values datachain-studio -n datachain-studio > values-backup.yaml + +# Backup Kubernetes resources +kubectl get all -n datachain-studio -o yaml > k8s-resources-backup.yaml + +# Backup secrets +kubectl get secrets -n datachain-studio -o yaml > secrets-backup.yaml +``` + +#### Storage Backup +```bash +# Backup persistent volumes (depends on storage provider) +kubectl get pv,pvc -n datachain-studio + +# For cloud storage, use provider tools: +# AWS: aws s3 sync s3://studio-bucket s3://studio-backup-bucket +# GCS: gsutil -m cp -r gs://studio-bucket gs://studio-backup-bucket +``` + +## Upgrade Process Overview + +### Standard Upgrade Flow + +1. **Preparation** + - Review release notes + - Plan maintenance window + - Create backups + +2. **Pre-upgrade Tasks** + - Update Helm repositories + - Validate configuration + - Check resource availability + +3. **Upgrade Execution** + - Apply configuration changes + - Perform database migrations + - Update application containers + +4. **Post-upgrade Tasks** + - Verify system health + - Test functionality + - Monitor performance + +5. **Validation** + - Run integration tests + - Verify data integrity + - Confirm user access + +## Version Management + +### Semantic Versioning + +DataChain Studio follows semantic versioning (MAJOR.MINOR.PATCH): + +- **MAJOR**: Breaking changes requiring manual intervention +- **MINOR**: New features, backward compatible +- **PATCH**: Bug fixes, security updates + +### Release Channels + +Choose appropriate release channel: + +- **Stable**: Production-ready releases +- **Beta**: Pre-release versions for testing +- **Alpha**: Early development versions + +```yaml +# Configure release channel in values.yaml +global: + image: + tag: "1.2.3" # Specific version + # tag: "stable" # Latest stable + # tag: "beta" # Latest beta +``` + +## Rollback Strategy + +### Automated Rollback + +Prepare for potential rollback: + +```yaml +# Enable automated rollback on failure +upgrade: + rollback: + enabled: true + onFailure: true + timeout: 600s + + # Health checks for validation + healthChecks: + enabled: true + initialDelay: 30s + timeout: 10s + failureThreshold: 3 +``` + +### Manual Rollback + +Steps for manual rollback: + +```bash +# Rollback using Helm +helm rollback datachain-studio -n datachain-studio + +# Restore database from backup +kubectl exec -it postgres-pod -n datachain-studio -- \ + psql -U studio datachain_studio < backup-20240115.sql + +# Restore configuration +helm upgrade datachain-studio datachain/studio \ + --namespace datachain-studio \ + --values values-backup.yaml +``` + +## Monitoring During Upgrades + +### Health Monitoring + +Monitor system health during upgrades: + +```yaml +monitoring: + upgrade: + enabled: true + + # Metrics to monitor + metrics: + - cpu_usage + - memory_usage + - database_connections + - response_time + - error_rate + + # Alert thresholds + alerts: + - name: "High Error Rate During Upgrade" + condition: "error_rate > 5%" + duration: "2m" + action: "pause_upgrade" +``` + +### Log Monitoring + +Key logs to monitor: + +```bash +# Application logs +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio + +# Database migration logs +kubectl logs -f job/migration-job -n datachain-studio + +# Ingress logs +kubectl logs -f deployment/nginx-ingress-controller -n ingress-nginx +``` + +## Testing After Upgrade + +### Automated Testing + +Run automated tests after upgrade: + +```bash +# Health check tests +curl -f https://studio.yourcompany.com/health + +# API functionality tests +curl -H "Authorization: Bearer $TOKEN" \ + https://studio.yourcompany.com/api/datasets + +# Database connectivity tests +kubectl exec -it postgres-pod -n datachain-studio -- \ + psql -U studio -c "SELECT version();" +``` + +### Manual Testing + +Perform manual testing: + +- [ ] User login functionality +- [ ] Dataset creation and access +- [ ] Job submission and execution +- [ ] Git integration functionality +- [ ] Webhook delivery +- [ ] API endpoints +- [ ] User interface responsiveness + +## Upgrade Troubleshooting + +### Common Issues + +**Database migration failures:** +- Check database connectivity +- Verify migration scripts +- Review database logs +- Ensure sufficient disk space + +**Container startup failures:** +- Check resource availability +- Verify image availability +- Review configuration changes +- Check dependency services + +**Configuration conflicts:** +- Compare old vs new configuration +- Review breaking changes in release notes +- Validate YAML syntax +- Check required vs optional fields + +### Recovery Procedures + +**Service degradation:** +1. Check resource utilization +2. Review application logs +3. Verify configuration +4. Consider scaling resources +5. Rollback if necessary + +**Data corruption:** +1. Stop write operations +2. Assess corruption extent +3. Restore from backup +4. Verify data integrity +5. Resume operations + +## Best Practices + +### Upgrade Preparation + +1. **Test in Staging**: Always test upgrades in staging first +2. **Read Release Notes**: Review all changes and breaking changes +3. **Plan Downtime**: Schedule appropriate maintenance windows +4. **Prepare Rollback**: Have rollback plan ready + +### During Upgrade + +1. **Monitor Closely**: Watch logs and metrics during upgrade +2. **Validate Each Step**: Confirm each step completes successfully +3. **Document Issues**: Record any problems encountered +4. **Stay Calm**: Follow procedures methodically + +### Post-Upgrade + +1. **Thorough Testing**: Test all critical functionality +2. **Performance Monitoring**: Watch for performance regressions +3. **User Communication**: Notify users when service is restored +4. **Document Lessons**: Record lessons learned for next time + +## Automation + +### CI/CD Integration + +Automate upgrades using CI/CD: + +```yaml +# GitLab CI example +upgrade-staging: + stage: upgrade + script: + - helm repo update + - helm upgrade datachain-studio datachain/studio + --namespace datachain-studio-staging + --values values-staging.yaml + only: + - main + +upgrade-production: + stage: upgrade + script: + - helm upgrade datachain-studio datachain/studio + --namespace datachain-studio + --values values-production.yaml + when: manual + only: + - main +``` + +### Automated Validation + +```yaml +# Automated post-upgrade validation +validation: + enabled: true + + tests: + - name: "Health Check" + url: "https://studio.yourcompany.com/health" + expected_status: 200 + + - name: "API Test" + url: "https://studio.yourcompany.com/api/version" + expected_status: 200 + timeout: 30s + + - name: "Database Test" + type: "sql" + query: "SELECT COUNT(*) FROM datasets" + expected_result: "> 0" +``` + +## Next Steps + +Choose your upgrade method: + +- **[Regular Procedure](regular-procedure.md)** - For connected environments +- **[Airgap Procedure](airgap-procedure.md)** - For air-gapped environments + +For additional information: + +- [Configuration Guide](../configuration/index.md) for post-upgrade configuration +- [Troubleshooting Guide](../troubleshooting/index.md) for resolving issues +- [Installation Guide](../installation/index.md) for fresh installations diff --git a/docs/studio/self-hosting/upgrading/regular-procedure.md b/docs/studio/self-hosting/upgrading/regular-procedure.md new file mode 100644 index 000000000..c8b610f58 --- /dev/null +++ b/docs/studio/self-hosting/upgrading/regular-procedure.md @@ -0,0 +1,464 @@ +# Regular Upgrade Procedure + +This guide covers the standard upgrade procedure for DataChain Studio deployments with internet access. + +## Prerequisites + +- Administrative access to your Kubernetes cluster or AMI instance +- Internet connectivity for downloading new container images +- Backup of current configuration and data +- Scheduled maintenance window + +## Pre-upgrade Preparation + +### 1. Review Release Notes + +Before upgrading, review the release notes for: +- Breaking changes that may affect your deployment +- New configuration options +- Deprecated features +- Security updates + +### 2. Create Backups + +#### Database Backup +```bash +# For Kubernetes deployments +kubectl exec -it deployment/datachain-studio-postgres -n datachain-studio -- \ + pg_dump -U studio datachain_studio > backup-$(date +%Y%m%d-%H%M%S).sql + +# For AMI deployments +sudo -u postgres pg_dump datachain_studio > backup-$(date +%Y%m%d-%H%M%S).sql +``` + +#### Configuration Backup +```bash +# Kubernetes: Backup Helm values +helm get values datachain-studio -n datachain-studio > values-backup-$(date +%Y%m%d).yaml + +# AMI: Backup configuration file +sudo cp /opt/datachain-studio/config.yml /opt/datachain-studio/config-backup-$(date +%Y%m%d).yml +``` + +#### Storage Backup (if applicable) +```bash +# Backup persistent volumes +kubectl get pv,pvc -n datachain-studio -o yaml > pv-backup-$(date +%Y%m%d).yaml + +# For cloud storage, create snapshots using provider tools +``` + +### 3. Verify System Health + +Before starting the upgrade, ensure the system is healthy: + +```bash +# Check pod status +kubectl get pods -n datachain-studio + +# Check resource usage +kubectl top nodes +kubectl top pods -n datachain-studio + +# Check service availability +curl -f https://studio.yourcompany.com/health +``` + +## Kubernetes/Helm Upgrade + +### 1. Update Helm Repository + +```bash +# Update the DataChain Helm repository +helm repo update datachain + +# Check available versions +helm search repo datachain/studio --versions +``` + +### 2. Review Configuration Changes + +Compare your current configuration with the new version: + +```bash +# Get current values +helm get values datachain-studio -n datachain-studio > current-values.yaml + +# Show default values for new version +helm show values datachain/studio --version NEW_VERSION > new-default-values.yaml + +# Compare configurations +diff current-values.yaml new-default-values.yaml +``` + +### 3. Plan the Upgrade + +```bash +# Dry run to see what will change +helm upgrade datachain-studio datachain/studio \ + --namespace datachain-studio \ + --values values.yaml \ + --version NEW_VERSION \ + --dry-run --debug +``` + +### 4. Perform the Upgrade + +```bash +# Upgrade DataChain Studio +helm upgrade datachain-studio datachain/studio \ + --namespace datachain-studio \ + --values values.yaml \ + --version NEW_VERSION \ + --wait \ + --timeout 10m + +# Monitor the upgrade progress +kubectl get pods -n datachain-studio -w +``` + +### 5. Verify Upgrade + +```bash +# Check upgrade status +helm status datachain-studio -n datachain-studio + +# Verify all pods are running +kubectl get pods -n datachain-studio + +# Check services +kubectl get services -n datachain-studio + +# Test application health +curl -f https://studio.yourcompany.com/health +``` + +## AMI Upgrade + +### 1. Connect to the Instance + +```bash +# SSH to your AMI instance +ssh -i your-key.pem ubuntu@your-instance-ip +``` + +### 2. Update System Packages + +```bash +# Update system packages +sudo apt update && sudo apt upgrade -y + +# Update Docker if needed +sudo apt install docker.io +``` + +### 3. Pull New Images + +```bash +# Pull new DataChain Studio images +sudo docker pull datachain/studio-frontend:NEW_VERSION +sudo docker pull datachain/studio-backend:NEW_VERSION +sudo docker pull datachain/studio-worker:NEW_VERSION + +# List current images +sudo docker images | grep datachain +``` + +### 4. Update Configuration + +```bash +# Navigate to configuration directory +cd /opt/datachain-studio + +# Backup current configuration +sudo cp config.yml config-backup-$(date +%Y%m%d).yml + +# Update configuration if needed (based on release notes) +sudo nano config.yml +``` + +### 5. Stop Current Services + +```bash +# Stop DataChain Studio services +sudo systemctl stop datachain-studio + +# Verify services are stopped +sudo systemctl status datachain-studio +``` + +### 6. Update and Start Services + +```bash +# Update service configuration if needed +sudo systemctl daemon-reload + +# Start services with new version +sudo systemctl start datachain-studio + +# Enable auto-start +sudo systemctl enable datachain-studio + +# Check service status +sudo systemctl status datachain-studio +``` + +### 7. Verify Upgrade + +```bash +# Check container status +sudo docker ps + +# Check logs +sudo journalctl -u datachain-studio -f + +# Test application +curl -f https://studio.yourcompany.com/health +``` + +## Database Migrations + +### Automatic Migrations + +DataChain Studio typically handles database migrations automatically during startup: + +```bash +# Monitor migration logs +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio | grep -i migration + +# For AMI deployments +sudo journalctl -u datachain-studio -f | grep -i migration +``` + +### Manual Migration (if required) + +If manual intervention is needed: + +```bash +# Run migration job manually (Kubernetes) +kubectl create job manual-migration \ + --from=deployment/datachain-studio-backend \ + -n datachain-studio + +# Monitor migration job +kubectl logs -f job/manual-migration -n datachain-studio + +# For AMI, run migration command +cd /opt/datachain-studio +sudo -u datachain python manage.py migrate +``` + +## Post-Upgrade Validation + +### 1. System Health Checks + +```bash +# Check all services are running +kubectl get pods -n datachain-studio +sudo systemctl status datachain-studio # For AMI + +# Verify resource usage is normal +kubectl top pods -n datachain-studio +top # For AMI + +# Test external connectivity +curl -f https://studio.yourcompany.com/health +``` + +### 2. Functional Testing + +Test critical functionality: + +#### Authentication +```bash +# Test login page +curl -f https://studio.yourcompany.com/login + +# Test OAuth endpoints +curl -f https://studio.yourcompany.com/auth/github/login +``` + +#### API Endpoints +```bash +# Test API availability +curl -H "Authorization: Bearer $TOKEN" \ + https://studio.yourcompany.com/api/datasets + +# Test version endpoint +curl https://studio.yourcompany.com/api/version +``` + +#### Database Connectivity +```bash +# Test database connection +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + python -c "import django; django.setup(); from django.db import connection; print('DB OK' if connection.ensure_connection() is None else 'DB Error')" +``` + +#### Git Integration +```bash +# Test Git forge connectivity +curl -f https://studio.yourcompany.com/api/git/github/test +curl -f https://studio.yourcompany.com/api/git/gitlab/test +``` + +### 3. Performance Validation + +Monitor performance after upgrade: + +```bash +# Check response times +time curl -f https://studio.yourcompany.com/health + +# Monitor resource usage +kubectl top pods -n datachain-studio +htop # For AMI + +# Check application metrics (if monitoring is enabled) +curl https://studio.yourcompany.com/metrics +``` + +## Rollback Procedure + +If issues are encountered during or after upgrade: + +### Kubernetes Rollback + +```bash +# List release history +helm history datachain-studio -n datachain-studio + +# Rollback to previous version +helm rollback datachain-studio -n datachain-studio + +# Or rollback to specific revision +helm rollback datachain-studio REVISION_NUMBER -n datachain-studio + +# Verify rollback +helm status datachain-studio -n datachain-studio +``` + +### AMI Rollback + +```bash +# Stop current services +sudo systemctl stop datachain-studio + +# Restore configuration backup +sudo cp /opt/datachain-studio/config-backup-DATE.yml /opt/datachain-studio/config.yml + +# Pull previous version images +sudo docker pull datachain/studio-frontend:PREVIOUS_VERSION +sudo docker pull datachain/studio-backend:PREVIOUS_VERSION + +# Update service to use previous version +# (Edit systemd service files or docker-compose as needed) + +# Restore database if needed +sudo -u postgres psql datachain_studio < backup-DATE.sql + +# Start services +sudo systemctl start datachain-studio +``` + +## Troubleshooting Common Issues + +### Upgrade Hangs or Fails + +**Container image pull failures:** +```bash +# Check image availability +docker pull datachain/studio-backend:NEW_VERSION + +# Check registry connectivity +kubectl describe pod POD_NAME -n datachain-studio +``` + +**Database migration failures:** +```bash +# Check database connectivity +kubectl exec -it deployment/datachain-studio-postgres -n datachain-studio -- \ + psql -U studio -c "SELECT version();" + +# Check migration logs +kubectl logs deployment/datachain-studio-backend -n datachain-studio | grep -i migration + +# Manually run problematic migrations +kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \ + python manage.py migrate --verbosity=2 +``` + +**Resource constraints:** +```bash +# Check node resources +kubectl describe nodes + +# Check pod resource requests/limits +kubectl describe pod POD_NAME -n datachain-studio + +# Scale down other services temporarily if needed +kubectl scale deployment OTHER_DEPLOYMENT --replicas=0 -n datachain-studio +``` + +### Post-Upgrade Issues + +**Service unavailable:** +```bash +# Check pod status +kubectl get pods -n datachain-studio + +# Check service endpoints +kubectl get endpoints -n datachain-studio + +# Check ingress configuration +kubectl describe ingress -n datachain-studio +``` + +**Performance degradation:** +```bash +# Check resource usage +kubectl top pods -n datachain-studio + +# Review application logs for errors +kubectl logs -f deployment/datachain-studio-backend -n datachain-studio + +# Check database performance +kubectl exec -it deployment/datachain-studio-postgres -n datachain-studio -- \ + psql -U studio -c "SELECT * FROM pg_stat_activity;" +``` + +## Best Practices + +### Before Upgrade +1. **Test in staging** environment first +2. **Schedule maintenance windows** during low usage +3. **Communicate** with users about planned downtime +4. **Document** current configuration and customizations + +### During Upgrade +1. **Monitor closely** throughout the process +2. **Have rollback plan ready** and tested +3. **Keep logs** of all commands and outputs +4. **Stay calm** and follow procedures methodically + +### After Upgrade +1. **Validate thoroughly** before declaring success +2. **Monitor performance** for several hours/days +3. **Update documentation** with any changes +4. **Clean up** old images and backups after verification + +## Next Steps + +- Review [configuration changes](../configuration/index.md) that may be needed +- Update [monitoring and alerting](../configuration/index.md#monitoring) if applicable +- Check [troubleshooting guide](../troubleshooting/index.md) if issues occur +- Plan for [next upgrade cycle](index.md) based on release schedule + +## Support + +If you encounter issues during the upgrade: + +1. Check the [troubleshooting guide](../troubleshooting/index.md) +2. Review application logs for error messages +3. Consult the release notes for known issues +4. Contact DataChain support with specific error details diff --git a/docs/studio/user-guide/account-management.md b/docs/studio/user-guide/account-management.md new file mode 100644 index 000000000..43f4f8f97 --- /dev/null +++ b/docs/studio/user-guide/account-management.md @@ -0,0 +1,98 @@ +# Account Management + +To open your account settings, click on your user icon on the top right corner of DataChain Studio, and go to your `Settings`. You can view and update the following settings: + +- [General settings](#general-settings) + - [Profile details](#profile-details) update your name and profile picture + - [Account details](#account-details) manage your username, password, email addresses, and delete your account +- [Git connections](#git-connections) with GitHub, GitLab and Bitbucket +- [Cloud credentials](#cloud-credentials) for data remotes +- [Teams](#teams) that you own +- [Tokens](#tokens) + - [Client access tokens](#client-access-tokens) for dataset and job operations + +!!! note + This does not include managing your team plan (Free or Enterprise). Team plans are defined for each team separately. [Get Enterprise](team-collaboration.md#get-enterprise). + +## General settings + +In your settings page, the general tab includes your profile and account settings. + +### Profile details + +Here, you can update your first name, last name and profile picture. + +### Account details + +In the account section, your username is displayed. Here, you can also update your username, password and email addresses. + +!!! note + If you signed up with a GitHub, GitLab or Bitbucket account, these details are fetched from your connected Git hosting account. + +#### Managing email addresses + +You can add multiple email addresses to a single DataChain Studio account. You can login to the account with any of your verified email addresses as long as you have set up a password for your account. This is true even if you signed up using your GitHub, GitLab, or Bitbucket. + +One of your email addresses must be designated as primary. This is the address to which DataChain Studio will send all your account notification emails. + +You can change your primary email address by clicking on the `Primary` button next to the email address which you want to designate as primary. + +You can delete your non-primary email addresses. + +#### Delete account + +If you delete your account, all the projects you own and the links that you have shared will be permanently deleted. So, click on `Delete my account` only if you are absolutely sure that you do not need those projects or links anymore. + +!!! note + Deleting your account in DataChain Studio does not delete your Git repositories. + +## Git Connections + +In this section, you can: + +- Connect to GitHub.com, GitLab.com or Bitbucket.org. + + When you connect to a Git hosting provider, you will be prompted to grant DataChain Studio access to your account. + + To connect to your GitHub repositories, you must install the DataChain Studio GitHub app. Refer to the section on [GitHub app installation](git-connections/github-app.md) for more details. + + Note that if you signed up to use DataChain Studio using your GitHub, GitLab or Bitbucket account, integration with that Git account will have been created during sign up. + + Also, note that **connections to self-hosted GitLab servers** are not managed in this section. If you want to connect to a self-hosted GitLab server, you should create a team and [set up the GitLab server connection](git-connections/custom-gitlab-server.md) in the team settings. + +- Disconnect from your GitHub, GitLab, or Bitbucket accounts. +- Configure your GitHub account connection. That is, install the DataChain Studio GitHub app on additional organizations or repositories, or even remove the app from organizations or repositories where you no longer need it. + +## Cloud credentials + +In this section, you can view, add and update credentials for cloud resources. These credentials are used to fetch project data from data remotes and cloud storage. + +To add new credentials, click `Add credentials` and select the cloud provider. Depending on the provider, you will be asked for more details. + +The credentials must have the required permissions for accessing your data storage. + +Finally, click `Save credentials`. + +!!! tip + DataChain Studio also supports [OpenID Connect authentication](authentication/openid-connect.md) for some cloud providers. + +## Teams + +In this section, you can view all the teams you are member of. + +Click on `select` to switch to the team's dashboard. Or, click on `manage` to go to the team settings page and manage the team. + +To create a new team, click on `Create a team` and enter the team name. You can invite members to the team by entering their email addresses. Find more details in the [team collaboration guide](team-collaboration.md#create-a-team). + +## Tokens + +### Client access tokens + +In this tokens section of your settings page, you can generate new client access tokens with specific scopes as well as delete existing access tokens. These tokens can be used to give limited permissions to a client without granting full access to your Studio account. You can restrict the access token to a certain team or allow it to access all teams as well. + +The available scopes are: + +- `Dataset operations` - Used for managing datasets and related operations. +- `Job related operations` - Used for managing data processing jobs and related operations. +- `Admin operations` - Used for team management and project creation. +- `Storage related operations` - Used for managing storage and related operations. diff --git a/docs/studio/user-guide/authentication/openid-connect.md b/docs/studio/user-guide/authentication/openid-connect.md new file mode 100644 index 000000000..2aa282319 --- /dev/null +++ b/docs/studio/user-guide/authentication/openid-connect.md @@ -0,0 +1,180 @@ +# OpenID Connect (OIDC) + +DataChain Studio can use OpenID Connect to access cloud resources securely, without requiring manual configuration of static credentials. + +To use OIDC, first follow the [cloud configuration](#cloud-configuration) instructions and then the [Studio configuration](#studio-configuration) instructions. + +## Cloud configuration + +### Generic configuration details + +- OpenID Connect Discovery URL: https://studio.datachain.ai/api/.well-known/openid-configuration + +- Subject claim format: `credentials:{owner}/{name}` where `{owner}` is the name of the DataChain Studio **user** or **team** owning the credentials, and `{name}` is the name of the DataChain Studio [credentials](../account-management.md#cloud-credentials). + +### Terraform examples + +The following Terraform examples illustrate how to configure the supported cloud providers, granting DataChain Studio access to object storage resources through OpenID Connect. Update the fields as described below and then apply the Terraform configuration. Make note of the outputs of `terraform apply`, since you will need to enter those for [Studio configuration](#studio-configuration). + +!!! tip + Replace the sample `credentials:example-team/example-credentials` subject claim condition. Replace `example-team` with the Studio **user** or **team** owning the credentials, and replace `example-credentials` with any name you want to use for those credentials. This name must match what you enter during [Studio configuration](#studio-configuration). + +#### Amazon Web Services + +```hcl +terraform { + required_providers { + aws = { + source = "hashicorp/aws" + version = "~> 4.16" + } + } + + required_version = ">= 1.2.0" +} + +provider "aws" { + region = "us-east-1" +} + +locals { + provider = "studio.datachain.ai/api" + condition = "credentials:example-team/example-credentials" +} + +data "tls_certificate" "studio" { + url = "https://${local.provider}" +} + +data "aws_iam_policy_document" "studio_assume_role" { + statement { + effect = "Allow" + actions = ["sts:AssumeRoleWithWebIdentity"] + + principals { + type = "Federated" + identifiers = [aws_iam_openid_connect_provider.studio.arn] + } + + condition { + test = "StringLike" + variable = "${aws_iam_openid_connect_provider.studio.url}:sub" + values = [local.condition] + } + } +} + +data "aws_iam_policy_document" "studio" { + statement { + actions = ["s3:*"] + resources = ["*"] + } +} + +resource "aws_iam_openid_connect_provider" "studio" { + url = data.tls_certificate.studio.url + client_id_list = ["sts.amazonaws.com"] + thumbprint_list = [data.tls_certificate.studio.certificates.0.sha1_fingerprint] +} + +resource "aws_iam_role" "studio" { + max_session_duration = 12 * 60 * 60 # 12 hours + assume_role_policy = data.aws_iam_policy_document.studio_assume_role.json + + inline_policy { + name = "studio" + policy = data.aws_iam_policy_document.studio.json + } +} + +output "role_arn" { + value = aws_iam_role.studio.arn +} +``` + +#### Google Cloud + +```hcl +terraform { + required_providers { + google = { + source = "hashicorp/google" + version = "5.13.0" + } + } +} + +provider "google" { + project = "your-project-id" + region = "us-central1" +} + +locals { + provider = "studio.datachain.ai/api" + condition = "credentials:example-team/example-credentials" +} + +data "google_project" "current" {} + +resource "google_iam_workload_identity_pool" "studio" { + workload_identity_pool_id = "datachain-studio" + display_name = "DataChain Studio" +} + +resource "google_iam_workload_identity_pool_provider" "studio" { + workload_identity_pool_id = google_iam_workload_identity_pool.studio.workload_identity_pool_id + workload_identity_pool_provider_id = "datachain-studio" + display_name = "DataChain Studio" + + attribute_mapping = { + "google.subject" = "assertion.sub" + } + + attribute_condition = "assertion.sub == '${local.condition}'" + + oidc { + issuer_uri = "https://${local.provider}" + } +} + +resource "google_service_account" "studio" { + account_id = "datachain-studio" + display_name = "DataChain Studio" +} + +resource "google_service_account_iam_binding" "studio" { + service_account_id = google_service_account.studio.name + role = "roles/iam.workloadIdentityUser" + + members = [ + "principalSet://iam.googleapis.com/${google_iam_workload_identity_pool.studio.name}/*" + ] +} + +output "google_service_account_email" { + value = google_service_account.studio.email +} + +output "google_workload_identity_provider" { + value = google_iam_workload_identity_pool_provider.studio.name +} +``` + +## Studio configuration + +[Create new credentials](../account-management.md#cloud-credentials) and configure them as follows: + +1. Choose an adequate OIDC variant on the provider field; e.g. _Amazon Web Services (OIDC)_. +2. Enter the name for the credentials. This must match the name used during [cloud configuration](#cloud-configuration). +3. Fill the provider-specific fields with the outputs from `terraform apply`. + +## Troubleshooting + +If you encounter issues with OIDC setup: + +1. Verify that the subject claim format matches exactly +2. Check that the Terraform configuration has been applied successfully +3. Ensure that the credential names match between cloud configuration and Studio configuration +4. Verify that the workload identity pool and provider are configured correctly + +For more help, see our [troubleshooting guide](../troubleshooting.md). diff --git a/docs/studio/user-guide/authentication/single-sign-on.md b/docs/studio/user-guide/authentication/single-sign-on.md new file mode 100644 index 000000000..ed247f254 --- /dev/null +++ b/docs/studio/user-guide/authentication/single-sign-on.md @@ -0,0 +1,55 @@ +# Single Sign-on (SSO) + +Single Sign-on (SSO) allows your team members to authenticate to DataChain Studio using your organization's identity Provider (IdP) such as Okta, LDAP, Microsoft AD, etc. + +We support integration with Okta, and instructions are provided below; but other IdPs should also work in a similar manner. If you need any support setting up your IdP integration, [let us know](../troubleshooting.md#support). + +SSO for teams can be configured by team admins, and requires configuration on both DataChain Studio and the IdP. The exact steps for this depend on the IdP. + +Once the SSO configuration is complete users can login to DataChain Studio by opening their team's login page `https://studio.datachain.ai/api/teams//sso` in their browser. They can also login directly from their Okta end-user dashboards by clicking on the DataChain Studio integration icon. + +If a user does not have a pre-assigned role when they sign in to a team, they will be auto-assigned the Viewer role. + +## Okta integration + +1. **Create Enterprise account**: SSO is available for DataChain Studio teams with enterprise account subscriptions. If you are on the Free or Basic plan of DataChain Studio, contact us to upgrade your account. + +2. **Add integration with DataChain Studio in Okta**: Follow the instructions from the [Okta developer guide](https://developer.okta.com/docs/guides/build-sso-integration/saml2/main/#create-your-integration-in-okta). In short, login to Okta with an admin account, and follow these steps: + 1. In the admin console, go to `Applications` -> `Create App Integration` to create a private SSO integration. + 2. Use `SAML 2.0` as the `Sign in method` (and not `OIDC` or some other option). + 3. Enter any name (eg, `DataChain Studio`) as the `App name`. + 4. `Single sign-on URL`: [`https://studio.datachain.ai/api/teams//saml/consume`](https://studio.datachain.ai/api/teams//saml/consume) (Replace with the name of your team in Studio. + 5. `Audience URI (SP Entity ID)`: https://studio.datachain.ai/api/saml + 6. `Name ID Format`: Persistent + 7. `Application username (NameID)`: Okta username + 8. `Attribute Statements (optional)`: + 1. `Name`: email + 2. `Name format`: URI Reference + 3. `Value`: user.email + + Click on `Next` and `Finish`. Once the integration is created, open the `Sign On` tab and expand the `Hide Details` section. From here, copy the Identity Provider metadata URL. + +3. **Configure DataChain Studio**: In your team settings, go to the `SSO` section and enable SSO. Enter the Identity Provider metadata URL that you copied from Okta. + +4. **Assign users**: In Okta, assign users to the DataChain Studio application. + +5. **Test the integration**: Users can now login to DataChain Studio using their Okta credentials by visiting the team's SSO login page. + +## User roles + +DataChain Studio supports the following user roles: + +- **Admin**: Full access to team settings and all projects +- **Member**: Can create and manage projects, view all team projects +- **Viewer**: Read-only access to team projects + +## Troubleshooting + +If you encounter issues with SSO setup: + +1. Verify that the URLs and configuration values are entered correctly +2. Check that users are properly assigned to the application in your IdP +3. Ensure that the Identity Provider metadata URL is accessible +4. Contact our support team if you need assistance with configuration + +For more help, see our [troubleshooting guide](../troubleshooting.md). diff --git a/docs/studio/user-guide/experiments/configure-a-project.md b/docs/studio/user-guide/experiments/configure-a-project.md new file mode 100644 index 000000000..57b658e59 --- /dev/null +++ b/docs/studio/user-guide/experiments/configure-a-project.md @@ -0,0 +1,101 @@ +# Configure a Project + +You can configure additional settings for your projects, including the project +name, directory, etc. Some of these settings are optional while others may be +mandatory depending on how your Git repository has been set up. + +To configure a project's settings , open the 3-dot menu for the project and +click on `Settings`. + +## Project name + +To change the project name, enter the new name for your project as shown below. + +## Project directory + +If the DVC repo for which you are creating the project is not in the root of +your Git repository but is in a sub-directory +of a [monorepo](https://en.wikipedia.org/wiki/Monorepo), then +[specify the full path](./configure-a-project.md#project-directory) +to the sub-directory that contains the DVC repo to which you are trying to +connect. + + + +Create multiple projects at once by providing up to 10 comma-separated values +during the initial [create project] flow. + + + +[create project]: + ./create-a-project.md#create-multiple-projects-from-a-single-git-repository + +## Data remotes / cloud storage credentials + +Here, the data remotes (cloud +storage or another location outside the Git repo) that are used in your DVC repo +will be listed. If you want your project to include data stored in these data +remotes, you will have to add credentials to grant DataChain Studio access to the data +remotes. Credentials that you have already added to your account are listed in +this section, and you can select them to add them to the project. + +To add new credentials, click on `Add new credentials` and select the provider +(Amazon S3, GCP, etc.). For details on what types of remote storage (protocols) +are supported, refer to the DVC documentation on supported storage types. + +Depending on the provider, you will be asked for more details such as the +credentials name, username, password etc. Note that for each supported storage +type, the required details may be different. + +You will also have to ensure that the credentials you enter have the required +permissions on the cloud / remote storage. Refer to the DVC Remote config +parameters for more details about this. + +Any credentials that you +[add in your profile page](../account-management.md#cloud-credentials) +are also available in your project settings page. + +Note that DataChain Studio uses the credentials only to read plots/metrics files if +they are not saved in Git. It does not access any other data in your remote +storage. And you do not need to provide the credentials if any DVC data remote +is not used in your Git repository. + +## Commits and columns + +You can specify which Git commits and columns should be imported from your Git +repository to your project in DataChain Studio, and which ones should be excluded. + +### Start date/time + +If your Git history has old commits that are not relevant to your project +anymore, you can set a cut-off date so that these outdated commits are not +imported in your project. Your old commits will remain in your Git repository, +but will not over-crowd your projects any more. This will let you focus on +recent experiments, metrics and plots. + + +### Columns + +You can specify which columns should be imported from your Git repository to +your project. Any unselected column cannot be displayed in your project table. + + +If you would like to hide imported columns from your project, you can do so in +the project's [Display preferences]. + +If your project is missing some required columns, then it is likely that +they have not been imported or are hidden. Refer to the +[troubleshooting guide](../troubleshooting.md) for more information. + + + +The **Columns** setting was earlier called **Tracking scope** or **Mandatory +columns** and behaved slightly differently. DataChain Studio would always import up to +200 columns. This meant that if you selected only 5 columns, DataChain Studio would +still import another 195 columns, unless your repository did not have so many +columns. This behavior is now obsolete, and only selected columns are imported. + + + +[display preferences]: + ./explore-ml-experiments.md#columns diff --git a/docs/studio/user-guide/experiments/create-a-project.md b/docs/studio/user-guide/experiments/create-a-project.md new file mode 100644 index 000000000..a2e8d74c8 --- /dev/null +++ b/docs/studio/user-guide/experiments/create-a-project.md @@ -0,0 +1,95 @@ +# Create a Project + +In this section, you will learn how to: + +- [Connect to a Git repository and add a project](#connect-to-a-git-repository-and-add-a-project) +- [Create multiple projects from a single Git repository](#create-multiple-projects-from-a-single-git-repository) +- [Create projects shared across a team](#create-projects-shared-across-a-team) + +## Connect to a Git repository and add a project + +To add a new project, follow these steps. + +1. Sign in to DataChain Studio using your GitHub.com, GitLab.com, or Bitbucket.org + account, or with your email address. + +2. Click on `Add a Project`. All the organizations that you have access to will + be listed. + + + + If you do not see your desired organizations or Git repositories, make sure + that + [the connection to your Git server has been set up](../account-management.md#git-connections). + + To connect to your GitHub repositories, you must install the DataChain Studio + GitHub app. Refer to the section on + [GitHub app installation](../git-connections/github-app.md) + for more details. + + To connect to repositories on your self-hosted GitLab server, you must first + add a connection to this server and create a team. Refer to the section on + [self-hosted GitLab server support](../git-connections/custom-gitlab-server.md) + for more details. + + + +3. Open the organization whose repository you want to connect to. You can also + use the search bar to directly look for a repository. + + ![](https://static.iterative.ai/img/studio/select_repo_v3.png) + +4. Click on the Git repository that you want to connect to. + +5. In the `Project settings` page that opens up, you can edit the project name, + directory and visibility (public accessibility). These settings can also be + [edited after the project has been created](configure-a-project.md). + + + + If your DVC repo is in a sub-directory of a + [monorepo](https://en.wikipedia.org/wiki/Monorepo), then you should specify + the full path to the sub-directory in the `Project directory` setting. + + + + + + You can create multiple projects at once by providing up to 10 + comma-separated values. DataChain Studio will create one project for each + sub-directory in the list. + + + +6. Click on `Create Project`. + +You should now see that the project has been added in your dashboard. + +## Create multiple projects from a single Git repository + +You can create multiple projects in DataChain Studio from a single Git repository and +apply different settings to them. + +One use case for this is if you have a +**[monorepo](https://en.wikipedia.org/wiki/Monorepo)** with multiple ML +projects, each one in a different sub-directory. + +For each ML project in the monorepo, follow the +[above process](#connect-to-a-git-repository-and-add-a-project) to connect to +the Git repository. On the additional settings page +[specify the sub-directory](configure-a-project.md#project-directory) +(or up to 10 comma-separated values) in which the desired ML project resides. + +This way, you will have multiple DataChain Studio projects for your single Git +repository, with each project presenting values from a different sub-directory. + +## Create projects shared across a team + +You can [create teams](../team-collaboration.md) with one or +more team members, also called collaborators. + +Each team will have its own projects dashboard, and the projects that you create +in the team's dashboard will be accessible to all members of the team. + +To add more than 2 collaborators in your team, +[upgrade to the **Enterprise** plan](../team-collaboration.md#get-enterprise). diff --git a/docs/studio/user-guide/experiments/explore-ml-experiments.md b/docs/studio/user-guide/experiments/explore-ml-experiments.md new file mode 100644 index 000000000..7dfa0d93a --- /dev/null +++ b/docs/studio/user-guide/experiments/explore-ml-experiments.md @@ -0,0 +1,251 @@ +# Explore ML Experiments + +The projects dashboard in DataChain Studio contains all your projects. Click on a +project name to open the project table, which contains: + +- [Git history and live experiments](#git-history-and-live-experiments) of the + project +- [Display preferences](#display-preferences) +- Buttons to + [visualize and compare experiments](#visualize-and-compare-experiments). +- Button to [export project data](#export-project-data). + +## Git history and live experiments + +Branches and commits in your Git repository are displayed along with the +corresponding models, metrics, hyperparameters, and DVC-tracked files. + +Experiments that you push using the `dvc exp push` command as well as any live +experiments that you send using [DVCLive] are displayed in a special experiment +row nested under the parent Git commit. More details of how live experiments are +displayed can be found in the +[live metrics and plots guide](live-metrics-and-plots.md). + +To manually check for updates in your repository, use the `Reload` button 🔄 +located above the project table. + +![](https://static.iterative.ai/img/studio/view_components_1.gif) + + + +One simple way to briefly describe your experiments is to use meaningful commit +messages. + + + +### Nested branches + +When a Git branch (e.g., `feature-branch-1`) is created from another branch +(e.g., `main`), two possibilities exist: + +- `feature-branch-1` is still active (contains commits that are not present in + `main`). This can happen if the user has pushed new commits to this branch and + - either hasn't merged it into `main` yet + - or has merged it into `main` but has continued to push more new commits to + it after the merger. + + Since the branch now contains new unique commits, the project table will + display both `main` and `feature-branch-1` separately. `feature-branch-1` will + show the new commits that are not part of `main` while all the merged commits + will be shown inside `main`. + +- `feature-branch-1` is inactive (does not contain any commits that are not + present in `main`). This can happen in two cases: + - if the user has not pushed any new commits to `feature-branch-1` + - if the user has merged `feature-branch-1` into `main` and has not pushed any + new commits to it after the merger. + + Since the branch does not contain any new unique commits, DataChain Studio considers + `feature-branch-1` as **"nested"** within `main` and does not display it as a + separate branch. This helps to keep the project table concise and reduce + clutter that can accumulate over time when inactive branches are not cleaned + from the Git repository. After all, those inactive branches usually carry no + new information for the purpose of managing experiments. If you would like to + display all commits of such an inactive branch, use the + [`Commits on branch = feature-branch-1` display filter](#filters). + +## Display preferences + +The table contains buttons to specify filters and other preferences regarding +which commits and columns to display. + +### Filters: + +Click on the `Filters` button to specify which rows you want to show in the +project table. + +![Project filters](https://static.iterative.ai/img/studio/project_filters.png) + +There are two types of filters: + +- **Quick filters** (highlighted in orange above): Use the quick filter buttons + to + - Show only DVC experiments + - Show only selected experiments + - Toggle hidden commits (include or exclude hidden commits in the project + table) + +- **Custom filters** (highlighted in purple above): Filter commits by one or + more of the following fields: + - Column values (values of metrics, hyperparameters, etc.) and their deltas + - Git related fields such as Git branch, commit message, tag and author + + + + The `Branch` filter displays only the specified branch and its commits. + + On the other hand, the `Commits on branch` filter will also display branches + [inside which the specified branch is nested](#nested-branches). + + + +
+ + ### More details on nested branches + + When a Git branch is nested inside another branch, the project table + [does not display the nested branch](#nested-branches). If + `feature-branch-1` is nested within `main`, `feature-branch-1` is NOT + displayed in the project table even if you apply the + `Branch = feature-brach-1` filter. + + In this case, if you would like to filter for commits in `feature-branch-1`, + you should use the `Commits on branch = feature-branch-1` filter. This will + display the `main` branch with commits that were merged from + `feature-branch-1` into `main`. A hint is present to indicate that even + though the commits appear inside `main`, they are part of the nested branch + `feature-branch-1`. + + ![Result of commits on branch filter](https://static.iterative.ai/img/studio/commits_on_branch_filter.png) + +
+ + - The `Custom filters` can be un-applied without deletion, allowing you to + create the filters once and toggle them on and off as needed. + + + +### Columns: + +Select the columns you want to display and hide the rest. +![Showing and hiding columns](https://static.iterative.ai/img/studio/show_hide_columns.gif) + +If your project is missing some required columns or includes columns that you do +not want, refer to the [troubleshooting guide](../troubleshooting.md) for more +information on managing project columns and settings. + +To reorder the columns, click and drag them in the table or from the Columns +dropdown. +![Showing and hiding columns](https://static.iterative.ai/img/studio/reorder_columns.gif) + +**Columns menu and goals:** Click on the column header to open a context menu +with actions such as sorting and filtering the project table by the column's +values. + +For metrics, you can also specify goals, which indicate whether an increase or a +decrease in the metric's value is desirable. Once a goal is set, the metric's +values for all rows are compared against the value in the baseline row. Values +that are better (higher or lower, depending on the goal) than that in the +baseline row are highlighted in green, with the best one shown with a green +border. Values that are worse than that in the baseline row are marked in pink. + +![Columns menu and goals](https://static.iterative.ai/img/studio/columns_menu_and_goals.gif) + + + +To change the baseline row in your project, use the 3-dot menu of the row which +you want to set as the new baseline. + +![Set baseline row](https://static.iterative.ai/img/studio/set-baseline-row.gif) + + + +### Hide commits: + +Commits can be hidden from the project table in the following ways: + +- **DataChain Studio auto-hides irrelevant commits:** DataChain Studio identifies commits + where metrics, files and hyperparameters did not change and hides them + automatically. +- **DataChain Studio auto-hides commits that contain `[skip studio]` in the commit + message:** This is particularly useful if your workflow creates multiple + commits per experiment and you would like to hide all those commits except the + final one. + + For example, suppose you create a Git commit with hyper-parameter changes for + running a new experiment, and your training CI job creates a new Git commit + with the experiment results (metrics and plots). You may want to hide the + first commit and only display the second commit, which has the new values for + the hyper-parameters as well as experiment results. For this, you can use the + string `[skip studio]` in the commit message of the first commit. + +- **Hide commits and branches manually:** This can be useful if there are + commits that do not add much value in your project. To hide a commit or + branch, click on the 3-dot menu next to the commit or branch name and click on + `Hide commit` or `Hide branch`. + + ![Hide commit](https://static.iterative.ai/img/studio/hide_commit.png) + +- **Unhide commits:** You can unhide commits as needed, so that you don't lose + any experimentation history. To display all hidden commits, click on the + `Show hidden commits` toggle (refer [filters](#filters)). This will display + all hidden commits, with a `hidden` (closed eye) indicator. + + ![Hidden commit indicator](https://static.iterative.ai/img/studio/hidden_commit_indicator.png) + + To unhide any commit, click on the 3-dot menu for that commit and click on + `Show commit`. + + ![Show hidden commit](https://static.iterative.ai/img/studio/show_hidden_commit.png) + +### Delta mode + +For metrics, models and files columns with numeric values, you can display +either the absolute values or their delta (difference) from the baseline row. To +toggle between these two options, use the `Delta mode` button. + +![Delta mode](https://static.iterative.ai/img/studio/delta_mode.png) + +### Save changes: + +Whenever you make any changes to your project's columns, commits or filters, a +notification to save or discard your changes is displayed at the top of the +project table. Saved changes remain intact even after you log out of DataChain Studio +and log back in later. + +![Save or discard changes](https://static.iterative.ai/img/studio/save_discard_changes.png) + +## Visualize and compare experiments + +Use the following buttons to visualize and compare experiments: + +- **Plots:** Open the `Plots` pane and + [display plots](visualize-and-compare.md#display-plots-and-images) for the + selected commits. +- **Trends:** [Generate trend charts](visualize-and-compare.md#generate-trend-charts) + to see how the metrics have changed over time. +- **Compare:** [Compare experiments](visualize-and-compare.md#compare-experiments) + side by side. + +These buttons appear above your project table as shown below. +![example export to csv](https://static.iterative.ai/img/studio/project_action_buttons_big_screen.png) + +On smaller screens, the buttons might appear without text labels, as shown +below. + +![example export to csv](https://static.iterative.ai/img/studio/project_action_buttons_small_screen.png) + +## Export project data + +The button to export data from the project table to CSV is present next to the +[`Delta mode`](#delta-mode) button. + +![export to csv](https://static.iterative.ai/img/studio/project_export_to_csv.png) + +Below is an example of the downloaded CSV file. + +![example export to csv](https://static.iterative.ai/img/studio/project_export_to_csv_example.png) + +[DVCLive]: https://dvc.org/doc/dvclive diff --git a/docs/studio/user-guide/experiments/index.md b/docs/studio/user-guide/experiments/index.md new file mode 100644 index 000000000..3f992d540 --- /dev/null +++ b/docs/studio/user-guide/experiments/index.md @@ -0,0 +1,44 @@ +# Experiments (DVC Integration) + +DataChain Studio provides comprehensive ML experiment tracking through DVC integration, allowing you to track, compare, and manage your machine learning experiments with Git-based versioning. + +## Overview + +DataChain Studio integrates with DVC (Data Version Control) to provide a powerful web-based interface for managing your ML experiments. By connecting your Git repositories, you can visualize experiment results, compare different runs, and collaborate with your team—all without leaving your browser. + +### Key Features + +**Project Management** + +- [Create projects](./create-a-project.md) by connecting to GitHub, GitLab, or Bitbucket repositories +- [Configure project settings](./configure-a-project.md) including project directory, data remotes, and column tracking +- Support for monorepos with multiple ML projects in different sub-directories +- [Share projects](./share-a-project.md) with your team or make them publicly accessible + +**Experiment Tracking** + +- [Explore ML experiments](./explore-ml-experiments.md) with a comprehensive project table showing Git history, metrics, hyperparameters, and DVC-tracked files +- [Generate live metrics and plots](./live-metrics-and-plots.md) for running experiments using DVCLive +- [Monitor running experiments](./run-experiments.md) in real-time with status updates and output logs +- Automatic tracking of experiments pushed with `dvc exp push` + +**Visualization and Comparison** +- [Visualize and compare experiments](./visualize-and-compare.md) using plots, images, and trend charts +- Display plots generated by DVCLive including AUC curves, loss functions, and confusion matrices +- Compare up to seven experiments side by side +- Generate trend charts to see how metrics changed over time +- Export project data to CSV for external analysis + +**Collaboration** +- Create teams with multiple collaborators +- Share projects within a team or publicly on the web +- Track experiments from different branches and commits +- Review and manage experiments through pull/merge requests + +### Getting Started + +1. **[Create a project](./create-a-project.md)** - Connect your Git repository to DataChain Studio +2. **[Configure your project](./configure-a-project.md)** - Set up data remotes, credentials, and tracking preferences +3. **[Run experiments](./run-experiments.md)** - Start tracking your ML experiments with DVC and DVCLive +4. **[Explore and visualize](./explore-ml-experiments.md)** - Analyze your results in the project table and plots +5. **[Share your work](./share-a-project.md)** - Collaborate with your team or share publicly diff --git a/docs/studio/user-guide/experiments/live-metrics-and-plots.md b/docs/studio/user-guide/experiments/live-metrics-and-plots.md new file mode 100644 index 000000000..2d7d72e03 --- /dev/null +++ b/docs/studio/user-guide/experiments/live-metrics-and-plots.md @@ -0,0 +1,157 @@ +# Generate live (real-time) metrics and plots for running experiments + +In your model training script, you can use [DVCLive] to send live updates for +metrics and plots without writing them to your Git repository, so that you can +track your experiments in real-time from DataChain Studio. + +This requires a 2-step process: + +1. [Set up an access token](#set-up-an-access-token) +2. [Send and view the updates](#send-and-view-live-metrics-and-plots) + +## Set up an access token + +DataChain Studio uses access tokens to authorize DVC and [DVCLive] to send live +experiment updates. The access token must be present in any request that sends +data to the DataChain Studio ingestion endpoint. Requests with missing or incorrect +access tokens are rejected with an appropriate HTTP error code and error +message. The access token is also used by DVC to notify DataChain Studio when you push +experiments using `dvc exp push`. + +Once you create your +[DataChain Studio client access token](../account-management.md#client-access-tokens) +with Experiment operations scope, pass it to your experiment. If you are running +the experiment locally, you can use `dvc studio login` to interactively set the +token: + +```cli +$ dvc studio login +``` + +If you are running the experiment as part of a CI job, a secure way to provide +the access token is to create a +[GitHub secret](https://docs.github.com/en/actions/security-guides/encrypted-secrets) +containing the value of the token, and use the secret in your CI job using the +`DVC_STUDIO_TOKEN` environment variable (see example below). + +```yaml +steps: + - name: Train model + env: + DVC_STUDIO_TOKEN: ${{ secrets.DVC_STUDIO_TOKEN }} +``` + + + +If the code is running outside of your Git repository (for example, in +Databricks or SageMaker), you lose the benefit of automatically +tracking metrics and plots with Git, but you can send live updates to Studio if +you set the `DVC_STUDIO_TOKEN` and `DVC_EXP_GIT_REMOTE` environment variables: + +```cli +$ export DVC_STUDIO_TOKEN="" +$ export DVC_EXP_GIT_REMOTE="https://github.com//" +``` + + + +## Send and view live metrics and plots + +### Send live updates using DVCLive + +In the training job (which has been configured as detailed above), whenever you +log your metrics or plots using [DVCLive], they will be automatically sent to +DataChain Studio. Here is an example of how you can use [DVCLive] in your training +code: + +```py +from dvclive import Live + +with Live() as live: + for i in range(params["epochs"]): + ... + live.log_metric("accuracy", accuracy) + live.next_step() + ... +``` + + + +DVCLive signals the end of the experiment using `live.end()`. Using +`with Live() as live:` or one of the integrations for ML Frameworks ensures that +`live.end()` is automatically called when the experiment concludes successfully. + + + +### Live experiments in DataChain Studio + +DataChain Studio stores the live experiments data in its database. In the project +table, the live experiments are displayed in experiment rows, which are nested +under the parent Git commit. Updates to the live experiments are highlighted (in +orange) in the project table and +[compare pane](visualize-and-compare.md#compare-experiments) in real time. + +![](https://static.iterative.ai/img/studio/live_metrics.gif) + +The number of live experiments with recent updates are displayed in the `Live` +icon, which can also be used to filter and show only live (running) experiments +in the table. + +Live plots are displayed in the [plots pane](visualize-and-compare.md). +You can see them getting populated as Studio receives new updates. + +![](https://static.iterative.ai/img/studio/live_plots.gif) + + + +If there are multiple projects connected to a single Git repository, then live +experiments for this repository are displayed in all its connected projects. + + + +### Detached experiments + +A live experiment for which the parent Git commit is missing in the Git +repository is displayed in a separate section called `Detached experiments` at +the top of the project table. + +Some of the reasons for missing parent commits are: + +- the parent commit exists in your local clone of the repository and is not + pushed to the Git remote +- the parent commit got removed by some mutative Git action such as rebase, hard + reset with a push, squash commit, etc. + +Once you push the missing parent commit to the Git remote, the live experiment +will get nested under the parent commit as expected. + +You can also delete the detached experiments if they are no longer important. + +### Experiment status + +An experiment can have one of the following statuses: + +- **Running** - DataChain Studio expects to receive live metrics and plots for these + experiments. + + + + If the experiment stops due to any error, DataChain Studio will not be aware of this + and it will continue to wait for live updates. In this case, you can delete + the row from the project table. + + + +- **Completed** - DataChain Studio does not expect to receive any more updates for + these experiments. Once the experiment concludes, you can delete the row from + the project table. + + + + DataChain Studio does not automatically commit and push the final results of your + experiment to Git. You can push the experiment using appropriate DVC and Git + commands. + + + +[dvclive]: https://dvc.org/doc/dvclive diff --git a/docs/studio/user-guide/experiments/run-experiments.md b/docs/studio/user-guide/experiments/run-experiments.md new file mode 100644 index 000000000..8e328acc2 --- /dev/null +++ b/docs/studio/user-guide/experiments/run-experiments.md @@ -0,0 +1,28 @@ +# Run Experiments + +The functionality to train your model and run experiments with different +hyperparameters or datasets is currently disabled in DataChain Studio. However, DVC +experiments that are created from the terminal or the VS Code extension can be +monitored and managed from DataChain Studio. + +## Monitor a running experiment + +Once you submit an experiment, a new row is created in the experiments table +under the original Git commit. +[Live updates to metrics and plots](live-metrics-and-plots.md) +generated by [DVCLive] will show up in this row, and you can click on the +experiment name to view the status and output log of the running experiment +task. + +## Manage a completed experiment + +When the experiment completes, the files (including code, data, models, +parameters, metrics, and plots) are pushed back to your Git and DVC remotes. + +In DataChain Studio, you can create a branch and pull/merge request from the completed +experiment, so that you can share, review, merge, and reproduce the experiment. +In the pull/merge request, DataChain Studio automatically inserts a link to the +training report. So your teammates who are reviewing your PR can quickly and +easily compare your experiment with its baseline. + +[dvclive]: https://dvc.org/doc/dvclive diff --git a/docs/studio/user-guide/experiments/share-a-project.md b/docs/studio/user-guide/experiments/share-a-project.md new file mode 100644 index 000000000..987e56efb --- /dev/null +++ b/docs/studio/user-guide/experiments/share-a-project.md @@ -0,0 +1,44 @@ +# Share a Project + +You can [share a project within a team](#share-a-project-within-a-team). You can +also [make a project public](#make-a-project-public) to share it on the web. + +## Share a project within a team + +Each team that you [create in DataChain Studio](../team-collaboration.md) will have +its own projects dashboard. All the projects that you create in the team's +dashboard will be accessible to all members (collaborators) of the team. + +To add more than 2 collaborators in your team, +[upgrade to the **Enterprise** plan](../team-collaboration.md#get-enterprise). + +## Make a project public + +To share a project on the web (i.e., to make the project public), click on the +button labeled `Private` next to the name of the project. In the menu that pops +up, turn on `Share to Web`. + + + +This will not change the settings of your connected Git repository; if the Git +repository is private, it will continue to remain private. + + + +![](https://static.iterative.ai/img/studio/project_share.png) + +A shared (public) project can be made private by turning off `Share to web`. + + + +This will not change the settings of your connected Git repository; if the Git +repository is public, it will continue to remain public. + + + +Projects that are shared on the web can be opened by anyone, including people +who are not logged in to DataChain Studio. These anonymous users have the `Visitor` +role. Their access is limited to opening the project's experiment table, +applying filters, and showing/hiding columns for themselves without saving any +changes permanently. Refer to the [Roles](../team-collaboration.md#roles) +section for details on the features available for different roles. diff --git a/docs/studio/user-guide/experiments/visualize-and-compare.md b/docs/studio/user-guide/experiments/visualize-and-compare.md new file mode 100644 index 000000000..c3d94e07d --- /dev/null +++ b/docs/studio/user-guide/experiments/visualize-and-compare.md @@ -0,0 +1,53 @@ +# Visualize and Compare Experiments + +You can visualize and compare experiments using plots, images, metrics, etc. You +can also +[export the project table as CSV](explore-ml-experiments.md#export-project-data), +to use the data with any external reporting or visualization tool. + +## Display plots and images + +You can visualize certain metrics of machine learning experiments as plots. Some +plot examples are AUC curves, loss functions, and confusion matrices. The +easiest way to start is with [DVCLive], which will automatically generate plots +data and configure them to be visualized. + +DataChain Studio can plot two types of files in your repository: + +1. Data series files, which can be JSON, YAML, CSV or TSV. Data from these files + will populate your AUC curves, loss functions, confusion matrices and other + metric plots. +2. Image files in JPEG, GIF, or PNG format. These images will be displayed as-is + in DataChain Studio. + +To open the `Plots` pane and display plots, select the plots toggle for one or +more experiments and click on the `Plots` button. + +### Live plots + +You can [send live updates to your plots](live-metrics-and-plots.md) with +[DVCLive]. The number of recent updates to the live metrics are displayed in the +`Live` icon. Live plots are also shown and updated in real-time in the plots +pane along with all other plots. + +![Live plots](https://static.iterative.ai/img/studio/live-plots.gif) + +## Generate trend charts + +Click on the `Trends` button to generate a plot of how the metrics changed over +the course of the different experiments. For each metric, the trend charts show +how the metric changed from one commit to another. You can include one or more +branches in the trend chart, and branches that are currently hidden in the +project table are excluded. + +![](https://static.iterative.ai/img/studio/trends.png) + +## Compare experiments + +Select up to seven experiments and click on the `Compare` button. The metrics, +parameters and files in the selected experiments will be displayed side by side +for easy comparison. + +![](https://static.iterative.ai/img/studio/compare.png) + +[dvclive]: https://dvc.org/doc/dvclive diff --git a/docs/studio/user-guide/git-connections/custom-gitlab-server.md b/docs/studio/user-guide/git-connections/custom-gitlab-server.md new file mode 100644 index 000000000..cc092e520 --- /dev/null +++ b/docs/studio/user-guide/git-connections/custom-gitlab-server.md @@ -0,0 +1,207 @@ +# Custom GitLab Server + +Learn how to connect DataChain Studio to your self-hosted GitLab server for enterprise deployments. + +## Overview + +DataChain Studio supports integration with self-hosted GitLab servers, enabling: + +- **Enterprise Integration**: Connect to corporate GitLab instances +- **Custom Domains**: Work with internal GitLab servers +- **Advanced Security**: Leverage enterprise security features +- **On-premises Data**: Keep code and data within your network + +## Prerequisites + +Before connecting to a custom GitLab server: + +- **GitLab Server**: Running GitLab CE or EE (version 12.0+) +- **Network Access**: DataChain Studio must be able to reach your GitLab server +- **Admin Access**: GitLab administrator privileges for OAuth app creation +- **SSL Certificate**: Valid SSL certificate for HTTPS (recommended) + +## Configuration Steps + +### 1. Create OAuth Application in GitLab + +1. Log in to your GitLab server as an administrator +2. Navigate to Admin Area → Applications +3. Click "New Application" +4. Configure the application: + - **Name**: DataChain Studio + - **Redirect URI**: `https://studio.datachain.ai/api/auth/gitlab/callback` + - **Scopes**: Select required scopes: + - `read_user`: Read user information + - `read_repository`: Access repositories + - `read_api`: API access + +5. Click "Save application" +6. Copy the **Application ID** and **Secret** + +### 2. Configure DataChain Studio + +1. Log in to DataChain Studio +2. Go to Account Settings → Git Connections +3. Click "Add GitLab Server" +4. Enter server details: + - **Server URL**: Your GitLab server URL (e.g., `https://gitlab.company.com`) + - **Application ID**: From step 1 + - **Application Secret**: From step 1 + - **Server Name**: Friendly name for identification + +5. Click "Save Configuration" +6. Test the connection + +### 3. Team Configuration + +For team-based access: + +1. Create or select a team in DataChain Studio +2. Go to Team Settings → Git Connections +3. Add the custom GitLab server configuration +4. Configure team-specific access permissions + +## OAuth Scopes + +Required OAuth scopes for different features: + +### Basic Integration +- `read_user`: Read user profile information +- `read_repository`: Access repository contents + +### Advanced Features +- `read_api`: API access for webhooks and automation +- `write_repository`: Update commit statuses (optional) + +### Webhook Integration +- `read_api`: Required for webhook configuration +- `admin`: May be required for some webhook operations + +## Network Configuration + +### Firewall Rules + +Ensure proper network access: + +#### Outbound (DataChain Studio → GitLab) +- **HTTPS (443)**: For API and OAuth communication +- **SSH (22)**: For Git operations (if using SSH) + +#### Inbound (GitLab → DataChain Studio) +- **HTTPS (443)**: For webhook callbacks +- **Custom Port**: If using custom webhook endpoints + +### SSL/TLS Configuration + +For secure communication: + +1. **Valid Certificate**: Use a valid SSL certificate for your GitLab server +2. **Certificate Chain**: Ensure complete certificate chain is configured +3. **TLS Version**: Use TLS 1.2 or higher +4. **Cipher Suites**: Configure secure cipher suites + +## Webhook Configuration + +### Automatic Configuration + +DataChain Studio can automatically configure webhooks: + +1. Ensure OAuth app has sufficient permissions +2. Grant `read_api` scope +3. DataChain Studio will create webhooks automatically + +### Manual Configuration + +If automatic configuration fails: + +1. Go to your repository settings in GitLab +2. Navigate to Settings → Webhooks +3. Add webhook with: + - **URL**: `https://studio.datachain.ai/api/webhooks/gitlab` + - **Secret Token**: (optional but recommended) + - **Trigger Events**: Push events, Merge requests + - **SSL Verification**: Enable (recommended) + +## Troubleshooting + +### Connection Issues + +#### SSL Certificate Errors +- Verify certificate validity and chain +- Check certificate matches server hostname +- Ensure DataChain Studio trusts the certificate + +#### Network Connectivity +- Test connectivity from DataChain Studio to GitLab +- Check firewall rules and network policies +- Verify DNS resolution + +#### OAuth Errors +- Verify Application ID and Secret +- Check redirect URI configuration +- Ensure OAuth app is enabled + +### Repository Access Issues + +#### Permission Denied +- Verify user has access to repositories +- Check OAuth scopes are sufficient +- Ensure repositories are not archived or disabled + +#### Webhook Failures +- Check webhook URL is accessible +- Verify webhook secret configuration +- Test webhook manually from GitLab + +### Performance Issues + +#### Slow Repository Loading +- Check network latency between systems +- Verify GitLab server performance +- Consider repository size and complexity + +#### Timeout Errors +- Increase timeout settings if possible +- Check for network bottlenecks +- Monitor GitLab server resource usage + +## Security Considerations + +### OAuth Security +- **Secret Protection**: Secure storage of OAuth credentials +- **Scope Limitation**: Grant minimum required scopes +- **Regular Rotation**: Rotate OAuth secrets periodically + +### Network Security +- **VPN Access**: Consider VPN for additional security +- **IP Restrictions**: Limit access to specific IP ranges +- **Audit Logging**: Enable comprehensive audit logging + +### Data Protection +- **Data Classification**: Classify repository data appropriately +- **Access Controls**: Implement proper access controls +- **Compliance**: Ensure compliance with data protection regulations + +## Enterprise Features + +### Single Sign-On (SSO) +- Integrate with corporate identity providers +- Leverage existing authentication systems +- Centralized user management + +### Advanced Permissions +- Role-based access control +- Group-based permissions +- Project-level access controls + +### Audit and Compliance +- Comprehensive audit logging +- Compliance reporting +- Security monitoring + +## Next Steps + +- Configure [team collaboration](../team-collaboration.md) +- Set up [automated workflows](../../../guide/processing.md) +- Explore [webhook integration](../../webhooks.md) +- Learn about [GitHub integration](github-app.md) as an alternative diff --git a/docs/studio/user-guide/git-connections/github-app.md b/docs/studio/user-guide/git-connections/github-app.md new file mode 100644 index 000000000..137c34122 --- /dev/null +++ b/docs/studio/user-guide/git-connections/github-app.md @@ -0,0 +1,129 @@ +# GitHub App + +Learn how to install and configure the DataChain Studio GitHub App for seamless integration with your GitHub repositories. + +## Overview + +The DataChain Studio GitHub App provides secure, fine-grained access to your GitHub repositories, enabling: + +- **Repository Access**: Connect public and private repositories +- **Webhook Integration**: Automatic job triggering on code changes +- **Security**: OAuth-based authentication with granular permissions +- **Team Collaboration**: Shared access across team members + +## Installation + +### Install for Personal Account + +1. Navigate to [DataChain Studio GitHub App](https://github.com/apps/datachain-studio) +2. Click "Install" or "Configure" +3. Choose "Only select repositories" or "All repositories" +4. Select the repositories you want to connect +5. Review and approve permissions +6. Complete installation + +### Install for Organization + +1. Go to your organization's settings on GitHub +2. Navigate to "Third-party access" → "GitHub Apps" +3. Search for "DataChain Studio" or use the installation link +4. Configure repository access and permissions +5. Complete installation for the organization + +## Configuration + +### Repository Selection + +Choose which repositories to connect: + +- **All repositories**: Grants access to all current and future repositories +- **Selected repositories**: Choose specific repositories to connect +- **Recommended**: Start with selected repositories for better security + +### Permissions + +The DataChain Studio GitHub App requests these permissions: + +#### Repository Permissions +- **Contents**: Read repository files and commit history +- **Metadata**: Read repository information and settings +- **Pull requests**: Read PR information for job triggering +- **Commit statuses**: Update commit status based on job results + +#### Organization Permissions +- **Members**: Read organization membership (for team features) +- **Plan**: Read organization plan information + +## Usage + +### Creating Datasets + +Once installed, you can create datasets from GitHub repositories: + +1. Go to DataChain Studio +2. Click "Create Dataset" +3. Select your GitHub organization +4. Choose the repository +5. Configure dataset settings +6. Create the dataset + +### Webhook Integration + +The GitHub App automatically configures webhooks for: + +- **Push events**: Trigger jobs on new commits +- **Pull requests**: Run validation jobs on PRs +- **Releases**: Deploy or process data on releases + +## Troubleshooting + +### App Not Visible + +If you don't see the GitHub App or repositories: + +1. **Check Installation**: Verify the app is installed on the correct account/organization +2. **Repository Access**: Ensure the app has access to the desired repositories +3. **Permissions**: Verify you have admin access to the organization +4. **Cache**: Try logging out and back into DataChain Studio + +### Permission Issues + +If you encounter permission errors: + +1. **Review Permissions**: Check that all required permissions are granted +2. **Reinstall**: Try uninstalling and reinstalling the app +3. **Organization Approval**: Some organizations require admin approval for new apps + +### Webhook Issues + +If webhooks aren't triggering jobs: + +1. **Check Webhook Settings**: Verify webhooks are configured in repository settings +2. **Event Types**: Ensure the correct event types are enabled +3. **Repository Access**: Confirm the app has access to the repository +4. **Network**: Check that GitHub can reach DataChain Studio servers + +## Security + +### Best Practices + +1. **Least Privilege**: Only grant access to repositories that need DataChain integration +2. **Regular Reviews**: Periodically review and audit app permissions +3. **Organization Policies**: Follow your organization's security policies +4. **Access Monitoring**: Monitor app access logs and usage + +### Permissions Audit + +Regularly audit GitHub App permissions: + +1. Go to your GitHub settings +2. Navigate to "Applications" → "Authorized GitHub Apps" +3. Review DataChain Studio permissions +4. Update or revoke access as needed + +## Next Steps + +- Learn about [custom GitLab server](custom-gitlab-server.md) integration +- Explore [team collaboration](../team-collaboration.md) features +- Set up [automated workflows](../../../guide/processing.md) +- Configure [webhooks](../../webhooks.md) for notifications diff --git a/docs/studio/user-guide/git-connections/index.md b/docs/studio/user-guide/git-connections/index.md new file mode 100644 index 000000000..13237ac37 --- /dev/null +++ b/docs/studio/user-guide/git-connections/index.md @@ -0,0 +1,34 @@ +# Git Connections + +DataChain Studio integrates seamlessly with Git repositories to manage your data processing code, track changes, and enable collaboration. + +## Overview + +Git connections in DataChain Studio enable: + +- **Version Control**: Track changes to your data processing pipelines +- **Collaboration**: Share code and collaborate with team members +- **Automated Workflows**: Trigger jobs based on Git events + +## Supported Git Providers + +DataChain Studio supports integration with major Git hosting providers: + +### GitHub +- **GitHub.com**: Public and private repositories +- **GitHub Enterprise**: Self-hosted GitHub instances +- **GitHub App**: Dedicated app integration for enhanced security + +[Learn more about GitHub integration →](github-app.md) + +### GitLab +- **GitLab.com**: SaaS GitLab service +- **Self-hosted GitLab**: Custom GitLab installations +- **OAuth Integration**: Secure authentication flow + +[Learn more about GitLab integration →](custom-gitlab-server.md) + +### Bitbucket +- **Bitbucket Cloud**: Atlassian's cloud service +- **OAuth Integration**: Secure authentication and access +- **Repository Access**: Public and private repository support diff --git a/docs/studio/user-guide/index.md b/docs/studio/user-guide/index.md new file mode 100644 index 000000000..ef3d31541 --- /dev/null +++ b/docs/studio/user-guide/index.md @@ -0,0 +1,29 @@ +# User Guide + +This section covers how to use DataChain Studio for managing your data processing workflows, collaborating with team members, and integrating with your development workflow. + +## Getting Started + +- **[Account Management](account-management.md)** - Manage your Studio account and settings +- **[Authentication](authentication/single-sign-on.md)** - Set up SSO and authentication methods + +## Core Features + +- **[Jobs](jobs/index.md)** - Run and monitor data processing jobs +- **[Git Connections](git-connections/index.md)** - Connect your Git repositories + +## Collaboration + +- **[Team Collaboration](team-collaboration.md)** - Work with your team in Studio + +## Support + +- **[Troubleshooting](troubleshooting.md)** - Common issues and solutions + +## Next Steps + +Once you're familiar with the basics, explore: + +- [API Reference](../api/index.md) for programmatic access +- [Webhooks](../webhooks.md) for event notifications +- [Self-hosting](../self-hosting/index.md) for enterprise deployments diff --git a/docs/studio/user-guide/jobs/create-and-run.md b/docs/studio/user-guide/jobs/create-and-run.md new file mode 100644 index 000000000..9025f3c3d --- /dev/null +++ b/docs/studio/user-guide/jobs/create-and-run.md @@ -0,0 +1,233 @@ +# Create and Run Jobs + +Write and execute DataChain scripts directly in Studio to process data from your connected storage. + +## Prerequisites + +- Connected storage (S3, GCS, Azure Blob Storage, or other supported storage) +- Storage credentials configured in account settings +- Access to DataChain Studio workspace + +## Writing Your Script + +### 1. Access the Editor + +In DataChain Studio, open the code editor through `Data` tab in the topbar to write your DataChain script. You'll see connected storages listed in the left sidebar. + +### 2. Write DataChain Code + +Write your data processing script using DataChain operations: + +```python +import datachain as dc + +# Process data from connected storage +dc.read_storage("gs://datachain-demo").save("datachain-demo") +``` + +### Basic Operations Example + +```python +from datachain import DataChain + +# Read from storage and process +dc = ( + DataChain.from_storage("s3://my-bucket/images/") + .filter(lambda file: file.size > 1000) + .map(lambda file: {"path": file.path, "size": file.size}) + .save("processed_images") +) + +print(f"Processed {len(dc)} files") +``` + +### Working with Multiple Storages + +```python +from datachain import DataChain + +# Access different connected storages +source_data = DataChain.from_storage("s3://source-bucket/data/") +reference_data = DataChain.from_storage("gs://reference-bucket/metadata/") + +# Process and combine +result = source_data.join(reference_data, on="id").save("combined_data") +``` + +## Configuring Run Settings + +Click the run settings button to configure your job execution parameters. + +### Python Version + +Select the Python version for your job environment: +- Python 3.12 (recommended) +- Python 3.11 +- Python 3.10 + +### Workers + +Set the number of parallel workers for data processing: +- **1 worker**: Sequential processing (default) +- **2-10 workers**: Parallel processing for larger datasets +- More workers increase throughput but consume more resources + +### Priority + +Set job queue priority: +- **1-10**: Higher numbers = higher priority in the job queue +- **5**: Default priority +- Use higher priority for time-sensitive jobs + +### Requirements.txt + +Specify additional Python packages needed for your job: + +```text +pandas==2.0.0 +pillow>=9.0.0 +requests +torch==2.0.1 +``` + +### Environment Variables + +Set environment variables for your script: + +```text +AWS_REGION=us-east-1 +BATCH_SIZE=1000 +LOG_LEVEL=INFO +MODEL_VERSION=v2.1 +``` + +### Override Credentials + +By default, jobs use team credentials for storage access. You can override with: +- **Using team defaults**: Use configured team credentials +- **Custom credentials**: Select specific credentials for this job + +### Attached Files + +Upload additional files needed by your job (currently disabled in standard plan). + +## Running Your Job + +### Submit for Execution + +1. Write your DataChain script in the editor +2. Click the run settings button (gear icon) +3. Configure Python version, workers, and priority +4. Add any required packages or environment variables +5. Click `Apply settings` +6. Click the run button to execute + +Your job will be queued and executed with the specified configuration. + +### Execution Process + +1. **QUEUED**: Job enters the execution queue based on priority +2. **INIT**: Python environment is set up with specified version and requirements +3. **RUNNING**: Your DataChain script executes with configured workers +4. **COMPLETE**: Results are saved and available in the data table + +## Viewing Results + +After job completion: + +### Data Table + +Results appear in the data table below your script: +- View processed files and their properties +- Sort and filter results +- Examine file paths, sizes, and metadata +- Download data if needed + +### Saved Datasets + +Access saved datasets by name: +```python +# Later access to saved results +saved_dc = DataChain.from_dataset("processed_images") +``` + +## Common Patterns + +### Processing Images + +```python +from datachain import DataChain + +dc = ( + DataChain.from_storage("s3://images/") + .filter(lambda file: file.path.endswith(('.jpg', '.png'))) + .map(lambda file: { + "path": file.path, + "size": file.size, + "extension": file.path.split('.')[-1] + }) + .save("image_catalog") +) +``` + +### Data Quality Checks + +```python +from datachain import DataChain + +dc = ( + DataChain.from_storage("gs://data-lake/") + .filter(lambda file: file.size > 0) # Non-empty files + .filter(lambda file: file.modified_at > "2024-01-01") # Recent files + .save("validated_data") +) +``` + +### Batch Processing + +```python +from datachain import DataChain + +# Process data in batches +for batch in DataChain.from_storage("s3://large-dataset/").batch(1000): + processed = batch.map(transform_function) + print(f"Processed batch of {len(processed)} files") +``` + +## Troubleshooting + +### Common Issues + +#### Package Import Errors +- Add missing packages to `requirements.txt` +- Verify package names and versions are correct +- Check for compatible package versions + +#### Storage Access Errors +- Verify storage credentials are configured +- Check storage paths are correct and accessible +- Ensure team has necessary permissions + +#### Memory Errors +- Reduce batch size in your processing +- Increase number of workers to distribute load +- Process data in smaller chunks + +#### Timeout Errors +- Optimize your processing code +- Reduce amount of data being processed +- Consider splitting into multiple jobs + +### Debugging Tips + +1. **Start Simple**: Test with small data samples first +2. **Check Logs**: Review job logs in the monitor tab +3. **Verify Storage**: Ensure connected storage is accessible +4. **Test Locally**: Test scripts locally when possible +5. **Use Print Statements**: Add logging to track progress + +## Next Steps + +- Learn how to [monitor running jobs](monitor-jobs.md) +- Set up [team collaboration](../team-collaboration.md) +- Explore [DataChain operations](../../../references/datachain.md) diff --git a/docs/studio/user-guide/jobs/index.md b/docs/studio/user-guide/jobs/index.md new file mode 100644 index 000000000..d16a428e7 --- /dev/null +++ b/docs/studio/user-guide/jobs/index.md @@ -0,0 +1,64 @@ +# Jobs + +DataChain Studio allows you to run DataChain scripts directly in the cloud, processing data from connected storage. Write your code in the Studio editor and execute it with configurable compute resources. + +## Key Features + +- **[Create and Run](create-and-run.md)** - Write and execute DataChain scripts in Studio +- **[Monitor Jobs](monitor-jobs.md)** - Track job progress, view logs, and analyze results + +## How Jobs Work + +Jobs in DataChain Studio let you execute data processing workflows: + +### Direct Script Execution +- Write DataChain code directly in the Studio editor +- Execute scripts against connected storage (S3, GCS, Azure) +- Results saved automatically + +### Configurable Compute +- Select Python version for your environment +- Configure number of workers for parallel processing +- Set job priority for queue management +- Specify custom requirements and environment variables + +## Job Lifecycle + +### 1. Write Script +Write your DataChain code in the Studio editor using connected storage sources. + +### 2. Configure Settings +Set Python version, workers, priority, and any required dependencies or environment variables. + +### 3. Execute +Submit the job to run on Studio's compute infrastructure with your specified configuration. + +### 4. Monitor +View real-time logs, progress, and results as your job executes. + +### 5. Review Results +Access processed data through the Studio interface, with datasets saved automatically. + +## Job States + +- **QUEUED**: Job is waiting in the execution queue +- **INIT**: Job environment is being initialized +- **RUNNING**: Job is actively processing data +- **COMPLETE**: Job finished successfully +- **FAILED**: Job encountered an error +- **CANCELED**: Job was stopped by user + +## Getting Started + +1. Connect your storage sources (S3, GCS, Azure) +2. Write DataChain code in the Studio editor +3. Configure job settings (Python version, workers, priority) +4. Run your job and monitor execution +5. View results in the data table + +## Next Steps + +- Learn how to [create and run jobs](create-and-run.md) +- Explore [job monitoring capabilities](monitor-jobs.md) +- Set up [webhooks](../../webhooks.md) for job notifications +- Configure [team collaboration](../team-collaboration.md) for shared access diff --git a/docs/studio/user-guide/jobs/monitor-jobs.md b/docs/studio/user-guide/jobs/monitor-jobs.md new file mode 100644 index 000000000..c89e2b3aa --- /dev/null +++ b/docs/studio/user-guide/jobs/monitor-jobs.md @@ -0,0 +1,252 @@ +# Monitor Jobs + +Track your DataChain job execution in real-time with Studio's monitoring interface. + +## Job Status Bar + +At the top of the Studio interface, you'll see the current job status: + +### Status Display +- **Workers**: Shows active/total workers (e.g., "2 / 10 workers busy") +- **Tasks**: Displays running tasks count (e.g., "2 tasks") +- **Execution Time**: Shows how long the job has been running + +### Job States +- 🟡 **QUEUED**: Waiting in the execution queue +- 🔵 **INIT**: Setting up environment and dependencies +- 🟢 **RUNNING**: Actively processing data +- ✅ **COMPLETE**: Successfully finished +- ❌ **FAILED**: Encountered an error +- ⚫ **CANCELED**: Stopped by user + +## Real-time Logs + +### Logs Tab + +Click the "Logs" tab to view real-time execution output: + +``` +Running job 7897833d-080c-464f-978b-59316886099a in cluster 'default' +Using cached virtualenv + +Listing gs://datachain-demo: 269981 objects [00:16, 16568.50 objects/s] +``` + +### Log Information +- **Job ID and Cluster**: Shows which cluster is running your job +- **Environment Status**: Indicates if using cached virtualenv or installing fresh +- **Timestamped Entries**: Real-time progress updates +- **Error Messages**: Stack traces for debugging failures +- **Data Statistics**: Files processed and rows handled +- **Performance Metrics**: Execution timing information + +## Dependencies Tab + +View data lineage and dataset dependencies: + +### Dataset Lineage +The Dependencies tab shows a visual graph of data flow: + +- **Output Dataset**: Your saved dataset (e.g., `@amritghimire.default.datachain-demo@v1.0.0`) + - Shows version number + - Displays creator and timestamp + - Indicates verification status + +- **Source Storage**: Connected storage sources (e.g., `gs://datachain-demo/`) + - Shows storage path + - Displays who added the storage + - Links to original data source + +- **Data Flow**: Visual arrows showing how data flows from source to output + +This helps you: +- Understand data lineage and provenance +- Track which storages were used +- Verify dataset versions +- Debug data pipeline issues + +## Diagnostics Tab + +View detailed job execution timeline and diagnostics: + +### Job Summary + +At the top, see the overall job status: + +``` +✓ Job complete: 00:07:30 +``` + +- **Execution Time**: Total duration (hours:minutes:seconds) +- **Status Icon**: Checkmark for success, X for failure + +### Execution Details + +Key job information: + +- **Started**: Start timestamp with timezone (e.g., `2025-10-18 07:48:27 GMT+5:45`) +- **Finished**: Completion timestamp +- **Compute Cluster**: Which cluster ran the job (e.g., `default`) +- **Job ID**: Unique identifier for the job (e.g., `7897833d-080c-464f-978b-59316886099a`) + +### Execution Timeline + +Detailed breakdown of each execution phase: + +``` +✓ Waiting in queue 2s +✓ Starting a worker 15s +✓ Initializing job 3s +✓ Installing dependencies 0s +✓ Waking up data warehouse 29s +✓ Running query 2m 35s +``` + +Each phase shows: +- **Checkmark**: Indicates successful completion +- **Phase Name**: What the system was doing +- **Duration**: Time spent in that phase + +### Understanding Phase Durations + +- **Waiting in queue**: Time before resources became available +- **Starting a worker**: Worker initialization and allocation +- **Initializing job**: Setting up job environment +- **Installing dependencies**: Installing Python packages from requirements.txt +- **Waking up data warehouse**: Activating data processing infrastructure +- **Running query**: Actual data processing time + +This breakdown helps identify bottlenecks and optimize job performance. + +## Data Results + +### Data Tab + +View processed results in the data table: + +- **Row Count**: Shows processed rows (e.g., "20 of 270,345 rows") +- **Columns**: File paths, sizes, and metadata +- **Sorting**: Click column headers to sort +- **Filtering**: Use filters to find specific data +- **Pagination**: Navigate through large result sets + +### Files Tab + +Browse processed files: + +- File paths and names +- File sizes and types +- Metadata and attributes +- Quick preview capabilities + +## Job Controls + +### Stop Job + +Click the stop button to cancel a running job: +- Job will transition to CANCELING state +- Current operations complete gracefully +- Resources are cleaned up + +## Monitoring Job Progress + +### Progress Indicators + +Track your job execution: + +- **Rows Processed**: Current progress through dataset +- **Processing Rate**: Files or records per second +- **Time Elapsed**: How long the job has been running +- **Estimated Completion**: Projected finish time (when available) + +### Resource Usage + +Monitor resource consumption: + +- **Workers Active**: Number of parallel workers processing data +- **Memory Usage**: RAM consumption during processing +- **Storage I/O**: Data read/write operations + +## Troubleshooting + +### Common Issues + +#### Job Stuck in QUEUED +- Check worker availability in status bar +- Verify team hasn't exceeded resource quotas +- Review job priority settings + +#### INIT Failures +- Check requirements.txt for invalid packages +- Verify package versions are compatible +- Review error messages in Logs tab + +#### RUNNING Failures +- Examine stack trace in Logs tab +- Verify storage credentials are valid +- Check storage paths are accessible +- Review error messages for specific issues + +#### Storage Access Errors +- Verify credentials in account settings +- Check storage bucket permissions +- Ensure storage path exists +- Test storage connection separately + +### Debugging Workflow + +1. **Check Diagnostics Tab**: Review job completion status and execution timeline +2. **Identify Bottleneck**: Look for phases with unusually long durations: + - Long "Starting a worker" time → Check cluster availability + - Long "Installing dependencies" → Review requirements.txt + - Long "Waking up data warehouse" → Contact support + - Long "Running query" → Optimize DataChain code +3. **Open Logs Tab**: Look for error messages and stack traces +4. **Check Dependencies Tab**: Verify data sources are connected correctly +5. **Test with Subset**: Try with smaller data sample +6. **Contact Support**: Provide Job ID from Diagnostics tab + +## Performance Optimization + +### Analyzing Execution Timeline + +Use the Diagnostics tab to identify optimization opportunities: + +#### Quick Queue Times (< 2m) +✓ Good - Your jobs are getting resources quickly + +#### Long Worker Start (> 5m) +Possible causes: +- High cluster demand +- Cold start of compute resources + +#### Slow Dependency Installation (> 3m) +Optimization tips: +- Pin package versions in requirements.txt +- Minimize number of dependencies + +#### Extended Data Warehouse Wake (> 2m) +This is infrastructure initialization. If consistently slow: +- Keep warehouse warm with regular jobs +- Contact support for dedicated warehouse + +#### Long Running Query Time +Optimize your DataChain code: +- Filter data early to reduce volume +- Use efficient DataChain operations +- Increase worker count for large datasets +- Batch operations appropriately + +### Monitoring Best Practices + +- **Compare Job Runs**: Check Diagnostics across multiple runs to spot trends +- **Track Phase Durations**: Note which phases take longest +- **Use Job ID**: Reference Job ID when reporting issues +- **Review Logs**: Check for warnings about performance + +## Next Steps + +- Set up [webhook notifications](../../webhooks.md) for job status updates +- Configure [team collaboration](../team-collaboration.md) for shared job access +- Explore [DataChain operations](../../../references/datachain.md) for optimization +- Review [account settings](../account-management.md) for credentials diff --git a/docs/studio/user-guide/model-registry/add-a-model.md b/docs/studio/user-guide/model-registry/add-a-model.md new file mode 100644 index 000000000..dce63959f --- /dev/null +++ b/docs/studio/user-guide/model-registry/add-a-model.md @@ -0,0 +1,67 @@ +# Add a model + +You can add models from any ML project to the model registry. To add a model, +DataChain Studio creates an annotation for it in a `dvc.yaml` file in your Git +repository. You can add a model in any of the following ways: + +1. Log your model during the training process using dvclive by calling + `live.log_artifact(path, type="model")` log_artifact method. +2. Edit `dvc.yaml` directly and add your model to `artifacts` section. +3. Use the DataChain Studio interface (watch this tutorial video or read on below). + +https://www.youtube.com/watch?v=szzv4ZXmYAs + +1. Click on `Add a model`. + +2. Select a [connected project] to which you want to add the model. + + + + If your model file or the `.dvc` file for your model already exists in a Git + repo, select that repo. If your model file resides in remote storage (S3, + GCS, etc.), select the Git repo where you want to add the model. + + + +3. Enter the path to `dvc.yaml` the model will be added to. Adding your model to + non-root `dvc.yaml` can be helpful if you develop this ML model in a specific + subfolder or if this repo is a monorepo. + +4. Enter the path of the model file as follows: + - If the model file is in the Git repository or is in the cloud but is + tracked by DVC, enter the relative path of the model (from the repository + root). + - Otherwise, enter the URL to the model file in the cloud. DataChain Studio will + ask you for the repository path where the dvc reference to the model should + be saved. + +5. Provide labels for your model. For example, if your model is about reviewing + sentiment analysis using natural language processing, one of the labels may + be `nlp` or `sentiment_analysis`. + +6. Optionally, add a brief description for your model. + +7. Enter a Git commit message. Then, select the branch to commit to. You can + commit to either the base branch or a new branch. DataChain Studio will commit the + changes to the selected branch. If you commit to a new branch, DataChain Studio + will also create a Git pull request from the new branch to the base branch. + +8. Now, click on `Commit changes`. + +At this point, the new model appears in the models dashboard. + +In your Git repository, you will find that an entry for the new model has been +created in the `dvc.yaml` that was specified. If you had committed to a new +branch, a new pull request (or merge request in the case of GitLab) will also +have been created to merge the new branch into the base branch. + +If you had added a model from a cloud storage, the following will also happen +before the commit is created: + +- If the repository does not contain DVC, DataChain Studio will run `dvc init`. It is + needed to version the model in the git repository. +- If the specified directory does not exist yet, it will be created. +- DataChain Studio will import the model to the repository by executing + `dvc import-url / --no-exec`. + +[connected project]: ../experiments/create-a-project.md diff --git a/docs/studio/user-guide/model-registry/assign-stage.md b/docs/studio/user-guide/model-registry/assign-stage.md new file mode 100644 index 000000000..33e1a3a6f --- /dev/null +++ b/docs/studio/user-guide/model-registry/assign-stage.md @@ -0,0 +1,66 @@ +# Assign stage to model version + +To manage model lifecycle, you can assign stages (such as `dev`, `staging`, +`prod`, etc.) to specific model versions. + +To assign a stage to a model version, DataChain Studio uses GTO to create an +annotated [Git tag][git tag] with the specified stage and version number. + +You can [write CI/CD actions][CI/CD] that can actually deploy the models to the +different deployment environments upon the creation of a new Git tag for stage +assignment. For that, you can leverage any ML model deployment tool, such as +[MLEM]. + +You can assign a stage in any of the following ways: + +1. Use GTO CLI or API. An example would be + `gto assign pool-segmentation --version v0.0.1 --stage dev`, + assuming `dvc.yaml` with the model annotation is located in the root of the + repo. If not, you should append its parent directory to the model's name like + this: + `gto assign cv:pool-segmentation --version v0.0.1 --stage dev` + (here, `cv` is the parent directory). +2. To assign stages using DataChain Studio, watch this tutorial video or read on + below. + +https://www.youtube.com/watch?v=Vrp1O5lkWBo + +1. On the models dashboard, open the 3-dot menu for the model whose version you + want to assign the stage to. Then, click on `Assign stage`. This action can + also be initiated from the model details page or from the related project's + experiment table - look for the `Assign stage` button or icon. + +2. Select the version to which you want to assign the stage. +3. Enter the stage name (eg, `dev`, `shadow`, `prod`). + + + + You can define the list of stages in the `.gto` config file, which is a + `yaml` structured file that allows you to specify artifact types and stages. + If you have defined the stages in this file, then you can assign to these + stages only. But if you have not defined the list of stages, you can enter + any string as the stage name. Note the following: + - GTO config files with stage names are specific to a Git repository. So, + they apply only to models within one repository. + - Currently, you cannot make entries to the GTO config file from DataChain Studio. + - If you define stages in the config file at any point, any stage assignments + after that point can use only the names defined in the config file. + + + +4. Optionally, provide a Git tag message. +5. Click on `Assign stage`. + +Once the action is successful, the stage assignment will show up in the `Stages` +column of the models dashboard. + +If you open the model details page, the stage assignment will be visible in the +model `History` section as well as in the `Stages` section. + +If you go to your Git repository, you will see that a new Git tag referencing +the selected version and stage has been created, indicating the stage +assignment. + +[git tag]: https://git-scm.com/docs/git-tag +[CI/CD]: use-models.md#deploying-and-publishing-models-in-cicd +[MLEM]: https://mlem.ai/ diff --git a/docs/studio/user-guide/model-registry/register-version.md b/docs/studio/user-guide/model-registry/register-version.md new file mode 100644 index 000000000..2f3e42c67 --- /dev/null +++ b/docs/studio/user-guide/model-registry/register-version.md @@ -0,0 +1,63 @@ +# Register a model version + +New model versions can signify an important, published or released iteration. To +register version, you first need to +[add a model to the model registry](add-a-model.md). + +To register a new version of a model, DataChain Studio uses GTO to create an +annotated [Git tag][git tag] with the specified version number. + +You can [write CI/CD actions][CI/CD] that can actually build and publish models +(for example, build Docker image with the model and publish it to a Docker +Registry) upon the creation of a new Git tag for version registration. For that, +you can leverage any ML model deployment tool, such as [MLEM]. + +You can register a version in any of the following ways: + +1. Use GTO CLI or API. An example would be + `gto register pool-segmentation --version v0.0.1`, assuming + `dvc.yaml` with the model annotation is located in the root of the repo. If + not, you should append its parent directory to the model's name like this: + `gto register cv:pool-segmentation --version v0.0.1` (here, `cv` + is the parent directory). +2. To register versions using DataChain Studio, watch this tutorial video or read on + below. + +https://www.youtube.com/watch?v=eA70puzOp1o + +1. On the models dashboard, open the 3-dot menu for the model whose version you + want to register. Then, click on `Register new version`. The registration + action can also be initiated from the model details page or from the related + project's experiment table - look for the `Register version` button or icon. + +2. Select the Git commit which corresponds to the new version of your model. If + the desired commit does not appear in the commit picker, type in the + 40-character sha-1 hash of the commit. +3. Enter a version name. Version names must start with the letter `v` and should + follow the [SemVer] format after the letter `v`. Below are some examples of + valid and invalid version names: + - Valid: v0.0.1, v1.0.0, v12.5.7 + - Invalid: 0.0.1 (missing `v` in the beginning), v1.0 (missing the patch + segment of the [Semver], v1.0.new (using an invalid value `new` as the + patch number). + +4. Optionally, provide a Git tag message. +5. Click on `Register version`. + +Once the action is successful, the newly registered version will show up in the +`Latest version` column of the models dashboard. Note that this will happen only +if the newly registered version is the greatest semantic version for your model. +For example, if your model already had v3.0.0 registered, then if you register a +smaller version (e.g., v2.0.0), then the new version will not appear in the +`Latest version` column. + +If you open the model details page, the newly registered version will be +available in the model `History` section as well as in the versions drop down. + +If you go to your Git repository, you will see that a new Git tag referencing +the selected commit has been created, representing the new version. + +[git tag]: https://git-scm.com/docs/git-tag +[semver]: https://semver.org/ +[CI/CD]: use-models.md#deploying-and-publishing-models-in-cicd +[MLEM]: https://mlem.ai/ diff --git a/docs/studio/user-guide/model-registry/remove-a-model-or-its-details.md b/docs/studio/user-guide/model-registry/remove-a-model-or-its-details.md new file mode 100644 index 000000000..09679ad10 --- /dev/null +++ b/docs/studio/user-guide/model-registry/remove-a-model-or-its-details.md @@ -0,0 +1,39 @@ +# Remove a model, version, or stage assignment + +When you remove (deprecate) a model, deregister a version or unassign a stage, +DataChain Studio creates Git tags that indicate the action and saves the tags in +your Git repository. + +These actions can be found in the 3-dot menu next to the model name in the +models dashboard (see the section highlighted in purple below). + +![](https://static.iterative.ai/img/studio/model-registry-undo-actions.png) + +These actions are also available in the model details page: + +- `Deprecate model` action is present in the 3-dot menu next to the model name. + +

+Deprecate model +

+ +- `Deregister version` button is present next to the version dropdown. + +

+Deprecate model +

+ +- Click on the relevant stage assignment pill in the `Stages` section to reveal + the `Unassign stage` menu item. + +

+Deprecate model +

+ + +To remove all of a project's models from DataChain Studio without deprecating them, you can simply delete the project. + + + +You can also remove a model version or stage assignment by removing the corresponding Git tag directly from your Git repository. But this destroys the audit trail of the original version registration or stage assignment action. + diff --git a/docs/studio/user-guide/model-registry/use-models.md b/docs/studio/user-guide/model-registry/use-models.md new file mode 100644 index 000000000..dc38dd6f6 --- /dev/null +++ b/docs/studio/user-guide/model-registry/use-models.md @@ -0,0 +1,63 @@ +# Use models + +Whether you need to download your models to use them, or you're looking to set +up some automation in CI/CD to deploy them, DataChain Studio provides these +capabilities. + +## Download models + +If your model file is DVC-tracked, you can download any of its registered +versions using the DataChain Studio REST API, `dvc artifacts get`, or DVC Python +API. + +Prerequisites: + +- Model stored with DVC with S3, Azure, http or https remote. +- The DataChain Studio project you like to download your model from needs access to + your remote storage credentials. +- Access to your [DataChain Studio client access token] with Model registry operations + scope. + +Without these prerequisites, you can still download a model artifact with DVC. +However, it can be easier to use the DataChain Studio API since you only need to have +the Studio access token. You do not need direct access to your remote storage or +Git repository, and you do not need to install DVC. + +[DataChain Studio client access token]: ../account-management.md#client-access-tokens + +You can download the files that make up your model directly from DataChain Studio. +Head to the model details page of the model you would like to download and click +`Access Model`. Here, you find different ways to download your model. + +=== "CLI (DVC)" + +Use the `dvc artifacts get` command to download an artifact by name. Learn more +on the command reference page for `dvc artifacts get`. + +=== "cURL / Python" + +Directly call the Studio REST API from your terminal +using `cURL` or in your `Python` code. + +=== "Direct Download" + +Here you can generate download links for your model files. After generation, +these download links are valid for 1 hour. You can click the link to directly +download the file. +## Deploying and publishing models in CI/CD + +A popular deployment option is to **use CI/CD pipelines triggered by new Git +tags to publish or deploy a new model version**. Since GTO registers versions +and assigns stages by creating Git tags, you can set up a CI/CD pipeline to be +triggered when the tags are pushed to the repository. + +You can use [the GTO GitHub Action](https://github.com/iterative/gto-action) +that interprets a Git tag to find out the model's version and stage assignment +(if any), reads annotation details such as `path`, `type` and `description`, and +downloads the model binaries if needed. + +For help building an end-to-flow from model training to deployment using the +DVC model registry, refer to the +[tutorial on automating model deployment to Sagemaker](https://iterative.ai/blog/sagemaker-model-deployment). +[Here](https://github.com/iterative/example-get-started-experiments/blob/main/.github/workflows/deploy-model.yml) +is the complete workflow script. diff --git a/docs/studio/user-guide/model-registry/view-and-compare-models.md b/docs/studio/user-guide/model-registry/view-and-compare-models.md new file mode 100644 index 000000000..13ffee0f5 --- /dev/null +++ b/docs/studio/user-guide/model-registry/view-and-compare-models.md @@ -0,0 +1,79 @@ +# View and compare models + +You can find all your models in the [models dashboard](#models-dashboard). Each +model has separate [model details pages](#model-details-page) for all its model +versions. Also, all models from a given Git repository are included as +[`model` columns in the experiment tables](#model-columns-in-the-projects-experiment-table) +of those projects that connect to this Git repository. + +## Models dashboard: + +The models in your model registry are organized in a central dashboard that +facilitates search and discovery. + + +You can sort the models in the dashboard by several criteria, including model +framework, repository, etc. + +DataChain Studio consolidates the stages of all the models in the registry, and +provides a way to filter models by stages. + +You can take a look at the [models dashboard] in Iterative's public (read only) +model registry. + +## Model details page: + +You can open the details of any model in the registry by clicking on the name of +the model in the models dashboard. + + +A model details page is divided into the following sections: + +- Top section: This contains + - the model name, + - a link to the model's Git repository, + - latest registered version of the model, + - a button to + [register a new version](register-version.md), + and + - information about how many projects in DataChain Studio have been created from the + model's Git repository. +- Left section: The left section contains information that is specific to a + particular registered version of the model. It has a version picker, which you + can use to switch between different registered versions of the model. For the + selected version, the left section shows + - buttons for common actions such as opening the corresponding experiment, + deregistering the model version, and + [assigning a stage to the version](assign-stage.md), + - all assigned stages, + - version description and labels, + - path to the model, + - metrics, params and plots. +- Right section: The right section contains information that is applicable + across all the versions of the model. In particular, it displays + - the assigned stages for the different versions, and + - the history of all version registration and stage assignment actions. + +You can find an example of a [model detail page] in Iterative's public (read +only) model registry. + +## Model columns in the project's experiment table: + +The models will also appear as `model` columns in the experiment tables of those +projects that have been created from the Git repository to which the model +belongs. + + +## Comparing model versions: + +To compare model versions, select relevant commits in the project's experiment +table and click `Compare` and/or `Plots` buttons: + + +This way you can compare both registered model versions and unregistered +experimental iterations and make a decision to register a new version out of the +latter. + +[models dashboard]: https://studio.datachain.ai/team/Iterative/models +[model detail page]: + https://studio.datachain.ai/team/Iterative/models/PTzV-9EJgmZ6TGspXtwKqw==/lightgbm-model/v2.0.1 diff --git a/docs/studio/user-guide/team-collaboration.md b/docs/studio/user-guide/team-collaboration.md new file mode 100644 index 000000000..889eccb15 --- /dev/null +++ b/docs/studio/user-guide/team-collaboration.md @@ -0,0 +1,250 @@ +# Teams + +DataChain Studio enables collaborative work through teams, allowing you to share +projects, datasets, and jobs with team members. You can create teams with one or +more team members, also called collaborators, and assign different roles to +control access permissions. The projects that you create in your team's page +will be accessible to all members of the team. + +In this page, you will learn about: + +- [How to create a team](#create-a-team) +- [How to invite collaborators (team members)](#invite-collaborators) +- [The privileges (access permissions) of different roles](#roles) +- [How to manage connections to self-hosted GitLab servers](#manage-connections-to-self-hosted-gitlab-servers) +- [How to configure Single Sign-on (SSO)](#configure-single-sign-on-sso) +- [How to upgrade to an Enterprise plan](#get-enterprise) + +## Create a team + +Click on the drop down next to `Personal`. All the teams that you have created +so far will be listed within `Teams` in the drop down menu. If you have not +created any team so far, this list will be empty. + +To create a new team, click on `Create a team`. +![](https://static.iterative.ai/img/studio/team_create_v3.png) + +You will be asked to enter the URL namespace for your team. Enter a unique name. +The URL for your team will be formed using this name. +![](https://static.iterative.ai/img/studio/team_enter_name_v3.png) + +Then, click the `Create team` button on the top right corner. + +## Invite collaborators + +To add collaborators, enter their email addresses. Each collaborator can be +assigned the [Admin, Edit, or View role](#roles). An email invite will be sent +to each invitee. Then, click on `Send invites and close`. + +![](https://static.iterative.ai/img/studio/team_roles_v3.png) + +You can also click on `Skip and close` to skip adding collaborators while +creating the team, and +[add them later by accessing team settings](#edit-collaborators). + +## Roles + +Team members or collaborators can have the following roles: + +- **`Viewers`** (Read permission) - Have read-only access to datasets, jobs, + queries, and projects. They can view and explore data but cannot make any + changes or create new resources. +- **`Editors`** (Write permission) - Can create and edit datasets, jobs, + queries, and projects. They can upload files, run jobs, and manage team + resources but cannot modify team settings or manage collaborators. +- **`Admins`** (Admin permission) - Have full access to all team resources and + settings. They can add (invite) and remove collaborators, manage team + settings, configure cloud credentials, and perform all operations available to + Editors and Viewers. + +DataChain Studio does not have the concept of an `Owner` role. The user who +creates the team has the `Admin` role. The privileges of such an admin is the +same as that of any other collaborator who has been assigned the `Admin` role. + +!!! note + + If your Git account does not have write access on the Git repository connected + to a project, you cannot push changes (e.g., new experiments) to the repository + even if the project belongs to a team where you are an `Editor` or `Admin`. + + +### Privileges for datasets + +| Feature | Viewer | Editor | Admin | +| --------------------------- | ------ | ------ | ----- | +| List datasets | Yes | Yes | Yes | +| View dataset information | Yes | Yes | Yes | +| View dataset rows | Yes | Yes | Yes | +| View dataset versions | Yes | Yes | Yes | +| Export datasets | Yes | Yes | Yes | +| Preview files | Yes | Yes | Yes | +| Create datasets | No | Yes | Yes | +| Edit dataset metadata | No | Yes | Yes | +| Delete datasets | No | Yes | Yes | +| Upload files | No | Yes | Yes | +| Move files in storage | No | Yes | Yes | +| Delete files | No | Yes | Yes | +| Reindex storage | No | Yes | Yes | +| Create dataset from storage | No | Yes | Yes | + +### Privileges for jobs + +| Feature | Viewer | Editor | Admin | +| -------------------- | ------ | ------ | ----- | +| List jobs | Yes | Yes | Yes | +| View job details | Yes | Yes | Yes | +| View job logs | Yes | Yes | Yes | +| List clusters | Yes | Yes | Yes | +| Create jobs | No | Yes | Yes | +| Cancel running jobs | No | Yes | Yes | +| Update job status | No | Yes | Yes | + +### Privileges for queries + +| Feature | Viewer | Editor | Admin | +| ----------------------- | ------ | ------ | ----- | +| List queries | Yes | Yes | Yes | +| View query details | Yes | Yes | Yes | +| Create queries | No | Yes | Yes | +| Update queries | No | Yes | Yes | +| Duplicate queries | No | Yes | Yes | +| Delete queries | No | Yes | Yes | + +### Privileges for DVC experiments + +| Feature | Viewer | Editor | Admin | +| --------------------------------------------- | ------ | ------ | ----- | +| Open a team's project | Yes | Yes | Yes | +| View experiments and metrics | Yes | Yes | Yes | +| Apply filters | Yes | Yes | Yes | +| Show / hide columns | Yes | Yes | Yes | +| Save filters and column settings | No | Yes | Yes | +| Add a new project | No | Yes | Yes | +| Edit project settings | No | Yes | Yes | +| Delete a project | No | Yes | Yes | +| Share a project | No | Yes | Yes | + +### Privileges for storage and activity logs + +| Feature | Viewer | Editor | Admin | +| ------------------------ | ------ | ------ | ----- | +| List storage files | Yes | Yes | Yes | +| View activity logs | Yes | Yes | Yes | +| Create activity logs | No | Yes | Yes | +| Get presigned URLs | No | Yes | Yes | + +### Privileges to manage the team + +| Feature | Viewer | Editor | Admin | +| ---------------------------------- | ------ | ------ | ----- | +| Manage team settings | No | No | Yes | +| Manage team collaborators | No | No | Yes | +| Configure cloud credentials | No | No | Yes | +| Manage GitLab server connections | No | No | Yes | +| Configure Single Sign-on (SSO) | No | No | Yes | +| Manage team plan and billing | No | No | Yes | +| Delete a team | No | No | Yes | + +## Manage your team and its resources + +Once you have created the team, the team's workspace opens up. + +![](https://static.iterative.ai/img/studio/team_page_v6.png) + +In this workspace, you can manage the team's: +- [Datasets](#datasets) +- [Jobs](#jobs) +- [Projects (DVC Experiments)](#projects-dvc-experiments) +- [Settings](#settings) + +## Datasets + +The datasets dashboard displays all datasets created by team members. Access +permissions are controlled by team roles: +- **Viewers** can explore and export datasets +- **Editors** can create, edit, and delete datasets +- **Admins** have full control over all datasets + +To create a new dataset, you can upload files, connect to cloud storage, or +create datasets from DataChain queries. + +## Jobs + +The jobs dashboard shows all DataChain jobs running on the team's compute +clusters. Team members can: +- **Viewers** can view job status and logs +- **Editors** can create, run, and cancel jobs +- **Admins** have full control over all jobs + +## Projects (DVC Experiments) + +This is the projects dashboard for DVC experiment tracking. All projects on this +dashboard are accessible to all team members based on their roles. + +To add a project to this dashboard, click on `Add a project`. The process for +adding a project is the same as that for adding personal projects +([instructions](./experiments/create-a-project.md)). + +## Settings + +In the [team settings](#settings) page, you can change the team name, add +credentials for the data remotes, and delete the team. Note that these settings +are applicable to the team and are thus different from +[project settings](./experiments/configure-a-project.md). + +Additionally, you can also +[manage connections to self-hosted GitLab servers](#manage-connections-to-self-hosted-gitlab-servers), +[configure sso](#configure-single-sign-on-sso), +[edit collaborators](#edit-collaborators). + +### Manage connections to self-hosted GitLab servers + +If your team’s Git repositories are on a self-hosted GitLab server, you can go +to the `GitLab connections` section of the team settings page to set up a +connection to this server. Once you set up the connection, all your team members +can connect to the Git repositories on this server. For more details, refer to +[Custom GitLab Server Connection](./git-connections/custom-gitlab-server.md). + +### Configure Single Sign-on (SSO) + +Single Sign-on (SSO) allows your team members to authenticate to DataChain +Studio using your organization's identity Provider (IdP) such as Okta, LDAP, +Microsoft AD, etc. + +Details on how to configure SSO for your team can be found +[here](./authentication/single-sign-on.md). + +Once the SSO configuration is complete, users can login to DataChain Studio +using their team's login page at +`http://studio.datachain.ai/api/teams//sso`. They can also login +directly from their Okta dashboards by clicking on the DataChain Studio +integration icon. + +If a user does not have a pre-assigned role when they sign in to a team, they +will be auto-assigned the [`Viewer` role](#roles). + +### Edit collaborators + +To manage the collaborators (team members) of your team, go to the +`Collaborators` section of the team settings page. Here you can invite new team +members as well as remove or change the [roles](#roles) of existing team +members. + +The number of collaborators in your team depends on your team plan. By default, +all teams are on the Free plan, and can have 2 collaborators. To add more +collaborators, [upgrade to the Enterprise plan](#get-enterprise). + +All collaborators and pending invites get counted in the subscription. Suppose +you have subscribed for a 10 member team. If you have 5 members who have +accepted your team invite and 3 pending invites, then you will have 2 remaining +seats. This means that you can invite 2 more collaborators. At this point, if +you remove any one team member or pending invite, that seat becomes available +and so you will have 3 remaining seats. + +## Get Enterprise + +**To upgrade to the Enterprise plan**, [schedule a call] with our in-house +experts. They will try to understand your needs and suggest a suitable plan and +pricing. + +[schedule a call]: https://calendly.com/gtm-2/studio-introduction diff --git a/docs/studio/user-guide/troubleshooting.md b/docs/studio/user-guide/troubleshooting.md new file mode 100644 index 000000000..5bce30586 --- /dev/null +++ b/docs/studio/user-guide/troubleshooting.md @@ -0,0 +1,476 @@ +# Troubleshooting + +Here we provide help for some of the problems that you may encounter when using +DataChain Studio. + +## Support + +If you need further help, you can send us a message using the `Help` option on the DataChain Studio +website. You can also [email us](mailto:support@datachain.studio), create a +support ticket on [GitHub](https://github.com/datachain-studio/support), or join +the discussion in our [community Discord](https://discord.gg/datachainstudio). + +## Projects and experiments + +- [Errors accessing your Git repository](#errors-accessing-your-git-repository) +- [Errors related to parsing the repository](#errors-related-to-parsing-the-repository) +- [Errors related to DVC remotes and credentials](#errors-related-to-dvc-remotes-and-credentials) +- [Error: No DVC repo was found at the root](#error-no-dvc-repo-was-found-at-the-root) +- [Error: Non-DVC sub-directory of a monorepo](#error-non-dvc-sub-directory-of-a-monorepo) +- [Error: No commits were found for the sub-directory](#error-no-commits-were-found-for-the-sub-directory) +- [Project got created, but does not contain any data](#project-got-created-but-does-not-contain-any-data) +- [Project does not contain the columns that I want](#project-does-not-contain-the-columns-that-i-want) +- [Project does not contain some of my commits or branches](#project-does-not-contain-some-of-my-commits-or-branches) +- [Error: Missing metric or plot file(s)](#error-missing-metric-or-plot-files) +- [Project does not display live metrics and plots](#project-does-not-display-live-metrics-and-plots) +- [Project does not display DVC experiments](#project-does-not-display-dvc-experiments) +- [Error: `dvc.lock` validation failed](#error-dvclock-validation-failed) +- [Project does not reflect updates in the Git repository ](#project-does-not-reflect-updates-in-the-git-repository) + +## Jobs + +- [Job stuck in QUEUED state](#job-stuck-in-queued-state) +- [Job fails during INIT](#job-fails-during-init) +- [Job fails during execution](#job-fails-during-execution) +- [Storage access errors](#storage-access-errors) +- [Job performance issues](#job-performance-issues) + +## Model registry + +- [I cannot find my desired Git repository in the form to add a model](#i-cannot-find-my-desired-git-repository-in-the-form-to-add-a-model) +- [Model registry does not display the models in my Git repositories](#model-registry-does-not-display-the-models-in-my-git-repositories) +- [My models have disappeared even though I did not remove (deprecate) them](#my-models-have-disappeared-even-though-i-did-not-remove-deprecate-them) + +## Billing and payment + +- [Questions or problems with billing and payment](#questions-or-problems-with-billing-and-payment) + +## Errors accessing your Git repository + +When DataChain Studio cannot access your Git repository, it can present one of the +following errors: + +- Repository not found or you don't have access to it +- Unable to access repository due to stale authorization +- Unable to access repository +- Could not access the git repository, because the connection was deleted or the + token was expired +- No tokens to access the repo +- Insufficient permission to push to this repository +- No access to this repo + +To fix this, make sure that the repository exists and you have access to it. +Re-login to the correct Git account and try to import the repository again. If +you are connecting to a GitHub account, also make sure that the DataChain Studio +GitHub app is installed. + +Additionally, network or third party issues (such as GitHub, GitLab or Bitbucket +outages) can also cause connection issues. In this case, DataChain Studio can display +an appropriate indication in the error message. + +## Errors related to parsing the repository + +If you see one of the following errors, it means that for some reason, parsing +of the Git repository could not start or it stopped unexpectedly. You can try to +import the repo again. + +- Failed to start parsing +- Parsing stopped unexpectedly + +## Errors related to DVC remotes and credentials + +DataChain Studio can include data from +[data remotes](experiments/configure-a-project.md#data-remotes-cloud-storage-credentials) +in your project. However, it can access data from network-accessible remotes +such as Amazon S3, Microsoft Azure, etc but not from local DVC +remotes. If your project uses an unsupported remote, you +will see one of the following errors: + +- Local remote was ignored +- Remote not supported + +Please use one of the following types of data remotes: Amazon S3, Microsoft +Azure, Google Drive, Google Cloud Storage and SSH. + +If the data remotes have access control, then you should [add the required +credentials to your project](experiments/configure-a-project.md#data-remotes-cloud-storage-credentials). If credentials are missing or +incorrect, you will see one of the following errors: + +- No credentials were provided +- Credentials are either broken or not recognized +- No permission to fetch remote data + +### Errors related to DVC remotes behind firewall + +For self-hosted S3 storage(like Minio) or SSH server, ensure that it is +available to access from the internet. If your server is behind the firewall, +you can limit the traffic on the firewall to the server to allow access from our +IP addresses only, which are: + +``` +3.21.85.173/32 +3.142.203.124/32 +``` + +Additionally, if you provide the hostname, the DNS records associated with the +storage server should be publicly available to resolve the server name. Use +[DNS Propagation Checker](https://www.whatsmydns.net/) to confirm if the server +domain is resolvable. If you still have any trouble setting up the connection to +your server, please +[contact us](#support). + +## Error: No DVC repo was found at the root + +If you get this message when you try to add a project: +`No DVC repo was found at the root`, then it means that you have connected to a +Git repository which contains a DVC repository in some sub-directory but not at +the root. + +This could be a typical situation when your DVC repository is part of a +[monorepo](https://en.wikipedia.org/wiki/Monorepo). + +To solve this, you should [specify the full path to the +sub-directory](experiments/configure-a-project.md#project-directory) that contains the DVC repo. + +Note that if you're connecting to a repository just to fetch models for the +model registry, and you are not working with DVC repositories, you can ignore +this error. + +## Error: Non-DVC sub-directory of a monorepo + +If you get this message when you try to add a project: +`Non-DVC sub-directory of a monorepo`, then it means that you have connected to +a Git repository which contains a DVC repository in some sub-directory, but you +have selected the incorrect sub-directory. + +This could be a typical situation when your DVC repository is part of a +[monorepo](https://en.wikipedia.org/wiki/Monorepo). Suppose your Git repository +contains sub-directories A and B. If A contains the DVC repository which you +want to connect from DataChain Studio, but you specify B when creating the project, +then you will get the above error. + +To solve this, you should [specify the full path to the correct +sub-directory](experiments/configure-a-project.md#project-directory) that contains the DVC repo. + +## Error: No commits were found for the sub-directory + +If you get this message when you try to add a project, then it means that you +have specified an empty or non-existent sub-directory. + +To solve this, you need to change the sub-directory and [specify the full path +to the correct sub-directory](experiments/configure-a-project.md#project-directory) that contains the DVC repo. + +## Project got created, but does not contain any data + +If you initialized a DVC repository, but did not push any commit with data, +metrics or hyperparameters, then even though you will be able to connect to this +repository, the project will appear empty in DataChain Studio. To solve this, make +relevant commits to your DVC repository. + +Refer to the [DVC documentation](https://dvc.org/doc) for help on making commits +to a DVC repository. + +Note that if you're connecting to a repository just to fetch models for the +model registry, and your repository is not expected to contain experiment data, +metrics or hyperparameters, your project will appear empty. This is ok - you +will still be able to work with your models in the model registry. + +## Project does not contain the columns that I want + +There are two possible reasons for this: + +1. **The required columns were not imported:** DataChain Studio will only import + columns that you select in the + [**Columns** setting](experiments/configure-a-project.md#columns). + + **What if the repository has more than 500 columns?** Currently DataChain Studio + does not import over 500 columns. If you have a large repository (with more + than 500 columns), one solution is to split the + metrics/hyperparameters/files that you want to display over + multiple subdirectories in your Git repository. For each subdirectory, you + can create a new project in DataChain Studio and limit it to that subdirectory. + + To create projects for subdirectories, [specify the project directory in + project settings](experiments/configure-a-project.md#project-directory). + + If this solution does not work for your use case, please create a support + ticket in the [DataChain Studio support GitHub repository](https://github.com/iterative/studio-support). + +2. **The required columns are hidden:** In the project's experiment table, you + can hide the columns that you do not want to display. If any column that you + want is not visible, make sure you have not hidden it. The following video + shows how you can show/hide columns. Once you show/hide columns, remember to + save the changes. + + #### Show/hide columns + + ![Showing and hiding columns](https://static.iterative.ai/img/studio/show_hide_columns.gif) + +## Project does not contain some of my commits or branches + +This is likely not an error. DataChain Studio identifies commits that do not change +metrics, files or hyperparameters and will auto-hide such commits. It also +auto-hides commits that contain the string `[skip studio]` in the commit +message. You can also manually hide commits and branches, which means it is +possible that the commits or branches you do not see in your project were +manually hidden by you or someone else in your team. + +You can unhide commits and branches to display them. For details, refer to +[Display preferences -> Hide commits](experiments/explore-ml-experiments.md#hide-commits). However, if the missing commit/branch is +not in the hidden commits list, please [raise a support request](#support). + +## Error: Missing metric or plot file(s) + +This error message means that the metric or plot files referenced from +`dvc.yaml` could not be found in your Git repository or cache. Make sure that +you have pushed the required files using `dvc push`. Then try to import the +repository again. + +## Error: Skipped big remote file(s) + +Files that are larger than 10 MB are currently skipped by DataChain Studio. + +## Project does not display live metrics and plots + +Confirm that you are correctly following the +[procedure to send live metrics and plots](experiments/live-metrics-and-plots.md) +to DataChain Studio. + +Note that a live experiment is nested under the parent Git commit in the project +table. If the parent Git commit is not pushed to the Git repository, the live +experiment row will appear within a `Detached experiments` dummy branch in the +project table. Once you push the missing parent commit to the Git remote, the +live experiment will get nested under the parent commit as expected. + +## Project does not display DVC experiments + +DataChain Studio automatically checks for updates to your repository using webhooks, +but it can not rely on this mechanism for custom Git objects, like DVC +experiment references. So the experiments you push using `dvc exp push` +may not automatically display in your project table. + +To manually check for updates in your repository, use the `Reload` button 🔄 +located above the project table. + +## Error: `dvc.lock` validation failed + +This error indicates that the `dvc.lock` file in the given commit has an invalid +YAML. If the given commit is unimportant to you, you can ignore this error. + +One potential cause for this error is that at the time of the given commit, your +repository used DVC 1.0. The format of lock files used in DVC 1.0 was deprecated +in the DVC 2.0 release. Upgrading to the latest DVC version will resolve this +issue for any future commits in your repository. + +## Project does not reflect updates in the Git repository + +When there are updates (new commits, branches, etc.) in your Git repository, +your project in DataChain Studio gets reflected to include those updates. If the +project has stopped receiving updates from the Git repository and you have to +`re-import` the project each time to get any new commit, then it is possible +that the DataChain Studio webhook in your repository got deleted or messed up. + +DataChain Studio periodically checks for any missing or messed up webhooks, and +attempts to re-create them. Currently, this happens every 2 hours. The webhook +also gets re-created every time you create a new project or re-import a +repository. + +## Job stuck in QUEUED state + +If your job remains in the QUEUED state for an extended period: + +### Possible Causes +- **No available workers**: All workers in the cluster are busy processing other jobs +- **Resource quotas exceeded**: Your team has reached the maximum number of concurrent jobs +- **High priority jobs ahead**: Other jobs with higher priority are being processed first + +### Solutions +1. Check the worker availability in the status bar at the top of Studio +2. Review your team's resource quotas and usage +3. Consider adjusting job priority settings if appropriate +4. Wait for currently running jobs to complete +5. Contact support if jobs remain queued for unusually long periods + +## Job fails during INIT + +If your job fails during the initialization phase: + +### Common Causes +- **Invalid package requirements**: Errors in requirements.txt file +- **Incompatible package versions**: Package version conflicts +- **Missing dependencies**: Required packages not specified + +### Solutions +1. Check the Logs tab for specific error messages about package installation +2. Review your requirements.txt file: + - Verify package names are spelled correctly + - Check for version compatibility between packages + - Pin package versions to avoid conflicts (e.g., `pandas==2.0.0`) +3. Test package installation locally before submitting the job +4. Minimize the number of dependencies to reduce initialization time +5. Check the Dependencies tab in job monitoring to see what was installed + +### Example of Common Issues + +**Bad requirements.txt:** +``` +pandas +numpy===1.24.0 # Three equals signs - syntax error +pillow>=9.0.0,<10.0.0 +invalipakage # Typo in package name +``` + +**Good requirements.txt:** +``` +pandas==2.0.0 +numpy==1.24.0 +pillow>=9.0.0,<10.0.0 +Pillow>=9.0.0 +``` + +## Job fails during execution + +If your job starts running but fails during data processing: + +### Script Errors +- **Syntax errors**: Check your Python code for syntax issues +- **Logic errors**: Review your DataChain operations for logical mistakes +- **Unhandled exceptions**: Add proper error handling to your script + +### Data Access Issues +- **Invalid storage paths**: Verify that storage paths are correct and accessible +- **Missing credentials**: Ensure storage credentials are configured in account settings +- **Permission denied**: Check that your credentials have the necessary permissions +- **Storage path not found**: Verify the bucket/container and path exist + +### Resource Limits +- **Out of memory**: Job exceeded allocated memory + - Solution: Reduce batch size, increase workers, or process data in chunks +- **Timeout**: Job took longer than maximum allowed time + - Solution: Optimize code or split into smaller jobs +- **Storage full**: Temporary storage filled up + - Solution: Clean up intermediate files or reduce data volume + +### Debugging Steps +1. **Check the Logs tab**: Look for error messages and stack traces +2. **Review the Diagnostics tab**: Check which phase failed and execution timeline +3. **Check the Dependencies tab**: Verify data sources are connected correctly +4. **Test with a subset**: Try running with a smaller sample of data +5. **Run locally**: Test your script locally with sample data before submitting + +## Storage access errors + +If you encounter errors accessing cloud storage: + +### Credential Issues +- **No credentials configured**: Add storage credentials in account settings +- **Expired credentials**: Refresh or update your credentials +- **Wrong credentials**: Verify you're using the correct credentials for the storage + +### Permission Issues +- **Insufficient permissions**: Your credentials don't have read access to the storage +- **Bucket not found**: Storage bucket/container name is incorrect +- **Path not accessible**: The specific path within storage doesn't exist + +### Network Issues +- **Connection timeout**: Network connectivity problems between Studio and storage +- **Firewall blocking**: Storage is behind a firewall that blocks Studio's IP addresses + +### Solutions +1. Verify credentials are configured correctly in [account settings](account-management.md) +2. Check storage bucket permissions and access policies +3. Test storage connection separately before running the job +4. Ensure storage path exists and is accessible +5. For self-hosted storage, verify firewall allows access from Studio's IP addresses: + ``` + 3.21.85.173/32 + 3.142.203.124/32 + ``` + +## Job performance issues + +If your jobs are running slower than expected: + +### Analyzing Performance + +Check the [Diagnostics tab](jobs/monitor-jobs.md#diagnostics-tab) to identify bottlenecks: + +#### Long Queue Times (> 2 minutes) +- **Cause**: High cluster demand or insufficient available workers +- **Solution**: + - Run jobs during off-peak hours + - Consider upgrading to a plan with more workers + - Adjust job priority for urgent tasks + +#### Long Worker Start (> 5 minutes) +- **Cause**: Cold start of compute resources +- **Solution**: + - This is typically infrastructure-related + - Contact support if consistently slow + +#### Slow Dependency Installation (> 3 minutes) +- **Causes**: + - Many packages to install + - Large package downloads + - Package version resolution conflicts +- **Solutions**: + - Pin package versions in requirements.txt to avoid resolution + - Minimize number of dependencies + - Use cached virtualenv when possible (shown in Logs) + +#### Extended Data Warehouse Wake (> 2 minutes) +- **Cause**: Infrastructure initialization +- **Solutions**: + - Keep warehouse warm by running jobs regularly + - Contact support for dedicated warehouse options + +#### Long Running Query Time +- **Causes**: + - Processing large volumes of data + - Inefficient DataChain operations + - Insufficient workers for dataset size +- **Solutions**: + - Filter data early to reduce processing volume + - Use efficient DataChain operations (avoid unnecessary transformations) + - Increase worker count for large datasets + - Batch operations appropriately + - Profile your code to identify slow operations + +### General Performance Tips + +1. **Start small**: Test with a small data sample first +2. **Monitor metrics**: Track job execution times across runs +3. **Use appropriate workers**: Balance between cost and performance +4. **Optimize code**: Profile and optimize DataChain operations +5. **Review logs**: Check for warnings about performance issues +6. **Compare runs**: Use the Diagnostics tab to compare execution times + +For detailed monitoring guidance, see [Monitor Jobs](jobs/monitor-jobs.md). + +## I cannot find my desired Git repository in the form to add a model + +Only repositories that you have connected to DataChain Studio are available in the +`Add a model` form. To connect your desired repository to DataChain Studio, go to the +`Projects` tab and [create a project that connects to this Git +repository](experiments/create-a-project.md). Then you can come back to the model registry and +add the model. + +## Model registry does not display the models in my Git repositories + +For a model to be displayed in the model registry, it has to be [added](model-registry/add-a-model.md) using +DVC. + +## My models have disappeared even though I did not remove (deprecate) them + +When a project is deleted, all its models get automatically removed from the +model registry. So check if the project has been removed. If yes, you can [add +the project](experiments/create-a-project.md) again. Deleting a project from DataChain Studio does +not delete any commits or tags from the Git repository. So, adding the project +back will restore all the models from the repository along with their details, +including versions and stage assignments. + +## Questions or problems with billing and payment + +Check out the [Frequently Asked Questions](https://studio.datachain.ai/faq) to +see if your questions have already been answered. If you still have problems, +please [contact us](#support). diff --git a/mkdocs.yml b/mkdocs.yml index 7fad70459..b78a04689 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -119,10 +119,65 @@ nav: - Namespaces: guide/namespaces.md - Local DB Migrations: guide/db_migrations.md - 🔗 Studio: + - Overview: studio/index.md + - User Guide: + - Overview: studio/user-guide/index.md + - Account Management: studio/user-guide/account-management.md + - Jobs: + - Overview: studio/user-guide/jobs/index.md + - Create and Run: studio/user-guide/jobs/create-and-run.md + - Monitor Jobs: studio/user-guide/jobs/monitor-jobs.md + - Experiments (DVC): + - Overview: studio/user-guide/experiments/index.md + - Create a Project: studio/user-guide/experiments/create-a-project.md + - Configure a Project: studio/user-guide/experiments/configure-a-project.md + - Run Experiments: studio/user-guide/experiments/run-experiments.md + - Explore ML Experiments: studio/user-guide/experiments/explore-ml-experiments.md + - Live Metrics and Plots: studio/user-guide/experiments/live-metrics-and-plots.md + - Visualize and Compare: studio/user-guide/experiments/visualize-and-compare.md + - Share a Project: studio/user-guide/experiments/share-a-project.md + - Model Registry: + - View and Compare Models: studio/user-guide/model-registry/view-and-compare-models.md + - Add a Model: studio/user-guide/model-registry/add-a-model.md + - Register a Model Version: studio/user-guide/model-registry/register-version.md + - Assign Stage to Model Version: studio/user-guide/model-registry/assign-stage.md + - Use Models: studio/user-guide/model-registry/use-models.md + - Remove a Model, Version, or Stage: studio/user-guide/model-registry/remove-a-model-or-its-details.md + - Git Connections: + - Overview: studio/user-guide/git-connections/index.md + - GitHub App: studio/user-guide/git-connections/github-app.md + - Custom GitLab Server: studio/user-guide/git-connections/custom-gitlab-server.md + - Team Collaboration: studio/user-guide/team-collaboration.md + - Authentication: + - Single Sign-on: studio/user-guide/authentication/single-sign-on.md + - OpenID Connect: studio/user-guide/authentication/openid-connect.md + - Troubleshooting: studio/user-guide/troubleshooting.md - API: studio/api/index.md - Webhooks: studio/webhooks.md + - Self-hosting: + - Overview: studio/self-hosting/index.md + - Installation: + - Overview: studio/self-hosting/installation/index.md + - AWS AMI: studio/self-hosting/installation/aws-ami.md + - Kubernetes (Helm): studio/self-hosting/installation/k8s-helm.md + - Configuration: + - Overview: studio/self-hosting/configuration/index.md + - SSL/TLS: studio/self-hosting/configuration/ssl-tls.md + - CA Certificates: studio/self-hosting/configuration/ca-certificates.md + - Git Forges: + - Overview: studio/self-hosting/configuration/git-forges/index.md + - GitHub: studio/self-hosting/configuration/git-forges/github.md + - GitLab: studio/self-hosting/configuration/git-forges/gitlab.md + - Bitbucket: studio/self-hosting/configuration/git-forges/bitbucket.md + - Upgrading: + - Overview: studio/self-hosting/upgrading/index.md + - Regular Procedure: studio/self-hosting/upgrading/regular-procedure.md + - Airgap Procedure: studio/self-hosting/upgrading/airgap-procedure.md + - Troubleshooting: + - Overview: studio/self-hosting/troubleshooting/index.md + - 502 Errors: studio/self-hosting/troubleshooting/502-errors.md + - Support Bundle: studio/self-hosting/troubleshooting/support-bundle.md - 🤝 Contributing: contributing.md - - DataChain Website ↗: https://datachain.ai" target="_blank" - Studio ↗: https://studio.datachain.ai" target="_blank" @@ -173,9 +228,8 @@ plugins: - mkdocstrings: handlers: python: - rendering: - show_submodules: no options: + show_submodules: no docstring_options: ignore_init_summary: true docstring_section_style: list @@ -187,7 +241,7 @@ plugins: show_symbol_type_heading: true show_symbol_type_toc: true signature_crossrefs: true - import: + inventories: - https://docs.python.org/3/objects.inv - https://numpy.org/doc/stable/objects.inv - https://pandas.pydata.org/docs/objects.inv diff --git a/pyproject.toml b/pyproject.toml index 1b099af84..c747505a6 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -62,9 +62,9 @@ docs = [ "mkdocs>=1.5.2", "mkdocs-gen-files>=0.5.0", "mkdocs-material==9.5.22", - "mkdocs-section-index>=0.3.6", "mkdocstrings-python>=1.6.3", "mkdocs-literate-nav>=0.6.1", + "mkdocs-section-index>=0.3.10", "eval-type-backport" ] torch = [