Skip to content

Commit 5a7b93c

Browse files
committed
Update Airbyte-Couchbase integration tutorial with enhanced tags, improved descriptions, and clarification on sync modes. Adjusted examples for better accuracy and added notes on best practices for data ingestion and synchronization.
1 parent 51dda13 commit 5a7b93c

File tree

1 file changed

+30
-83
lines changed

1 file changed

+30
-83
lines changed

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

Lines changed: 30 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ technology:
1313
- query
1414
tags:
1515
- Airbyte
16-
- Data Integration
17-
- ETL
1816
- Connector
17+
- Data Ingestion
18+
- Best Practices
1919
sdk_language:
2020
- python
2121
length: 35 Mins
@@ -28,7 +28,9 @@ Airbyte is an open-source data integration platform that enables you to move dat
2828
- **Cross-bucket replication**: Sync data between buckets within the same or different Couchbase clusters
2929
- **Analytics pipelines**: Extract data from Couchbase to data warehouses or analytics platforms
3030
- **Data ingestion**: Load data from SaaS applications, databases, or APIs into Couchbase
31-
- **Change data capture**: Track and replicate document changes in near real-time
31+
- **Change data capture**: Track and replicate document changes with periodic syncs
32+
33+
> **Note**: Airbyte is designed for batch/periodic data synchronization (typically 5-60 minute intervals), not sub-second real-time change tracking. For true real-time CDC, consider Couchbase's built-in XDCR or Eventing services.
3234
3335
This tutorial will guide you through setting up Airbyte with Couchbase Capella (cloud-hosted) as both source and destination, covering configuration, sync modes, common patterns, and best practices.
3436

@@ -94,6 +96,8 @@ This tutorial assumes you have:
9496

9597
The Couchbase source connector allows Airbyte to extract data from your Couchbase buckets. It automatically discovers all collections within a bucket and creates individual streams for each.
9698

99+
> **What is a stream?** In Airbyte, a stream represents a single data source (in this case, a Couchbase collection) that can be synced to a destination. Each stream has its own schema, sync mode, and cursor configuration. Learn more in [Airbyte's documentation](https://docs.airbyte.com/understanding-airbyte/connections/).
100+
97101
### Step 1: Prepare Your Couchbase Source
98102

99103
#### Create a Database User
@@ -176,7 +180,7 @@ Example streams from a `travel-sample` bucket:
176180
{
177181
"_id": "string", // Document key
178182
"_ab_cdc_updated_at": "integer", // Modification timestamp (for incremental sync)
179-
"travel-sample": { // Collection name
183+
"bucket": { // Bucket name
180184
// Original document fields
181185
}
182186
}
@@ -194,8 +198,8 @@ Syncs all documents from the collection every time.
194198
```sql
195199
SELECT META().id as _id,
196200
TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,
197-
*
198-
FROM `bucket`.`scope`.`collection`
201+
c AS `bucket`
202+
FROM `bucket`.`scope`.`collection` AS c
199203
```
200204

201205
**When to use**:
@@ -217,8 +221,8 @@ Syncs only new or modified documents since the last sync.
217221
```sql
218222
SELECT META().id as _id,
219223
TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,
220-
*
221-
FROM `bucket`.`scope`.`collection`
224+
c AS `bucket`
225+
FROM `bucket`.`scope`.`collection` AS c
222226
WHERE TO_NUMBER(meta().xattrs.$document.last_modified) > {last_cursor_value}
223227
ORDER BY TO_NUMBER(meta().xattrs.$document.last_modified) ASC
224228
```
@@ -257,7 +261,7 @@ The Couchbase destination connector allows Airbyte to load data into your Couchb
257261
- **Permissions**: Assign "Data Reader", "Data Writer", and "Query Manager" roles
258262
4. Save the credentials
259263

260-
**Note**: Query Manager role is required for automatic collection and index creation.
264+
**Note**: Query Manager role is required for automatic collection and index creation. These Database Access credentials are used for cluster connections via the SDK, distinct from Capella API credentials which would be used for Capella management operations.
261265

262266
#### Ensure Network Access
263267

@@ -425,12 +429,7 @@ For each enabled stream, select the appropriate sync mode combination:
425429
| Full Refresh | Overwrite | Complete replacement each sync | Mirror source exactly |
426430
| Full Refresh | Append | Multiple complete snapshots | Historical snapshots |
427431
| Incremental | Append | All changes tracked | Complete audit trail |
428-
| Incremental | Append Dedup | Current state maintained | Live replica (recommended) |
429-
430-
**Recommended for Couchbase → Couchbase**:
431-
- **Incremental | Append Dedup** for most use cases
432-
- Provides efficient syncing with minimal data transfer
433-
- Maintains current state of documents
432+
| Incremental | Append Dedup | Current state maintained | Live replica |
434433

435434
#### Configure Cursor Field
436435

@@ -486,10 +485,6 @@ Predefined intervals:
486485
- Every 12 hours
487486
- Every 24 hours
488487

489-
**Recommendation for Couchbase**:
490-
- Use incremental syncs with 15-60 minute intervals for near real-time replication
491-
- Use full refresh sparingly due to resource usage
492-
493488
### Step 4: Advanced Configuration
494489

495490
**Connection Name**: Give your connection a descriptive name
@@ -570,7 +565,7 @@ WHERE type = 'airbyte_record'
570565

571566
### Pattern 1: Couchbase to Couchbase (Cross-Bucket Replication)
572567

573-
**Use Case**: Replicate production data to an analytics bucket for reporting without impacting production workload.
568+
**Use Case**: Replicate production data to a staging environment for testing and development without impacting production workload.
574569

575570
**Configuration**:
576571

@@ -581,7 +576,7 @@ WHERE type = 'airbyte_record'
581576

582577
**Destination**:
583578
- Connection: Same cluster or different cluster
584-
- Bucket: `analytics`
579+
- Bucket: `staging`
585580
- Scope: `replicated`
586581
- User: Read-write user
587582

@@ -591,23 +586,23 @@ WHERE type = 'airbyte_record'
591586
**Cursor**: `_ab_cdc_updated_at`
592587

593588
**Benefits**:
594-
- Near real-time analytics without production impact
589+
- Safe testing and development environment
595590
- Automatic propagation of changes
596591
- Isolated workloads
597592

598593
**Example Stream Configuration**:
599594
```
600595
Stream: production.app.users
601-
Destination Collection: analytics.replicated.production_app_users
596+
Destination Collection: staging.replicated.production_app_users
602597
Sync Mode: Incremental | Append Dedup
603598
Primary Key: [["_id"]]
604599
```
605600

606-
**Query Pattern in Analytics Bucket**:
601+
**Query Pattern in Staging Bucket**:
607602
```sql
608603
-- Access the original data
609604
SELECT data.*
610-
FROM `analytics`.`replicated`.`production_app_users`
605+
FROM `staging`.`replicated`.`production_app_users`
611606
WHERE type = 'airbyte_record'
612607
```
613608

@@ -631,8 +626,8 @@ WHERE type = 'airbyte_record'
631626
**Primary Key**: Business key (e.g., `[["data", "order_id"]]`)
632627

633628
**Benefits**:
634-
- Join Couchbase data with other data sources
635-
- Use BI tools (Tableau, Looker) on Couchbase data
629+
- Enable business intelligence and cross-source analytics
630+
- Centralized data warehousing
636631
- Historical trend analysis
637632

638633
**Transformation Example** (dbt):
@@ -684,48 +679,7 @@ WHERE data.customer_id = $customer_id
684679
AND type = 'airbyte_record'
685680
```
686681

687-
### Pattern 4: Real-Time Change Tracking
688-
689-
**Use Case**: Maintain an audit log of all changes to critical collections.
690-
691-
**Configuration**:
692-
693-
**Source**:
694-
- Bucket: `production`
695-
- Collections: `financial_transactions`, `user_accounts`
696-
697-
**Destination**:
698-
- Bucket: `audit_log`
699-
- Scope: `change_tracking`
700-
701-
**Sync Mode**: Incremental | Append (not dedup!)
702-
**Schedule**: Every 5 minutes
703-
**Primary Key**: None (to track all versions)
704-
705-
**Benefits**:
706-
- Complete history of all document changes
707-
- Compliance and audit requirements
708-
- Point-in-time recovery capability
709-
710-
**Query Pattern**:
711-
```sql
712-
-- View all changes to a specific document
713-
SELECT
714-
data.*,
715-
emitted_at,
716-
TO_TIMESTAMP(data._ab_cdc_updated_at / 1000000000) as modified_at
717-
FROM `audit_log`.`change_tracking`.`production_app_user_accounts`
718-
WHERE data._id = 'user::12345'
719-
AND type = 'airbyte_record'
720-
ORDER BY emitted_at DESC
721-
```
722-
723-
**Storage Consideration**: This pattern will grow continuously. Plan for data lifecycle management:
724-
- Archive old audit records to object storage
725-
- Set up data retention policies
726-
- Monitor bucket size
727-
728-
### Pattern 5: Multi-Environment Sync
682+
### Pattern 4: Multi-Environment Sync
729683

730684
**Use Case**: Keep development/staging environments synchronized with production data.
731685

@@ -749,10 +703,8 @@ ORDER BY emitted_at DESC
749703
- Safe environment for development
750704

751705
**Security Note**: Implement data masking for sensitive fields:
752-
```
753-
Consider using Airbyte's transformation capabilities or custom dbt models
754-
to mask PII before syncing to non-production environments
755-
```
706+
> Consider using Airbyte's transformation capabilities or custom dbt models
707+
> to mask PII (Personally Identifiable Information, such as names, emails, SSNs) before syncing to non-production environments.
756708
757709
## Performance and Best Practices
758710

@@ -783,13 +735,6 @@ ON `bucket`.`scope`.`collection`(TO_NUMBER(meta().xattrs.$document.last_modified
783735
- Most documents change frequently
784736
- You need guaranteed consistency
785737

786-
**Performance Comparison**:
787-
```
788-
Full Refresh (100k docs): ~5-10 minutes
789-
Incremental (1k changes): ~30-60 seconds
790-
Savings: 83-90% reduction in time
791-
```
792-
793738
#### 3. Start Date Configuration
794739

795740
For the initial incremental sync, use start_date to limit the data window:
@@ -876,9 +821,11 @@ Always use `couchbases://` for production:
876821

877822
The connector uses these timeout settings:
878823
```python
824+
from datetime import timedelta
825+
879826
ClusterTimeoutOptions(
880-
kv_timeout=5 seconds,
881-
query_timeout=10 seconds
827+
kv_timeout=timedelta(seconds=5),
828+
query_timeout=timedelta(seconds=10)
882829
)
883830
```
884831

@@ -1174,7 +1121,7 @@ Monitor sync health in the Airbyte UI:
11741121
11751122
Check existing indexes:
11761123
SELECT * FROM system:indexes
1177-
WHERE keyspace_id = 'collection_name'
1124+
WHERE keyspace_id = 'collection'
11781125
```
11791126

11801127
**Issue**: "No streams discovered"

0 commit comments

Comments
 (0)