Skip to content

Commit 09cef1d

Browse files
committed
new Replicator metrics; reviewer comments
1 parent 9667d57 commit 09cef1d

11 files changed

+156
-165
lines changed

src/current/_includes/molt/molt-troubleshooting-failback.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,20 +12,23 @@ This indicates that Replicator is down, the webhook URL is incorrect, or the por
1212

1313
**Resolution:** Verify that MOLT Replicator is running on the port specified in the changefeed `INTO` configuration. Confirm the host and port are correct.
1414

15-
##### Unknown schema error
15+
##### Incorrect schema path errors
16+
17+
This error occurs when the [CockroachDB changefeed]({% link {{ site.current_cloud_version }}/create-and-configure-changefeeds.md %}) webhook URL path does not match the target database schema naming convention:
1618

1719
~~~
1820
transient error: 400 Bad Request: unknown schema:
1921
~~~
2022

21-
This indicates the webhook URL path is incorrectly formatted. Common causes include using the wrong path format for your target database type or incorrect database names.
23+
The webhook URL path is specified in the `INTO` clause when you [create the changefeed](#create-the-cockroachdb-changefeed). For example: `webhook-https://replicator-host:30004/database/schema`.
2224

23-
**Resolution:** Check the webhook URL path mapping:
25+
**Resolution:** Verify the webhook path format matches your target database type:
2426

25-
- **PostgreSQL targets:** Use `/database/schema` format (for example, `/molt_db/public`).
26-
- **MySQL/Oracle targets:** Use `/SCHEMA` format (for example, `/MOLT_DB`). Use only the schema name (for example, `molt` instead of `molt/public`).
27+
- PostgreSQL or CockroachDB targets: Use `/database/schema` format. For example, `webhook-https://replicator-host:30004/migration_schema/public`.
28+
- MySQL targets: Use `/database` format (schema is implicit). For example, `webhook-https://replicator-host:30004/migration_schema`.
29+
- Oracle targets: Use `/DATABASE` format in uppercase. For example, `webhook-https://replicator-host:30004/MIGRATION_SCHEMA`.
2730

28-
Verify that the target database and schema names match the webhook URL.
31+
For details on configuring the webhook sink URI, refer to [Webhook sink]({% link {{ site.current_cloud_version }}/changefeed-sinks.md %}#webhook-sink).
2932

3033
##### GC threshold error
3134

src/current/_includes/molt/molt-troubleshooting-fetch.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,14 @@
22

33
##### Fetch exits early due to mismatches
44

5-
`molt fetch` exits early in the following cases, and will output a log with a corresponding `mismatch_tag` and `failable_mismatch` set to `true`:
5+
When run in `none` or `truncate-if-exists` mode, `molt fetch` exits early in the following cases, and will output a log with a corresponding `mismatch_tag` and `failable_mismatch` set to `true`:
66

77
- A source table is missing a primary key.
88
- A source primary key and target primary key have mismatching types.
9+
{{site.data.alerts.callout_success}}
10+
These restrictions (missing or mismatching primary keys) can be bypassed with [`--skip-pk-check`]({% link molt/molt-fetch.md %}#skip-primary-key-matching).
11+
{{site.data.alerts.end}}
12+
913
- A [`STRING`]({% link {{site.current_cloud_version}}/string.md %}) primary key has a different [collation]({% link {{site.current_cloud_version}}/collate.md %}) on the source and target.
1014
- A source and target column have mismatching types that are not [allowable mappings]({% link molt/molt-fetch.md %}#type-mapping).
1115
- A target table is missing a column that is in the corresponding source table.

src/current/_includes/molt/molt-troubleshooting-replication.md

Lines changed: 16 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,14 @@ If MOLT Replicator appears hung or performs poorly:
66

77
1. Enable trace logging with `-vv` to get more visibility into the replicator's state and behavior.
88

9-
1. If MOLT Replicator is in an unknown, hung, or erroneous state, collect performance profiles to include with support tickets:
9+
1. If MOLT Replicator is in an unknown, hung, or erroneous state, collect performance profiles to include with support tickets. Replace `{host}` and `{metrics-port}` with your Replicator host and the port specified by `--metricsAddr`:
1010

1111
{% include_cached copy-clipboard.html %}
1212
~~~shell
13-
curl 'localhost:30005/debug/pprof/trace?seconds=15' > trace.out
14-
curl 'localhost:30005/debug/pprof/profile?seconds=15' > profile.out
15-
curl 'localhost:30005/debug/pprof/goroutine?seconds=15' > gr.out
16-
curl 'localhost:30005/debug/pprof/heap?seconds=15' > heap.out
13+
curl '{host}:{metrics-port}/debug/pprof/trace?seconds=15' > trace.out
14+
curl '{host}:{metrics-port}/debug/pprof/profile?seconds=15' > profile.out
15+
curl '{host}:{metrics-port}/debug/pprof/goroutine?seconds=15' > gr.out
16+
curl '{host}:{metrics-port}/debug/pprof/heap?seconds=15' > heap.out
1717
~~~
1818

1919
1. Monitor lag metrics and adjust performance parameters as needed.
@@ -62,7 +62,7 @@ Dropping a replication slot can be destructive and delete data that is not yet r
6262
run CREATE PUBLICATION molt_fetch FOR ALL TABLES;
6363
~~~
6464

65-
**Resolution:** {% if page.name != "migrate-load-replicate.md" %}[Create the publication]({% link molt/migrate-load-replicate.md %}#configure-source-database-for-replication){% else %}[Create the publication](#configure-source-database-for-replication){% endif %} on the source database. Ensure you also create the replication slot:
65+
**Resolution:** Create the publication on the source database. Ensure you also create the replication slot:
6666

6767
{% include_cached copy-clipboard.html %}
6868
~~~ sql
@@ -139,14 +139,18 @@ Interpret the results as follows:
139139

140140
If the GTID is purged or invalid, follow these steps:
141141

142-
1. Increase binlog retention by configuring `binlog_expire_logs_seconds` in MySQL or through your cloud provider:
142+
1. Increase binlog retention by configuring `binlog_expire_logs_seconds` in MySQL:
143143

144144
{% include_cached copy-clipboard.html %}
145145
~~~ sql
146146
-- Increase binlog retention (example: 7 days = 604800 seconds)
147147
SET GLOBAL binlog_expire_logs_seconds = 604800;
148148
~~~
149149

150+
{{site.data.alerts.callout_info}}
151+
For managed MySQL services (such as Amazon RDS, Google Cloud SQL, or Azure Database for MySQL), binlog retention is typically configured through the provider's console or CLI. Consult your provider's documentation for how to adjust binlog retention settings.
152+
{{site.data.alerts.end}}
153+
150154
1. Get a current GTID set to restart replication:
151155

152156
{% include_cached copy-clipboard.html %}
@@ -195,10 +199,12 @@ Oracle LogMiner excludes tables and columns with names longer than 30 characters
195199
##### Unsupported data types
196200

197201
LogMiner and replication do not support:
198-
- Long BLOB/CLOBs (4000+ characters)
202+
203+
- Long `BLOB`/`CLOB`s (4000+ characters)
199204
- User-defined types (UDTs)
200205
- Nested tables
201206
- Varrays
207+
- `GEOGRAPHY` and `GEOMETRY`
202208

203209
**Resolution:** Convert unsupported data types or exclude affected tables from replication.
204210

@@ -218,7 +224,7 @@ SQL NULL and JSON null values are not distinguishable in JSON payloads during re
218224

219225
If the Oracle redo log files are too small or do not retain enough history, you may get errors indicating that required log files are missing for a given SCN range, or that a specific SCN is unavailable.
220226

221-
Increase the number and size of online redo log files, and verify that archived log files are being generated and retained correctly in your Oracle environment.
227+
**Resolution:** Increase the number and size of online redo log files, and verify that archived log files are being generated and retained correctly in your Oracle environment.
222228

223229
##### Replicator lag
224230

@@ -244,14 +250,4 @@ WARNING: warning during tryCommit: ERROR: duplicate key value violates unique co
244250
ERROR: maximum number of retries (10) exceeded
245251
~~~
246252

247-
**Resolution:** Check target database constraints and connection stability. MOLT Replicator will log warnings for each retry attempt. If you see warnings but no final error, the apply succeeded after retrying. If all retry attempts are exhausted, Replicator will surface a final error and restart the apply loop to continue processing.
248-
249-
##### Incorrect schema path errors
250-
251-
Schema path mismatches in changefeed URLs:
252-
253-
~~~
254-
transient error: 400 Bad Request: unknown schema:
255-
~~~
256-
257-
**Resolution:** Verify the webhook path matches your target database schema. Use `/database/schema` for CockroachDB/PostgreSQL targets and `/DATABASE` for MySQL/Oracle targets.
253+
**Resolution:** Check target database constraints and connection stability. MOLT Replicator will log warnings for each retry attempt. If you see warnings but no final error, the apply succeeded after retrying. If all retry attempts are exhausted, Replicator will surface a final error and restart the apply loop to continue processing.

src/current/_includes/molt/replicator-flags.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@
4444
| `--taskGracePeriod` | `DURATION` | How long to allow for task cleanup when recovering from errors.<br><br>**Default:** `1m0s` |
4545
| `--timestampLimit` | `INT` | The maximum number of source timestamps to coalesce into a target transaction.<br><br>**Default:** `1000` |
4646
| `--userscript` | `STRING` | The path to a TypeScript configuration script. For example, `--userscript 'script.ts'`. |
47-
| `-v`, `--verbose` | `COUNT` | Increase logging verbosity to `debug`; repeat for `trace`. |
47+
| `-v`, `--verbose` | `COUNT` | Increase logging verbosity. Use `-v` for `debug` logging or `-vv` for `trace` logging. |
4848

4949
### `pglogical` replication flags
5050

src/current/_includes/molt/replicator-metrics.md

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,31 @@ Cockroach Labs recommends monitoring the following metrics during replication:
77
{% if page.name == "migrate-failback.md" %}
88
| Metric Name | Description |
99
|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
10-
| `source_lag_seconds` | Time between when an incoming resolved MVCC timestamp originated on the source CockroachDB cluster and when it was received by Replicator. |
11-
| `target_lag_seconds` | End-to-end lag from when an incoming resolved MVCC timestamp originated on the source CockroachDB to when all data changes up to that timestamp were written to the target database. |
12-
| `source_lag_seconds_histogram` | Same as `source_lag_seconds` but stored as a histogram for analyzing distributions over time. |
13-
| `target_lag_seconds_histogram` | Same as `target_lag_seconds` but stored as a histogram for analyzing distributions over time. |
14-
| `replicator_applier_mutations_staged` | Number of mutations that have been staged for application to the target database. |
15-
| `replicator_applier_mutations_applied` | Number of mutations that have been successfully applied to the target database. |
10+
| `commit_to_stage_lag_seconds` | Time between when a mutation is written to the source CockroachDB cluster and when it is written to the staging database. |
11+
| `source_commit_to_apply_lag_seconds` | End-to-end lag from when a mutation is written to the source CockroachDB cluster to when it is applied to the target database. |
12+
| `stage_mutations_total` | Number of mutations staged for application to the target database. |
13+
| `apply_conflicts_total` | Number of rows that experienced a compare-and-set (CAS) conflict. |
14+
| `apply_deletes_total` | Number of rows deleted. |
15+
| `apply_duration_seconds` | Length of time it took to successfully apply mutations. |
16+
| `apply_errors_total` | Number of times an error was encountered while applying mutations. |
17+
| `apply_resolves_total` | Number of rows that experienced a compare-and-set (CAS) conflict and which were resolved. |
18+
| `apply_upserts_total` | Number of rows upserted. |
19+
| `target_apply_queue_depth` | Number of batches in the target apply queue. Indicates how backed up the applier flow is between receiving changefeed data and applying it to the target database. |
20+
| `target_apply_queue_utilization_percent` | Utilization percentage (0.0-100.0) of the target apply queue capacity. Use this to understand how close the queue is to capacity and to set alerting thresholds for backpressure conditions. |
21+
| `core_parallelism_utilization_percent` | Current utilization percentage of the applier flow parallelism capacity. Shows what percentage of the configured parallelism is actively being used. |
1622
{% else %}
1723
| Metric Name | Description |
1824
|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
19-
| `source_lag_seconds_histogram` | Time between when a source transaction is committed and when its COMMIT transaction log arrives at Replicator. |
20-
| `target_lag_seconds_histogram` | End-to-end lag from when a source transaction is committed to when its changes are fully written to the target CockroachDB. |
21-
| `replicator_applier_mutations_staged` | Number of mutations that have been staged for application to the target database. |
22-
| `replicator_applier_mutations_applied` | Number of mutations that have been successfully applied to the target database. |
25+
| `commit_to_stage_lag_seconds` | Time between when a mutation is written to the source database and when it is written to the staging database. |
26+
| `source_commit_to_apply_lag_seconds` | End-to-end lag from when a mutation is written to the source database to when it is applied to the target CockroachDB. |
27+
| `apply_conflicts_total` | Number of rows that experienced a compare-and-set (CAS) conflict. |
28+
| `apply_deletes_total` | Number of rows deleted. |
29+
| `apply_duration_seconds` | Length of time it took to successfully apply mutations. |
30+
| `apply_errors_total` | Number of times an error was encountered while applying mutations. |
31+
| `apply_resolves_total` | Number of rows that experienced a compare-and-set (CAS) conflict and which were resolved. |
32+
| `apply_upserts_total` | Number of rows upserted. |
2333
{% endif %}
2434

25-
You can use the [Replicator Grafana dashboard](https://replicator.cockroachdb.com/replicator_grafana_dashboard.json) to visualize these metrics. <section class="filter-content" markdown="1" data-scope="oracle">For Oracle-specific metrics, import [this Oracle Grafana dashboard](https://replicator.cockroachdb.com/replicator_oracle_grafana_dashboard.json).</section>
35+
You can use the [Replicator Grafana dashboard](https://replicator.cockroachdb.com/replicator_grafana_dashboard.json) to visualize the metrics. <section class="filter-content" markdown="1" data-scope="oracle">For Oracle-specific metrics, import the [Oracle Grafana dashboard](https://replicator.cockroachdb.com/replicator_oracle_grafana_dashboard.json).</section>
2636

2737
To check MOLT Replicator health when metrics are enabled, run `curl http://localhost:30005/_/healthz` (replacing the port with your `--metricsAddr` value). This returns a status code of `200` if Replicator is running.

0 commit comments

Comments
 (0)