Skip to content

Commit b6e52fc

Browse files
committed
[Bug][Master/Worker] Fix reconnect to registry might cause task duplicate running
1 parent 8340f3f commit b6e52fc

File tree

39 files changed

+297
-588
lines changed

39 files changed

+297
-588
lines changed

docs/docs/en/architecture/configuration.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,6 @@ Location: `master-server/conf/application.yaml`
295295
| master.server-load-protection.max-disk-usage-percentage-thresholds | 0.7 | Master max disk usage , when the master's disk usage is smaller then this value, master server can execute workflow. |
296296
| master.failover-interval | 10 | failover interval, the unit is minute |
297297
| master.kill-application-when-task-failover | true | whether to kill yarn/k8s application when failover taskInstance |
298-
| master.registry-disconnect-strategy.strategy | stop | Used when the master disconnect from registry, default value: stop. Optional values include stop, waiting |
299298
| master.registry-disconnect-strategy.max-waiting-time | 100s | Used when the master disconnect from registry, and the disconnect strategy is waiting, this config means the master will waiting to reconnect to registry in given times, and after the waiting times, if the master still cannot connect to registry, will stop itself, if the value is 0s, the Master will wait infinitely |
300299
| master.worker-group-refresh-interval | 10s | The interval to refresh worker group from db to memory |
301300
| master.command-fetch-strategy.type | ID_SLOT_BASED | The command fetch strategy, only support `ID_SLOT_BASED` |

docs/docs/en/guide/upgrade/incompatible.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,4 +32,5 @@ This document records the incompatible updates between each version. You need to
3232
* Remove the `Pigeon` from the `Task Plugin` ([#16218])(https://github.com/apache/dolphinscheduler/pull/16218)
3333
* Uniformly name `process` in code as `workflow` ([#16515])(https://github.com/apache/dolphinscheduler/pull/16515)
3434
* Deprecated upgrade code of 1.x and 2.x in 3.3.0-release ([#16543])(https://github.com/apache/dolphinscheduler/pull/16543)
35+
* Remove the `registry-disconnect-strategy` in `application.yaml` ([#16821])(https://github.com/apache/dolphinscheduler/pull/16821)
3536

docs/docs/zh/architecture/configuration.md

Lines changed: 22 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -277,30 +277,28 @@ common.properties配置文件目前主要是配置hadoop/s3/yarn/applicationId
277277

278278
位置:`master-server/conf/application.yaml`
279279

280-
| 参数 | 默认值 | 描述 |
281-
|-----------------------------------------------------------------------------|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
282-
| master.listen-port | 5678 | master监听端口 |
283-
| master.pre-exec-threads | 10 | master准备执行任务的数量,用于限制并行的command |
284-
| master.exec-threads | 100 | master工作线程数量,用于限制并行的流程实例数量 |
285-
| master.dispatch-task-number | 3 | master每个批次的派发任务数量 |
286-
| master.worker-load-balancer-configuration-properties.type | DYNAMIC_WEIGHTED_ROUND_ROBIN | Master 将会使用Worker的动态CPU/Memory/线程池使用率来计算Worker的负载,负载越低的worker将会有更高的机会被分发任务 |
287-
| master.max-heartbeat-interval | 10s | master最大心跳间隔 |
288-
| master.task-commit-retry-times | 5 | 任务重试次数 |
289-
| master.task-commit-interval | 1000 | 任务提交间隔,单位为毫秒 |
290-
| master.state-wheel-interval | 5 | 轮询检查状态时间 |
291-
| master.server-load-protection.enabled | true | 是否开启系统保护策略 |
292-
| master.server-load-protection.max-system-cpu-usage-percentage-thresholds | 0.7 | master最大系统cpu使用值,只有当前系统cpu使用值低于最大系统cpu使用值,master服务才能调度任务. 默认值为0.7: 会使用70%的操作系统CPU |
293-
| master.server-load-protection.max-jvm-cpu-usage-percentage-thresholds | 0.7 | master最大JVM cpu使用值,只有当前JVM cpu使用值低于最大JVM cpu使用值,master服务才能调度任务. 默认值为0.7: 会使用70%的JVM CPU |
294-
| master.server-load-protection.max-system-memory-usage-percentage-thresholds | 0.7 | master最大系统 内存使用值,只有当前系统内存使用值低于最大系统内存使用值,master服务才能调度任务. 默认值为0.7: 会使用70%的操作系统内存 |
295-
| master.server-load-protection.max-disk-usage-percentage-thresholds | 0.7 | master最大系统磁盘使用值,只有当前系统磁盘使用值低于最大系统磁盘使用值,master服务才能调度任务. 默认值为0.7: 会使用70%的操作系统磁盘空间 |
296-
| master.failover-interval | 10 | failover间隔,单位为分钟 |
297-
| master.kill-application-when-task-failover | true | 当任务实例failover时,是否kill掉yarn或k8s application |
298-
| master.registry-disconnect-strategy.strategy | stop | 当Master与注册中心失联之后采取的策略, 默认值是: stop. 可选值包括: stop, waiting |
299-
| master.registry-disconnect-strategy.max-waiting-time | 100s | 当Master与注册中心失联之后重连时间, 之后当strategy为waiting时,该值生效。 该值表示当Master与注册中心失联时会在给定时间之内进行重连, 在给定时间之内重连失败将会停止自己,在重连时,Master会丢弃目前正在执行的工作流,值为0表示会无限期等待 |
300-
| master.master.worker-group-refresh-interval | 10s | 定期将workerGroup从数据库中同步到内存的时间间隔 |
301-
| master.command-fetch-strategy.type | ID_SLOT_BASED | Command拉取策略, 目前仅支持 `ID_SLOT_BASED` |
302-
| master.command-fetch-strategy.config.id-step | 1 | 数据库中t_ds_command的id自增步长 |
303-
| master.command-fetch-strategy.config.fetch-size | 10 | master拉取command数量 |
280+
| 参数 | 默认值 | 描述 |
281+
|-----------------------------------------------------------------------------|------------------------------|-----------------------------------------------------------------------------------------|
282+
| master.listen-port | 5678 | master监听端口 |
283+
| master.pre-exec-threads | 10 | master准备执行任务的数量,用于限制并行的command |
284+
| master.exec-threads | 100 | master工作线程数量,用于限制并行的流程实例数量 |
285+
| master.dispatch-task-number | 3 | master每个批次的派发任务数量 |
286+
| master.worker-load-balancer-configuration-properties.type | DYNAMIC_WEIGHTED_ROUND_ROBIN | Master 将会使用Worker的动态CPU/Memory/线程池使用率来计算Worker的负载,负载越低的worker将会有更高的机会被分发任务 |
287+
| master.max-heartbeat-interval | 10s | master最大心跳间隔 |
288+
| master.task-commit-retry-times | 5 | 任务重试次数 |
289+
| master.task-commit-interval | 1000 | 任务提交间隔,单位为毫秒 |
290+
| master.state-wheel-interval | 5 | 轮询检查状态时间 |
291+
| master.server-load-protection.enabled | true | 是否开启系统保护策略 |
292+
| master.server-load-protection.max-system-cpu-usage-percentage-thresholds | 0.7 | master最大系统cpu使用值,只有当前系统cpu使用值低于最大系统cpu使用值,master服务才能调度任务. 默认值为0.7: 会使用70%的操作系统CPU |
293+
| master.server-load-protection.max-jvm-cpu-usage-percentage-thresholds | 0.7 | master最大JVM cpu使用值,只有当前JVM cpu使用值低于最大JVM cpu使用值,master服务才能调度任务. 默认值为0.7: 会使用70%的JVM CPU |
294+
| master.server-load-protection.max-system-memory-usage-percentage-thresholds | 0.7 | master最大系统 内存使用值,只有当前系统内存使用值低于最大系统内存使用值,master服务才能调度任务. 默认值为0.7: 会使用70%的操作系统内存 |
295+
| master.server-load-protection.max-disk-usage-percentage-thresholds | 0.7 | master最大系统磁盘使用值,只有当前系统磁盘使用值低于最大系统磁盘使用值,master服务才能调度任务. 默认值为0.7: 会使用70%的操作系统磁盘空间 |
296+
| master.failover-interval | 10 | failover间隔,单位为分钟 |
297+
| master.kill-application-when-task-failover | true | 当任务实例failover时,是否kill掉yarn或k8s application |
298+
| master.master.worker-group-refresh-interval | 10s | 定期将workerGroup从数据库中同步到内存的时间间隔 |
299+
| master.command-fetch-strategy.type | ID_SLOT_BASED | Command拉取策略, 目前仅支持 `ID_SLOT_BASED` |
300+
| master.command-fetch-strategy.config.id-step | 1 | 数据库中t_ds_command的id自增步长 |
301+
| master.command-fetch-strategy.config.fetch-size | 10 | master拉取command数量 |
304302

305303
## Worker Server相关配置
306304

@@ -320,7 +318,6 @@ common.properties配置文件目前主要是配置hadoop/s3/yarn/applicationId
320318
| worker.server-load-protection.max-disk-usage-percentage-thresholds | 0.7 | worker最大系统磁盘使用值,只有当前系统磁盘使用值低于最大系统磁盘使用值,worker服务才能接收任务. 默认值为0.7: 会使用70%的操作系统磁盘空间 |
321319
| worker.alert-listen-host | localhost | alert监听host |
322320
| worker.alert-listen-port | 50052 | alert监听端口 |
323-
| worker.registry-disconnect-strategy.strategy | stop | 当Worker与注册中心失联之后采取的策略, 默认值是: stop. 可选值包括: stop, waiting |
324321
| worker.registry-disconnect-strategy.max-waiting-time | 100s | 当Worker与注册中心失联之后重连时间, 之后当strategy为waiting时,该值生效。 该值表示当Worker与注册中心失联时会在给定时间之内进行重连, 在给定时间之内重连失败将会停止自己,在重连时,Worker会丢弃kill正在执行的任务。值为0表示会无限期等待 |
325322
| worker.task-execute-threads-full-policy | REJECT | 如果是 REJECT, 当Worker中等待队列中的任务数达到exec-threads时, Worker将会拒绝接下来新接收的任务,Master将会重新分发该任务; 如果是 CONTINUE, Worker将会接收任务,放入等待队列中等待空闲线程去执行该任务 |
326323
| worker.tenant-config.auto-create-tenant-enabled | true | 租户对应于系统的用户,由worker提交作业.如果系统没有该用户,则在参数worker.tenant.auto.create为true后自动创建。 |

docs/docs/zh/guide/upgrade/incompatible.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,4 +30,5 @@
3030
*`任务插件` 中移除了 `Pigeon` 类型 ([#16218])(https://github.com/apache/dolphinscheduler/pull/16218)
3131
* 统一代码中的 `process``workflow` ([#16515])(https://github.com/apache/dolphinscheduler/pull/16515)
3232
* 在 3.3.0-release 中废弃了从 1.x 至 2.x 的升级代码 ([#16543])(https://github.com/apache/dolphinscheduler/pull/16543)
33+
*`application.yaml`中移除`registry-disconnect-strategy`配置 ([#16821])(https://github.com/apache/dolphinscheduler/pull/16821)
3334

dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/cluster/BaseServerMetadata.java

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,17 @@
2020
import org.apache.dolphinscheduler.common.enums.ServerStatus;
2121

2222
import lombok.Data;
23+
import lombok.ToString;
2324
import lombok.experimental.SuperBuilder;
2425

2526
@Data
27+
@ToString
2628
@SuperBuilder
2729
public abstract class BaseServerMetadata implements IClusters.IServerMetadata {
2830

31+
// The server startup time in milliseconds.
32+
private final long serverStartupTime;
33+
2934
private final String address;
3035

3136
private final double cpuUsage;

0 commit comments

Comments
 (0)