Low import speed using TiSpark. How to optimize cluster performance and increase speed? #66452

sykp241095 · 2026-02-26T03:21:14Z

sykp241095
Feb 26, 2026

Original author: aaarnell
Original time: 2023-06-22 07:59:25.889351
Original ID/Slug: 605 / low-import-speed-using-tispark-how-to-optimize-cluster-performance-and-increase-speed

TiDB version:

v7.1.0

Problem:

Low speed of data import using TiSpark. Suboptimal cluster configuration

Resource allocation:

4 hosts. Characteristics of each:

2 CPU Intel Xeon 2.2GHz, 40 cores, 80 threads
768GB RAM
10 HDD of 5.5TB each
2 network interface card of 10Gbit/sec each

Cluster Configuration:
1 host: 3 PD
2 host: 1 TiDB, 5 TiKV, 1 TiFlash
3 host: 1 TiDB, 4 TiKV, 1 TiFlash
4 host: 1 TiDB, 4 TiKV, 1 TiFlash

In the cluster_template.yaml file, the topology of the deployed TiDB cluster.
cluster_template.yaml (8.8 KB)

Detailed description of the problem:

We try to import data from another system according to the instructions using TiSpark. But it turns out slowly.

For example: the customer table from the TPC-DS test suite occupies about 30GB in text format and has 65,000,000 rows. In the current configuration of the Td b cluster and with 3 SparkExec units configured, the table import time is about 30 minutes. Or 17MB/sec and 36000 rows/sec.

How can I increase this speed? Perhaps it is worth optimizing the cluster configuration?

sykp241095 · 2026-02-26T03:21:16Z

sykp241095
Feb 26, 2026
Author

Original author: TiDB-AI-Assistant
Original time: 2023-06-22 07:59:38.265274

Based on the information provided, it seems that the cluster configuration is not the bottleneck for the slow import speed. The hardware resources allocated to the cluster are more than sufficient for the import task.

One possible reason for the slow import speed could be the default configuration of TiSpark. By default, TiSpark uses a small number of Spark partitions, which may not be enough to fully utilize the available cluster resources.

To increase the import speed, you can try adjusting the following TiSpark configuration parameters:

spark.sql.shuffle.partitions: This parameter controls the number of Spark partitions used for shuffling data during the import process. Increasing this value can help improve the import speed. You can try setting it to a value that is equal to or greater than the number of TiKV regions in your cluster.
spark.default.parallelism: This parameter controls the default number of Spark partitions used for parallel processing. You can try setting it to a value that is equal to or greater than the number of CPU cores in your cluster.

You can set these parameters in the spark-defaults.conf file or pass them as command-line arguments when submitting the TiSpark job.

Additionally, you can try using TiDB Lightning for data import instead of TiSpark. TiDB Lightning is a standalone tool that can import data into TiDB at a much faster speed than TiSpark. You can refer to the documentation for more information on how to use TiDB Lightning.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PingCAP

Low import speed using TiSpark. How to optimize cluster performance and increase speed? #66452

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PingCAP

Low import speed using TiSpark. How to optimize cluster performance and increase speed? #66452

Uh oh!

sykp241095 Feb 26, 2026

TiDB version:

Problem:

Resource allocation:

Detailed description of the problem:

Replies: 1 comment

Uh oh!

sykp241095 Feb 26, 2026 Author

sykp241095
Feb 26, 2026

sykp241095
Feb 26, 2026
Author