Skip to content

[Feature]: Do we have any plans to supporting spark cluster in test container?  #7657

@Rembrant777

Description

@Rembrant777

Module

None

Problem

Hello everyone,

I'm a newcomer to using testcontainers. Due to my job requirements, I frequently need to develop Spark applications. For me, the most time-consuming part of working on Spark application, from design and development to debugging and deployment, is the need to recompile the code and generate a JAR file every time I make changes to the code logic and then submit it to the cluster to wait for the results.

While Spark provides relatively robust JUnit test cases, setting the master to local doesn't truly replicate the issues that may arise in a distributed environment encounters data skew when trying to consume data from Kafka Cluster with more than 3 partitions, or if I want to develop Spark Shuffle components further, the existing JUnit test cases don't cover the potential problems in a distributed environment.

So, when I first add testcontainer to my JUnit environment and execute it. I was very excited about the convenience it provides and really let me focus on the inner logics of the codes.

That's why I wonder if testcontainer group plans to support Spark in the future like based on Yarn, Mesos, or even Kubernetes?

While I am aware that AWS and Azure's platforms provide robust solutions for Spark with Databricks and related serverless API services. I still believe that there is a pressing need for heavy-duty computing frameworks like Spark, and Flink for beginners and their app developers. I also think that I'm certainly not the only one who has experienced reduced development efficiency due to environmental issues.

If the solution is feasible, I will actively participate in building this feature. And looking forward to your reply and plans in this direction.

Solution

After referencing the implementation code of KafkaContainerCluster, I suppose that its container construction method is quite similar to the deployment approach of a Spark Cluster. In KafkaContainerCluster, a cluster is built with one Zookeeper and three KafkaContainers.

Following the same solution approach, for the Spark Standalone deployment mode, we can refer to this folder's docker docker-compose.yml and Dockerfile(s) deployment method that to deploy different components into separate containers to achieve cluster deployment (Perhaps this solution is not very mature and requires more in-depth discussion about the details.).

Benefit

For Spark Beginners:

  1. Simplify the process of running Spark code for beginners.

For Spark application developers:

  1. Require minimal resources and datasets to validate the data processing logic of their Spark applications, enabling debugging and optimization in a local environment.

  2. Reduce the compilation time of the Spark application JAR file after each code modification and minimize the time spent on submitting the Spark application to a remote or local cluster for execution.

  3. Enable quick regression testing of Spark application functionality improvements through JUnit test cases, reducing the need for additional maintenance costs in DevOps for operations personnel (as the setup is already established at the JUnit test level).

For Spark Infra developers:

  1. Some of the underlying logic, which originally required the development of a MockServer, can be completely achieved through the use of testcontainers. This saves a significant amount of development time for creating mock cases.

Alternatives

Nope, since it is a new module. As far as I know, it will not affect other modules.

Would you like to help contributing this feature?

Yes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions