Skip to content

Commit 89a2add

Browse files
committed
Merge remote-tracking branch 'upstream/master' into ali_and_weibo_case_study
Signed-off-by: chenqiming <[email protected]>
2 parents 2e2cbb1 + 9e59e58 commit 89a2add

File tree

11 files changed

+68
-24
lines changed

11 files changed

+68
-24
lines changed

docs/case-study/alibaba-case-study.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
sidebar_label: Alibaba's Case Study
3-
sidebar_position: 2
3+
sidebar_position: 1
44
---
55

66
# Alibaba's Case Study
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
sidebar_label: Metabit Trading's Case Study
3+
sidebar_position: 3
4+
---
5+
6+
# Metabit Trading's Case Study
7+
to be updated.

docs/core-concepts/architecture-and-concepts.md

Lines changed: 36 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -11,16 +11,21 @@ sidebar_position: 3
1111
Fluid is built in the Kubernetes native fashion. It lies between existing underlying cloud native storage systems and the upper layer data-intensive applications. The architecture of Fluid in Kubernetes is as following:
1212

1313
![](../../static/img/docs/core-concepts/architecture.png)
14-
Specifically, Fluid is logically split into a data plane and a control plane.
14+
Specifically, Fluid's architecture is logically split into **control plane** and **data plane**. The following diagram shows the different components.
1515

16-
- The control plane is composed of **Dataset/Runtime Controller** and **Application Manager**
17-
- **Dataset/Runtime Controller:** It manages the datasets and also automates the data operations of the dataset, like data load, data migrate, data process and so on.
18-
- **Application Manager**: It is responsible for scheduling the workload pods according to cache location and managing their life cycles.
19-
- The data plane is composed of Runtime Plugin and CSI Plugin:
20-
- **Runtime Plugin**: As a highly extensible plugin, it can help turn various data cache engines into self-managing, self-scaling, self-healing and observable cache services inside of Kubernetes by providing the common framework of Fluid.
21-
- **Data Access Plugin**: It is responsible for managing different kinds of storage clients in container mode in the same manner. It supports both CSI Plugin and sidecar mode to run FUSE containers.
16+
- **Control Plane**:The control plane is composed of **Dataset/Runtime Operator** and **Application Manager**
17+
- **Dataset/Runtime Operator**: Responsible for the scheduling and orchestration of datasets and their supporting runtimes in Kubernetes. This includes scheduling, migration, and elastic scaling of the runtime for datasets, as well as automated operations for dataset support, such as fine-grained data preheating, such as specifying preheating for a specific folder; controlling metadata backup and recovery to improve data access performance for scenarios with massive small files; and setting pinning policies for cached data to avoid performance fluctuations caused by data eviction.
18+
19+
- **Application Manager**: Responsible for the scheduling and operation of application Pods that use datasets, which is divided into two core components: the Scheduler and the Webhook.
20+
- Scheduler: schedule application Pods that use datasets in the Kubernetes cluster. By incorporating cached information obtained from the Runtime, Pods that use datasets are preferentially scheduled to nodes that have data caching, without the need for users to specify caching nodes.
21+
22+
- Sidecar Webhook: For Kubernetes environments where the csi-plugin cannot be run, the Sidecar webhook automatically replaces the PVC with a FUSE sidecar and controls the startup order of containers in the Pod to ensure that the FUSE container starts first.
2223

23-
The following diagram shows the different components.
24+
25+
- **Data Plane**:The data plane is composed of Runtime Plugin and CSI Plugin.
26+
- **Runtime Plugin**: As a highly extensible plugin, it can support various data access engines. Fluid achieves this by abstracting some common features, such as the use of cache media, quotas, directories, etc., making it extensible with different distributed cache engine implementation technologies. For example, the AlluxioRuntime uses a Master-Slave architecture, while the JuiceFSRuntime uses a Worker P2P architecture, both of which can be configured in the CRD of the Runtime. This plugin not only supports specific Runtimes like Alluxio and JuiceFS, but also supports a generic ThinRuntime, enabling users to access generic storage without the need for development.
27+
28+
- **CSI Plugin**: The storage client is started in a containerized manner, completely decoupled from the business container. Upgrading the CSI plugin will not affect the business container, and it also supports deploying multiple versions of the storage client in the same Kubernetes cluster. Running the client independently in a Pod also provides observability within the Kubernetes system. Additionally, resource quotas can be set for the client's computing resources.
2429

2530
![](../../static/img/docs/core-concepts/components.png)
2631

@@ -32,13 +37,30 @@ For achieving its goals, Fluid provides some core concepts.
3237
* Same as native Kubernetes API definitions, including CRDs
3338
* Users describe the data’s source, type, access mode and cache location
3439
* Users can use observability to make scaling decisions of distributed cache
40+
![](../../static/img/docs/core-concepts/dataset.png)
41+
42+
Dataset management has multiple dimensions, including security, version control, and data acceleration. We aim to provide support for dataset management with a focus on data acceleration. For example, we support aggregation of data from different storage sources, portability, and data features.
43+
44+
* **Data Source**: Supports multiple data sources with different protocols, including HDFS, S3, OSS, and the native Kubernetes Persistent Volume Claim protocol. Multiple data sources can also be mounted under different subdirectories in a unified namespace.
45+
* **Placement Policy**: cached dataset on nodes of different types using the strong and weak affinity and toleration of the nodeAffinity in Kubernetes semantics.
46+
![](../../static/img/docs/core-concepts/dataset-yaml.png)
47+
48+
At the same time, Dataset provides observability, such as how much data is in the dataset, how much cache space is currently available, and what the cache hit rate is. Users can use this information to decide whether to scale up or down.
49+
50+
**Runtime**: Dataset is a unified abstract concept, and the actual data operations are implemented by specific runtimes. Due to the differences in storage, there are different runtime interfaces. The introduction of runtime is necessary for accessing the data. The API specification here can be defined relatively flexibly, but the lifecycle of the runtime is defined by Fluid in a unified manner, and the implementer of the runtime needs to complete the specific implementation according to the common interface definition. The Runtime enforces dataset isolation/share, provides version management, and enables data acceleration by defining a set of interfaces to handle DataSets throughout their lifecycle, allowing for the implementation of management and acceleration functionalities behind these interfaces. Fluid has two kind of Runtime: CacheRuntime and ThinRuntime.
3551

36-
**Runtime**: The Runtime enforces dataset isolation/share, provides version management, and enables data acceleration by defining a set of interfaces to handle DataSets throughout their lifecycle, allowing for the implementation of management and acceleration functionalities behind these interfaces. Fluid has two kind of Runtime: CacheRuntime and ThinRuntime.
3752
* CacheRuntime, which implements distributed caching solutions including Alluxio, JuiceFS, Vineyard and others
53+
3854
* ThinRuntime, that provides a unified access interface to systems like CubeFS, GlusterFS, NFS and others.
55+
![](../../static/img/docs/core-concepts/dataset-kubectl.png)
56+
57+
58+
59+
**Data Operations**: Unlike traditional PVC-based storage abstraction, Fluid takes an Application-oriented perspective to abstract the “process of manipulating data on Kubernetes”. It introduces the concept of elastic Dataset and implements it as a first-class citizen in Kubernetes to enable Dataset CRUD operation, permission control, and data access acceleration. Besides the basic operations like creation, Fluid also provides a set of operations for the defined Dataset for users to manipulate the data flow, such as data prefetch, data migration, elastic scaling, cache cleaning, metadata backup, and recovery.
60+
61+
* Data Prefetch: The directory to be prefetched and the preheating strategy can be one-time, scheduled, or event-triggered can be specified.
62+
* Data Migration: Supports both importing data from external storage into a dataset before using it, and using a dataset while importing data into it.
63+
* Data Process: Support transform, split, applying dimensionality reduction to data
64+
distributed cache scale up and down.
3965

40-
**Data Operations**: Unlike traditional PVC-based storage abstraction, Fluid takes an Application-oriented perspective to abstract the “process of manipulating data on Kubernetes”. It introduces the concept of elastic Dataset and implements it as a first-class citizen in Kubernetes to enable Dataset CRUD operation, permission control, and data access acceleration. Besides the basic operations like creation, Fluid also provides a set of operations for the defined Dataset for users to manipulate the data flow.
41-
* Data Load prefetches data from dataset source to cache system.
42-
* Data Migration syncs data between external storages and dataset .
43-
* Data Process can be used to transform, split, applying dimensionality reduction to data
44-
Distributed cache scale up and down.
66+
![](../../static/img/docs/core-concepts/oprations.png)

docs/core-concepts/what-is-fluid.md

Lines changed: 24 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,31 +6,46 @@ slug: /
66

77
# What is Fluid
88

9-
Fluid is an open source Kubernetes-native Distributed Dataset Orchestrator and Accelerator for data-intensive applications, such as big data and AI applications. It is hosted by the [Cloud Native Computing Foundation](https://cncf.io) (CNCF) as a sandbox project.
10-
9+
Fluid is an open source Kubernetes-native Distributed Dataset Orchestrator and Accelerator for data-intensive applications, such as big data and AI applications. It is hosted by the [Cloud Native Computing Foundation](https://cncf.io) (CNCF) as a sandbox project. Fluid is can convert distributed caching systems (such as Alluxio and JuiceFS) into observable caching services with self-management, elastic scaling, and self-healing capabilities, and it does so by supporting dataset operations. At the same time, through the data caching location information, Fluid can provide data-affinity scheduling for applications using datasets.
1110

1211
## Target Scenario and Values
1312

1413
In the treand of computation and stroage separation, the goal of Fluid is to enable AI/Big Data Applications to use data from any storage more efficiently with a high-level abstraction manner and without changes to the applications themselves.
1514

16-
Through the data abstraction layer powered by Fluid on Kubernetes, the data will just be like
17-
the fluid, waving across the storage sources(such as HDFS, OSS, Ceph) and the cloud native applications on Kubernetes. It can be moved, copied, evicted, transformed and managed flexibly. Besides, All the data operations are transparent to users. Users do not need to worry about the efficiency of remote data access nor the convenience of data source management. User only need to access the data abstracted from the Kubernetes native data volume, and all the left tasks and details are handled by Fluid.
15+
Unlike traditional PVC-based storage abstraction, Fluid takes an Application-oriented perspective to abstract the “process of using data on Kubernetes”. It introduces the concept of elastic Dataset and implements it as a first-class citizen in Kubernetes to enable Dataset CRUD operation, permission control, and access acceleration.
16+
17+
![](../../static/img/docs/core-concepts/perspective.png)
18+
19+
Through the data abstraction layer powered by Fluid on Kubernetes, the data will just be like the fluid, waving across the storage sources (such as HDFS, OSS, Ceph) and the cloud native applications on Kubernetes. It can be moved, copied, evicted, transformed and managed flexibly. Besides, All the data operations are transparent to users. Users do not need to worry about the efficiency of remote data access nor the convenience of data source management. User only need to access the data abstracted from the Kubernetes native data volume, and all the left tasks and details are handled by Fluid.
1820

19-
Fluid aims to turn different distributed cache systems(Alluxio, JuiceFS, Vineyard, CubeFS and so on) into self-managing, self-scaling, self-healing and observable cache services inside of Kubernetes by providing the common framework of Fluid.
21+
Fluid aims to turn different distributed cache systems (Alluxio, JuiceFS, Vineyard, CubeFS and so on) into self-managing, self-scaling, self-healing and observable cache services inside of Kubernetes by providing the common framework of Fluid.
2022

2123
Fluid enables Kubernetes schedulers to make intelligent, topology-aware scheduling plans regarding where the distributed data cache system is located. It focuses on the dataset orchestration and application orchestration scenarios. The dataset orchestration can arrange the cached dataset to the specific Kubernetes node, while the application orchestration can arrange the the applications to nodes with the pre-loaded datasets. These two can work together to form the co-orchestration scenario, which take both the dataset specifications and application characteristics into consideration during resouce scheduling.
2224

2325
Fluid presents its value in the following two aspects:
2426
1. Use the power of Kubernetes platform to deliver its services via a Kubernetes Operator for each distributed cache provider, and automate the tasks of the administrator: deployment, bootstrapping, configuration, provisioning, scaling, upgrading, monitoring, data prefetch, data migration and resource management.
2527
2. Help the users make the most of distributed caching by combining third-party caching systems with Kubernetes scheduling and elasticity, also aligning them with specific application data usage scenarios and methods.
2628

27-
## Why Cloud Native needs Fluid
29+
## Why Fluid
30+
31+
1. Running AI, big data and other tasks on the cloud through a cloud-native architecture can take advantage of the elasticity of computing resources, but at the same time, it also faces data access latency and large bandwidth overhead due to the separated computing and storage architecture. Especially deep learning training with GPUs, iterative remote access to large amounts of training data will significantly slow down the computing efficiency.
32+
33+
2. Kubernetes provides heterogeneous storage service access and management standard interface (CSI, Container Storage Interface), but it does not define how the application uses and manages data. When running machine learning tasks, data scientists need to be able to define file features of the dataset, manage versions of the dataset, control access permissions, pre-process the dataset, accelerate heterogeneous data reading, etc. However, there is no such standard scheme in Kubernetes, which is one of the important missing capabilities of Kubernetes.
34+
35+
3. Kubernetes supports a variety of forms, such as native Kubernetes, edge Kubernetes and Serverless Kubernetes. However, for different forms of Kubernetes, the support for CSI plug-ins is also different, for example, many Serverless Kubernetes do not support the deployment of third-party CSI plug-ins.
36+
37+
In summary, to resolve the issue that Kubernetes lacks the awareness and optimization for application data, Fluid put forward a series of innovative methods such as co-orchestration, intelligent awareness, joint-optimization, to form an efficient supporting platform for data-intensive applications in cloud native environment.
38+
39+
## System Characteristics
40+
1. **Application-oriented DataSet Unified Abstraction**:DataSet not only consolidates data from multiple storage sources, but also describes the data's portablity and features, also providing observability, such as total data volume of the DataSet, current cache space size, and cache hit rate. Users can evaluate whether a cache system needs to be scaled up or down according to this information.
41+
42+
2. **Lightweight but highly extensible Runtime Plugins**:Dataset is an abstract concept, and the data operation needs to be implemented by the Runtime. According to the different storages, there will be different Runtime interfaces. Fluid's Runtime is divided into two categories: CacheRuntime to accelerate data access, such as AlluxioRuntime for S3, HDFS and JuiceFSRuntime for JuiceFS; the other category is ThinRuntime, which provides a unified access interface to facilitate the access to third-party storage.
2843

29-
There exist a nature divergence between the cloud native environment and the earlier big data processing framework. Deeply affected by Google's GFS, MapReduce, BigTable influential papers, the open souce big data ecosystem keeps the concept of 'moving data but not moving computation' during system design. Therefore, data-intensive computing frameworks, such as Spark, Hive, MapReduce, aim to reduce data transmission, and consider more data locality architecture during the design. However, as time changes, for both consider the flexibility of the resource scalability and usage cost, compution and storage separation architecture has been widely used in the cloud native environment. Thus, the cloud native ecosystem need an component like Fluid to make up the lost data locality when the big data architecture embraces cloud native architecture.
44+
3. **Automated data operation**:Providing data prefetch, migration, backup and other operations via CRDs, and supporting various trigger modes such as one-time, scheduled, and event-driven, to facilitate users to integrate them into the automated operation and maintenance system.
3045

31-
Besides, in the cloud native environment, applications are usually deployed in the stateless micro-service style, but focus on data processing. However, the data-intensive frameworks and applications always focus on data abstraction, and schedules and executes the computing jobs and tasks. When data-intensive frameworks are deployed in cluod native environment, it needs component like Fluid to handle the data scheduling in cloud.
46+
4. **Data elasticity and scheduling**:By combining distributed data caching technology with autoscaling, portability, observability, and affinity scheduling capabilities, data access performance can be improved through the provision of observable, elastic scaling cache capabilities and data affinity scheduling capabilities.
3247

33-
To resolve the issue that Kubernetes lacks the awareness and optimization for application data, Fluid put forward a series of innovative methods such as co-orchestration, intelligent awareness, joint-optimization, to form an efficient supporting platform for data-intensive applications in cloud native environment.
48+
5. **Runtime platform Agnostic**:Support diverse environments such as native, edge, Serverless Kubernetes cluster, Kubernetes multi-cluster, and can run in various environments such as cloud platform, edge, Kubernetes multi-cluster. It can run storage client in different modes by choosing CSI Plugin and sidecar according to the differences in environments.
3449

3550
## Publication
3651
For more information of our key ideas, please refer to our papers:
-406 KB
Loading
-426 KB
Loading
26.8 KB
Loading
86 KB
Loading
103 KB
Loading
263 KB
Loading

0 commit comments

Comments
 (0)