|
1 | 1 | # Architecture |
2 | 2 |
|
3 | | -**TODO:** File a subtask under [HDDS-9857](https://issues.apache.org/jira/browse/HDDS-9857) and complete this page or section. |
| 3 | +Apache Ozone is a scalable, distributed, and highly available object store designed to handle billions of objects of any size. This document provides an overview of Ozone's architecture, including its core components, data organization, and operational concepts. |
4 | 4 |
|
5 | | -A high level overview of Ozone's components and what they do. |
| 5 | +## Ozone Namespace |
| 6 | + |
| 7 | +Ozone organizes data in a hierarchical namespace consisting of three levels: |
| 8 | + |
| 9 | +### Volumes |
| 10 | + |
| 11 | +Volumes are the top-level entities in the Ozone namespace, conceptually similar to filesystems in traditional storage systems. They typically represent: |
| 12 | + |
| 13 | +- Multi-tenant boundaries |
| 14 | +- Organizational divisions |
| 15 | +- Project groupings |
| 16 | + |
| 17 | +Volumes provide isolation and can have their own admins and quota limits. |
| 18 | + |
| 19 | +### Buckets |
| 20 | + |
| 21 | +Buckets exist within volumes and act as containers for objects (keys). Each bucket can be configured with specific properties: |
| 22 | + |
| 23 | +- Storage type and replication factor |
| 24 | +- Encryption settings |
| 25 | +- Access control policies |
| 26 | +- Quota limits |
| 27 | + |
| 28 | +Buckets are analogous to directories in a filesystem or buckets in cloud object stores. |
| 29 | + |
| 30 | +### Keys (Objects) |
| 31 | + |
| 32 | +Keys are the actual data objects stored in buckets. They can be: |
| 33 | + |
| 34 | +- Files of any size |
| 35 | +- Binary data |
| 36 | +- Named using a path-like structure depending on bucket layout |
| 37 | + |
| 38 | +For more details about Ozone's namespace, see the namespace documentation. |
| 39 | + |
| 40 | +### Ozone Bucket Types |
| 41 | + |
| 42 | +Ozone supports two distinct bucket layouts, each optimized for different use cases: |
| 43 | + |
| 44 | +#### Object Store Layout |
| 45 | + |
| 46 | +The Object Store layout (OBS) works like traditional object stores with a flat namespace: |
| 47 | + |
| 48 | +- Objects are stored with their full key path |
| 49 | +- No concept of directories (though path-like naming is supported) |
| 50 | +- Optimized for object storage workloads |
| 51 | +- Compatible with S3-style access patterns |
| 52 | + |
| 53 | +#### File System Optimized Layout |
| 54 | + |
| 55 | +The File System Optimized layout (FSO) provides hierarchical directory structure: |
| 56 | + |
| 57 | +- Directories are first-class entities |
| 58 | +- Supports efficient directory operations (listing, renaming) |
| 59 | +- Includes filesystem features like trash |
| 60 | +- Better performance for filesystem-style workloads |
| 61 | +- Default layout type |
| 62 | + |
| 63 | +The bucket layout determines how data is organized and accessed within a bucket. |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | +## Core Components |
| 68 | + |
| 69 | +Ozone has a modular architecture with several key components that work together to provide a scalable and reliable storage system. |
| 70 | + |
| 71 | +### Ozone Manager (OM) |
| 72 | + |
| 73 | +The Ozone Manager is the metadata server that manages the namespace: |
| 74 | + |
| 75 | +- Handles all volume, bucket, and key operations |
| 76 | +- Maintains metadata in RocksDB |
| 77 | +- Allocates blocks for data storage |
| 78 | +- Manages access control |
| 79 | +- Supports HA deployment with Ratis consensus protocol |
| 80 | + |
| 81 | +The OM is the entry point for all namespace operations. It tracks which objects exist and where they are stored. |
| 82 | + |
| 83 | +### Storage Container Manager (SCM) |
| 84 | + |
| 85 | +The Storage Container Manager orchestrates the container lifecycle and coordinates datanodes: |
| 86 | + |
| 87 | +- Manages container creation and allocation |
| 88 | +- Tracks datanode status and health |
| 89 | +- Handles container replication and EC |
| 90 | +- Issues block allocation requests |
| 91 | +- Coordinates container balancing |
| 92 | +- Supports HA deployment with Ratis |
| 93 | + |
| 94 | +SCM is the control plane for container management. |
| 95 | + |
| 96 | +### Datanode |
| 97 | + |
| 98 | +Datanodes are the workhorses that store the actual data: |
| 99 | + |
| 100 | +- Store data in containers on local disks |
| 101 | +- Serve read and write requests |
| 102 | +- Report container statistics to SCM |
| 103 | +- Participate in replication pipelines |
| 104 | +- Handle data integrity checks |
| 105 | + |
| 106 | +Each datanode manages a set of containers and serves read/write requests from clients. |
| 107 | + |
| 108 | +### Recon |
| 109 | + |
| 110 | +Recon is the analytics and monitoring component: |
| 111 | + |
| 112 | +- Collects and aggregates metrics |
| 113 | +- Provides a web UI for monitoring |
| 114 | +- Offers a consolidated view of the cluster |
| 115 | +- Helps identify issues and bottlenecks |
| 116 | +- Syncs data from OM, SCM, and Datanodes |
| 117 | + |
| 118 | +Recon is an optional but recommended component for operational visibility. |
| 119 | + |
| 120 | +### S3 Gateway |
| 121 | + |
| 122 | +The S3 Gateway provides S3-compatible API access: |
| 123 | + |
| 124 | +- Implements S3 REST API |
| 125 | +- Translates S3 operations to Ozone operations |
| 126 | +- Supports most S3 features and SDKs |
| 127 | +- Provides authentication and authorization |
| 128 | + |
| 129 | +The S3 Gateway enables applications built for S3 to work with Ozone. |
| 130 | + |
| 131 | +### HttpFS |
| 132 | + |
| 133 | +HttpFS provides a REST interface compatible with WebHDFS: |
| 134 | + |
| 135 | +- Enables HTTP access to Ozone |
| 136 | +- Compatible with HDFS clients |
| 137 | +- Supports read/write operations |
| 138 | +- Facilitates integration with web applications |
| 139 | + |
| 140 | +HttpFS allows web applications to interact with Ozone using familiar APIs. |
| 141 | + |
| 142 | +### Ozone Client |
| 143 | + |
| 144 | +The Ozone client is the software component that enables applications to interact with the Ozone storage system: |
| 145 | + |
| 146 | +- Provides Java libraries for programmatic access |
| 147 | +- Handles communication with OM for namespace operations |
| 148 | +- Manages direct data transfer with datanodes |
| 149 | +- Implements client-side caching for improved performance |
| 150 | +- Offers pluggable interfaces for different protocols (S3, OFS) |
| 151 | +- Handles authentication and token management |
| 152 | + |
| 153 | +The client library abstracts away the complexity of the distributed system, providing applications with a simple, consistent interface to Ozone storage. |
| 154 | + |
| 155 | +### Component Interactions |
| 156 | + |
| 157 | +The components of Ozone interact in well-defined patterns for different operations: |
| 158 | + |
| 159 | + |
| 160 | + |
| 161 | +#### Write Path Sequence |
| 162 | + |
| 163 | +The typical write sequence follows these steps: |
| 164 | + |
| 165 | +1. **Namespace Operations**: The client contacts the Ozone Manager to create or locate the key in the namespace |
| 166 | +2. **Block Allocation**: The Ozone Manager requests blocks from the Storage Container Manager |
| 167 | +3. **Data Transfer**: The client directly writes data to the selected Datanodes according to the replication pipeline |
| 168 | +4. **Key Commit**: After successful data transfer, the client commits the key to the Ozone Manager |
| 169 | + |
| 170 | +#### Read Path Sequence |
| 171 | + |
| 172 | +For reads, the process is simpler: |
| 173 | + |
| 174 | +1. The client requests key location information from the Ozone Manager |
| 175 | +2. Using the block location information, the client reads data directly from Datanodes |
| 176 | +3. In case of failures, the client retries with alternative replicas |
| 177 | + |
| 178 | +#### Monitoring and Management |
| 179 | + |
| 180 | + |
| 181 | +The Recon service continuously: |
| 182 | +- Collects metrics from the Ozone Manager, Storage Container Manager, and Datanodes |
| 183 | +- Provides consolidated views of system health and performance |
| 184 | +- Facilitates troubleshooting and management |
| 185 | + |
| 186 | +## Ozone Internals |
| 187 | + |
| 188 | +Understanding Ozone's internal structure helps to grasp how data is organized and protected. |
| 189 | + |
| 190 | +### Containers |
| 191 | + |
| 192 | +Containers are the fundamental storage units in Ozone: |
| 193 | + |
| 194 | +- Fixed-size (typically 5GB) units of storage |
| 195 | +- Managed by the Storage Container Manager (SCM) |
| 196 | +- Replicated or erasure-coded across datanodes |
| 197 | +- Contain multiple blocks |
| 198 | +- Include metadata and chunk files |
| 199 | + |
| 200 | +Containers are self-contained units that include all necessary metadata and data. They are the unit of replication in Ozone. |
| 201 | + |
| 202 | +### Blocks |
| 203 | + |
| 204 | +Blocks are logical units of data within containers: |
| 205 | + |
| 206 | +- Represent portions of objects/keys |
| 207 | +- Created when clients write data |
| 208 | +- Referenced by object metadata |
| 209 | +- Allocated by the Ozone Manager |
| 210 | +- Secured with block tokens |
| 211 | + |
| 212 | +When a client writes data, the OM allocates blocks from SCM, and the client writes data to these blocks through datanodes. |
| 213 | + |
| 214 | +### Chunks |
| 215 | + |
| 216 | +Chunks are the physical data units stored on disk: |
| 217 | + |
| 218 | +- Fixed-size portions of blocks |
| 219 | +- Written sequentially in container data files |
| 220 | +- Checksummed for data integrity |
| 221 | +- Optimized for disk I/O |
| 222 | + |
| 223 | +Chunks are the smallest units of data stored on disk and include checksums for integrity verification. |
| 224 | + |
| 225 | + |
| 226 | + |
| 227 | +### Replicated Containers |
| 228 | + |
| 229 | +Ozone provides durability through container replication: |
| 230 | + |
| 231 | +- Default replication factor is 3 |
| 232 | +- Uses Ratis (Raft) consensus protocol |
| 233 | +- Synchronously replicates data across datanodes |
| 234 | +- Provides strong consistency guarantees |
| 235 | +- Handles node failures transparently |
| 236 | + |
| 237 | +Replicated containers ensure data durability by storing multiple copies of each container across different datanodes. |
| 238 | + |
| 239 | + |
| 240 | + |
| 241 | +### Erasure Encoded Containers |
| 242 | + |
| 243 | +Erasure coding provides space-efficient durability: |
| 244 | + |
| 245 | +- Splits data across multiple datanodes with parity |
| 246 | +- Supports various coding schemes (e.g., RS-3-2-1024k) |
| 247 | +- Reduces storage overhead compared to replication |
| 248 | +- Trades some performance for storage efficiency |
| 249 | +- Suitable for cold data storage |
| 250 | + |
| 251 | +Erasure coding allows for data durability with less storage overhead than full replication. |
| 252 | + |
| 253 | + |
| 254 | + |
| 255 | +### Pipelines |
| 256 | + |
| 257 | +Pipelines are groups of datanodes that work together to store data: |
| 258 | + |
| 259 | +- Managed by SCM |
| 260 | +- Consist of multiple datanodes |
| 261 | +- Handle write operations as a unit |
| 262 | +- Support different replication strategies |
| 263 | + |
| 264 | +For detailed information, see [Write Pipelines](http://localhost:3001/docs/core-concepts/replication/write-pipelines). |
| 265 | + |
| 266 | +#### Ratis Replicated |
| 267 | + |
| 268 | +Ratis pipelines use the Raft consensus protocol: |
| 269 | + |
| 270 | +- Typically three datanodes per pipeline |
| 271 | +- One leader and multiple followers |
| 272 | +- Synchronous replication |
| 273 | +- Strong consistency guarantees |
| 274 | +- Automatic leader election on failure |
| 275 | + |
| 276 | +#### Erasure Coded |
| 277 | + |
| 278 | +Erasure coded pipelines distribute data and parity: |
| 279 | + |
| 280 | +- Datanodes store data or parity chunks |
| 281 | +- EC pipeline size depends on the coding scheme |
| 282 | +- Handles reconstruction on node failure |
| 283 | +- Optimized for storage efficiency |
| 284 | + |
| 285 | +## Multi-Protocol Access |
| 286 | + |
| 287 | +Ozone supports multiple access protocols, making it versatile for different applications: |
| 288 | + |
| 289 | +### Native API |
| 290 | + |
| 291 | +- Command-line interface and Java API |
| 292 | +- Full feature access |
| 293 | +- Highest performance |
| 294 | + |
| 295 | +### Ozone File System (OFS) |
| 296 | + |
| 297 | +- Hadoop-compatible filesystem interface |
| 298 | +- Works with all Hadoop ecosystem applications |
| 299 | +- Path format: `ofs://om-host/volume/bucket/key` |
| 300 | + |
| 301 | +### S3 Compatible |
| 302 | + |
| 303 | +- REST API compatible with Amazon S3 |
| 304 | +- Works with S3 clients and SDKs |
| 305 | +- Path format: `http://s3g-host/bucket/key` |
| 306 | + |
| 307 | +### HttpFS |
| 308 | + |
| 309 | +- REST API compatible with WebHDFS |
| 310 | +- Enables web applications to access Ozone |
| 311 | +- Path format: `http://httpfs-host/webhdfs/v1/volume/bucket/key` |
| 312 | + |
| 313 | +The multi-protocol architecture allows for flexible integration with existing applications and workflows. |
| 314 | + |
| 315 | +## Summary |
| 316 | + |
| 317 | +Apache Ozone's architecture provides: |
| 318 | + |
| 319 | +1. **Scalability** through separation of metadata and data paths |
| 320 | +2. **Reliability** via replication and erasure coding |
| 321 | +3. **Flexibility** with multiple access protocols |
| 322 | +4. **Performance** through optimized data paths |
| 323 | +5. **Manageability** with comprehensive monitoring |
| 324 | + |
| 325 | +This architecture enables Ozone to serve as both an object store and a filesystem, making it suitable for a wide range of applications from big data analytics to cloud-native workloads. |
| 326 | + |
| 327 | +<!-- For more detailed information about Ozone's components and internals, see the System Internals section when available. --> |
0 commit comments