You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[TiDB](https://github.com/pingcap/tidb) - MySQL compatible SQL database that supports hybrid transactional and analytical processing workloads.
@@ -89,15 +90,15 @@ The following columnar databases use a [shared-nothing architecture](https://en.
89
90
90
91
## Data lake
91
92
92
-
The data lake approach (or "lakehouse") is a semi-structured schema that sit on top of object storage in the cloud.
93
+
The data lake approach (or "lakehouse") is a semi-structured schema that sits on top of object storage in the cloud.
93
94
94
95
It is composed of a few layers (from lower to higher level): codec, file format, table format + metastore, and the ingestion/query layer.
95
96
96
97
### File formats and serialization
97
98
98
-
These formats are popular for shared-everything databases, using object storage as persistence layer. The data is organized in row or column, with strict schema definition. These files are immutable and offer partial reads (only headers, metadata, data page, etc). Mutation require a new upload. Most formats support nested schema, codecs, compression and data encryption. Index can be added to file metadata for faster processing.
99
+
These formats are popular for shared-everything databases, using object storage as a persistence layer. The data is organized in row or column, with strict schema definition. These files are immutable and offer partial reads (only headers, metadata, data page, etc). Mutation requires a new upload. Most formats support nested schema, codecs, compression, and data encryption. Index can be added to file metadata for faster processing.
99
100
100
-
A single file can weight between tens of MB to a few GB. Lot of small files require more merge operation. Larger files can be costly to update.
101
+
A single file can weight between tens of MB to a few GB. Lots of small files require more merge operation. Larger files can be costly to update.
101
102
102
103
-[Apache Arrow Columnar Format](https://arrow.apache.org/docs/format/Columnar.html) - Columnar format for in-memory Apache Arrow processing.
103
104
-[Apache Avro](https://avro.apache.org/) - Row-oriented serialization for data streaming purpose.
@@ -237,6 +238,7 @@ The popular acronym for Extracting, Transforming and Loading data. ELT performs
0 commit comments