|
| 1 | +--- |
| 2 | +title: "This Week in Databend #85" |
| 3 | +date: 2023-03-17 |
| 4 | +slug: 2023-03-17-databend-weekly |
| 5 | +tags: [databend, weekly] |
| 6 | +description: "Get to know the latest updates on Databend this week!" |
| 7 | +contributors: |
| 8 | + - name: andylokandy |
| 9 | + - name: ariesdevil |
| 10 | + - name: b41sh |
| 11 | + - name: BohuTANG |
| 12 | + - name: Carlosfengv |
| 13 | + - name: Chasen-Zhang |
| 14 | + - name: dantengsky |
| 15 | + - name: dependabot[bot] |
| 16 | + - name: drmingdrmer |
| 17 | + - name: everpcpc |
| 18 | + - name: jun0315 |
| 19 | + - name: leiysky |
| 20 | + - name: lichuang |
| 21 | + - name: mergify[bot] |
| 22 | + - name: PsiACE |
| 23 | + - name: RinChanNOWWW |
| 24 | + - name: soyeric128 |
| 25 | + - name: sundy-li |
| 26 | + - name: TCeason |
| 27 | + - name: wubx |
| 28 | + - name: Xuanwo |
| 29 | + - name: xudong963 |
| 30 | + - name: youngsofun |
| 31 | + - name: zhang2014 |
| 32 | + - name: zhyass |
| 33 | +authors: |
| 34 | + - name: PsiACE |
| 35 | + url: https://github.com/psiace |
| 36 | + image_url: https://github.com/psiace.png |
| 37 | +--- |
| 38 | + |
| 39 | +[Databend](https://github.com/datafuselabs/databend) is a modern cloud data warehouse, serving your massive-scale analytics needs at low cost and complexity. Open source alternative to Snowflake. Also available in the cloud: <https://app.databend.com> . |
| 40 | + |
| 41 | +> :loudspeaker: Read our blog *[Way to Go: OpenDAL successfully entered Apache Incubator](https://databend.rs/blog/opendal-enters-apache-incubator)* to learn about the story of [OpenDAL](https://github.com/apache/incubator-opendal). |
| 42 | +
|
| 43 | +## What's On In Databend |
| 44 | + |
| 45 | +Stay connected with the latest news about Databend. |
| 46 | + |
| 47 | +### Data Type: MAP |
| 48 | + |
| 49 | +The MAP data structure holds `Key:Value` pairs using a nested `Array(Tuple(key, value))` and is useful when the data type is constant but the Key's value cannot be fully determined. The Key must be of a specified basic data type and duplicates are not allowed, while the Value can be any data type including nested arrays or tuples. A bloom filter index is created in Map makes it easier and faster to search for values in MAP. |
| 50 | + |
| 51 | +```sql |
| 52 | +select * from nginx_log where log['ip'] = '205.91.162.148'; |
| 53 | ++----+----------------------------------------+ |
| 54 | +| id | log | |
| 55 | ++----+----------------------------------------+ |
| 56 | +| 1 | {'ip':'205.91.162.148','url':'test-1'} | |
| 57 | ++----+----------------------------------------+ |
| 58 | +1 row in set |
| 59 | +``` |
| 60 | + |
| 61 | +If you want to learn more about the Map data type, please read the following materials: |
| 62 | + |
| 63 | +- [Docs | Data Types - Map](https://databend.rs/doc/sql-reference/data-types/data-type-map) |
| 64 | + |
| 65 | +### Data Transformation During Loading Process |
| 66 | + |
| 67 | +Do you remember the two RFCs mentioned last week? Now, Databend has added support for data transformation during the loading process into tables. Basic transformation operations can be achieved by using the `COPY INTO <table>` command. |
| 68 | + |
| 69 | +```sql |
| 70 | +CREATE TABLE my_table(id int, name string, time date); |
| 71 | + |
| 72 | +COPY INTO my_table |
| 73 | +FROM (SELECT t.id, t.name, to_date(t.timestamp) FROM @mystage t) |
| 74 | +FILE_FORMAT = (type = parquet) PATTERN='.*parquet'; |
| 75 | +``` |
| 76 | + |
| 77 | +This feature avoids storing pre-transformed data in temporary tables and supports column reordering, column omission, and type conversion operation. In addition, partial data can be loaded from staged Parquet files or their columns can be rearranged. This feature simplifies and streamlines ETL processes, allowing users to focus on data analysis rather than mechanically moving it. |
| 78 | + |
| 79 | +If you're interested, check the following documentation: |
| 80 | + |
| 81 | +- [Docs | Transforming Data During a Load](https://databend.rs/doc/load-data/data-load-transform) |
| 82 | +- [PR | feat(storage): Map data type support bloom filter](https://github.com/datafuselabs/databend/pull/10457) |
| 83 | + |
| 84 | +## Code Corner |
| 85 | + |
| 86 | +Discover some fascinating code snippets or projects that showcase our work or learning journey. |
| 87 | + |
| 88 | +### Run Multiple Futures Parallel |
| 89 | + |
| 90 | +Are you interested in how to run futures in parallel? It is worth mentioning that Databend has greatly improved the scanning performance in situations with a huge number of files by utilizing this technique. |
| 91 | + |
| 92 | +The following code, which is less than 30 lines long, will introduce you to how it all works. |
| 93 | + |
| 94 | +```rust |
| 95 | +/// Run multiple futures parallel |
| 96 | +/// using a semaphore to limit the parallelism number, and a specified thread pool to run the futures. |
| 97 | +/// It waits for all futures to complete and returns their results. |
| 98 | +pub async fn execute_futures_in_parallel<Fut>( |
| 99 | + futures: impl IntoIterator<Item = Fut>, |
| 100 | + thread_nums: usize, |
| 101 | + permit_nums: usize, |
| 102 | + thread_name: String, |
| 103 | +) -> Result<Vec<Fut::Output>> |
| 104 | +where |
| 105 | + Fut: Future + Send + 'static, |
| 106 | + Fut::Output: Send + 'static, |
| 107 | +{ |
| 108 | + // 1. build the runtime. |
| 109 | + let semaphore = Semaphore::new(permit_nums); |
| 110 | + let runtime = Arc::new(Runtime::with_worker_threads( |
| 111 | + thread_nums, |
| 112 | + Some(thread_name), |
| 113 | + )?); |
| 114 | + |
| 115 | + // 2. spawn all the tasks to the runtime with semaphore. |
| 116 | + let join_handlers = runtime.try_spawn_batch(semaphore, futures).await?; |
| 117 | + |
| 118 | + // 3. get all the result. |
| 119 | + future::try_join_all(join_handlers) |
| 120 | + .await |
| 121 | + .map_err(|e| ErrorCode::Internal(format!("try join all futures failure, {}", e))) |
| 122 | +} |
| 123 | +``` |
| 124 | + |
| 125 | +If you are interested in this Rust trick, you can read this PR: [feat: improve the parquet get splits to parallel](https://github.com/datafuselabs/databend/pull/10514). |
| 126 | + |
| 127 | +### How to Create a System Table |
| 128 | + |
| 129 | +System tables are tables that provide information about Databend's internal state, such as databases, tables, functions, and settings. |
| 130 | + |
| 131 | +If you are interested in creating system tables, you may want to check out our recently released documentation which introduces the implementation, registration, and testing of system tables using the `system.credits` table as an example. |
| 132 | + |
| 133 | +Here is a code snippet: |
| 134 | + |
| 135 | +```rust |
| 136 | +let table_info = TableInfo { |
| 137 | + desc: "'system'.'credits'".to_string(), |
| 138 | + name: "credits".to_string(), |
| 139 | + ident: TableIdent::new(table_id, 0), |
| 140 | + meta: TableMeta { |
| 141 | + schema, |
| 142 | + engine: "SystemCredits".to_string(), |
| 143 | + ..Default::default() |
| 144 | + }, |
| 145 | +..Default::default() |
| 146 | +}; |
| 147 | +``` |
| 148 | + |
| 149 | +- [Docs | How to Create a System Table](https://databend.rs/doc/contributing/how-to-write-a-system-table) |
| 150 | + |
| 151 | +## Highlights |
| 152 | + |
| 153 | +Here are some noteworthy items recorded here, perhaps you can find something that interests you. |
| 154 | + |
| 155 | +- We suggest users to consider `unset max_storage_io_requests` to use `num_cpu` as the default value when upgrading to **1.0.17-nightly** and above. |
| 156 | +- Now Databend can integrate with MindsDB to provide users with machine learning workflow support. *[Bringing in-database ML to Databend](https://mindsdb.com/integrations/databend-machine-learning)* |
| 157 | +- If you happen to use HDFS and are interested in Databend, why not try our WebHDFS storage backend? This blog post may be helpful for you. *[How to Configure WebHDFS as a Storage Backend for Databend](https://databend.rs/blog/2023-03-13-webhdfs-storage-for-backend)* |
| 158 | + |
| 159 | +## What's Up Next |
| 160 | + |
| 161 | +We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend. |
| 162 | + |
| 163 | +### Support Quantile with A List |
| 164 | + |
| 165 | +After the merge of PR [#10474](https://github.com/datafuselabs/databend/pull/10474), Databend began to support quantile aggregation functions, but currently only supports setting a single floating-point value as the level. If it could also support passing in a list, it may help simplify SQL writing in some scenarios. |
| 166 | + |
| 167 | +```sql |
| 168 | +SELECT QUANTILE([0.25, 0.5, 0.75])(number) FROM numbers(25); |
| 169 | ++-------------------------------------+ |
| 170 | +| quantile([0.25, 0.5, 0.75])(number) | |
| 171 | ++-------------------------------------+ |
| 172 | +| [6, 12, 18] | |
| 173 | ++-------------------------------------+ |
| 174 | +``` |
| 175 | + |
| 176 | +[Feature: quantile support list and add functions kurtosis() and skewness()](https://github.com/datafuselabs/databend/issues/10589) |
| 177 | + |
| 178 | +Additionally, the `kurtosis(x)` and `skewness(x)` mentioned in this problem may also be a good starting point for contributing to Databend. |
| 179 | + |
| 180 | +Please let us know if you're interested in contributing to this issue, or pick up a good first issue at <https://link.databend.rs/i-m-feeling-lucky> to get started. |
| 181 | + |
| 182 | +## Changelog |
| 183 | + |
| 184 | +You can check the changelog of Databend Nightly for details about our latest developments. |
| 185 | + |
| 186 | +**Full Changelog**: <https://github.com/datafuselabs/databend/compare/v1.0.11-nightly...v1.0.21-nightly> |
0 commit comments