|
17 | 17 | under the License.
|
18 | 18 | -->
|
19 | 19 |
|
20 |
| -# Native Rust implementation of Apache Arrow |
| 20 | +# Native Rust implementation of Apache Arrow and Parquet |
21 | 21 |
|
22 | 22 | [](https://codecov.io/gh/apache/arrow?branch=master)
|
23 | 23 |
|
24 | 24 | Welcome to the implementation of Arrow, the popular in-memory columnar format, in [Rust](https://www.rust-lang.org/).
|
25 | 25 |
|
26 |
| -This part of the Arrow project is divided in 4 main components: |
| 26 | +This repo contains the following main components: |
27 | 27 |
|
28 |
| -| Crate | Description | Documentation | |
29 |
| -| ------------ | ------------------------------------------------------------------ | ------------------------------------------------------------------------------------- | |
30 |
| -| Arrow | Core functionality (memory layout, arrays, low level computations) | [(README)](arrow/README.md) | |
31 |
| -| Parquet | Parquet support | [(README)](parquet/README.md) | |
32 |
| -| Arrow-flight | Arrow data between processes | [(README)](arrow-flight/README.md) | |
33 |
| -| DataFusion | In-memory query engine with SQL support | [(README)](https://github.com/apache/arrow-datafusion/blob/master/README.md) | |
34 |
| -| Ballista | Distributed query execution | [(README)](https://github.com/apache/arrow-datafusion/blob/master/ballista/README.md) | |
35 |
| - |
36 |
| -Independently, they support a vast array of functionality for in-memory computations. |
| 28 | +| Crate | Description | Documentation | |
| 29 | +| ------------ | ------------------------------------------------------------------ | ---------------------------------- | |
| 30 | +| arrow | Core functionality (memory layout, arrays, low level computations) | [(README)](arrow/README.md) | |
| 31 | +| parquet | Support for Parquet columnar file format | [(README)](parquet/README.md) | |
| 32 | +| arrow-flight | Support for Arrow-Flight IPC protocol | [(README)](arrow-flight/README.md) | |
37 | 33 |
|
38 |
| -Together, they allow users to write an SQL query or a `DataFrame` (using the `datafusion` crate), run it against a parquet file (using the `parquet` crate), evaluate it in-memory using Arrow's columnar format (using the `arrow` crate), and send to another process (using the `arrow-flight` crate). |
| 34 | +There are two related crates in a different repository |
| 35 | +| Crate | Description | Documentation | |
| 36 | +| ------------ | ------------------------------------------------------------------ | ------------------------------------------------------------------------------------- | |
| 37 | +| DataFusion | In-memory query engine with SQL support | [(README)](https://github.com/apache/arrow-datafusion/blob/master/README.md) | |
| 38 | +| Ballista | Distributed query execution | [(README)](https://github.com/apache/arrow-datafusion/blob/master/ballista/README.md) | |
39 | 39 |
|
40 |
| -Generally speaking, the `arrow` crate offers functionality to develop code that uses Arrow arrays, and `datafusion` offers most operations typically found in SQL, including `join`s and window functions. |
| 40 | +Collectively, these crates support a vast array of functionality for analytic computations in Rust. |
41 | 41 |
|
42 |
| -There are too many features to enumerate here, but some notable mentions: |
| 42 | +For example, you can write an SQL query or a `DataFrame` (using the `datafusion` crate), run it against a parquet file (using the `parquet` crate), evaluate it in-memory using Arrow's columnar format (using the `arrow` crate), and send to another process (using the `arrow-flight` crate). |
43 | 43 |
|
44 |
| -- `Arrow` implements all formats in the specification except certain dictionaries |
45 |
| -- `Arrow` supports SIMD operations to some of its vertical operations |
46 |
| -- `DataFusion` supports `async` execution |
47 |
| -- `DataFusion` supports user-defined functions, aggregates, and whole execution nodes |
| 44 | +Generally speaking, the `arrow` crate offers functionality for using Arrow arrays, and `datafusion` offers most operations typically found in SQL, including `join`s and window functions. |
48 | 45 |
|
49 | 46 | You can find more details about each crate in their respective READMEs.
|
50 | 47 |
|
51 | 48 | ## Arrow Rust Community
|
52 | 49 |
|
53 |
| -We use the official [ASF Slack](https://s.apache.org/slack-invite) for informal discussions and coordination. This is |
54 |
| -a great place to meet other contributors and get guidance on where to contribute. Join us in the `arrow-rust` channel. |
55 |
| - |
56 |
| -We use [ASF JIRA](https://issues.apache.org/jira/secure/Dashboard.jspa) as the system of record for new features |
57 |
| -and bug fixes and this plays a critical role in the release process. |
58 |
| - |
59 |
| -For design discussions we generally collaborate on Google documents and file a JIRA linking to the document. |
60 |
| - |
61 |
| -There is also a bi-weekly Rust-specific sync call for the Arrow Rust community. This is hosted on Google Meet |
62 |
| -at https://meet.google.com/ctp-yujs-aee on alternate Wednesday's at 09:00 US/Pacific, 12:00 US/Eastern. During |
63 |
| -US daylight savings time this corresponds to 16:00 UTC and at other times this is 17:00 UTC. |
64 |
| - |
65 |
| -## Developer's guide to Arrow Rust |
66 |
| - |
67 |
| -### How to compile |
68 |
| - |
69 |
| -This is a standard cargo project with workspaces. To build it, you need to have `rust` and `cargo`: |
70 |
| - |
71 |
| -```bash |
72 |
| -cargo build |
73 |
| -``` |
74 |
| - |
75 |
| -You can also use rust's official docker image: |
76 |
| - |
77 |
| -```bash |
78 |
| -docker run --rm -v $(pwd):/arrow-rs -it rust /bin/bash -c "cd /arrow-rs && rustup component add rustfmt && cargo build" |
79 |
| -``` |
80 |
| - |
81 |
| -The command above assumes that are in the root directory of the project, not in the same |
82 |
| -directory as this README.md. |
83 |
| - |
84 |
| -You can also compile specific workspaces: |
85 |
| - |
86 |
| -```bash |
87 |
| -cd arrow && cargo build |
88 |
| -``` |
89 |
| - |
90 |
| -### Git Submodules |
91 |
| - |
92 |
| -Before running tests and examples, it is necessary to set up the local development environment. |
93 |
| - |
94 |
| -The tests rely on test data that is contained in git submodules. |
95 |
| - |
96 |
| -To pull down this data run the following: |
97 |
| - |
98 |
| -```bash |
99 |
| -git submodule update --init |
100 |
| -``` |
101 |
| - |
102 |
| -This populates data in two git submodules: |
103 |
| - |
104 |
| -- `../parquet_testing/data` (sourced from https://github.com/apache/parquet-testing.git) |
105 |
| -- `../testing` (sourced from https://github.com/apache/arrow-testing) |
106 |
| - |
107 |
| -By default, `cargo test` will look for these directories at their |
108 |
| -standard location. The following environment variables can be used to override the location: |
109 |
| - |
110 |
| -```bash |
111 |
| -# Optionally specify a different location for test data |
112 |
| -export PARQUET_TEST_DATA=$(cd ../parquet-testing/data; pwd) |
113 |
| -export ARROW_TEST_DATA=$(cd ../testing/data; pwd) |
114 |
| -``` |
| 50 | +The `[email protected]` mailing list serves as the core communication channel for the Arrow community. Instructions for signing up and links to the archives can be found at the [Arrow Community ](https://arrow.apache.org/community/) page. All major announcements and communications happen there. |
115 | 51 |
|
116 |
| -From here on, this is a pure Rust project and `cargo` can be used to run tests, benchmarks, docs and examples as usual. |
| 52 | +The Rust Arrow community also uses the official [ASF Slack](https://s.apache.org/slack-invite) for informal discussions and coordination. This is |
| 53 | +a great place to meet other contributors and get guidance on where to contribute. Join us in the `#arrow-rust` channel. |
117 | 54 |
|
118 |
| -### Running the tests |
119 |
| - |
120 |
| -Run tests using the Rust standard `cargo test` command: |
121 |
| - |
122 |
| -```bash |
123 |
| -# run all tests. |
124 |
| -cargo test |
125 |
| - |
126 |
| - |
127 |
| -# run only tests for the arrow crate |
128 |
| -cargo test -p arrow |
129 |
| -``` |
130 |
| - |
131 |
| -## Code Formatting |
132 |
| - |
133 |
| -Our CI uses `rustfmt` to check code formatting. Before submitting a |
134 |
| -PR be sure to run the following and check for lint issues: |
135 |
| - |
136 |
| -```bash |
137 |
| -cargo +stable fmt --all -- --check |
138 |
| -``` |
139 |
| - |
140 |
| -## Clippy Lints |
141 |
| - |
142 |
| -We recommend using `clippy` for checking lints during development. While we do not yet enforce `clippy` checks, we recommend not introducing new `clippy` errors or warnings. |
143 |
| - |
144 |
| -Run the following to check for clippy lints. |
145 |
| - |
146 |
| -```bash |
147 |
| -cargo clippy |
148 |
| -``` |
149 |
| - |
150 |
| -If you use Visual Studio Code with the `rust-analyzer` plugin, you can enable `clippy` to run each time you save a file. See https://users.rust-lang.org/t/how-to-use-clippy-in-vs-code-with-rust-analyzer/41881. |
151 |
| - |
152 |
| -One of the concerns with `clippy` is that it often produces a lot of false positives, or that some recommendations may hurt readability. We do not have a policy of which lints are ignored, but if you disagree with a `clippy` lint, you may disable the lint and briefly justify it. |
153 |
| - |
154 |
| -Search for `allow(clippy::` in the codebase to identify lints that are ignored/allowed. We currently prefer ignoring lints on the lowest unit possible. |
155 |
| - |
156 |
| -- If you are introducing a line that returns a lint warning or error, you may disable the lint on that line. |
157 |
| -- If you have several lints on a function or module, you may disable the lint on the function or module. |
158 |
| -- If a lint is pervasive across multiple modules, you may disable it at the crate level. |
159 |
| - |
160 |
| -## Git Pre-Commit Hook |
161 |
| - |
162 |
| -We can use [git pre-commit hook](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks) to automate various kinds of git pre-commit checking/formatting. |
163 |
| - |
164 |
| -Suppose you are in the root directory of the project. |
165 |
| - |
166 |
| -First check if the file already exists: |
167 |
| - |
168 |
| -```bash |
169 |
| -ls -l .git/hooks/pre-commit |
170 |
| -``` |
171 |
| - |
172 |
| -If the file already exists, to avoid mistakenly **overriding**, you MAY have to check |
173 |
| -the link source or file content. Else if not exist, let's safely soft link [pre-commit.sh](pre-commit.sh) as file `.git/hooks/pre-commit`: |
174 |
| - |
175 |
| -```bash |
176 |
| -ln -s ../../rust/pre-commit.sh .git/hooks/pre-commit |
177 |
| -``` |
178 |
| - |
179 |
| -If sometimes you want to commit without checking, just run `git commit` with `--no-verify`: |
| 55 | +Unlike other parts of the Arrow ecosystem, the Rust implementation uses [github issues](https://github.com/apache/arrow-rs/issues) as the system of record for new features |
| 56 | +and bug fixes and this plays a critical role in the release process. |
180 | 57 |
|
181 |
| -```bash |
182 |
| -git commit --no-verify -m "... commit message ..." |
183 |
| -``` |
| 58 | +For design discussions we generally collaborate on Google documents and file a github issue linking to the document. |
0 commit comments