Skip to content

Commit 72ac1cc

Browse files
authored
Add option to set config dir to load from (#284)
1 parent 9287f88 commit 72ac1cc

File tree

7 files changed

+147
-48
lines changed

7 files changed

+147
-48
lines changed

README.md

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,11 @@
99
`hdfs-native` is an HDFS client written natively in Rust. It supports nearly all major features of an HDFS client, and several key client configuration options listed below.
1010

1111
## Supported HDFS features
12+
1213
Here is a list of currently supported and unsupported but possible future features.
1314

1415
### HDFS Operations
16+
1517
- [x] Listing
1618
- [x] Reading
1719
- [x] Writing
@@ -24,14 +26,16 @@ Here is a list of currently supported and unsupported but possible future featur
2426
- [x] Set timestamps
2527

2628
### HDFS Features
29+
2730
- [x] Name Services
2831
- [x] Observer reads
2932
- [x] ViewFS
3033
- [x] Router based federation
3134
- [x] Erasure coded reads and writes
32-
- RS schema only, no support for RS-Legacy or XOR
35+
- RS schema only, no support for RS-Legacy or XOR
3336

3437
### Security Features
38+
3539
- [x] Kerberos authentication (GSSAPI SASL support) (requires libgssapi_krb5, see below)
3640
- [x] Token authentication (DIGEST-MD5 SASL support)
3741
- [x] NameNode SASL connection
@@ -40,47 +44,54 @@ Here is a list of currently supported and unsupported but possible future featur
4044
- [ ] Encryption at rest (KMS support)
4145

4246
### Kerberos Support
47+
4348
Kerberos (SASL GSSAPI) mechanism is supported through a runtime dynamic link to `libgssapi_krb5`. This must be installed separately, but is likely already installed on your system. If not you can install it by:
4449

4550
#### Debian-based systems
51+
4652
```bash
4753
apt-get install libgssapi-krb5-2
4854
```
4955

5056
#### RHEL-based systems
57+
5158
```bash
5259
yum install krb5-libs
5360
```
5461

5562
#### MacOS
63+
5664
```bash
5765
brew install krb5
5866
```
5967

6068
#### Windows
69+
6170
Download and install the Microsoft Kerberos package from https://web.mit.edu/kerberos/dist/
6271

6372
Copy the `<INSTALL FOLDER>\MIT\Kerberos\bin\gssapi64.dll` file to a folder in %PATH% and change the name to `gssapi_krb5.dll`
6473

6574
## Supported HDFS Settings
66-
The client will attempt to read Hadoop configs `core-site.xml` and `hdfs-site.xml` in the directories `$HADOOP_CONF_DIR` or if that doesn't exist, `$HADOOP_HOME/etc/hadoop`. Currently the supported configs that are used are:
75+
76+
The client will attempt to read Hadoop configs `core-site.xml` and `hdfs-site.xml` in the directories `$HADOOP_CONF_DIR` or if that doesn't exist, `$HADOOP_HOME/etc/hadoop`. Passing configs in run time is supported as well via `client::ClientBuilder`. Currently the supported configs that are used are:
77+
6778
- `fs.defaultFS` - Client::default() support
6879
- `dfs.ha.namenodes` - name service support
6980
- `dfs.namenode.rpc-address.*` - name service support
7081
- `dfs.client.failover.resolve-needed.*` - DNS based NameNode discovery
7182
- `dfs.client.failover.resolver.useFQDN.*` - DNS based NameNode discovery
7283
- `dfs.client.failover.random.order.*` - Randomize order of NameNodes to try
7384
- `dfs.client.failover.proxy.provider.*` - Supports the behavior of the following proxy providers. Any other values will default back to the `ConfiguredFailoverProxyProvider` behavior:
74-
- `org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider`
75-
- `org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider`
76-
- `org.apache.hadoop.hdfs.server.namenode.ha.RouterObserverReadConfiguredFailoverProxyProvider`
85+
- `org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider`
86+
- `org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider`
87+
- `org.apache.hadoop.hdfs.server.namenode.ha.RouterObserverReadConfiguredFailoverProxyProvider`
7788
- `dfs.client.block.write.replace-datanode-on-failure.enable`
7889
- `dfs.client.block.write.replace-datanode-on-failure.policy`
7990
- `dfs.client.block.write.replace-datanode-on-failure.best-effort`
8091
- `fs.viewfs.mounttable.*.link.*` - ViewFS links
8192
- `fs.viewfs.mounttable.*.linkFallback` - ViewFS link fallback
8293

83-
All other settings are generally assumed to be the defaults currently. For instance, security is assumed to be enabled and SASL negotiation is always done, but on insecure clusters this will just do SIMPLE authentication. Any setups that require other customized Hadoop client configs may not work correctly.
94+
All other settings are generally assumed to be the defaults currently. For instance, security is assumed to be enabled and SASL negotiation is always done, but on insecure clusters this will just do SIMPLE authentication. Any setups that require other customized Hadoop client configs may not work correctly.
8495

8596
## Building
8697

@@ -89,19 +100,23 @@ cargo build
89100
```
90101

91102
## Object store implementation
103+
92104
An object_store implementation for HDFS is provided in the [hdfs-native-object-store](https://github.com/datafusion-contrib/hdfs-native-object-store) crate.
93105

94106
## Running tests
107+
95108
The tests are mostly integration tests that utilize a small Java application in `rust/mindifs/` that runs a custom `MiniDFSCluster`. To run the tests, you need to have Java, Maven, Hadoop binaries, and Kerberos tools available and on your path. Any Java version between 8 and 17 should work.
96109

97110
```bash
98-
cargo test -p hdfs-native --features intergation-test
111+
cargo test -p hdfs-native --features integration-test
99112
```
100113

101114
### Python tests
115+
102116
See the [Python README](./python/README.md)
103117

104118
## Running benchmarks
119+
105120
Some of the benchmarks compare performance to the JVM based client through libhdfs via the fs-hdfs3 crate. Because of that, some extra setup is required to run the benchmarks:
106121

107122
```bash
@@ -110,13 +125,15 @@ export CLASSPATH=$(hadoop classpath)
110125
```
111126

112127
then you can run the benchmarks with
128+
113129
```bash
114130
cargo bench -p hdfs-native --features benchmark
115131
```
116132

117133
The `benchmark` feature is required to expose `minidfs` and the internal erasure coding functions to benchmark.
118134

119135
## Running examples
136+
120137
The examples make use of the `minidfs` module to create a simple HDFS cluster to run the example. This requires including the `integration-test` feature to enable the `minidfs` module. Alternatively, if you want to run the example against an existing HDFS cluster you can exclude the `integration-test` feature and make sure your `HADOOP_CONF_DIR` points to a directory with HDFS configs for talking to your cluster.
121138

122139
```bash

python/hdfs_native/__init__.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -131,8 +131,9 @@ def __init__(
131131
self,
132132
url: Optional[str] = None,
133133
config: Optional[Dict[str, str]] = None,
134+
config_dir: Optional[str] = None,
134135
):
135-
self.inner = RawClient(url, config)
136+
self.inner = RawClient(url, config, config_dir)
136137

137138
def get_file_info(self, path: str) -> FileStatus:
138139
"""Gets the file status for the file at `path`"""
@@ -362,8 +363,9 @@ def __init__(
362363
self,
363364
url: Optional[str] = None,
364365
config: Optional[Dict[str, str]] = None,
366+
config_dir: Optional[str] = None,
365367
):
366-
self.inner = AsyncRawClient(url, config)
368+
self.inner = AsyncRawClient(url, config, config_dir)
367369

368370
async def get_file_info(self, path: str) -> FileStatus:
369371
"""Gets the file status for the file at `path`"""

python/hdfs_native/_internal.pyi

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,7 @@ class RawClient:
9898
self,
9999
url: Optional[str],
100100
config: Optional[Dict[str, str]],
101+
config_dir: Optional[str],
101102
) -> None: ...
102103
def get_file_info(self, path: str) -> FileStatus: ...
103104
def list_status(self, path: str, recursive: bool) -> Iterator[FileStatus]: ...
@@ -159,6 +160,7 @@ class AsyncRawClient:
159160
self,
160161
url: Optional[str],
161162
config: Optional[Dict[str, str]],
163+
config_dir: Optional[str],
162164
) -> None: ...
163165
async def get_file_info(self, path: str) -> FileStatus: ...
164166
def list_status(self, path: str, recursive: bool) -> AsyncIterator[FileStatus]: ...

python/src/lib.rs

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -413,8 +413,12 @@ struct RawClient {
413413
#[pymethods]
414414
impl RawClient {
415415
#[new]
416-
#[pyo3(signature = (url, config))]
417-
pub fn new(url: Option<&str>, config: Option<HashMap<String, String>>) -> PyResult<Self> {
416+
#[pyo3(signature = (url, config, config_dir))]
417+
pub fn new(
418+
url: Option<&str>,
419+
config: Option<HashMap<String, String>>,
420+
config_dir: Option<&str>,
421+
) -> PyResult<Self> {
418422
// Initialize logging, ignore errors if this is called multiple times
419423
let _ = env_logger::try_init();
420424

@@ -426,6 +430,10 @@ impl RawClient {
426430
builder = builder.with_url(url);
427431
}
428432

433+
if let Some(config_dir) = config_dir {
434+
builder = builder.with_config_dir(config_dir);
435+
}
436+
429437
let inner = builder.build().map_err(PythonHdfsError::from)?;
430438

431439
Ok(RawClient {
@@ -722,8 +730,12 @@ struct AsyncRawClient {
722730
#[pymethods]
723731
impl AsyncRawClient {
724732
#[new]
725-
#[pyo3(signature = (url, config))]
726-
pub fn new(url: Option<&str>, config: Option<HashMap<String, String>>) -> PyResult<Self> {
733+
#[pyo3(signature = (url, config, config_dir))]
734+
pub fn new(
735+
url: Option<&str>,
736+
config: Option<HashMap<String, String>>,
737+
config_dir: Option<&str>,
738+
) -> PyResult<Self> {
727739
// Initialize logging, ignore errors if this is called multiple times
728740
let _ = env_logger::try_init();
729741

@@ -735,6 +747,10 @@ impl AsyncRawClient {
735747
builder = builder.with_url(url);
736748
}
737749

750+
if let Some(config_dir) = config_dir {
751+
builder = builder.with_config_dir(config_dir);
752+
}
753+
738754
let inner = builder.build().map_err(PythonHdfsError::from)?;
739755

740756
Ok(Self { inner })

rust/build.rs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,9 @@ use std::io::Result;
33
fn main() -> Result<()> {
44
#[cfg(feature = "generate-protobuf")]
55
{
6-
std::env::set_var("PROTOC", protobuf_src::protoc());
6+
unsafe {
7+
std::env::set_var("PROTOC", protobuf_src::protoc());
8+
}
79

810
prost_build::compile_protos(
911
&[

rust/src/client.rs

Lines changed: 60 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -190,16 +190,41 @@ impl IORuntime {
190190
}
191191
}
192192

193-
/// Builds a new [Client] instance. By default, configs will be loaded from the default config directories with the following precedence:
193+
/// Builds a new [Client] instance. Configs will be loaded with the following precedence:
194+
///
195+
/// - If method `ClientBuilder::with_config_dir` is invoked, configs will be loaded from `${config_dir}/{core,hdfs}-site.xml`
194196
/// - If the `HADOOP_CONF_DIR` environment variable is defined, configs will be loaded from `${HADOOP_CONF_DIR}/{core,hdfs}-site.xml`
195197
/// - If the `HADOOP_HOME` environment variable is defined, configs will be loaded from `${HADOOP_HOME}/etc/hadoop/{core,hdfs}-site.xml`
196-
/// - Otherwise no default configs are defined
198+
/// - Otherwise no configs are defined
199+
///
200+
/// Finally, configs set by `with_config` will override the configs loaded above.
197201
///
198202
/// If no URL is defined, the `fs.defaultFS` config must be defined and is used as the URL.
199203
///
200204
/// # Examples
201205
///
206+
/// Create a new client with given config directory
207+
///
208+
/// ```rust,no_run
209+
/// # use hdfs_native::ClientBuilder;
210+
/// let client = ClientBuilder::new()
211+
/// .with_config_dir("/opt/hadoop/etc/hadoop")
212+
/// .build()
213+
/// .unwrap();
214+
/// ```
215+
///
216+
/// Create a new client with the environment variable
217+
///
218+
/// ```rust,no_run
219+
/// # use hdfs_native::ClientBuilder;
220+
/// unsafe { std::env::set_var("HADOOP_CONF_DIR", "/opt/hadoop/etc/hadoop") };
221+
/// let client = ClientBuilder::new()
222+
/// .build()
223+
/// .unwrap();
224+
/// ```
225+
///
202226
/// Create a new client using the fs.defaultFS config
227+
///
203228
/// ```rust
204229
/// # use hdfs_native::ClientBuilder;
205230
/// let client = ClientBuilder::new()
@@ -209,6 +234,7 @@ impl IORuntime {
209234
/// ```
210235
///
211236
/// Create a new client connecting to a specific URL:
237+
///
212238
/// ```rust
213239
/// # use hdfs_native::ClientBuilder;
214240
/// let client = ClientBuilder::new()
@@ -218,6 +244,7 @@ impl IORuntime {
218244
/// ```
219245
///
220246
/// Create a new client using a dedicated tokio runtime for spawned tasks and IO operations
247+
///
221248
/// ```rust
222249
/// # use hdfs_native::ClientBuilder;
223250
/// let client = ClientBuilder::new()
@@ -229,7 +256,8 @@ impl IORuntime {
229256
#[derive(Default)]
230257
pub struct ClientBuilder {
231258
url: Option<String>,
232-
config: HashMap<String, String>,
259+
config: Option<HashMap<String, String>>,
260+
config_dir: Option<String>,
233261
runtime: Option<IORuntime>,
234262
}
235263

@@ -245,15 +273,23 @@ impl ClientBuilder {
245273
self
246274
}
247275

248-
/// Set configs to use for the client. The provided configs will override any found in the default config files loaded
276+
/// Set configs to use for the client. The provided configs will override any found in the config files loaded
249277
pub fn with_config(
250278
mut self,
251279
config: impl IntoIterator<Item = (impl Into<String>, impl Into<String>)>,
252280
) -> Self {
253-
self.config = config
254-
.into_iter()
255-
.map(|(k, v)| (k.into(), v.into()))
256-
.collect();
281+
self.config = Some(
282+
config
283+
.into_iter()
284+
.map(|(k, v)| (k.into(), v.into()))
285+
.collect(),
286+
);
287+
self
288+
}
289+
290+
/// Set the configration directory path to read from. The provided path will override the one provided by environment variable.
291+
pub fn with_config_dir(mut self, config_dir: impl Into<String>) -> Self {
292+
self.config_dir = Some(config_dir.into());
257293
self
258294
}
259295

@@ -266,7 +302,7 @@ impl ClientBuilder {
266302

267303
/// Create the [Client] instance from the provided settings
268304
pub fn build(self) -> Result<Client> {
269-
let config = Configuration::new_with_config(self.config)?;
305+
let config = Configuration::new(self.config_dir, self.config)?;
270306
let url = if let Some(url) = self.url {
271307
Url::parse(&url)?
272308
} else {
@@ -323,18 +359,18 @@ impl Client {
323359
#[deprecated(since = "0.12.0", note = "Use ClientBuilder instead")]
324360
pub fn new(url: &str) -> Result<Self> {
325361
let parsed_url = Url::parse(url)?;
326-
Self::build(&parsed_url, Configuration::new()?, None)
362+
Self::build(&parsed_url, Configuration::new(None, None)?, None)
327363
}
328364

329365
#[deprecated(since = "0.12.0", note = "Use ClientBuilder instead")]
330366
pub fn new_with_config(url: &str, config: HashMap<String, String>) -> Result<Self> {
331367
let parsed_url = Url::parse(url)?;
332-
Self::build(&parsed_url, Configuration::new_with_config(config)?, None)
368+
Self::build(&parsed_url, Configuration::new(None, Some(config))?, None)
333369
}
334370

335371
#[deprecated(since = "0.12.0", note = "Use ClientBuilder instead")]
336372
pub fn default_with_config(config: HashMap<String, String>) -> Result<Self> {
337-
let config = Configuration::new_with_config(config)?;
373+
let config = Configuration::new(None, Some(config))?;
338374
Self::build(&Self::default_fs(&config)?, config, None)
339375
}
340376

@@ -1138,7 +1174,7 @@ mod test {
11381174
fn create_protocol(url: &str) -> Arc<NamenodeProtocol> {
11391175
let proxy = NameServiceProxy::new(
11401176
&Url::parse(url).unwrap(),
1141-
Arc::new(Configuration::new().unwrap()),
1177+
Arc::new(Configuration::new(None, None).unwrap()),
11421178
RT.handle().clone(),
11431179
)
11441180
.unwrap();
@@ -1310,4 +1346,15 @@ mod test {
13101346
.is_ok()
13111347
);
13121348
}
1349+
1350+
#[test]
1351+
fn test_set_conf_dir() {
1352+
assert!(
1353+
ClientBuilder::new()
1354+
.with_url("hdfs://127.0.0.1:9000")
1355+
.with_config_dir("target/test")
1356+
.build()
1357+
.is_ok()
1358+
)
1359+
}
13131360
}

0 commit comments

Comments
 (0)