Skip to content

Commit da177ac

Browse files
committed
book in sync with apachecon talk, october 2015
1 parent 428673f commit da177ac

13 files changed

+318
-133
lines changed

sections/biblography.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
1. [JAAS Configuration (Java 8)](http://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/LoginConfigFile.html)
1414
1. For OS/X users, the GUI ticket viewer is `/System/Library/CoreServices/Ticket\ Viewer.app`
1515
1. [Colouris01], Colouris, Dollimore & Kindberg, 2001, *Distributed System Concepts and Design*,
16-
16+
1. [JAva 8 GSS API](https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/jgss-features.html)
1717

1818
### Kerberos, Active Directory and Apache Hadoop
1919

sections/checklists.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,8 @@
5858

5959
[ ] Container Credentials are retrieved in AM and containers.
6060

61+
[ ] Delegation tokens revoked during (managed) teardown.
62+
6163
## YARN Web UIs and REST endpoints
6264

6365
[ ] Primary Web server: `AmFilterInitializer` used to redirect requests to the RM Proxy.

sections/errors.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,22 @@
1919
> *[Supernatural Horror in Literature](https://en.wikisource.org/wiki/Supernatural_Horror_in_Literature), HP Lovecraft, 1927.*
2020
2121

22+
Security error messages appear to take pride in providing limited information. In particular,
23+
they are usually some generic `IOException` wrapping a generic security exception. There is some
24+
text in the message, but it is often `Failure unspecified at GSS-API level`, which means
25+
"something went wrong".
26+
27+
Generally a stack trace with UGI in it is a security problem, *though it can be a network problem
28+
surfacing in the security code*.
29+
30+
The underlying causes of problems are usually the standard ones of distributed systems: networking
31+
and configuration.
32+
33+
34+
In [HADOOP-12426](https://issues.apache.org/jira/browse/HADOOP-12426) I've proposed a CLI entry point
35+
for health checking this. Volunteers to implement welcome.
36+
37+
2238
# OS/JVM Layer; GSS library
2339

2440
Some of these are covered in Oracle's Troubleshooting Kerberos docs.
@@ -27,7 +43,8 @@ This section just highlights some of the common causes, other causes that Oracle
2743
## Server not found in Kerberos database (7)
2844

2945
* DNS is a mess and your machine does not know its own name.
30-
* Your machine has a hostname, but it's not one there's an entry in the keytab for
46+
* Your machine has a hostname, but the service principal is a `/_HOST` wildcard and the hostname
47+
is not one there's an entry in the keytab for.
3148

3249
## No valid credentials provided (Mechanism level: Illegal key size)]
3350

@@ -59,6 +76,7 @@ This comes from the clocks on the machines being too far out of sync.
5976

6077
This can surface if you are doing Hadoop work on some VMs and have been suspending and resuming them;
6178
they've lost track of when they are. Reboot them.
79+
6280
If it's a physical cluster, make sure that your NTP daemons are pointing at the same NTP server, one that is actually reachable from the Hadoop cluster. And that the timezone settings of all the hosts are consistent.
6381

6482
## KDC has no support for encryption type
@@ -79,7 +97,7 @@ an error about checksums.
7997
## Principal not found
8098

8199
The hostname is wrong (or there is >1 hostname listed with different IP addrs) and so a principal
82-
of the form `USER/HOST@DOMAIN` is coming back with the wrong host, and the KDC doesn't find it.
100+
of the form `user/_HOST@REALM` is coming back with the wrong host, and the KDC doesn't find it.
83101

84102
See the comments above about DNS for some more possibilities.
85103

sections/hadoop_and_kerberos.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
See the License for the specific language governing permissions and
1212
limitations under the License. See accompanying LICENSE file.
1313
-->
14-
14+
1515
# Hadoop's support for Kerberos
1616

1717
Hadoop can use Kerberos to authenticate users, and processes running within a
@@ -31,3 +31,13 @@ to interact with a Hadoop cluster and applications running in it *do need to kno
3131

3232
This is what this book attempts to cover.
3333

34+
## Why do they inflict so much pain on us?
35+
36+
Before going in there, here's a recurring question: why? Why Kerberos and not, say some
37+
SSL-certificate like system? Or OAuth?
38+
39+
Kerberos was written to support centrally managed accounts in a local area network, one in
40+
which adminstrators manage individual accounts. This is actually much simpler to manage than
41+
PKI-certificate based systems: look at the effort it takes to revoke a certificate in a browser.
42+
43+
OAuth?

sections/hadoop_tokens.md

Lines changed: 87 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,8 @@ public class BlockTokenIdentifier extends TokenIdentifier {
6262
...
6363
```
6464

65-
Alongside the fields covering the block and permissions, that `cache` data contains
65+
Alongside the fields covering the block and permissions, that `cache` data contains the token
66+
identifier
6667

6768
## Kerberos Tickets vs Hadoop Tokens
6869

@@ -107,6 +108,9 @@ Alongside the fields covering the block and permissions, that `cache` data conta
107108

108109
1. The tokens must be renewed before they expire: once expired, a token is worthless.
109110
1. Token renewers can be implemented as a Hadoop RPC service, or by other means, *including HTTP*.
111+
1. Token renewal may simply be the updating of an expiry time in the server, without pushing
112+
out new tokens to the clients. This scales well when there are many processes across
113+
the cluster associated with a single application..
110114

111115
For the HDFS Client protocol, the client protocol itself is the token renewer. A client may
112116
talk to the Namenode using its current token, and request a new one, so refreshing it.
@@ -138,7 +142,7 @@ in some form of storage shared across the failover services.
138142
The benefit: there's no need to involve the KDC in authenticating requests, yet short-lived access
139143
can be granted to applications running in the cluster. This explicitly avoid the problem of having
140144
1000+ containers in a YARN application each trying to talk to the KDC. (Issue: surely tickets
141-
offer that feature?)
145+
offer that feature?).
142146

143147
## Example
144148

@@ -178,20 +182,61 @@ offer that feature?)
178182
is currently considered valid (based on the expiry time and the clock value of the Name Node)
179183

180184

181-
## Determining the Kerberos Principal for a service
182185

183-
1. Service name is derived from the URI (see `SecurityUtil.buildDTServiceName`)...different
184-
services on the same host have different service names
185-
1. Every service has a protocol (usually defined by the RPC protocol API)
186-
1. To find a token for a service, client enumerates all `SecurityInfo` instances; these
187-
return info about the provider. One class `AnnotatedSecurityInfo`, examines the annotations
188-
on the class to determine these values, including looking in the Hadoop configuration
189-
to determine the kerberos principal declared for that service (see [IPC](ipc.html) for specifics).
186+
187+
## What does this mean for my application?
188+
189+
If you are writing an application, what does this mean?
190+
191+
You need to worry about tokens in servers if:
192+
193+
1. You want to support secure connections without requiring Kerberos
194+
authentication at the rate of the maximum life of a kerberos ticket.
195+
1. You want to allow applications to delegate authority, such
196+
as to YARN applications, or other services. (Example, filesystem delegation tokens
197+
provided to a Hive thrift server could be used to access the filesystem
198+
as that user).
199+
1. You want a consistent client/server authentication and identification
200+
mechanism across secure and insecure clusters. This is exactly what YARN does:
201+
a token is issued by the YARN Resource Manager to an application instance's
202+
Application Manager at launch time; this is used in all communications from
203+
the AM to the RM. Using tokens *always* means there is no separate codepath
204+
between insecure and secure clusters.
205+
206+
You need to worry about tokens in client applications if you wish
207+
to interact with Hadoop services. If the client is required to run
208+
on a kerberos-authenticated account (e.g. kinit or keytab), then
209+
your main concern is simply making sure the principal is logged in.
210+
211+
If your application wishes to run code in the cluster using the YARN scheduler, you need to
212+
directly worry about Hadoop tokens. You will need to request delegation tokens
213+
from every service with which your application will interact, include them in the YARN
214+
launch information —and propagate them from your Application Master to all
215+
containers the application launches.
216+
217+
## Design
218+
219+
(from Owen's design document)
220+
221+
Namenode Token
222+
223+
TokenID = {ownerID, renewerID, issueDate, maxDate, sequenceNumber}
224+
TokenAuthenticator = HMAC-SHA1(masterKey, TokenID)
225+
Delegation Token = {TokenID, TokenAuthenticator}
226+
227+
The token ID is used in messages from the client to identify the client; service can
228+
rebuild the `TokenAuthenticator` from it; this is the secret used for DIGEST-MD5 signing
229+
of requests.
230+
231+
232+
Token renewal: caller asks service provider for a token to be renewed. The server updates
233+
the expiry date in its local table to `min(maxDate, now()+renew_period)`. A non-HA NN
234+
can use these renewal requests to actually rebuild its token table —provided the master
235+
key has been persisted.
190236

191237
## Implementation Details
192238

193-
What is inside a Hadoop Token? Whatever the
194-
service provider wishes to supply.
239+
What is inside a Hadoop Token? Whatever the service provider wishes to supply.
195240

196241
A token is treated as a byte array to be passed
197242
in communications, such as when setting up an IPC
@@ -215,7 +260,7 @@ used to represent a token in Java code; it contains
215260
|-------|------|------|
216261
| identifier | `ByteBuffer` | the service-specific data within a token |
217262
| password | `ByteBuffer` | a password
218-
| tokenKind | `String` | token kind for looking up tokens. Example
263+
| tokenKind | `String` | token kind for looking up tokens.
219264

220265

221266
### `SecretManager`
@@ -229,10 +274,26 @@ This contains a "secret" (generated by the `javax.crypto` libraries), adding ser
229274
and equality checks. Because of this the keys can be persisted (as HDFS does) or sent
230275
over a secure channel. Uses crop up in YARN's `ZKRMStateStore`, the MapReduce History server
231276
and the YARN Application Timeline Service.
277+
278+
279+
232280

233281
### How tokens are issued
282+
283+
284+
A first step is determining the Kerberos Principal for a service:
234285

235-
TODO: how a connection bootstraps from Kerberos auth to Tokens
286+
1. Service name is derived from the URI (see `SecurityUtil.buildDTServiceName`)...different
287+
services on the same host have different service names
288+
1. Every service has a protocol (usually defined by the RPC protocol API)
289+
1. To find a token for a service, client enumerates all `SecurityInfo` instances; these
290+
return info about the provider. One class `AnnotatedSecurityInfo`, examines the annotations
291+
on the class to determine these values, including looking in the Hadoop configuration
292+
to determine the kerberos principal declared for that service (see [IPC](ipc.html) for specifics).
293+
294+
295+
With a Kerberos principal,
296+
236297

237298
### How tokens are refreshed
238299

@@ -255,21 +316,18 @@ the client-side launcher code to collect the tokens needed, and pass them
255316
to the launch context used to launch the Application Master..
256317

257318

258-
## What does this mean for my application?
319+
### Proxy Users
259320

260-
If you are writing an application, what does this mean?
321+
Proxy users are a feature which was included in the Hadoop security model for services
322+
such as Oozie; a service which needs to be able to execute work on behalf of a user
261323

262-
You need to worry about tokens in servers if
324+
Because the time at which Oozie would execute future work cannot be determined, delegation
325+
tokens cannot be used to authenticate requests issued by Oozie on behalf of a user.
326+
Kerberos keytabs are a possible solution here, but it would require every user submitting
327+
work to Oozie to have a keytab and to pass it to Oozie.
263328

264-
1. You want to support secure connections without requiring Kerberos
265-
authentication at the rate of the maximum life of a kerberos ticket.
266-
1. You want to allow applications to delegate authority, such
267-
as to YARN applications, or other services. (Example, filesystem delegation tokens
268-
provided to a Hive thrift server could be used to access the filesystem
269-
as that user).
270-
1. You want a consistent client/server authentication and identification
271-
mechanism across secure and insecure clusters. This is exactly what YARN does:
272-
a token is issued by the YARN Resource Manager to an application instance's
273-
Application Manager at launch time; this is used in all communications from
274-
the AM to the RM. Using tokens *always* means there is no separate codepath
275-
between insecure and secure clusters.
329+
330+
331+
## Weaknesses
332+
333+
1. Any compromised DN can create block tokens.

sections/ipc.md

Lines changed: 64 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,16 @@ This is "fiddly". It's not impossible, it just involves effort.
3939

4040
In its favour: it's a lot easier than SPNEGO.
4141

42+
### Annotating a service interface
43+
44+
```
45+
@KerberosInfo(serverPrincipal = "my.kerberos.principal")
46+
public interface MyRpc extends VersionedProtocol {
47+
long versionID = 0x01;
48+
...
49+
}
50+
```
51+
4252
### `SecurityInfo` subclass
4353

4454
Every exported RPC service will need its own extension of the `SecurityInfo` class, to provide two things:
@@ -48,16 +58,40 @@ Every exported RPC service will need its own extension of the `SecurityInfo` cla
4858

4959
### `PolicyProvider` subclass
5060

51-
A `PolicyProvider` subclass. This is used to inform the RPC infrastructure of the ACL policy: who may talk to the service. It must be explicitly passed to the RPC server
5261

53-
rpcService.getServer()
54-
.refreshServiceAcl(serviceConf, new MyRPCPolicyProvider());
62+
```
63+
public class MyRpcPolicyProvider extends PolicyProvider {
64+
65+
public Service[] getServices() {
66+
return new Service[] {
67+
new Service("my.protocol.acl", MyRpc.class)
68+
};
69+
}
70+
71+
}
72+
73+
```
74+
75+
This is used to inform the RPC infrastructure of the ACL policy: who may talk to the service. It must be explicitly passed to the RPC server
76+
77+
```
78+
rpcService.getServer() .refreshServiceAcl(serviceConf, new MyRpcPolicyProvider());
79+
```
80+
81+
In practise, the ACL list is usually configured with a list of groups, rather than a user.
82+
83+
### `SecurityInfo` class
84+
85+
```
86+
public class MyRpcSecurityInfo extends SecurityInfo { ... }
87+
88+
```
5589

5690
### `SecurityInfo` resource file
5791

5892
The resource file `META-INF/services/org.apache.hadoop.security.SecurityInfo` lists all RPC APIs which have a matching SecurityInfo subclass in that JAR.
5993

60-
org.apache.example.appmaster.rpc.RPCSecurityInfo
94+
org.example.rpc.MyRpcSecurityInfo
6195

6296
The RPC framework will read this file and build up the security information for the APIs (server side? Client side? both?)
6397

@@ -70,32 +104,37 @@ the server can determine the identity of the principal.
70104

71105
This is something it can ask for when handling the RPC Call:
72106

73-
UserGroupInformation callerUGI;
74-
75-
// #1: get the current user identity
76-
try {
77-
callerUGI = UserGroupInformation.getCurrentUser();
78-
} catch (IOException ie) {
79-
LOG.info("Error getting UGI ", ie);
80-
AuditLogger.logFailure("UNKNOWN", "Error getting UGI");
81-
throw RPCUtil.getRemoteException(ie);
82-
}
107+
```
108+
UserGroupInformation callerUGI;
109+
110+
// #1: get the current user identity
111+
try {
112+
callerUGI = UserGroupInformation.getCurrentUser();
113+
} catch (IOException ie) {
114+
LOG.info("Error getting UGI ", ie);
115+
AuditLogger.logFailure("UNKNOWN", "Error getting UGI");
116+
throw RPCUtil.getRemoteException(ie);
117+
}
118+
```
83119

84120
The `callerUGI` variable is now set to the identity of the caller. If the caller
85121
has delegated authority (tickets, tokens) then they still authenticate as
86122
that principal they were acting as (possibly via a `doAs()` call).
87123

88-
89-
// #2 verify their permissions
90-
if (!checkAccess(callerUGI, ApplicationAccessType.MODIFY)) {
91-
AuditLogger.logFailure(callerUGI.getShortUserName(),
92-
AuditConstants.KILL_CONTAINER_REQUEST,
93-
"User doesn't have permissions to " + ApplicationAccessType.MODIFY
94-
AuditConstants.UNAUTHORIZED_USER);
95-
throw RPCUtil.getRemoteException(new AccessControlException("User "
96-
+ callerUGI.getShortUserName() + " cannot perform operation "
97-
+ ApplicationAccessType.MODIFY_APP.name());
98-
}
124+
125+
```
126+
// #2 verify their permissions
127+
String user = callerUGI.getShortUserName();
128+
if (!checkAccess(callerUGI, MODIFY)) {
129+
AuditLog.unauthorized(user,
130+
KILL_CONTAINER_REQUEST,
131+
"User doesn't have permissions to " + MODIFY);
132+
throw RPCUtil.getRemoteException(new AccessControlException(
133+
+ user + " lacks access "
134+
+ MODIFY_APP.name()));
135+
}
136+
AuditLog.authorized(user, KILL_CONTAINER_REQUEST)
137+
```
99138

100139
In ths example, there's a check to see if the caller can make a request which modifies
101140
something in the service, if not the calls is rejected.

0 commit comments

Comments
 (0)