Skip to content

Commit 2a4d703

Browse files
committed
add details on a new cause of the "no TGT" error, with a detailed description of the failure scenario and fix
1 parent 234d83c commit 2a4d703

File tree

3 files changed

+189
-1
lines changed

3 files changed

+189
-1
lines changed

sections/errors.md

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,8 @@ Possible causes:
9090
1. Your process was issued with a ticket, which has now expired.
9191
1. You did specify a keytab but it isn't there or is somehow otherwise invalid
9292
1. You don't have the Java Cryptography Extensions installed.
93+
1. The principal isn't in the same realm as the service, so a matching TGT cannot be found.
94+
That is: you have a TGT, it's just for the wrong realm.
9395

9496

9597
## `Failure unspecified at GSS-API level (Mechanism level: Checksum failed)`
@@ -392,4 +394,33 @@ Possible causes
392394

393395

394396

395-
##
397+
## SASL `No common protection layer between client and server`
398+
399+
Not Kerberos, SASL itself
400+
401+
```
402+
16/01/22 09:44:17 WARN Client: Exception encountered while connecting to the server :
403+
javax.security.sasl.SaslException: DIGEST-MD5: No common protection layer between client and server
404+
at com.sun.security.sasl.digest.DigestMD5Client.checkQopSupport(DigestMD5Client.java:418)
405+
at com.sun.security.sasl.digest.DigestMD5Client.evaluateChallenge(DigestMD5Client.java:221)
406+
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
407+
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:558)
408+
at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:373)
409+
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:727)
410+
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:723)
411+
at java.security.AccessController.doPrivileged(Native Method)
412+
at javax.security.auth.Subject.doAs(Subject.java:422)
413+
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
414+
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:722)
415+
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:373)
416+
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1493)
417+
at org.apache.hadoop.ipc.Client.call(Client.java:1397)
418+
at org.apache.hadoop.ipc.Client.call(Client.java:1358)
419+
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
420+
at com.sun.proxy.$Proxy23.renewLease(Unknown Source)
421+
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590)
422+
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
423+
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
424+
at java.lang.reflect.Method.invoke(Method.java:497)
425+
```
426+

sections/ipc.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,43 @@ token.
3333
1. Applications may explicitly request delegation tokens to forward to other processes.
3434
1. Delegation tokens are renewed in a background thread (which?).
3535

36+
37+
## IPC authentication options
38+
39+
Hadoop IPC uses [SASL](sasl.html) to authenticate, sign and potentially encrypt
40+
communications.
41+
42+
## Use Kerberos to authenticate sender and recipient
43+
44+
```xml
45+
<property>
46+
<name>hadoop.rpc.protection</name>
47+
<value>authentication</value>
48+
</property>
49+
```
50+
51+
## Kerberos to authenticate sender and recipient, Checksums for tamper-protection
52+
53+
```xml
54+
<property>
55+
<name>hadoop.rpc.protection</name>
56+
<value>integrity</value>
57+
</property>
58+
```
59+
60+
## Kerberos to authenticate sender and recipient, Wire Encryption
61+
62+
```xml
63+
<property>
64+
<name>hadoop.rpc.protection</name>
65+
<value>privacy</value>
66+
</property>
67+
```
68+
69+
70+
71+
72+
3673
## Adding a new IPC interface to a Hadoop Service/Application
3774

3875
This is "fiddly". It's not impossible, it just involves effort.

sections/terrors.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,123 @@ the client connection —without any notification to the client. Rather than a n
3333
When a Kerberos keytab is created, the entries in it have a lifespan. The default value is one
3434
year. This was its first birthday, hence ZK wouldn't trust the client.
3535

36+
**Fix: create new keytabs, valid for another year, and distribute them.**
37+
38+
## The Principal With No Realm
39+
40+
This one showed up during release testing —credit to Andras Bokor for tracking it all down.
41+
42+
A stack trace
43+
44+
```
45+
16/01/16 01:42:39 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
46+
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "os-u14-2-2.novalocal/172.22.73.243"; destination host is: "os-u14-2-3.novalocal":8020;
47+
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
48+
at org.apache.hadoop.ipc.Client.call(Client.java:1431)
49+
at org.apache.hadoop.ipc.Client.call(Client.java:1358)
50+
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
51+
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
52+
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
53+
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
54+
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
55+
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
56+
at java.lang.reflect.Method.invoke(Method.java:606)
57+
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
58+
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
59+
at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
60+
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2116)
61+
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1315)
62+
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1311)
63+
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
64+
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1311)
65+
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
66+
```
67+
68+
This looks like a normal "not logged in" problem, except for some little facts:
69+
70+
1. The user was logged in.
71+
1. The failure was replicable.
72+
1. It only surfaced on OpenJDK, not oracle JDK.
73+
1. Everything worked on OpenJDK 7u51, but not on OpenJDK 7u91.
74+
75+
Something had changed in the JDK to reject the login on this system (ubuntu, virtual test cluster).
76+
77+
`Kdiag` didn't throw up anything obvious. What did show some warning was `klist`:
78+
79+
```
80+
Ticket cache: FILE:/tmp/krb5cc_2529
81+
Default principal: qe@REALM
82+
83+
Valid starting Expires Service principal
84+
01/16/2016 11:07:23 01/16/2016 21:07:23 krbtgt/REALM@REALM
85+
renew until 01/23/2016 11:07:23
86+
01/16/2016 13:13:11 01/16/2016 21:07:23 HTTP/hdfs-3-5@
87+
renew until 01/23/2016 11:07:23
88+
01/16/2016 13:13:11 01/16/2016 21:07:23 HTTP/hdfs-3-5@REALM
89+
renew until 01/23/2016 11:07:23
90+
```
91+
92+
See that? There's a principal which doesn't have a stated realm. Does that matter?
93+
94+
In OracleJDK, and OpenJDK 7u51, apparently not. In OpenJDK 7u91, yes
95+
96+
There's some new code in `sun.security.krb5.PrincipalName`
97+
98+
```java
99+
// Validate a nameStrings argument
100+
private static void validateNameStrings(String[] ns) {
101+
if (ns == null) {
102+
throw new IllegalArgumentException("Null nameStrings not allowed");
103+
}
104+
if (ns.length == 0) {
105+
throw new IllegalArgumentException("Empty nameStrings not allowed");
106+
}
107+
for (String s: ns) {
108+
if (s == null) {
109+
throw new IllegalArgumentException("Null nameString not allowed");
110+
}
111+
if (s.isEmpty()) {
112+
throw new IllegalArgumentException("Empty nameString not allowed");
113+
}
114+
}
115+
}
116+
```
117+
118+
This checks the code, and rejects if nothing is valid. Now, how does something invalid get in?
119+
Setting `HADOOP_JAAS_DEBUG=true` and logging at debug turned up output,
120+
121+
With 7u51:
122+
```
123+
16/01/20 15:13:20 DEBUG security.UserGroupInformation: using kerberos user:qe@REALM
124+
```
125+
126+
With 7u91:
127+
128+
```
129+
16/01/20 15:10:44 DEBUG security.UserGroupInformation: using kerberos user:null
130+
```
131+
132+
Which means that the default principal wasn't being picked up, instead some JVM specific introspection
133+
had kicked in —and it was finding the principal without a realm, rather than the one that was.
134+
135+
136+
*Fix: add a `domain_realm` in `/etc/krb5.conf` mapping hostnames to realms *
137+
138+
```
139+
[domain_realm]
140+
hdfs-3-5.novalocal = REALM
141+
```
142+
143+
A `klist` then returns a list of credentials without this realm-less one in.
144+
145+
```
146+
Valid starting Expires Service principal
147+
01/17/2016 14:49:08 01/18/2016 00:49:08 krbtgt/REALM@REALM
148+
renew until 01/24/2016 14:49:08
149+
01/17/2016 14:49:16 01/18/2016 00:49:08 HTTP/hdfs-3-5@REALM
150+
renew until 01/24/2016 14:49:08
151+
```
152+
153+
Because this was a virtual cluster, DNS/RDNS probably wasn't working, presumably kerberos
154+
didn't know what realm the host was in, and things went downhill. It just didn't show in
155+
any validation operations, merely in the classic "no TGT" error.

0 commit comments

Comments
 (0)