Skip to content

Commit 2ebd2a3

Browse files
committed
ongoing
1 parent ac9d5c6 commit 2ebd2a3

File tree

2 files changed

+37
-10
lines changed

2 files changed

+37
-10
lines changed

book.json

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,26 @@
11
{
22
"gitbook": ">=2.0.0" ,
3-
"variables": {
3+
"title": "Hadoop and Kerberos: The Madness beyond the Gate",
4+
"description": "A terrifying dive into the depths of Hadoop security",
5+
"variables": {
46
"asf": "Apache Software Foundation",
57
"author":"Steve Loughran",
6-
"title": "Kerberos and Hadoop: The Madness Beyond The Gate",
8+
"title": "Hadoop and Kerberos: The Madness beyond the Gate",
79
"hadoop-latest": "2.7.1"
810
},
9-
"plugins": ["autocover"],
11+
"plugins": [
12+
"autocover",
13+
"katex",
14+
"printlinks",
15+
"include-codeblock"],
1016
"pluginsConfig": {
1117
"fontSettings": {
1218
"theme": "night",
1319
"family": "serif",
1420
"size": 1
1521
},
1622
"autocover": {
17-
"title": "Kerberos and Hadoop",
23+
"title": "Kerberos and Hadoop: The Madness Beyond the Gate",
1824
"author": "Steve Loughran",
1925
"font": {
2026
"size": null,
@@ -47,9 +53,9 @@
4753
},
4854

4955
"comment":"//Header HTML template. Available variables: _PAGENUM_, _TITLE_, _AUTHOR_ and _SECTION_.",
50-
"headerTemplate": "_TITLE_",
56+
"headerTemplate-off": "_TITLE_",
5157

5258
"comment":"//Footer HTML template. Available variables: _PAGENUM_, _TITLE_, _AUTHOR_ and _SECTION_.",
53-
"footerTemplate": "_PAGENUM_"
59+
"footerTemplate-off": "_PAGENUM_"
5460
}
5561
}

sections/kerberos_the_madness.md

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ What has been learned cannot be unlearned(*)
4545

4646
# Foundational Concepts
4747

48-
What is the problem that Hadoop security is trying to address?
48+
What is the problem that Hadoop security is trying to address? Securing Hadoop.
4949

5050
Apache Hadoop is "an OS for data".
5151
A Hadoop cluster can rapidly become the largest stores of data in an organisation.
@@ -71,7 +71,7 @@ In particular, any web UI or IPC service they instantiate needs to have its acce
7171

7272
## Authentication
7373

74-
The authentication problem: who is a caller identifying themselves as --and can you verify
74+
The authentication problem: who is a caller identifying themselves as and can you verify
7575
that they really are this person.
7676

7777
In an unsecure cluster, all callers to HDFS, YARN and other services are trusted to be
@@ -91,7 +91,7 @@ users. When cluster node labels are used to differentiate parts of the cluster (
9191
more RAM, GPUs or other features), then the queues can be used to restrict access
9292
to specific sets of nodes.
9393

94-
Similarly, HBase & Accumulo have their users and permissions, while Hive uses the
94+
Similarly, HBase and Accumulo have their users and permissions, while Hive uses the
9595
permissions of the source files as its primary access control mechanism.
9696

9797
These various mechanisms are all a bit disjoint, hence the emergence of tools
@@ -118,9 +118,30 @@ that when making queries across encrypted datasets, temporary data files are als
118118
in the same encryption zone, to stop the intermediate data being stored unencrypted.
119119
And of course, analytics code running in the servers may also intentionally or unintentionally
120120
persist the sensitive data in an unencrypted form: the local filesystem, OS swap space
121-
and even OS hibernate-time memory snapshots need to be
121+
and even OS hibernate-time memory snapshots need to be managed.
122122

123123
Before rushing to enable persistent data encryption, then, you need to consider: what is the
124124
goal here?
125125

126+
What at-REST encryption does deliver is better guarantees that data stored in hard disks
127+
is not recoverable —at least on the HDFS side. However, as OS-level data can persist,
128+
(strongly) wiping HDDs prior to disposal is still going to be necessary to guarantee
129+
destruction of the data.
130+
126131
## Auditing & Governance
132+
133+
Authenticated and Authorized users should not just be able to perform actions
134+
or read and write data —this should all be logged in *Audit Logs* so that
135+
if there is ever a need to see which files a user accessed, or what individual
136+
made specific requests of a service —that information is available. Audit logs
137+
should be
138+
139+
1. Separate log categories from normal processing logs, so log configurations
140+
can store them in separate locations, with different persistence policies.
141+
1. Machine Parseable. This allows the audit logs themselves to be analyzed. This
142+
does not just have to be for security reasons; Spotify have disclosed that they
143+
run analysis over their HDFS audit logs to identify which files are most popular (and
144+
hence should have their replication factor increased), and which do not get
145+
used more then 7 days after their creation —and hence can be automatically deleted
146+
as part of a workflow.
147+

0 commit comments

Comments
 (0)