@@ -45,7 +45,7 @@ What has been learned cannot be unlearned(*)
45
45
46
46
# Foundational Concepts
47
47
48
- What is the problem that Hadoop security is trying to address?
48
+ What is the problem that Hadoop security is trying to address? Securing Hadoop.
49
49
50
50
Apache Hadoop is "an OS for data".
51
51
A Hadoop cluster can rapidly become the largest stores of data in an organisation.
@@ -71,7 +71,7 @@ In particular, any web UI or IPC service they instantiate needs to have its acce
71
71
72
72
## Authentication
73
73
74
- The authentication problem: who is a caller identifying themselves as -- and can you verify
74
+ The authentication problem: who is a caller identifying themselves as — and can you verify
75
75
that they really are this person.
76
76
77
77
In an unsecure cluster, all callers to HDFS, YARN and other services are trusted to be
@@ -91,7 +91,7 @@ users. When cluster node labels are used to differentiate parts of the cluster (
91
91
more RAM, GPUs or other features), then the queues can be used to restrict access
92
92
to specific sets of nodes.
93
93
94
- Similarly, HBase & Accumulo have their users and permissions, while Hive uses the
94
+ Similarly, HBase and Accumulo have their users and permissions, while Hive uses the
95
95
permissions of the source files as its primary access control mechanism.
96
96
97
97
These various mechanisms are all a bit disjoint, hence the emergence of tools
@@ -118,9 +118,30 @@ that when making queries across encrypted datasets, temporary data files are als
118
118
in the same encryption zone, to stop the intermediate data being stored unencrypted.
119
119
And of course, analytics code running in the servers may also intentionally or unintentionally
120
120
persist the sensitive data in an unencrypted form: the local filesystem, OS swap space
121
- and even OS hibernate-time memory snapshots need to be
121
+ and even OS hibernate-time memory snapshots need to be managed.
122
122
123
123
Before rushing to enable persistent data encryption, then, you need to consider: what is the
124
124
goal here?
125
125
126
+ What at-REST encryption does deliver is better guarantees that data stored in hard disks
127
+ is not recoverable —at least on the HDFS side. However, as OS-level data can persist,
128
+ (strongly) wiping HDDs prior to disposal is still going to be necessary to guarantee
129
+ destruction of the data.
130
+
126
131
## Auditing & Governance
132
+
133
+ Authenticated and Authorized users should not just be able to perform actions
134
+ or read and write data —this should all be logged in * Audit Logs* so that
135
+ if there is ever a need to see which files a user accessed, or what individual
136
+ made specific requests of a service —that information is available. Audit logs
137
+ should be
138
+
139
+ 1 . Separate log categories from normal processing logs, so log configurations
140
+ can store them in separate locations, with different persistence policies.
141
+ 1 . Machine Parseable. This allows the audit logs themselves to be analyzed. This
142
+ does not just have to be for security reasons; Spotify have disclosed that they
143
+ run analysis over their HDFS audit logs to identify which files are most popular (and
144
+ hence should have their replication factor increased), and which do not get
145
+ used more then 7 days after their creation —and hence can be automatically deleted
146
+ as part of a workflow.
147
+
0 commit comments