You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Java analysis can be customized by adding library models (summaries, sinks and sources) in data extensions files.
8
+
The Java analysis can be customized by adding library models (summaries, sinks and sources) in data extension files.
9
9
10
10
A data extension file for Java is a YAML file in the form:
11
11
@@ -31,15 +31,15 @@ TODO: Link or inline documentation on how to add dataextensions.
31
31
Are we going for extensions packs as the recommended default?
32
32
If yes, then we probably need to elaborate with a concrete example.
33
33
34
-
In the sections below, we will go through the different extension points using concrete examples.
35
-
The extension points are used to customize and improve the existing dataflow queries, by providing sources, sinks and flow through for library methods.
34
+
In the sections below, we will show by example how to add tuples to the different extension points.
35
+
The extension points are used to customize and improve the existing dataflow queries, by providing sources, sinks, and flow through for library elements.
36
36
The **Reference material** section will in more detail describe the *mini DSLs* that are used to comprise a model definition for each extension point.
37
37
38
38
Example: Taint sink in the **java.sql** package.
39
39
------------------------------------------------
40
40
41
41
In this example we will see, how to define the argument of the **execute** method as a SQL injection sink.
42
-
This is the **execute** method in the **Statement** class, which is located in the 'java.sql' package.
42
+
This is the **execute** method in the **Statement** class, which is located in the **java.sql** package.
43
43
Please note that this sink is already added to the CodeQL Java analysis.
44
44
45
45
.. code-block:: java
@@ -75,7 +75,7 @@ For most practical purposes the sixth value is not relevant.
75
75
The remaining values are used to define the **access path**, the **kind**, and the **provenance** (origin) of the sink.
76
76
77
77
- The seventh value **Argument[0]** is the **access path** to the first argument passed to the method, which means that this is the location of the sink.
78
-
- The eighth value **sql** is the kind of the sink. The sink kind is used to define for which queries the sink is in scope. In this case SQL injection queries.
78
+
- The eighth value **sql** is the kind of the sink. The sink kind is used to define the queries where the sink is in scope. In this case - the SQL injection queries.
79
79
- The ninth value **manual** is the provenance of the sink, which is used to identify the origin of the sink.
80
80
81
81
Example: Taint source from the **java.net** package.
@@ -116,12 +116,12 @@ The first five values are used to identify the method (callable) which we are de
116
116
For most practical purposes the sixth value is not relevant.
117
117
The remaining values are used to define the **access path**, the **kind**, and the **provenance** (origin) of the source.
118
118
119
-
- The seventh value **ReturnValue** is the access path to the return of the method, which means that it is the return value that should be considered a tainted source.
120
-
- The eighth value **remote** is the kind of the source. The source kind is used to define for which queries the source is in scope. **remote** applies to many of security related queries as it means a remote source of untrusted data. As an example the SQL injection query uses **remote** sources.
119
+
- The seventh value **ReturnValue** is the access path to the return of the method, which means that it is the return value that should be considered a source of tainted input.
120
+
- The eighth value **remote** is the kind of the source. The source kind is used to define the queries where the source is in scope. **remote** applies to many of security related queries as it means a remote source of untrusted data. As an example the SQL injection query uses **remote** sources.
121
121
- The ninth value **manual** is the provenance of the source, which is used to identify the origin of the source.
122
122
123
-
Example: Adding flow through the **concat** method.
In this example we will see, how to define flow through a method for a simple case.
126
126
This pattern covers many of the cases where we need to define flow through a method.
127
127
Please note that the flow through the **concat** method is already added to the CodeQL Java analysis.
@@ -150,26 +150,28 @@ Reasoning:
150
150
151
151
Since we are adding flow through a method, we need to add tuples to the **summaryModel** extension point.
152
152
Each tuple defines flow from one argument to the return value.
153
-
The first five values are used to identify the method (callable) which we are defining a source on.
154
-
These are the same for both of the rows above.
153
+
The first row defines flow from the qualifier (**s1** in the example) to the return value (**t** in the example) and the second row defines flow from the first argument (**s2** in the example) to the return value (**t** in the example).
154
+
155
+
The first five values are used to identify the method (callable) which we are defining a summary for.
156
+
These are the same for both of the rows above as we are adding two summaries for the same method.
155
157
156
158
- The first value **java.lang** is the package name.
157
159
- The second value **String** is the class (type) name.
158
-
- The third value **False** is flag indicating, whether the source also applies to all overrides of the method.
160
+
- The third value **False** is flag indicating, whether the summary also applies to all overrides of the method.
159
161
- The fourth value **concat** is the method name.
160
162
- The fifth value **(String)** is the method input type signature.
161
163
162
164
For most practical purposes the sixth value is not relevant.
163
-
The remaining values are used to define the **access path**, the **kind**, and the **provenance** (origin) of the source.
165
+
The remaining values are used to define the **access path**, the **kind**, and the **provenance** (origin) of the summary.
164
166
165
-
- The seventh value is the access path to the input where data flows from. **Argument[-1]** is the access path to the qualifier (**s1** in the example) and **Argument[0]** is the access path to the first argument (**s2** in the example).
166
-
- The eighth value **ReturnValue** is the access path to the output where data flows too, in this case **ReturnValue**, which means that the input flows to the return value.
167
+
- The seventh value is the access path to the input (where data flows from). **Argument[-1]** is the access path to the qualifier (**s1** in the example) and **Argument[0]** is the access path to the first argument (**s2** in the example).
168
+
- The eighth value **ReturnValue** is the access path to the output (where data flows to), in this case **ReturnValue**, which means that the input flows to the return value.
167
169
- The ninth value **taint** is the kind of the flow. **taint** means that taint is propagated through the flow.
168
-
- The tenth value **manual** is the provenance of the source, which is used to identify the origin of the summary.
170
+
- The tenth value **manual** is the provenance of the summary, which is used to identify the origin of the summary.
169
171
170
172
Example: Add flow through the **map** method.
171
173
---------------------------------------------
172
-
In this example will will see a more complex example of modelling flow through a method.
174
+
In this example, we will see a more complex example of modelling flow through a method.
173
175
This pattern shows how to model flow through higher order methods and collection types.
174
176
Please note that the flow through the **map** method is already added to the CodeQL Java analysis.
175
177
@@ -195,21 +197,21 @@ This can be achieved by adding the following data extension.
195
197
Reasoning:
196
198
197
199
Since we are adding flow through a method, we need to add tuples to the **summaryModel** extension point.
198
-
Each tuple defines part of the flow that comprises the total flow through the method.
199
-
The first five values are used to identify the method (callable) which we are defining a source on.
200
-
These are the same for both of the rows above.
200
+
Each tuple defines part of the flow that comprises the total flow through the **map** method.
201
+
The first five values are used to identify the method (callable) which we are defining a summary for.
202
+
These are the same for both of the rows above as we are adding two summaries for the same method.
201
203
202
204
- The first value **java.util.stream** is the package name.
203
205
- The second value **Stream** is the class (type) name.
204
-
- The third value **True** is flag indicating, whether the source also applies to all overrides of the method.
206
+
- The third value **True** is flag indicating, whether the summary also applies to all overrides of the method.
205
207
- The fourth value **map** is the method name.
206
208
- The fifth value **Function** is the method input type signature.
207
209
208
210
For most practical purposes the sixth value is not relevant.
209
-
The remaining values are used to define the **access path**, the **kind**, and the **provenance** (origin) of the source.
211
+
The remaining values are used to define the **access path**, the **kind**, and the **provenance** (origin) of the summary definition.
210
212
211
-
- The seventh value is the access path to the **input** where data flows from.
212
-
- The eighth value **ReturnValue** is the access path to the **output** where data flows too.
213
+
- The seventh value is the access path to the **input** (where data flows from).
214
+
- The eighth value **ReturnValue** is the access path to the **output** (where data flows to).
213
215
214
216
For the first row the
215
217
@@ -223,13 +225,13 @@ For the second row the
223
225
224
226
The remaining values for both rows
225
227
226
-
- The ninth value **value** is the kind of the flow. **value** means that the value is propagated.
227
-
- The tenth value **manual** is the provenance of the source, which is used to identify the origin of the summary.
228
+
- The ninth value **value** is the kind of the flow. **value** means that the value is preserved.
229
+
- The tenth value **manual** is the provenance of the summary, which is used to identify the origin of the summary.
228
230
229
-
That is, the first row models that there is value flow from the elements of qualifier stream into the first argument of the Function provided to **map** and the second row models that there is value flow from the return value of the Function to the elements of the stream returned from **map**.
231
+
That is, the first row models that there is value flow from the elements of the qualifier stream into the first argument of the Function provided to **map** and the second row models that there is value flow from the return value of the Function to the elements of the stream returned from **map**.
230
232
231
-
Example: Adding **neutral** methods.
232
-
------------------------------------
233
+
Example: Add a **neutral** method.
234
+
----------------------------------
233
235
In this example we will see, how to define the **now** method as being neutral.
234
236
This is purely for consistency and has no impact on the analysis.
235
237
A neutral model is used to define that there is no flow through a method.
@@ -266,7 +268,7 @@ Reference material
266
268
------------------
267
269
268
270
The following sections provide reference material for extension points.
269
-
This includes descriptions of each of the arguments (eg. access paths, types, and kinds).
271
+
This includes descriptions of each of the arguments (eg. access paths, kinds and provenance).
270
272
271
273
Extension points
272
274
----------------
@@ -275,26 +277,25 @@ Below is a description of the columns for each extension point.
275
277
Sources, Sinks, Summaries and Neutrals are commonly known as Models.
276
278
The semantics of many of the columns of the extension points are shared.
277
279
278
-
279
280
The shared columns are:
280
281
281
282
- **package**: Name of the package.
282
283
- **type**: Name of the type.
283
-
- **subtypes**: A flag indicating whether the model should also apply to all overrides of the selected method(s).
284
-
- **name**: Name of the method (optional). If left blank, it means all methods matching the previous selction criteria.
285
-
- **signature**: Type signature of the method where the source resides (optional). If this is left blank it means all methods matching the previous selction criteria.
- **subtypes**: A flag indicating whether the model should also apply to all overrides of the selected element(s).
285
+
- **name**: Name of the element (optional). If this is left blank, it means all elements matching the previous selection criteria.
286
+
- **signature**: Type signature of the selected element (optional). If this is left blank it means all elements matching the previous selection criteria.
287
+
- **ext**: Specifies additional API-graph-like edges (mostly empty) and out of scope for this document.
287
288
- **provenance**: Provenance (origin) of the model definition.
288
289
289
-
The columns **package**, **type**, **subtypes**, **name**, and **signature** are used to select the method(s) that the model applies to.
290
+
The columns **package**, **type**, **subtypes**, **name**, and **signature** are used to select the element(s) that the model applies to.
290
291
291
292
The section Access paths describes in more detail, how access paths are composed.
292
293
This is the most complicated part of the extension points and the **mini DSL** for access paths is shared accross the extension points.
Taint source. Most taint tracking queries will use the sources added to this extensions point.
298
+
Taint source. Most taint tracking queries will use the all sources added to this extensions point regardless of their kind.
298
299
299
300
- **output**: Access path to the source, where the possibly tainted data flows from.
300
301
- **kind**: Kind of the source.
@@ -303,7 +304,7 @@ Taint source. Most taint tracking queries will use the sources added to this ext
303
304
As most sources are used by all taint tracking queries there are only a few different source kinds.
304
305
The following source kinds are supported:
305
306
306
-
- **remote**: A remote source is tainted data. This is the most common kind of source and sources of this kind is used for almost all taint tracking queries.
307
+
- **remote**: A remote source of possibly tainted data. This is the most common kind for a source. Sources of this kind is used for almost all taint tracking queries.
0 commit comments