Skip to content

Commit 68a7fc9

Browse files
committed
Java: Minor improvements on wording.
1 parent 396e24c commit 68a7fc9

File tree

1 file changed

+47
-46
lines changed

1 file changed

+47
-46
lines changed

docs/codeql/codeql-language-guides/customizing-library-models-for-java.rst

Lines changed: 47 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Customizing Library Models for Java
55

66
.. include:: ../reusables/beta-note-customizing-library-models.rst
77

8-
The Java analysis can be customized by adding library models (summaries, sinks and sources) in data extensions files.
8+
The Java analysis can be customized by adding library models (summaries, sinks and sources) in data extension files.
99

1010
A data extension file for Java is a YAML file in the form:
1111

@@ -31,15 +31,15 @@ TODO: Link or inline documentation on how to add dataextensions.
3131
Are we going for extensions packs as the recommended default?
3232
If yes, then we probably need to elaborate with a concrete example.
3333

34-
In the sections below, we will go through the different extension points using concrete examples.
35-
The extension points are used to customize and improve the existing dataflow queries, by providing sources, sinks and flow through for library methods.
34+
In the sections below, we will show by example how to add tuples to the different extension points.
35+
The extension points are used to customize and improve the existing dataflow queries, by providing sources, sinks, and flow through for library elements.
3636
The **Reference material** section will in more detail describe the *mini DSLs* that are used to comprise a model definition for each extension point.
3737

3838
Example: Taint sink in the **java.sql** package.
3939
------------------------------------------------
4040

4141
In this example we will see, how to define the argument of the **execute** method as a SQL injection sink.
42-
This is the **execute** method in the **Statement** class, which is located in the 'java.sql' package.
42+
This is the **execute** method in the **Statement** class, which is located in the **java.sql** package.
4343
Please note that this sink is already added to the CodeQL Java analysis.
4444

4545
.. code-block:: java
@@ -75,7 +75,7 @@ For most practical purposes the sixth value is not relevant.
7575
The remaining values are used to define the **access path**, the **kind**, and the **provenance** (origin) of the sink.
7676

7777
- The seventh value **Argument[0]** is the **access path** to the first argument passed to the method, which means that this is the location of the sink.
78-
- The eighth value **sql** is the kind of the sink. The sink kind is used to define for which queries the sink is in scope. In this case SQL injection queries.
78+
- The eighth value **sql** is the kind of the sink. The sink kind is used to define the queries where the sink is in scope. In this case - the SQL injection queries.
7979
- The ninth value **manual** is the provenance of the sink, which is used to identify the origin of the sink.
8080

8181
Example: Taint source from the **java.net** package.
@@ -116,12 +116,12 @@ The first five values are used to identify the method (callable) which we are de
116116
For most practical purposes the sixth value is not relevant.
117117
The remaining values are used to define the **access path**, the **kind**, and the **provenance** (origin) of the source.
118118

119-
- The seventh value **ReturnValue** is the access path to the return of the method, which means that it is the return value that should be considered a tainted source.
120-
- The eighth value **remote** is the kind of the source. The source kind is used to define for which queries the source is in scope. **remote** applies to many of security related queries as it means a remote source of untrusted data. As an example the SQL injection query uses **remote** sources.
119+
- The seventh value **ReturnValue** is the access path to the return of the method, which means that it is the return value that should be considered a source of tainted input.
120+
- The eighth value **remote** is the kind of the source. The source kind is used to define the queries where the source is in scope. **remote** applies to many of security related queries as it means a remote source of untrusted data. As an example the SQL injection query uses **remote** sources.
121121
- The ninth value **manual** is the provenance of the source, which is used to identify the origin of the source.
122122

123-
Example: Adding flow through the **concat** method.
124-
---------------------------------------------------
123+
Example: Add flow through the **concat** method.
124+
------------------------------------------------
125125
In this example we will see, how to define flow through a method for a simple case.
126126
This pattern covers many of the cases where we need to define flow through a method.
127127
Please note that the flow through the **concat** method is already added to the CodeQL Java analysis.
@@ -150,26 +150,28 @@ Reasoning:
150150

151151
Since we are adding flow through a method, we need to add tuples to the **summaryModel** extension point.
152152
Each tuple defines flow from one argument to the return value.
153-
The first five values are used to identify the method (callable) which we are defining a source on.
154-
These are the same for both of the rows above.
153+
The first row defines flow from the qualifier (**s1** in the example) to the return value (**t** in the example) and the second row defines flow from the first argument (**s2** in the example) to the return value (**t** in the example).
154+
155+
The first five values are used to identify the method (callable) which we are defining a summary for.
156+
These are the same for both of the rows above as we are adding two summaries for the same method.
155157

156158
- The first value **java.lang** is the package name.
157159
- The second value **String** is the class (type) name.
158-
- The third value **False** is flag indicating, whether the source also applies to all overrides of the method.
160+
- The third value **False** is flag indicating, whether the summary also applies to all overrides of the method.
159161
- The fourth value **concat** is the method name.
160162
- The fifth value **(String)** is the method input type signature.
161163

162164
For most practical purposes the sixth value is not relevant.
163-
The remaining values are used to define the **access path**, the **kind**, and the **provenance** (origin) of the source.
165+
The remaining values are used to define the **access path**, the **kind**, and the **provenance** (origin) of the summary.
164166

165-
- The seventh value is the access path to the input where data flows from. **Argument[-1]** is the access path to the qualifier (**s1** in the example) and **Argument[0]** is the access path to the first argument (**s2** in the example).
166-
- The eighth value **ReturnValue** is the access path to the output where data flows too, in this case **ReturnValue**, which means that the input flows to the return value.
167+
- The seventh value is the access path to the input (where data flows from). **Argument[-1]** is the access path to the qualifier (**s1** in the example) and **Argument[0]** is the access path to the first argument (**s2** in the example).
168+
- The eighth value **ReturnValue** is the access path to the output (where data flows to), in this case **ReturnValue**, which means that the input flows to the return value.
167169
- The ninth value **taint** is the kind of the flow. **taint** means that taint is propagated through the flow.
168-
- The tenth value **manual** is the provenance of the source, which is used to identify the origin of the summary.
170+
- The tenth value **manual** is the provenance of the summary, which is used to identify the origin of the summary.
169171

170172
Example: Add flow through the **map** method.
171173
---------------------------------------------
172-
In this example will will see a more complex example of modelling flow through a method.
174+
In this example, we will see a more complex example of modelling flow through a method.
173175
This pattern shows how to model flow through higher order methods and collection types.
174176
Please note that the flow through the **map** method is already added to the CodeQL Java analysis.
175177

@@ -195,21 +197,21 @@ This can be achieved by adding the following data extension.
195197
Reasoning:
196198

197199
Since we are adding flow through a method, we need to add tuples to the **summaryModel** extension point.
198-
Each tuple defines part of the flow that comprises the total flow through the method.
199-
The first five values are used to identify the method (callable) which we are defining a source on.
200-
These are the same for both of the rows above.
200+
Each tuple defines part of the flow that comprises the total flow through the **map** method.
201+
The first five values are used to identify the method (callable) which we are defining a summary for.
202+
These are the same for both of the rows above as we are adding two summaries for the same method.
201203

202204
- The first value **java.util.stream** is the package name.
203205
- The second value **Stream** is the class (type) name.
204-
- The third value **True** is flag indicating, whether the source also applies to all overrides of the method.
206+
- The third value **True** is flag indicating, whether the summary also applies to all overrides of the method.
205207
- The fourth value **map** is the method name.
206208
- The fifth value **Function** is the method input type signature.
207209

208210
For most practical purposes the sixth value is not relevant.
209-
The remaining values are used to define the **access path**, the **kind**, and the **provenance** (origin) of the source.
211+
The remaining values are used to define the **access path**, the **kind**, and the **provenance** (origin) of the summary definition.
210212

211-
- The seventh value is the access path to the **input** where data flows from.
212-
- The eighth value **ReturnValue** is the access path to the **output** where data flows too.
213+
- The seventh value is the access path to the **input** (where data flows from).
214+
- The eighth value **ReturnValue** is the access path to the **output** (where data flows to).
213215

214216
For the first row the
215217

@@ -223,13 +225,13 @@ For the second row the
223225

224226
The remaining values for both rows
225227

226-
- The ninth value **value** is the kind of the flow. **value** means that the value is propagated.
227-
- The tenth value **manual** is the provenance of the source, which is used to identify the origin of the summary.
228+
- The ninth value **value** is the kind of the flow. **value** means that the value is preserved.
229+
- The tenth value **manual** is the provenance of the summary, which is used to identify the origin of the summary.
228230

229-
That is, the first row models that there is value flow from the elements of qualifier stream into the first argument of the Function provided to **map** and the second row models that there is value flow from the return value of the Function to the elements of the stream returned from **map**.
231+
That is, the first row models that there is value flow from the elements of the qualifier stream into the first argument of the Function provided to **map** and the second row models that there is value flow from the return value of the Function to the elements of the stream returned from **map**.
230232

231-
Example: Adding **neutral** methods.
232-
------------------------------------
233+
Example: Add a **neutral** method.
234+
----------------------------------
233235
In this example we will see, how to define the **now** method as being neutral.
234236
This is purely for consistency and has no impact on the analysis.
235237
A neutral model is used to define that there is no flow through a method.
@@ -266,7 +268,7 @@ Reference material
266268
------------------
267269

268270
The following sections provide reference material for extension points.
269-
This includes descriptions of each of the arguments (eg. access paths, types, and kinds).
271+
This includes descriptions of each of the arguments (eg. access paths, kinds and provenance).
270272

271273
Extension points
272274
----------------
@@ -275,26 +277,25 @@ Below is a description of the columns for each extension point.
275277
Sources, Sinks, Summaries and Neutrals are commonly known as Models.
276278
The semantics of many of the columns of the extension points are shared.
277279

278-
279280
The shared columns are:
280281

281282
- **package**: Name of the package.
282283
- **type**: Name of the type.
283-
- **subtypes**: A flag indicating whether the model should also apply to all overrides of the selected method(s).
284-
- **name**: Name of the method (optional). If left blank, it means all methods matching the previous selction criteria.
285-
- **signature**: Type signature of the method where the source resides (optional). If this is left blank it means all methods matching the previous selction criteria.
286-
- **ext**: Specifies additional API-graph-like edges (mostly empty).
284+
- **subtypes**: A flag indicating whether the model should also apply to all overrides of the selected element(s).
285+
- **name**: Name of the element (optional). If this is left blank, it means all elements matching the previous selection criteria.
286+
- **signature**: Type signature of the selected element (optional). If this is left blank it means all elements matching the previous selection criteria.
287+
- **ext**: Specifies additional API-graph-like edges (mostly empty) and out of scope for this document.
287288
- **provenance**: Provenance (origin) of the model definition.
288289

289-
The columns **package**, **type**, **subtypes**, **name**, and **signature** are used to select the method(s) that the model applies to.
290+
The columns **package**, **type**, **subtypes**, **name**, and **signature** are used to select the element(s) that the model applies to.
290291

291292
The section Access paths describes in more detail, how access paths are composed.
292293
This is the most complicated part of the extension points and the **mini DSL** for access paths is shared accross the extension points.
293294

294295
sourceModel(package, type, subtypes, name, signature, ext, output, kind, provenance)
295296
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
296297

297-
Taint source. Most taint tracking queries will use the sources added to this extensions point.
298+
Taint source. Most taint tracking queries will use the all sources added to this extensions point regardless of their kind.
298299

299300
- **output**: Access path to the source, where the possibly tainted data flows from.
300301
- **kind**: Kind of the source.
@@ -303,7 +304,7 @@ Taint source. Most taint tracking queries will use the sources added to this ext
303304
As most sources are used by all taint tracking queries there are only a few different source kinds.
304305
The following source kinds are supported:
305306

306-
- **remote**: A remote source is tainted data. This is the most common kind of source and sources of this kind is used for almost all taint tracking queries.
307+
- **remote**: A remote source of possibly tainted data. This is the most common kind for a source. Sources of this kind is used for almost all taint tracking queries.
307308
- **contentprovider**: ?
308309
- **android-widget**: ?
309310
- **android-external-storage-dir**: ?
@@ -313,7 +314,7 @@ sinkModel(package, type, subtypes, name, signature, ext, input, kind, provenance
313314

314315
Taint sink. As opposed to source kinds, there are many different kinds of sinks as these tend to be more query specific.
315316

316-
- **input**: Access path to the sink, where we want to check if possibly tainted data flows too.
317+
- **input**: Access path to the sink, where we want to check if tainted data can flow to.
317318
- **kind**: Kind of the sink.
318319

319320
The following sink kinds are supported:
@@ -348,10 +349,10 @@ The following sink kinds are supported:
348349
summaryModel(package, type, subtypes, name, signature, ext, input, output, kind, provenance)
349350
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
350351

351-
Flow through. This extension point is used to model flow through methods.
352+
Flow through. This extension point is used to model flow through elements.
352353

353-
- **input**: Access path to the input of the method where data will flow to the output.
354-
- **output**: Access path to the output of the method where data will flow from the input.
354+
- **input**: Access path to the input of the element (where data will flow to the output).
355+
- **output**: Access path to the output of the element (where data will flow from the input).
355356
- **kind**: Kind of the flow through.
356357
- **provenance**: Provenance (origin) of the flow through.
357358

@@ -370,7 +371,7 @@ The **input**, and **output** columns consist of a **.**-separated list of compo
370371
The following components are supported:
371372

372373
- **Argument[**\ `n`\ **]** selects the argument at index `n` (zero-indexed).
373-
- **Argument[**\ `-1`\ **]** selects the qualifier of the call.
374+
- **Argument[**\ `-1`\ **]** selects the qualifier.
374375
- **Argument[**\ `n1..n2`\ **]** selects the arguments in the given range (both ends included).
375376
- **Parameter[**\ `n`\ **]** selects the parameter at index `n` (zero-indexed).
376377
- **Parameter[**\ `n1..n2`\ **]** selects the parameters in the given range (both ends included).
@@ -396,7 +397,7 @@ The following values are supported:
396397
The provenance is used to distinguish between models that are manually added to the extension point and models that are automatically generated.
397398
Furthermore, it impacts the dataflow analysis in the following way
398399

399-
- A **manual** model takes precedence over **generated** models. If a **manual** model exist for a method then all generated models are ignored.
400-
- A **generated** or **ai-generated** model is ignored during analysis, if the source code of the method they are modelling is available.
400+
- A **manual** model takes precedence over **generated** models. If a **manual** model exist for an element then all generated models are ignored.
401+
- A **generated** or **ai-generated** model is ignored during analysis, if the source code of the element it is modelling is available.
401402

402-
That is, generated models are less trusted than manual models.
403+
That is, generated models are less trusted than manual models and only used if neither source code or a manual model is available.

0 commit comments

Comments
 (0)