Skip to content

Commit cee527e

Browse files
committed
Document flow through arrays in dataflow.md
1 parent 415a1c2 commit cee527e

File tree

1 file changed

+127
-15
lines changed

1 file changed

+127
-15
lines changed

docs/ql-libraries/dataflow/dataflow.md

Lines changed: 127 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@ For further details about `PostUpdateNode` see [Field flow](#field-flow) below.
125125

126126
Nodes corresponding to expressions and parameters are the most common for users
127127
to interact with so a couple of convenience predicates are generally included:
128-
```
128+
```ql
129129
DataFlowExpr Node::asExpr()
130130
Parameter Node::asParameter()
131131
ExprNode exprNode(DataFlowExpr n)
@@ -266,10 +266,23 @@ as described above.
266266
## Field flow
267267

268268
The library supports tracking flow through field stores and reads. In order to
269-
support this, a class `Content` and two predicates
270-
`storeStep(Node node1, Content f, Node node2)` and
271-
`readStep(Node node1, Content f, Node node2)` must be defined. It generally
272-
makes sense for stores to target `PostUpdateNode`s, but this is not a strict
269+
support this, two classes `ContentSet` and `Content`, and two predicates
270+
`storeStep(Node node1, ContentSet f, Node node2)` and
271+
`readStep(Node node1, ContentSet f, Node node2)`, must be defined. The interaction
272+
between `ContentSet` and `Content` is defined through
273+
274+
```ql
275+
Content ContentSet::getAStoreContent();
276+
Content ContentSet::getAReadContent();
277+
```
278+
279+
which means that a `storeStep(n1, cs, n2)` will be interpreted as storing into _any_
280+
of `cs.getAStoreContent()`, and dually that a `readStep(n1, cs, n2)` will be
281+
interpreted as reading from _any_ of `cs.getAReadContent()`. In most cases, there
282+
will be a simple bijection between `ContentSet` and `Content`, but when modelling
283+
for example flow through arrays it can be more involved (see [Example 4](#example-4)).
284+
285+
It generally makes sense for stores to target `PostUpdateNode`s, but this is not a strict
273286
requirement. Besides this, certain nodes must have associated
274287
`PostUpdateNode`s. The node associated with a `PostUpdateNode` should be
275288
defined by `PostUpdateNode::getPreUpdateNode()`.
@@ -294,7 +307,7 @@ through a couple of examples.
294307
### Example 1
295308

296309
Consider the following setter and its call:
297-
```
310+
```java
298311
setFoo(obj, x) {
299312
sink1(obj.foo);
300313
obj.foo = x;
@@ -335,12 +348,12 @@ call sites.
335348
In the following two lines we would like flow from `x` to reach the
336349
`PostUpdateNode` of `a` through a sequence of two store steps, and this is
337350
indeed handled automatically by the shared library.
338-
```
351+
```java
339352
a.b.c = x;
340353
a.getB().c = x;
341354
```
342355
The only requirement for this to work is the existence of `PostUpdateNode`s.
343-
For a specified read step (in `readStep(Node n1, Content f, Node n2)`) the
356+
For a specified read step (in `readStep(Node n1, ContentSet c, Node n2)`) the
344357
shared library will generate a store step in the reverse direction between the
345358
corresponding `PostUpdateNode`s. A similar store-through-reverse-read will be
346359
generated for calls that can be summarized by the shared library as getters.
@@ -370,33 +383,132 @@ itself as this represents the value of the object after construction, that is
370383
after the constructor has run. With this setup of `ArgumentNode`s and
371384
`PostUpdateNode`s we will achieve the desired flow from `source` to `sink`
372385

386+
### Example 4
387+
388+
Assume we want to track flow through arrays precisely:
389+
390+
```rb
391+
a[0] = tainted
392+
a[1] = not_tainted
393+
sink(a[0]) # bad
394+
sink(a[1]) # good
395+
sink(a[unknown]) # bad; unknown may be 0
396+
397+
b[unknown] = tainted
398+
sink(b[0]) # bad; unknown may be 0
399+
400+
c[unknown][0] = tainted
401+
c[unknown][1] = not_tainted
402+
sink(c[0][0]) # bad; unknown may be 0
403+
sink(c[0][1]) # good
404+
```
405+
406+
This can be done by defining
407+
408+
```ql
409+
newtype TContent =
410+
TKnownArrayElementContent(int i) { i in [0 .. 10] } or
411+
TUnknownArrayElementContent()
412+
413+
class Content extends TContent {
414+
...
415+
}
416+
417+
newtype TContentSet =
418+
TSingletonContent(Content c) or
419+
TAnyArrayElementContent()
420+
421+
class ContentSet extends TContentSet {
422+
Content getAStoreContent() {
423+
this = TSingletonContent(result)
424+
or
425+
// for reverse stores
426+
this = TAnyArrayElementContent() and
427+
result = TUnknownArrayElementContent()
428+
}
429+
430+
Content getAReadContent() {
431+
this = TSingletonContent(result)
432+
or
433+
this = TAnyArrayElementContent() and
434+
(result = TUnknownArrayElementContent() or result = TKnownArrayElementContent(_))
435+
}
436+
}
437+
```
438+
439+
and we will have the following store/read steps
440+
```rb
441+
# storeStep(tainted, TSingletonContent(TKnownArrayElementContent(0)), [post update] a)
442+
a[0] = tainted
443+
444+
# storeStep(not_tainted, TSingletonContent(TKnownArrayElementContent(1)), [post update] a)
445+
a[1] = not_tainted
446+
447+
# readStep(a, TSingletonContent(TKnownArrayElementContent(0)), a[0])
448+
# readStep(a, TSingletonContent(TUnknownArrayElementContent()), a[0])
449+
sink(a[0]) # bad
450+
451+
# readStep(a, TSingletonContent(TKnownArrayElementContent(1)), a[1])
452+
# readStep(a, TSingletonContent(TUnknownArrayElementContent()), a[1])
453+
sink(a[1]) # good
454+
455+
# readStep(a, TAnyArrayElementContent(), a[unknown])
456+
sink(a[unknown]) # bad; unknown may be 0
457+
458+
# storeStep(tainted, TSingletonContent(TUnknownArrayElementContent()), [post update] b)
459+
b[unknown] = tainted
460+
461+
# readStep(b, TSingletonContent(TKnownArrayElementContent(0)), b[0])
462+
# readStep(b, TSingletonContent(TUnknownArrayElementContent()), b[0])
463+
sink(b[0]) # bad; unknown may be 0
464+
465+
# storeStep(tainted, TSingletonContent(TKnownArrayElementContent(0)), [post update] c[unknown])
466+
# storeStep(not_tainted, TSingletonContent(TKnownArrayElementContent(1)), [post update] c[unknown])
467+
# readStep(c, TAnyArrayElementContent(), c[unknown])
468+
# storeStep([post update] c[unknown], TAnyArrayElementContent(), [post update] c) # auto-generated reverse store (see Example 2)
469+
c[unknown][0] = tainted
470+
c[unknown][1] = not_tainted
471+
472+
473+
# readStep(c[0], TSingletonContent(TKnownArrayElementContent(0)), c[0][0])
474+
# readStep(c[0], TSingletonContent(TUnknownArrayElementContent()), c[0][0])
475+
# readStep(c[0], TSingletonContent(TKnownArrayElementContent(1)), c[0][1])
476+
# readStep(c[0], TSingletonContent(TUnknownArrayElementContent()), c[0][1])
477+
# readStep(c, TSingletonContent(TKnownArrayElementContent(0)), c[0])
478+
# readStep(c, TSingletonContent(TUnknownArrayElementContent()), c[0])
479+
sink(c[0][0]) # bad; unknown may be 0
480+
sink(c[0][1]) # good
481+
```
482+
373483
### Field flow barriers
374484

375485
Consider this field flow example:
376-
```
486+
```java
377487
obj.f = source;
378488
obj.f = safeValue;
379489
sink(obj.f);
380490
```
381491
or the similar case when field flow is used to model collection content:
382-
```
492+
```java
383493
obj.add(source);
384494
obj.clear();
385495
sink(obj.get(key));
386496
```
387497
Clearing a field or content like this should act as a barrier, and this can be
388-
achieved by marking the relevant `Node, Content` pair as a clear operation in
498+
achieved by marking the relevant `Node, ContentSet` pair as a clear operation in
389499
the `clearsContent` predicate. A reasonable default implementation for fields
390500
looks like this:
391501
```ql
392-
predicate clearsContent(Node n, Content c) {
502+
predicate clearsContent(Node n, ContentSet c) {
393503
n = any(PostUpdateNode pun | storeStep(_, c, pun)).getPreUpdateNode()
394504
}
395505
```
396506
However, this relies on the local step relation using the smallest possible
397507
use-use steps. If local flow is implemented using def-use steps, then
398508
`clearsContent` might not be easy to use.
399509

510+
Note that `clearsContent(n, cs)` is interpreted using `cs.getAReadContent()`.
511+
400512
## Type pruning
401513

402514
The library supports pruning paths when a sequence of value-preserving steps
@@ -415,13 +527,13 @@ as a single entity (this improves performance). As an example, Java uses erased
415527
types for this purpose and a single equivalence class for all numeric types.
416528

417529
The type of a `Node` is given by the following predicate
418-
```
530+
```ql
419531
DataFlowType getNodeType(Node n)
420532
```
421533
and every `Node` should have a type.
422534

423535
One also needs to define the string representation of a `DataFlowType`:
424-
```
536+
```ql
425537
string ppReprType(DataFlowType t)
426538
```
427539
The `ppReprType` predicate is used for printing a type in the labels of
@@ -499,7 +611,7 @@ predicate nodeIsHidden(Node n)
499611
### Unreachable nodes
500612

501613
Consider:
502-
```
614+
```java
503615
foo(source1, false);
504616
foo(source2, true);
505617

0 commit comments

Comments
 (0)