Skip to content

Commit b70ca77

Browse files
authored
Merge pull request github#10899 from hmac/flow-summary-docs
Ruby: Document flow summary syntax
2 parents 9c255b6 + 0340549 commit b70ca77

File tree

8 files changed

+4004
-712
lines changed

8 files changed

+4004
-712
lines changed

ruby/ql/docs/flow_summaries.md

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
# Flow summaries
2+
3+
Flow summaries describe how data flows through methods whose definition is not
4+
included in the database. For example, methods in the standard library or a gem.
5+
6+
Say we have the following code:
7+
8+
```rb
9+
x = gets
10+
y = x.chomp
11+
system(y)
12+
```
13+
14+
This code reads a line from STDIN, strips any trailing newlines, and executes it
15+
as a shell command. Assuming `x` is considered tainted, we want the argument `y`
16+
to be tainted in the call to `system`.
17+
18+
`chomp` is a standard library method in the `String` class for which we
19+
have no source code, so we include a flow summary for it:
20+
21+
```ql
22+
private class ChompSummary extends SimpleSummarizedCallable {
23+
ChompSummary() { this = "chomp" }
24+
25+
override predicate propagatesFlowExt(string input, string output, boolean preservesValue) {
26+
input = "Argument[self]" and
27+
output = "ReturnValue" and
28+
preservesValue = false
29+
}
30+
}
31+
```
32+
33+
The shared dataflow library will use this summary to construct a fake definition
34+
for `chomp`. The behaviour of this definition depends on the body of
35+
`propagatesFlowExt`. In this case, the method will propagate taint flow from the
36+
`self` argument (i.e. the receiver) to the return value.
37+
38+
If `preservesValue = true` then value flow is propagated. If it is `false` then
39+
only taint flow is propagated.
40+
41+
Any call to `chomp` in the database will be translated, in the dataflow graph,
42+
to a call to this fake definition.
43+
44+
`input` and `output` define the "from" and "to" locations in the flow summary.
45+
They use a custom string-based syntax which is similar to that used in `path`
46+
column in the Models as Data format. These strings are often referred to as
47+
access paths.
48+
49+
Note: The behaviour documented below is tested in
50+
`dataflow/flow-summaries/behaviour.ql`. Where specific quirks exist, we may
51+
reference a particular test case in this file which demonstrates the quirk.
52+
53+
# Syntax
54+
55+
Access paths consist of zero or more components separated by dots (`.`). The
56+
permitted components differ for input and output paths. The meaning of each
57+
component is defined relative to the implicit context of the component as
58+
defined by the preceding access path. For example,
59+
60+
```
61+
Argument[0].Element[1].ReturnValue
62+
```
63+
64+
refers to the return value of the element at index 1 in the array at argument 0
65+
of the method call.
66+
67+
## `Argument` and `Parameter`
68+
69+
The `Argument` and `Parameter` components refer respectively to an argument to a
70+
call or a parameter of a callable. They contain one or more _specifiers_[^1] which
71+
constrain the range of arguments/parameters that the component refers to. For
72+
example, `Argument[0]` refers to the first argument.
73+
74+
If multiple specifiers are given then the result is a disjunction, meaning that
75+
the component refers to any argument/parameter that satisfies at least one of
76+
the specifiers. For example, `Argument[0, 1]` refers to the first and second
77+
arguments.
78+
79+
### Specifiers
80+
81+
#### `self`
82+
The receiver of the call.
83+
84+
#### `<integer>`
85+
The argument to the method call at the position given by the integer. For
86+
example, `Argument[0]` refers to the first argument to the call.
87+
88+
#### `<integer>..`
89+
An argument to the call at a position greater or equal to the integer. For
90+
example, `Argument[1..]` refers to all arguments except the first one. This
91+
specifier is not available on `Parameter` components.
92+
93+
#### `<string>:`
94+
A keyword argument to the call with the given name. For example,
95+
`Argument[foo:]` refers to the keyword argument `foo:` in the call.
96+
97+
#### `block`
98+
The block argument passed to the call, if any.
99+
100+
#### `any`
101+
Any argument to the call, except `self` or `block` arguments.
102+
103+
#### `any-named`
104+
Any keyword argument to the call.
105+
106+
#### `hash-splat`
107+
The special "hash splat" argument/parameter, which is written as `**args`.
108+
When used in an `Argument` component, this specifier refers to special dataflow
109+
node which is constructed at the call site, containing any elements in a hash
110+
splat argument (`**args`) along with any explicit keyword arguments (`foo:
111+
bar`). The node behaves like a normal dataflow node for a hash, meaning that you
112+
can access specific elements of it using the `Element` component.
113+
114+
For example, the following flow summary states that values flow from any keyword
115+
arguments (including those in a hash splat) to the return value:
116+
117+
```ql
118+
input = "Argument[hash-splat].Element[any]" and
119+
output = "ReturnValue" and
120+
preservesValue = true
121+
```
122+
123+
Assuming this summary is for a global method `foo`, the following test will pass:
124+
125+
```rb
126+
a = source "a"
127+
b = source "b"
128+
129+
h = {a: a}
130+
131+
x = foo(b: b, **h)
132+
133+
sink x # $ hasValueFlow=a hasValueFlow=b
134+
```
135+
136+
If the method returns the hash itself, you will need to use `WithElement` in
137+
order to preserve taint/value in its elements. For example:
138+
139+
```ql
140+
input = "Argument[hash-splat].WithElement[any]" and
141+
output = "ReturnValue" and
142+
preservesValue = true
143+
```
144+
```rb
145+
a = source "a"
146+
x = foo(a: a)
147+
sink x[:a] # $ hasValueFlow=a
148+
```
149+
150+
## `ReturnValue`
151+
`ReturnValue` refers to the return value of the element identified in the
152+
preceding access path. For example, `Argument[0].ReturnValue` refers to the
153+
return value of the first argument. Of course this only makes sense if the first
154+
argument is a callable.
155+
156+
## `Element`
157+
This component refers to elements inside a collection of some sort. Typically
158+
this is an Array or Hash. Elements are considered to have an index, which is an
159+
integer in arrays and a symbol or string in hashes (even though hashes can have
160+
arbitrary objects as keys). Elements can also have an unknown index, which means
161+
we know the element exists in the collection but we don't know where.
162+
163+
Many of the specifiers have an optional suffix `!`. If this suffix is used then
164+
the specifier excludes elements at unknown indices. Otherwise, these are
165+
included by default.
166+
167+
### Specifiers
168+
169+
#### `?`
170+
If used in an input path: an element at an unknown index. If used in an output
171+
path: an element at any known or unkown index. In other words, `?` in an output
172+
path means the same as `any`.
173+
174+
#### `any`
175+
An element at any known or unknown index.
176+
177+
#### `<integer>`, `<integer>!`
178+
An element at the index given by the integer.
179+
180+
#### `<integer>..`, `<integer>..!`
181+
Any element at a known index greater or equal to the integer.
182+
183+
#### `<string>`, `<string>!`
184+
An element at the index given by string. The string should match the result of
185+
`serialize()` on the `ConstantValue` that represents the index. For a string
186+
with contents `foo` this is `"foo"` and for a symbol `:foo` it is `:foo`. The
187+
Ruby values `true`, `false` and `nil` can be written verbatim. See tests 31-33
188+
for examples.
189+
190+
## `Field`
191+
A "field" in the object. In practice this refers to a value stored in an
192+
instance variable in the object. The only valid specifier is `@<string>`, where
193+
`<string>` is the name of the instance variable. Currently we assume that a
194+
setter call such as `x.foo = bar` means there is a field `foo` in `x`, backed by
195+
an instance variable `@foo`.
196+
197+
For example, the access path `Argument[0].Field[@foo]` would refer to the value `"foo"` in
198+
199+
```rb
200+
x = SomeClass.new
201+
x.foo = "foo"
202+
some_call(x)
203+
```
204+
205+
## `WithElement`
206+
This component restricts the set of elements that are included in the preceding
207+
access path to to those at a specific set of indices. The specifiers are the
208+
same as those for `Element`. It is only valid in an input path.
209+
210+
This component has the effect of copying all relevant elements from the input to
211+
the output. For example, in the following summary:
212+
213+
```ql
214+
input = "Argument[0].WithElement[1, 2]" and
215+
output = "ReturnValue"
216+
```
217+
218+
any data in indices 1 and 2 of the first argument will be copied to indices 1
219+
and 2 of the return value. We use this in many Hash summaries that return the
220+
receiver, in order to preserve any data stored in it. For example, the summary
221+
for `Hash#to_h` is
222+
223+
```ql
224+
input = "Argument[self].WithElement[any]" and
225+
output = "ReturnValue" and
226+
preservesValue = true
227+
```
228+
229+
## `WithoutElement`
230+
This component is used to exclude certain elements from the set included in the
231+
preceding access path. It takes the same specifiers as `WithElement` and
232+
`Element`. It is only valid in an input path.
233+
234+
This component has the effect of excluding the relevant elements when copying
235+
from input to output. It is useful for modelling methods that remove elements
236+
from a collection. For example to model a method that removes the first element
237+
from the receiver, we can do so like this:
238+
239+
```ql
240+
input = "Argument[self].WithoutElement[0]" and
241+
output = "Argument[self]"
242+
```
243+
244+
Note that both the input and output refer to the receiver. The effect of this
245+
summary is that use-use flow between the receiver in the method call and a
246+
subsequent use of the same receiver will be blocked:
247+
248+
```ruby
249+
a[0] = source 0
250+
a[1] = source 1
251+
252+
a.remove_first # use-use flow from `a` on this line to `a` below will be blocked.
253+
# there will still be flow from `[post-update] a` to `a` below.
254+
255+
sink a[0]
256+
sink a[1] # $ hasValueFlow=1
257+
```
258+
259+
It is also important to note that in a summary such as
260+
261+
```ql
262+
input = "Argument[self].WithoutElement[0]" and
263+
output = "ReturnValue"
264+
```
265+
266+
if `Argument[self]` contains data, it will be copied to `ReturnValue`. If you only want to copy data in elements, and not in the container itself, add `WithElement[any]` to the input path:
267+
268+
```ql
269+
input = "Argument[self].WithoutElement[0].WithElement[any]" and
270+
output = "ReturnValue"
271+
```
272+
273+
See tests 53 and 54 for examples of this behaviour.
274+
275+
276+
277+
[^1]: I've chosen this name to avoid overloading the word "argument".

ruby/ql/lib/codeql/ruby/frameworks/core/Hash.qll

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -474,9 +474,6 @@ private class TransformKeysBangSummary extends SimpleSummarizedCallable {
474474
(
475475
input = "Argument[self].Element[any]" and
476476
output = "Argument[self].Element[?]"
477-
or
478-
input = "Argument[self].WithoutElement[any]" and
479-
output = "Argument[self]"
480477
) and
481478
preservesValue = true
482479
}

0 commit comments

Comments
 (0)