@@ -12,12 +12,83 @@ import semmle.python.dataflow.new.DataFlow
12
12
private import semmle.python.internal.CachedStages
13
13
14
14
/**
15
- * Provides classes and predicates for working with APIs used in a database.
15
+ * Provides classes and predicates for working with the API boundary between the current
16
+ * codebase and external libraries.
17
+ *
18
+ * See `API::Node` for more in-depth documentation.
16
19
*/
17
20
module API {
18
21
/**
19
- * An abstract representation of a definition or use of an API component such as a function
20
- * exported by a Python package, or its result.
22
+ * A node in the API graph, representing a value that has crossed the boundary between this
23
+ * codebase and an external library (or in general, any external codebase).
24
+ *
25
+ * ### Basic usage
26
+ *
27
+ * API graphs are typically used to identify "API calls", that is, calls to an external function
28
+ * whose implementation is not necessarily part of the current codebase.
29
+ *
30
+ * The most basic use of API graphs is typically as follows:
31
+ * 1. Start with `API::moduleImport` for the relevant library.
32
+ * 2. Follow up with a chain of accessors such as `getMember` describing how to get to the relevant API function.
33
+ * 3. Map the resulting API graph nodes to data-flow nodes, using `asSource` or `asSink`.
34
+ *
35
+ * For example, a simplified way to get arguments to `json.dumps` would be
36
+ * ```ql
37
+ * API::moduleImport("json").getMember("dumps").getParameter(0).asSink()
38
+ * ```
39
+ *
40
+ * The most commonly used accessors are `getMember`, `getParameter`, and `getReturn`.
41
+ *
42
+ * ### API graph nodes
43
+ *
44
+ * There are two kinds of nodes in the API graphs, distinguished by who is "holding" the value:
45
+ * - **Use-nodes** represent values held by the current codebase, which came from an external library.
46
+ * (The current codebase is "using" a value that came from the library).
47
+ * - **Def-nodes** represent values held by the external library, which came from this codebase.
48
+ * (The current codebase "defines" the value seen by the library).
49
+ *
50
+ * API graph nodes are associated with data-flow nodes in the current codebase.
51
+ * (Since external libraries are not part of the database, there is no way to associate with concrete
52
+ * data-flow nodes from the external library).
53
+ * - **Use-nodes** are associated with data-flow nodes where a value enters the current codebase,
54
+ * such as the return value of a call to an external function.
55
+ * - **Def-nodes** are associated with data-flow nodes where a value leaves the current codebase,
56
+ * such as an argument passed in a call to an external function.
57
+ *
58
+ *
59
+ * ### Access paths and edge labels
60
+ *
61
+ * Nodes in the API graph are associated with a set of access paths, describing a series of operations
62
+ * that may be performed to obtain that value.
63
+ *
64
+ * For example, the access path `API::moduleImport("json").getMember("dumps")` represents the action of
65
+ * importing `json` and then accessing the member `dumps` on the resulting object.
66
+ *
67
+ * Each edge in the graph is labelled by such an "operation". For an edge `A->B`, the type of the `A` node
68
+ * determines who is performing the operation, and the type of the `B` node determines who ends up holding
69
+ * the result:
70
+ * - An edge starting from a use-node describes what the current codebase is doing to a value that
71
+ * came from a library.
72
+ * - An edge starting from a def-node describes what the external library might do to a value that
73
+ * came from the current codebase.
74
+ * - An edge ending in a use-node means the result ends up in the current codebase (at its associated data-flow node).
75
+ * - An edge ending in a def-node means the result ends up in external code (its associated data-flow node is
76
+ * the place where it was "last seen" in the current codebase before flowing out)
77
+ *
78
+ * Because the implementation of the external library is not visible, it is not known exactly what operations
79
+ * it will perform on values that flow there. Instead, the edges starting from a def-node are operations that would
80
+ * lead to an observable effect within the current codebase; without knowing for certain if the library will actually perform
81
+ * those operations. (When constructing these edges, we assume the library is somewhat well-behaved).
82
+ *
83
+ * For example, given this snippet:
84
+ * ```python
85
+ * import foo
86
+ * foo.bar(lambda x: doSomething(x))
87
+ * ```
88
+ * A callback is passed to the external function `foo.bar`. We can't know if `foo.bar` will actually invoke this callback.
89
+ * But _if_ the library should decide to invoke the callback, then a value will flow into the current codebase via the `x` parameter.
90
+ * For that reason, an edge is generated representing the argument-passing operation that might be performed by `foo.bar`.
91
+ * This edge is going from the def-node associated with the callback to the use-node associated with the parameter `x`.
21
92
*/
22
93
class Node extends Impl:: TApiNode {
23
94
/**
0 commit comments