Skip to content

Commit 5fe6a5a

Browse files
author
Stephan Brandauer
authored
Merge pull request github#14487 from github/kaeluka/extraction-query-docs
Java: basic version of automodel extraction query docs
2 parents ec58b20 + cffcc73 commit 5fe6a5a

File tree

1 file changed

+183
-0
lines changed

1 file changed

+183
-0
lines changed

java/ql/automodel/src/README.md

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# Automodel Java Extraction Queries
2+
3+
This pack contains the automodel extraction queries for Java. Automodel uses extraction queries to extract the information it needs in order to create a prompt for a large language model. There's extraction queries for positive examples (things that are known to be, e.g., a sink), for negative examples (things that are known not to be, e.g., a sink), and for candidates (things where we should ask the large language model to classify).
4+
5+
## Extraction Queries in `java/ql/automodel/src`
6+
7+
Included in this pack are queries for both application mode and framework mode.
8+
9+
| Kind | Mode | Query File |
10+
|------|------|------------|
11+
| Candidates | Application Mode | [AutomodelApplicationModeExtractCandidates.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelApplicationModeExtractCandidates.ql) |
12+
| Positive Examples | Application Mode | [AutomodelApplicationModeExtractPositiveExamples.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelApplicationModeExtractPositiveExamples.ql) |
13+
| Negative Examples | Application Mode | [AutomodelApplicationModeExtractNegativeExamples.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelApplicationModeExtractNegativeExamples.ql) |
14+
| Candidates | Framework Mode | [AutomodelFrameworkModeExtractCandidates.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelFrameworkModeExtractCandidates.ql) |
15+
| Positive Examples | Framework Mode | [AutomodelFrameworkModeExtractPositiveExamples.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelFrameworkModeExtractPositiveExamples.ql) |
16+
| Negative Examples | Framework Mode | [AutomodelFrameworkModeExtractNegativeExamples.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelFrameworkModeExtractNegativeExamples.ql) |
17+
18+
## Running the Queries
19+
20+
The extraction queries are part of a separate query pack, `codeql/java-automodel-queries`. Use this pack to run them. The queries are tagged appropriately, you can use the tags (example here: https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelApplicationModeExtractNegativeExamples.ql#L8) to construct query suites.
21+
22+
For example, a query suite selecting all example extraction queries (positive and negative) for application mode looks like this:
23+
24+
```
25+
# File: automodel-application-mode-extraction-examples.qls
26+
# ---
27+
# Query suite for extracting examples for automodel
28+
29+
- description: Automodel application mode examples extraction.
30+
- queries: .
31+
from: codeql/java-automodel-queries
32+
- include:
33+
tags contain all:
34+
- automodel
35+
- extract
36+
- application-mode
37+
- examples
38+
```
39+
40+
## Important Software Design Concepts and Goals
41+
42+
### Concept: `Endpoint`
43+
44+
Endpoints are source code locations of interest. All positive examples, negative examples, and all candidates are endpoints, but not all endpoints are examples or candidates. Each mode decides which endpoints are relevant. For instance, if the Java application mode wants to support candidates for sinks that are arguments passed to unknown method calls, then the Java application mode implementation needs to make sure that method arguments are endpoints. If you look at the `TApplicationModeEndpoint` implementation in [AutomodelApplicationModeCharacteristics.qll](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelApplicationModeCharacteristics.qll), you can see that this is the case: the `TExplicitArgument` implements this behavior.
45+
46+
Whether or not an endpoint is a positive/negative example, or a candidate depends on the individual extraction queries.
47+
48+
### Concept: `EndpointCharacteristics`
49+
50+
In the file [AutomodelSharedCharacteristics.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelSharedCharacteristics.ql), you will find the definition of the QL class `EndpointCharacteristic`.
51+
52+
An endpoint characteristic is a QL class that "tags" all endpoints for which the characteristic's `appliesToEndpoint` predicate holds. The characteristic defines a `hasImplications` predicate that declares whether all the endpoints should be considered as sinks/sources/negatives, and with which confidence.
53+
54+
The positive, negative, and candidate extraction queries largely[^1] use characteristics to decide which endpoint to select. For instance, if a characteristic exists that applies to an endpoint, and the characteristic implies (cf. `hasImplications`) that the endpoint is a sink with a high confidence – then that endpoint will be selected as a positive example. See the use of `isKnownAs` in [AutomodelFrameworkModeExtractPositiveExamples.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelFrameworkModeExtractPositiveExamples.ql).
55+
56+
[^1]: Candidate extraction queries are an exception, they treat `UninterestingToModelCharacteristic` differently.
57+
58+
#### :warning: Warning
59+
60+
Do not try to "fix" shortcomings that could be fixed by a better prompt or better example selection by adding language- or mode-specific characteristics . Those "fixes" tend to be confusing downstream when questions like "why wasn't this location selected as a candidate?" becomes progressively harder and harder to answer. It's best to rely on characteristics in the code that is shared across all languages and modes (see [Shared Code](#shared-code)).
61+
62+
## Shared Code
63+
64+
A significant part of the behavior of extraction queries is implemented in shared modules. When we add support for new languages, we expect to move the shared code to a separate QL pack. In the mean time, shared code modules must not import any java libraries.
65+
66+
## Packaging
67+
68+
Automodel extraction queries come as a dedicated package. See [qlpack.yml](https://github.com/github/codeql/blob/main/java/ql/automodel/src/qlpack.yml). The [publish.sh](https://github.com/github/codeql/blob/main/java/ql/automodel/publish.sh) script is responsible for publishing a new version to the [package registry](https://github.com/orgs/codeql/packages/container/package/java-automodel-queries).
69+
70+
### Backwards Compatibility
71+
72+
We try to keep changes to extraction queries backwards-compatible whenever feasible. There's several reasons:
73+
74+
- That automodel can always decide which version of the package to run is a flawed assumption. We don't have direct control over the version of the extraction queries running on the user's local machine.
75+
- An automodel deployment will sometimes require the extraction queries to be published. If the new version of the extraction queries works with the old version of automodel, then it is much easier to roll back deployments of automodel.
76+
77+
## Candidate Examples
78+
79+
This section contains a few examples of the kinds of candidates that our queries might select, and why.
80+
81+
:warning: For clarity, this section presents "candidates" that are **actual** sinks. Therefore, the candidates presented here would actually be selected as positive examples in practice - rather than as candidates.
82+
83+
### Framework Mode Candidates
84+
85+
Framework mode is special because in framework mode, we extract candidates (as well as examples) from the implementation of a framework or library while the resulting models are applied in code bases that are _using_ the framework or library.
86+
87+
In framework mode, endpoints currently can have a number of shapes (see: `newtype TFrameworkModeEndpoint` in [AutomodelApplicationModeExtractCandidates.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelFrameworkModeCharacteristics.qll)). Depending on what kind of endpoint it is, the candidate is a candidate for one or several extensible types (eg., `sinkModel`, `sourceModel`).
88+
89+
#### Framework Mode Sink Candidates
90+
91+
Sink candidates in framework mode are modelled as formal parameters of functions defined within the framework. We use these to represent the corresponding inputs of function calls in a client codebase, which would be passed into those parameters.
92+
93+
For example, customer code could call the `Files.copy` method:
94+
95+
```java
96+
// customer code using a library
97+
...
98+
Files.copy(userInputPath, outStream);
99+
...
100+
```
101+
102+
In order for `userInputPath` to be modeled as a sink, the corresponding parameter must be selected as a candidate. In the following example, assuming they're not modeled yet, the parameters `source` and `out` would be candidates:
103+
104+
```java
105+
// Files.java
106+
// library code that's analyzed in framework mode
107+
public class Files {
108+
public static void copy(Path source, OutputStream out) throws IOException {
109+
// ...
110+
}
111+
}
112+
```
113+
114+
#### Framework Mode Source Candidates
115+
116+
Source candidates are a bit more varied than sink candidates:
117+
118+
##### Parameters as Source Candidates
119+
120+
A parameter could be a source, e.g. when a framework passes user-controlled data to a handler defined in customer code.
121+
```java
122+
// customer code using a library:
123+
import java.net.http.WebSocket;
124+
125+
final class MyListener extends WebSocket.Listener {
126+
@override
127+
public CompletionStage<?> onText(WebSocket ws, CharSequence cs, boolean last) {
128+
... process data that was received from websocket
129+
}
130+
}
131+
```
132+
133+
In this case, data passed to the program via a web socket connection is a source of remote data. Therefore, when we look at the implementation of `WebSocket.Listener` in framework mode, we need to produce a candidate for each parameter:
134+
135+
```java
136+
// WebSocket.java
137+
// library code that's analyzed in framework mode
138+
interface Listener {
139+
...
140+
default CompletionStage<?> onText(WebSocket webSocket CharSequence data, boolean last) {
141+
// <omitting default implementation>
142+
}
143+
...
144+
}
145+
```
146+
147+
For framework mode, all parameters of the `onText` method should be candidates. If the candidates result in a model, the parameters of classes implementing this interface will be recognized as sources of remote data.
148+
149+
:warning: a consequence of this is that we can have endpoints in framework mode that are both sink candidates, as well as source candidates.
150+
151+
##### Return Values as Source Candidates
152+
153+
The other kind of source candidate we model is the return value of a method. For example:
154+
155+
```java
156+
public class Socket {
157+
...
158+
public InputStream getInputStream() throws IOException {
159+
...
160+
}
161+
...
162+
}
163+
```
164+
165+
This method returns a source of remote data that should be modeled as a sink. We therefore want to select the _method_ as a candidate.
166+
167+
### Application Mode Candidates
168+
169+
In application mode, we extract candidates from an application that is using various libraries.
170+
171+
#### Application Mode Source Candidates
172+
173+
##### Overridden Parameters as Source Candidates
174+
175+
In application mode, a parameter of a method that is overriding another method is taken as a source parameter to account for cases like the `WebSocket.Listener` example above where an application is implementing a "handler" that receives remote data.
176+
177+
##### Return Values as Source Candidates
178+
179+
Just like in framework mode, application mode also has to consider the return value of a call as a source candidate. The difference is that in application mode, we extract from the application sources, not the library sources. Therefore, we use the invocation expression as a candidate (unlike in framework mode, where we use the method definition).
180+
181+
#### Application Mode Sink Candidates
182+
183+
In application mode, arguments to calls are sink candidates.

0 commit comments

Comments
 (0)