|
| 1 | +# Automodel Java Extraction Queries |
| 2 | + |
| 3 | +This pack contains the automodel extraction queries for Java. Automodel uses extraction queries to extract the information it needs in order to create a prompt for a large language model. There's extraction queries for positive examples (things that are known to be, e.g., a sink), for negative examples (things that are known not to be, e.g., a sink), and for candidates (things where we should ask the large language model to classify). |
| 4 | + |
| 5 | +## Extraction Queries in `java/ql/automodel/src` |
| 6 | + |
| 7 | +Included in this pack are queries for both application mode and framework mode. |
| 8 | + |
| 9 | +| Kind | Mode | Query File | |
| 10 | +|------|------|------------| |
| 11 | +| Candidates | Application Mode | [AutomodelApplicationModeExtractCandidates.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelApplicationModeExtractCandidates.ql) | |
| 12 | +| Positive Examples | Application Mode | [AutomodelApplicationModeExtractPositiveExamples.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelApplicationModeExtractPositiveExamples.ql) | |
| 13 | +| Negative Examples | Application Mode | [AutomodelApplicationModeExtractNegativeExamples.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelApplicationModeExtractNegativeExamples.ql) | |
| 14 | +| Candidates | Framework Mode | [AutomodelFrameworkModeExtractCandidates.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelFrameworkModeExtractCandidates.ql) | |
| 15 | +| Positive Examples | Framework Mode | [AutomodelFrameworkModeExtractPositiveExamples.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelFrameworkModeExtractPositiveExamples.ql) | |
| 16 | +| Negative Examples | Framework Mode | [AutomodelFrameworkModeExtractNegativeExamples.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelFrameworkModeExtractNegativeExamples.ql) | |
| 17 | + |
| 18 | +## Running the Queries |
| 19 | + |
| 20 | +The extraction queries are part of a separate query pack, `codeql/java-automodel-queries`. Use this pack to run them. The queries are tagged appropriately, you can use the tags (example here: https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelApplicationModeExtractNegativeExamples.ql#L8) to construct query suites. |
| 21 | + |
| 22 | +For example, a query suite selecting all example extraction queries (positive and negative) for application mode looks like this: |
| 23 | + |
| 24 | +``` |
| 25 | +# File: automodel-application-mode-extraction-examples.qls |
| 26 | +# --- |
| 27 | +# Query suite for extracting examples for automodel |
| 28 | +
|
| 29 | +- description: Automodel application mode examples extraction. |
| 30 | +- queries: . |
| 31 | + from: codeql/java-automodel-queries |
| 32 | +- include: |
| 33 | + tags contain all: |
| 34 | + - automodel |
| 35 | + - extract |
| 36 | + - application-mode |
| 37 | + - examples |
| 38 | +``` |
| 39 | + |
| 40 | +## Important Software Design Concepts and Goals |
| 41 | + |
| 42 | +### Concept: `Endpoint` |
| 43 | + |
| 44 | +Endpoints are source code locations of interest. All positive examples, negative examples, and all candidates are endpoints, but not all endpoints are examples or candidates. Each mode decides which endpoints are relevant. For instance, if the Java application mode wants to support candidates for sinks that are arguments passed to unknown method calls, then the Java application mode implementation needs to make sure that method arguments are endpoints. If you look at the `TApplicationModeEndpoint` implementation in [AutomodelApplicationModeCharacteristics.qll](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelApplicationModeCharacteristics.qll), you can see that this is the case: the `TExplicitArgument` implements this behavior. |
| 45 | + |
| 46 | +Whether or not an endpoint is a positive/negative example, or a candidate depends on the individual extraction queries. |
| 47 | + |
| 48 | +### Concept: `EndpointCharacteristics` |
| 49 | + |
| 50 | +In the file [AutomodelSharedCharacteristics.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelSharedCharacteristics.ql), you will find the definition of the QL class `EndpointCharacteristic`. |
| 51 | + |
| 52 | +An endpoint characteristic is a QL class that "tags" all endpoints for which the characteristic's `appliesToEndpoint` predicate holds. The characteristic defines a `hasImplications` predicate that declares whether all the endpoints should be considered as sinks/sources/negatives, and with which confidence. |
| 53 | + |
| 54 | +The positive, negative, and candidate extraction queries largely[^1] use characteristics to decide which endpoint to select. For instance, if a characteristic exists that applies to an endpoint, and the characteristic implies (cf. `hasImplications`) that the endpoint is a sink with a high confidence – then that endpoint will be selected as a positive example. See the use of `isKnownAs` in [AutomodelFrameworkModeExtractPositiveExamples.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelFrameworkModeExtractPositiveExamples.ql). |
| 55 | + |
| 56 | +[^1]: Candidate extraction queries are an exception, they treat `UninterestingToModelCharacteristic` differently. |
| 57 | + |
| 58 | +#### :warning: Warning |
| 59 | + |
| 60 | +Do not try to "fix" shortcomings that could be fixed by a better prompt or better example selection by adding language- or mode-specific characteristics . Those "fixes" tend to be confusing downstream when questions like "why wasn't this location selected as a candidate?" becomes progressively harder and harder to answer. It's best to rely on characteristics in the code that is shared across all languages and modes (see [Shared Code](#shared-code)). |
| 61 | + |
| 62 | +## Shared Code |
| 63 | + |
| 64 | +A significant part of the behavior of extraction queries is implemented in shared modules. When we add support for new languages, we expect to move the shared code to a separate QL pack. In the mean time, shared code modules must not import any java libraries. |
| 65 | + |
| 66 | +## Packaging |
| 67 | + |
| 68 | +Automodel extraction queries come as a dedicated package. See [qlpack.yml](https://github.com/github/codeql/blob/main/java/ql/automodel/src/qlpack.yml). The [publish.sh](https://github.com/github/codeql/blob/main/java/ql/automodel/publish.sh) script is responsible for publishing a new version to the [package registry](https://github.com/orgs/codeql/packages/container/package/java-automodel-queries). |
| 69 | + |
| 70 | +### Backwards Compatibility |
| 71 | + |
| 72 | +We try to keep changes to extraction queries backwards-compatible whenever feasible. There's several reasons: |
| 73 | + |
| 74 | + - That automodel can always decide which version of the package to run is a flawed assumption. We don't have direct control over the version of the extraction queries running on the user's local machine. |
| 75 | + - An automodel deployment will sometimes require the extraction queries to be published. If the new version of the extraction queries works with the old version of automodel, then it is much easier to roll back deployments of automodel. |
| 76 | + |
| 77 | +## Candidate Examples |
| 78 | + |
| 79 | +This section contains a few examples of the kinds of candidates that our queries might select, and why. |
| 80 | + |
| 81 | +:warning: For clarity, this section presents "candidates" that are **actual** sinks. Therefore, the candidates presented here would actually be selected as positive examples in practice - rather than as candidates. |
| 82 | + |
| 83 | +### Framework Mode Candidates |
| 84 | + |
| 85 | +Framework mode is special because in framework mode, we extract candidates (as well as examples) from the implementation of a framework or library while the resulting models are applied in code bases that are _using_ the framework or library. |
| 86 | + |
| 87 | +In framework mode, endpoints currently can have a number of shapes (see: `newtype TFrameworkModeEndpoint` in [AutomodelApplicationModeExtractCandidates.ql](https://github.com/github/codeql/blob/main/java/ql/automodel/src/AutomodelFrameworkModeCharacteristics.qll)). Depending on what kind of endpoint it is, the candidate is a candidate for one or several extensible types (eg., `sinkModel`, `sourceModel`). |
| 88 | + |
| 89 | +#### Framework Mode Sink Candidates |
| 90 | + |
| 91 | +Sink candidates in framework mode are modelled as formal parameters of functions defined within the framework. We use these to represent the corresponding inputs of function calls in a client codebase, which would be passed into those parameters. |
| 92 | + |
| 93 | +For example, customer code could call the `Files.copy` method: |
| 94 | + |
| 95 | +```java |
| 96 | +// customer code using a library |
| 97 | +... |
| 98 | +Files.copy(userInputPath, outStream); |
| 99 | +... |
| 100 | +``` |
| 101 | + |
| 102 | +In order for `userInputPath` to be modeled as a sink, the corresponding parameter must be selected as a candidate. In the following example, assuming they're not modeled yet, the parameters `source` and `out` would be candidates: |
| 103 | + |
| 104 | +```java |
| 105 | +// Files.java |
| 106 | +// library code that's analyzed in framework mode |
| 107 | +public class Files { |
| 108 | + public static void copy(Path source, OutputStream out) throws IOException { |
| 109 | + // ... |
| 110 | + } |
| 111 | +} |
| 112 | +``` |
| 113 | + |
| 114 | +#### Framework Mode Source Candidates |
| 115 | + |
| 116 | +Source candidates are a bit more varied than sink candidates: |
| 117 | + |
| 118 | +##### Parameters as Source Candidates |
| 119 | + |
| 120 | +A parameter could be a source, e.g. when a framework passes user-controlled data to a handler defined in customer code. |
| 121 | +```java |
| 122 | +// customer code using a library: |
| 123 | +import java.net.http.WebSocket; |
| 124 | + |
| 125 | +final class MyListener extends WebSocket.Listener { |
| 126 | + @override |
| 127 | + public CompletionStage<?> onText(WebSocket ws, CharSequence cs, boolean last) { |
| 128 | + ... process data that was received from websocket |
| 129 | + } |
| 130 | +} |
| 131 | +``` |
| 132 | + |
| 133 | +In this case, data passed to the program via a web socket connection is a source of remote data. Therefore, when we look at the implementation of `WebSocket.Listener` in framework mode, we need to produce a candidate for each parameter: |
| 134 | + |
| 135 | +```java |
| 136 | +// WebSocket.java |
| 137 | +// library code that's analyzed in framework mode |
| 138 | +interface Listener { |
| 139 | + ... |
| 140 | + default CompletionStage<?> onText(WebSocket webSocket CharSequence data, boolean last) { |
| 141 | + // <omitting default implementation> |
| 142 | + } |
| 143 | + ... |
| 144 | +} |
| 145 | +``` |
| 146 | + |
| 147 | +For framework mode, all parameters of the `onText` method should be candidates. If the candidates result in a model, the parameters of classes implementing this interface will be recognized as sources of remote data. |
| 148 | + |
| 149 | +:warning: a consequence of this is that we can have endpoints in framework mode that are both sink candidates, as well as source candidates. |
| 150 | + |
| 151 | +##### Return Values as Source Candidates |
| 152 | + |
| 153 | +The other kind of source candidate we model is the return value of a method. For example: |
| 154 | + |
| 155 | +```java |
| 156 | +public class Socket { |
| 157 | + ... |
| 158 | + public InputStream getInputStream() throws IOException { |
| 159 | + ... |
| 160 | + } |
| 161 | + ... |
| 162 | +} |
| 163 | +``` |
| 164 | + |
| 165 | +This method returns a source of remote data that should be modeled as a sink. We therefore want to select the _method_ as a candidate. |
| 166 | + |
| 167 | +### Application Mode Candidates |
| 168 | + |
| 169 | +In application mode, we extract candidates from an application that is using various libraries. |
| 170 | + |
| 171 | +#### Application Mode Source Candidates |
| 172 | + |
| 173 | +##### Overridden Parameters as Source Candidates |
| 174 | + |
| 175 | +In application mode, a parameter of a method that is overriding another method is taken as a source parameter to account for cases like the `WebSocket.Listener` example above where an application is implementing a "handler" that receives remote data. |
| 176 | + |
| 177 | +##### Return Values as Source Candidates |
| 178 | + |
| 179 | +Just like in framework mode, application mode also has to consider the return value of a call as a source candidate. The difference is that in application mode, we extract from the application sources, not the library sources. Therefore, we use the invocation expression as a candidate (unlike in framework mode, where we use the method definition). |
| 180 | + |
| 181 | +#### Application Mode Sink Candidates |
| 182 | + |
| 183 | +In application mode, arguments to calls are sink candidates. |
0 commit comments