semgrep-interfaces/semgrep_output_v1.atd at main · semgrep/semgrep-interfaces · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<doc text="
  Specification of the Semgrep CLI JSON output formats using ATD
  (see https://atd.readthedocs.io/en/latest/ for information on ATD).

  This file specifies mainly the JSON formats of:

   - the output of the {{semgrep scan --json}} command

   - the output of the {{semgrep test --json}} command

   - the messages exchanged with the Semgrep backend by the
     {{semgrep ci}} command

  It's also (ab)used to specify the JSON input and output of semgrep-core,
  some RPC between pysemgrep and semgrep-core, and a few more internal
  things. We should use separate .atd for those different purposes but
  ATD does not have a proper module system yet and many types are shared
  so it is simpler for now to have everything in one file.

  There are other important form of outputs which are not specified here:

   - The semgrep metrics sent to https://metrics.semgrep.dev in
     semgrep_metrics.atd

   - The parsing stats of semgrep-core -parsing_stats -json have its own
     Parsing_stats.atd

  For the definition of the Semgrep input (the rules), see rule_schema_v2.atd

  This file has the _v1 suffix to explicitely represent the
  version of this JSON format. If you need to extend this file, please
  be careful because you may break consumers of this format (e.g., the
  Semgrep playground or Semgrep backend or external users of this JSON).
  See https://atd.readthedocs.io/en/latest/atdgen-tutorial.html#smooth-protocol-upgrades
  for more information on how to smoothly extend the types in this file.

  Any backward incompatible changes should require to upgrade the major
  version of Semgrep as this JSON output is part of the \"API\" of Semgrep
  (any incompatible changes to the rule format should also require a major
   version upgrade). Hopefully, we will always be backward compatible.
  However, a few fields are tagged with [EXPERIMENTAL] meaning external users
  should not rely on them as those fields may be changed or removed.
  They are not part of the \"API\" of Semgrep.

  Again, keep in mind that this file is used both by the CLI to *produce* a
  JSON output, and by our backends to *consume* the JSON, including to
  consume the JSON produced by old versions of the CLI. As of Nov 2024,
  our backend is still supporting as far as Semgrep 1.50.0 released Nov 2023.
  (see server/semgrep_app/util/cli_version_support.py in the semgrep-app repo)

  This file is translated in OCaml modules by atdgen. Look for the
  corresponding Semgrep_output_v1_[tj].ml[i] generated files
  under dune's _build/ folder. A few types below have the 'deriving show'
  decorator because those types are reused in semgrep core data structures
  and we make heavy use of 'deriving show' in OCaml to help debug things.

  This file is also translated in Python modules by atdpy.
  For Python, a few types have the 'dataclass(frozen=True)' decorator
  so that the class can be hashed and put in set. Indeed, with 'Frozen=True'
  the class is immutable and dataclass can autogenerate a hash function for it.

  Finally this file is translated in jsonschema/openapi spec by atdcat, and
  in Typescript modules by atdts.

  History:

   - the types in this file were originally inferred from JSON_report.ml for
     use by spacegrep when it was separate from semgrep-core. It's now also
     useds in JSON_report.ml (now called Core_json_output.ml)

   - it was extended to not only support semgrep-core JSON output but also
     (py)semgrep CLI output!

   - it was then simplified with the osemgrep migration effort by
     removing gradually the semgrep-core JSON output.

   - it was extended to support 'semgrep ci' output to type most messages
     sent between the Semgrep CLI and the Semgrep backend

   - we use this file to specify RPCs between pysemgrep and semgrep-core
     for the gradual migration effort of osemgrep

   - merged what was in Input_to_core.atd here
">

(*
  Maintenance:

  - Most comments should be placed under <doc text="..."> so that they
    can be translated to the target language.
    These annotations will be translated into ocamldoc comments or equivalent
    in other languages (if implemented).
    See https://atd.readthedocs.io/en/latest/atdgen.html#integration-with-ocamldoc
  - Comments that are relevant only to the reader of the source ATD file
    should use the (* ... *) comment syntax.
*)

type raw_json
  <ocaml module="JSON.Yojson" t="t">
  <ocaml attr="deriving eq, ord, show">
  <doc text="escape hatch"> = abstract


(*****************************************************************************)
(* String aliases *)
(*****************************************************************************)

(* File path.
   less: could convert directly to Path class of pathlib library for Python
   See libs/commons/ATD_string_wrap.ml for more info on those ATD_string_wrap.
 *)
type fpath
     <ocaml attr="deriving eq, ord, show">
     <python decorator="dataclass(frozen=True, order=True)"> =
   string wrap <ocaml module="ATD_string_wrap.Fpath">

type ppath
     <ocaml attr="deriving show, eq">
     <python decorator="dataclass(frozen=True, order=True)"> =
   string wrap <ocaml module="Ppath">

type fppath
     <ocaml attr="deriving show, eq">
     <doc text="Same as Fppath.t: a nice filesystem path
 + the path relative to the project root provided for pattern-based
 filtering purposes.">
 = {
  fpath: fpath;
  ppath: ppath;
}

type uri
  <ocaml attr="deriving ord"> =
  string wrap <ocaml module="ATD_string_wrap.Uri">

type sha1
  <ocaml attr="deriving ord"> =
  string wrap <ocaml module="ATD_string_wrap.Sha1">

type uuid
  <ocaml attr="deriving ord"> =
  string wrap <ocaml module="ATD_string_wrap.Uuidm">

type datetime
  <ocaml attr="deriving ord">
  <doc text="RFC 3339 format"> =
  string wrap <ocaml module="ATD_string_wrap.Datetime">

type glob = string

(*****************************************************************************)
(* Versioning *)
(*****************************************************************************)
type version
  <ocaml attr="deriving show">
  <doc text="e.g., '1.1.0'"> = string

(*****************************************************************************)
(* Location *)
(*****************************************************************************)

type position
    <ocaml attr="deriving ord, show">
    <python decorator="dataclass(frozen=True, order=True)">
    <doc text="Note that there is no filename here like in 'location' below">
=
{
  line: int; (* starts from 1 *)
  col: int; (* starts from 1 *)
  ~offset
    <doc text="
    Byte position from the beginning of the file, starts at 0.
    OCaml code sets it correctly. Python code sets it to a dummy value (-1).
    This uses '~' because pysemgrep < 1.30? was *producing* positions without
    offset sometimes, and we want the backend to still *consume* such
    positions.
    Note that pysemgrep 1.97 was still producing dummy positions without
    an offset so we might need this ~offset longer than expected?
">
  : int;
}

type location
    <ocaml attr="deriving ord, show">
    <python decorator="dataclass(frozen=True)">
    <doc text="a.k.a range">
  =
{
  path: fpath;
  start: position;
  end <ocaml name="end_">: position;
}

(*****************************************************************************)
(* Simple semgrep types *)
(*****************************************************************************)

type rule_id
     <ocaml attr="deriving show, eq, ord">
     <python decorator="dataclass(frozen=True)">
     <doc text="e.g., \"javascript.security.do-not-use-eval\"">
  =
  string wrap <ocaml module="Rule_ID">

(*
   coupling: with 'severity' in 'rule_schema_v1.yaml'
   coupling: with 'severity' in 'rule_schema_v2.atd'
*)
type match_severity
    <ocaml attr="deriving eq, ord, show">
    <python decorator="dataclass(frozen=True)">
    <doc text="
   This is used in rules to specify the severity of matches/findings.
   alt: could be called rule_severity, or finding_severity.
{{{
   Error = something wrong that must be fixed
   Warning = something wrong that should be fixed
   Info = some special condition worth knowing about
   Experiment = deprecated: guess what
   Inventory = deprecated: was used for the Code Asset Inventory (CAI) project
}}}
">
  =
[
  | Error <json name="ERROR">
  | Warning <json name="WARNING">
  | Experiment <json name="EXPERIMENT">
  | Inventory <json name="INVENTORY">
  | Critical
      <json name="CRITICAL">
      <doc text="since 1.72.0, meant to replace the cases above where
      Error -> High, Warning -> Medium. Critical/Low are the only really
      new category here without equivalent before.
      Experiment and Inventory above should be removed. Info can be kept.">
  | High <json name="HIGH">
  | Medium <json name="MEDIUM">
  | Low <json name="LOW">
  | Info
      <json name="INFO">
      <doc text="generic placeholder for non-risky things
      (including experiments)">
]

type error_severity
    <ocaml attr="deriving show, eq">
    <python decorator="dataclass(frozen=True)">
    <doc text="
    This is used to specify the severity of errors which
    happened during Semgrep execution (e.g., a parse error).
{{{
    Error = Always an error
    Warning = Only an error if \"strict\" is set
    Info = Nothing may be wrong
}}}

   alt: could reuse match_severity but seems cleaner to define its own type">
   =
[
  | Error <json name="error">
  | Warning <json name="warn">
  | Info <json name="info">
]

type pro_feature
    <ocaml attr="deriving ord, show">
    <python decorator="dataclass(frozen=True)">
    <doc text="
    Used for a best-effort report to users about what findings they get with
    the pro engine that they couldn't with the oss engine.
{{{
    interproc_taint = requires interprocedural taint
    interfile_taint = requires interfile taint
    proprietary_language = requires some non-taint pro feature
}}}">
  =
{
  interproc_taint: bool;
  interfile_taint: bool;
  proprietary_language: bool;
}

type engine_of_finding
   <ocaml attr="deriving ord, show">
   <python decorator="dataclass(frozen=True)">
   <doc text="Report the engine used to detect each finding. Additionally, if we are able
   to infer that the finding could only be detected using the pro engine,
   report that the pro engine is required and include basic information about
   which feature is required.

{{{
   OSS = ran with OSS
   PRO = ran with PRO, but we didn't infer that OSS couldn't have found this
   finding
   PRO_REQUIRED = ran with PRO and requires a PRO feature (see pro_feature_used)
}}}

   Note: OSS and PRO could have clearer names, but for backwards compatibility
   we're leaving them as is">
   =
[
  | OSS
  | PRO
  | PRO_REQUIRED <doc text="Semgrep 1.64.0 or later"> of pro_feature
]

type engine_kind
   <ocaml attr="deriving ord, show">
   <python decorator="dataclass(frozen=True)"> =
[
  | OSS
  | PRO
]

type rule_id_and_engine_kind <python decorator="dataclass(frozen=True)"> =
  (rule_id * engine_kind)

type product
    <ocaml attr="deriving eq, ord, show">
    <python decorator="dataclass(frozen=True)"> =
[
  | SAST <json name="sast"> <doc text="a.k.a. Code">
  | SCA <json name="sca"> <doc text="a.k.a. SSC">
  | Secrets <json name="secrets">
]

type match_based_id
  <ocaml attr="deriving show, eq">
  <doc text="e.g. \"ab023_1\""> = string

(*****************************************************************************)
(* Matches *)
(*****************************************************************************)

type cli_match = {
  check_id: rule_id;
  inherit location;
  extra: cli_match_extra;
}

type cli_match_extra = {
  ?metavars
    <doc text="Since 1.98.0, you need to be logged in to get this field.
     note: we also need ?metavars because dependency_aware code">
  : metavars option;

  message
    <doc text="Those fields are derived from the rule but the metavariables
     they contain have been expanded to their concrete value.">
  : string;

  ?fix
    <doc text="If present, semgrep was able to compute a string that should be
     inserted in place of the text in the matched range in order to fix the
     finding. Note that this is the result of applying both the fix: or
     fix_regex: in a rule.">
  : string option;
  (* TODO: done with monkey patching right now in the Python code,
     and seems to be used only when sending findings to the backend. *)
  ?fixed_lines: string list option;

  metadata <doc text="fields coming from the rule">: raw_json;
  severity: match_severity;

  fingerprint
    <doc text="Since 1.98.0, you need to be logged in to get those fields">
  : string;
  lines: string;

  ?is_ignored <doc text="for nosemgrep ">: bool option;

  ?sca_info
    <doc text="EXPERIMENTAL: added by dependency_aware code">
  : sca_match option;

  ?validation_state
    <doc text="EXPERIMENTAL: If present indicates the status of postprocessor validation.
     This field not being present should be equivalent to No_validator.
     Added in semgrep 1.37.0">
  : validation_state option;

  ?historical_info
    <doc text="EXPERIMENTAL: added by secrets post-processing & historical scanning code
     Since 1.60.0.">
  : historical_info option;

  ?dataflow_trace
    <doc text="EXPERIMENTAL: For now, present only for taint findings. May be extended to
     others later on.">
  : match_dataflow_trace option;

  ?engine_kind: engine_of_finding option;

  ?extra_extra
    <doc text="EXPERIMENTAL: see core_match_extra.extra_extra">
  : raw_json option;
}

(*****************************************************************************)
(* Metavariables *)
(*****************************************************************************)

type metavars
  <ocaml attr="deriving ord">
  <doc text="Name/value map of the matched metavariables.
   The leading '$' must be included in the metavariable name.">
  =
  (string * metavar_value) list
    <json repr="object">
    <python repr="dict">
    <ts repr="map">

(* TODO: should just inherit location. Maybe it was optimized to not contain
   the filename, which might be redundant with the information in core_match,
   but with deep-semgrep a metavar could also refer to code in another file,
   so simpler to generalize and 'inherit location'.
 *)
type metavar_value
  <ocaml attr="deriving ord">
  <python decorator="dataclass(frozen=True)"> = {
  start
    <doc text="
    for certain metavariable like $...ARGS, 'end' may be equal to 'start'
    to represent an empty metavariable value. The rest of the Python
    code (message metavariable substitution and autofix) works
    without change for empty ranges (when end = start).
    ">
  : position;
  end <ocaml name="end_">: position;
  abstract_content <doc text="value?">: string;
  ?propagated_value: svalue_value option;
}

type svalue_value
  <ocaml attr="deriving ord">
  <python decorator="dataclass(frozen=True)"> = {
  ?svalue_start: position option;
  ?svalue_end: position option;
  svalue_abstract_content <doc text="value?">: string
}

(*****************************************************************************)
(* Matching explanations *)
(*****************************************************************************)
(* coupling: semgrep-core/src/core/Matching_explanation.ml
   LATER: merge with Matching_explanation.t at some point *)
type matching_explanation
  <doc text="EXPERIMENTAL">
  = {
    op: matching_operation;
    children: matching_explanation list;
    matches
      <doc text="result matches at this node (can be empty when we reach a nomatch)">
    : core_match list;
    (*
     *)
    loc
      <doc text="location in the rule file! not target file.
       This tries to delimit the part of the rule relevant to the current
       operation (e.g., the position of the 'patterns:' token in the rule
       for the And operation).">
    : location;
    ?extra <doc text="NEW: since v1.79">: matching_explanation_extra option;
}

(*
 *)
type matching_explanation_extra
    <doc text="
   For any \"extra\" information that we cannot fit at the node itself.
   This is useful for kind-specific information, which we cannot put
   in the operation itself without giving up our ability to derive `show`
   (needed for `matching_operation` below).">
  = {
  before_negation_matches
    <doc text="
    Only present in And kind.
    This information is useful for determining the input matches
    to the first Negation node.">
  : core_match list option;
  before_filter_matches
    <doc text="
    Only present in nodes which have children Filter nodes.
    This information is useful for determining the input matches
    to the first Filter node, as there is otherwise no way of
    obtaining the post-intersection matches in an And node, for instance
    ">
  : core_match list option;
}

(* TODO:
   - Negation
   - Where filters (metavar-comparison, etc)
   - tainting source/sink/sanitizer
   - subpattern EllipsisAndStmt, ClassHeaderAndElems
 *)
type matching_operation
    <ocaml attr="deriving show { with_path = false}">
    <doc text="
   Note that this type is used in Matching_explanation.ml hence the need
   for deriving show below.">
   =
[
  | And
  | Or
  | Inside
  | Anywhere
  | XPat
      <doc text="XPat for eXtended pattern. Can be a spacegrep pattern, a
     regexp pattern, or a proper semgrep pattern.
     see semgrep-core/src/core/XPattern.ml">
      of string
  (* TODO *)
  | Negation
  (* TODO "metavar-regex:xxx" | "metavar-comparison:xxx" | "metavar-pattern" *)
  | Filter of string
  (* TODO tainting "operations" *)
  | Taint
  | TaintSource
  | TaintSink
  | TaintSanitizer
  (* TODO subpatterns *)
  | EllipsisAndStmts
  | ClassHeaderAndElems
] <ocaml repr="classic">


(*****************************************************************************)
(* Match dataflow trace *)
(*****************************************************************************)
(* EXPERIMENTAL *)

(* It's easier to understand the dataflow trace data structures on a simple
   example. Here is one simple Python target file:

   1:   def foo():
   2:     return source()
   3:
   4:   def bar(v):
   5:     sink(v)
   6:
   7:   x = foo()
   8:   y = x
   9:   bar(y)

   and here is roughly the generated match_dataflow_trace assuming
   a Semgrep rule where source() is a taint source and sink() the taint sink:

    taint_source = CliCall("foo() @l7", [], CliLoc "source() @l2")
    intermediate_vars = ["x", "y"]
    taint_sink = CliCall("bar()" @l9, ["v"], CliLoc "sink(v) @l5")
 *)

type match_dataflow_trace
  <ocaml attr="deriving ord">
  <python decorator="dataclass(frozen=True)"> = {
  ?taint_source: match_call_trace option;
  ?intermediate_vars
    <doc text="Intermediate variables which are involved in the dataflow. This
     explains how the taint flows from the source to the sink.">
  : match_intermediate_var list option;
  ?taint_sink: match_call_trace option;
}

(*
 *)
type loc_and_content
     <ocaml attr="deriving ord">
     <doc text="
   The string attached to the location is the actual code from the file.
   This can contain sensitive information so be careful!

   TODO: the type seems redundant since location already specifies a range.
   maybe this saves some effort to the user of this type which do not
   need to read the file to get the content.">
  = (location * string)

type match_call_trace
  <ocaml attr="deriving ord">
  <python decorator="dataclass(frozen=True, order=True)"> =
[
  | CliLoc of loc_and_content
  | CliCall of (loc_and_content * match_intermediate_var list * match_call_trace)
] <ocaml repr="classic">


(*
*)
type match_intermediate_var
    <ocaml attr="deriving ord">
    <python decorator="dataclass(frozen=True)">
    <doc text="
   This type happens to be mostly the same as a loc_and_content for now, but
   it's split out because Iago has plans to extend this with more information">
  = {
  location: location;
  content
    <doc text="Unlike abstract_content, this is the actual text read from the
     corresponding source file">
  : string;
}

(*****************************************************************************)
(* Software Composition Analysis (SCA) match info (SCA part1) *)
(*****************************************************************************)
(* This is also known as Semgrep Supply Chain (SSC) *)

(* EXPERIMENTAL *)

type ecosystem
    <python decorator="dataclass(frozen=True)">
    <ocaml attr="deriving eq, ord, show { with_path = false }">
    <doc text="
   both ecosystem and transitivity below have frozen=True so the generated
   classes can be hashed and put in sets (see calls to reachable_deps.add()
   in semgrep SCA code)

   alt: type package_manager">
   =
[
  | Npm <json name="npm">
  | Pypi  <json name="pypi">
  | Gem <json name="gem">
  | Gomod <json name="gomod">
  | Cargo <json name="cargo">
  | Maven <json name="maven">
  | Composer <json name="composer">
  | Nuget <json name="nuget">
  | Pub <json name="pub">
  | SwiftPM <json name="swiftpm">
  | Cocoapods <json name="cocoapods">
  | Mix
      <json name="mix">
      <doc text="Deprecated: Mix is a build system, should use Hex, which is the ecosystem ">
  | Hex <json name="hex">
  | Opam <json name="opam">
] <ocaml repr="classic">

type dependency_kind
    <python decorator="dataclass(frozen=True)">
    <ocaml attr="deriving ord, eq, show">
  =
[
  | Direct
      <json name="direct">
      <doc text="
     we depend directly on the 3rd-party library mentioned in the lockfile
     (e.g., use of log4j library and concrete calls to log4j in 1st-party code).
     log4j must be declared as a direct dependency in the manifest file.">

  (* TODO? add and detect shadow dependencies? *)
  | Transitive
      <json name="transitive">
      <doc text="we depend indirectly (transitively) on the 3rd-party library
     (e.g., if we use lodash which itself uses internally log4j then
     lodash is a Direct dependency and log4j a Transitive one)

     alt: Indirect">

  | Unknown
      <json name="unknown">
      <doc text="
     If there is insufficient information to determine the transitivity,
     such as a requirements.txt file without a requirements.in manifest,
     we leave it Unknown.">
] <ocaml repr="classic">


type sca_match
     <doc text="part of cli_match_extra, core_match_extra, and finding">
   = {
  reachability_rule
    <doc text="
     does the rule has a pattern part; otherwise it's a \"parity\"
     or \"upgrade-only\" rule.">
  : bool;
  sca_finding_schema: int;
  dependency_match: dependency_match;
  (* TODO: deprecate, we should use sca_match_kind instead *)
  reachable: bool;
  ?kind <doc text="EXPERIMENTAL since 1.108.0">: sca_match_kind option;
}

(*
   coupling: see also SCA_match.ml

   TODO? have a Direct of xxx and Transitive of sca_transitive_match_kind?
   better so can be reused in other types such as tr_cache_result?
*)
type sca_match_kind
    <ocaml attr="deriving ord">
    <doc text="
   Note that in addition to \"reachable\" there are also the notions of
   \"vulnerable\" and \"exploitable\".">
  = [
  | LockfileOnlyMatch
      <doc text="
     This is used for \"parity\" or \"upgrade-only\" rules. transitivity
     indicates whether the match is for a direct or transitive usage of
     the dependency; for a dependency that is both direct and transitive
     two findings should be generated.">
      of dependency_kind
  | DirectReachable
      <doc text="
     found the pattern-part of the SCA rule in 1st-party code
     (reachable as originally defined by Semgrep Inc.)
     the match location will be in some target code.">
  | TransitiveReachable
      <doc text="
     found the pattern-part of the SCA rule in third-party code
     and ultimately found a path from 1st party code to this vulnerable
     third-party code.
     The goal of transitive reachability analysis is to change
     some Undetermined or (LockfileOnlyMatch Transitive) into
     TransitiveReachable or TransitiveUnreachable">
      of transitive_reachable
  | TransitiveUnreachable
      <doc text="This is a \"positive\" finding in the sense that semgrep was
     able to prove that the transitive finding is \"safe\" and
     can be ignored because either there is no call to the pattern-part
     of the SCA rule in 3rd party code, or if there is
     it's in third-party code that is not accessed from the
     1st-party code (e.g., via callgraph analysis)
     Note that there is no need for DirectUnreachable because semgrep
     would never generate such a finding. We have TransitiveUnreachable
     because semgrep first generates some Undetermined that we then
     retag as DirectUnreachable.">
      of transitive_unreachable
  | TransitiveUndetermined
      <doc text="
     could not decide because of the engine limitations (e.g.,
     found the use of a vulnerable library in the lockfile but
     could not find the pattern in first party code and could not
     access third-party code for further investigation
     (similar to (LockfileOnlyMatch Transitive))">
      of transitive_undetermined
] < ocaml repr="classic">

type transitive_reachable = {
  (* TODO: include matched code? 3rd party libraries are usually OSS
     so maybe ok to store code for once in our DBs? Otherwise The App will
     have to redownload and reaccess the package code (or maybe we
     could give a github URL?)
  *)
  matches
    <doc text="
     The matches we found in 3rd party libraries.
     Ideally the location in cli_match are relative to the root of the project
     so one can display matches as package@/path/to/finding.py">
  : (found_dependency * cli_match list) list;
  callgraph_reachable
    <doc text="
     LATER: add callgraph information so one can see the path from 1st party
     code to the vulnerable intermediate 3rd party function.
     This is set to None for now.">
  : bool option;
  explanation
    <doc text="some extra explanation that the user can understand">
  : string option;
}

type transitive_unreachable <ocaml attr="deriving ord"> = {
  analyzed_packages
    <doc text="
     We didn't find any findings in all the 3rd party libraries that are using
     the 3rd party vulnerable library. This is a \"proof of work\".">
  : found_dependency list;
  explanation
    <doc text="some extra explanation that the user can understand">
  : string option;
}

type transitive_undetermined <ocaml attr="deriving ord"> = {
  explanation: string option;
}

type dependency_match <ocaml attr="deriving ord"> = {
  dependency_pattern: sca_pattern;
  found_dependency: found_dependency;
  lockfile: fpath;
}

type sca_pattern <ocaml attr="deriving ord"> = {
  ecosystem: ecosystem;
  package: string;
  semver_range: string;
}

(* alt: sca_dependency? *)
type found_dependency <ocaml attr="deriving ord"> = {
  package: string;
  version: string;
  ecosystem: ecosystem;
  allowed_hashes <doc text="???">: (string * string list) list
    <json repr="object"> <python repr="dict"> <ts repr="map">;
  ?resolved_url: string option;
  transitivity: dependency_kind;
  ?manifest_path
    <doc text="
     Path to the manifest file that defines the project containing this
     dependency. Examples: package.json, nested/folder/pom.xml">
  : fpath option;
  ?lockfile_path
    <doc text="Path to the lockfile that contains this dependency.
     Examples: package-lock.json, nested/folder/requirements.txt, go.mod.
     Since 1.87.0">
  : fpath option;
  ?line_number
    <doc text="
     The line number of the dependency in the lockfile. When combined with the
     lockfile_path, this can identify the location of the dependency in the
     lockfile.">
  : int option;
  ?children
    <doc text="
     If we have dependency relationship information for this dependency, this
     field will include the name and version of other found_dependency items
     that this dependency requires.
     These fields must match values in `package` and `version` of another
     `found_dependency` in the same set">
  : dependency_child list option;
  ?git_ref
    <doc text="
    Git ref of the dependency if the dependency comes directly from a git repo.
    Examples: refs/heads/main, refs/tags/v1.0.0, e5c704df4d308690fed696faf4c86453b4d88a95.
    Since 1.66.0">
  : string option;
}

type dependency_child
  <ocaml attr="deriving ord">
  <python decorator="dataclass(frozen=True)"> = {
  package: string;
  version: string;
}

(*****************************************************************************)
(* Semgrep Secrets match info *)
(*****************************************************************************)
(* EXPERIMENTAL *)

(* TODO: use <ocaml repr="classic"> *)
type validation_state
    <ocaml attr="deriving eq, ord, show">
    <python decorator="dataclass(frozen=True)">
    <doc text="
   This type is used by postprocessors for secrets to report back
   the validity of a finding. No_validator is currently also used when no
   validation has yet occurred, which if that becomes confusing we
   could adjust that, by adding another state.">
  =
[
  | Confirmed_valid <json name="CONFIRMED_VALID">
  | Confirmed_invalid <json name="CONFIRMED_INVALID">
  | Validation_error <json name="VALIDATION_ERROR">
  | No_validator <json name="NO_VALIDATOR">
]

type historical_info
    <ocaml attr="deriving ord">
    <doc text="part of cli_match_extra">
   = {
  git_commit
    <doc text="
    Git commit at which the finding is present. Used by \"historical\" scans,
    which scan non-HEAD commits in the git history. Relevant for finding, e.g.,
    secrets which are buried in the git history which we wouldn't find at HEAD
    ">
  : sha1;
  ?git_blob
    <doc text="
    Git blob at which the finding is present. Sent in addition to the commit
    since some SCMs have permalinks which use the blob sha, so this information
    is useful when generating links back to the SCM.">
  : sha1 option;
  git_commit_timestamp: datetime;
}

(*****************************************************************************)
(* Errors *)
(*****************************************************************************)

(* coupling: if you add a constructor here with arguments, you probably need
   to adjust _error_type_string() in error.py for pysemgrep and
   Error.string_of_error_type() for osemgrep.
 *)
type error_type
    <ocaml attr="deriving show">
    <python decorator="dataclass(frozen=True, order=True)">
  =
[
  | LexicalError
      <json name="Lexical error">
      <doc text="
      File parsing related errors;
      coupling: if you add a target parse error then metrics for
      cli need to be updated. See cli/src/semgrep/parsing_data.py.">
  | ParseError
      <json name="Syntax error">
      <doc text="a.k.a SyntaxError">
  | OtherParseError <json name="Other syntax error">
  | AstBuilderError <json name="AST builder error">
  (* TODO? should we move invalid_rule_error_kind here? *)
  | RuleParseError
      <json name="Rule parse error">
      <doc text="Pattern parsing related errors.
     There are more precise info about the error in
     Rule.invalid_rule_error_kind in Rule.ml.">
  (* TODO: some should take error_span in param *)
  | SemgrepWarning
      <json name="SemgrepWarning">
      <doc text="generated in pysemgrep only">
  | SemgrepError <json name="SemgrepError">
  | InvalidRuleSchemaError <json name="InvalidRuleSchemaError">
  | UnknownLanguageError <json name="UnknownLanguageError">
  | InvalidYaml <json name="Invalid YAML">
  (* matching (semgrep) related *)
  | MatchingError
      <json name="Internal matching error">
      <doc text="internal error, e.g., NoTokenLocation">
  | SemgrepMatchFound (* TODO of string (* check_id *) *)
      <json name="Semgrep match found">
  | TooManyMatches <json name="Too many matches">
  (* other *)
  | FatalError (*  *)
      <json name="Fatal error">
      <doc text="missing file, OCaml errors, etc.">
  | Timeout <json name="Timeout">
  | OutOfMemory <json name="Out of memory">
  | FixpointTimeout
      <json name="Fixpoint timeout">
      <doc text="since semgrep 1.132.0">
  | StackOverflow
      <json name="Stack overflow">
      <doc text="since semgrep 1.86.0">
  (* pro-engine specific *)
  | TimeoutDuringInterfile
      <json name="Timeout during interfile analysis">
  | OutOfMemoryDuringInterfile
      <json name="OOM during interfile analysis">
  | MissingPlugin
      <json name="Missing plugin">
      <doc text="since semgrep 1.40.0">
  (* !constructors with arguments! *)
  | PatternParseError
      <doc text="
      the string list is the \"YAML path\" of the pattern,
      e.g. {{[\"rules\"; \"1\"; ...]}}"> of string list
  | PartialParsing
      <doc text="list of skipped tokens. Since semgrep 0.97."> of location list
  | IncompatibleRule
      <doc text="since semgrep 1.38.0"> of incompatible_rule
  | PatternParseError0
      <json name="Pattern parse error">
      <doc text="
     Those Xxx0 variants were introduced in semgrep 1.45.0, but actually they
     are here so that our backend can read the cli_error.type_ from old semgrep
     versions that were translating the PatternParseError _ and IncompatibleRule _
     above as a single string (instead of a list [\"PatternParseError\", ...] now).
     There is no PartialParsing0 because this was encoded as a ParseError
     instead.
     ">
  | IncompatibleRule0 <json name="Incompatible rule">
  | DependencyResolutionError
      <doc text="since semgrep 1.94.0"> of resolution_error_kind
] <ocaml repr="classic">

type incompatible_rule
     <ocaml attr="deriving show">
     <python decorator="dataclass(frozen=True)"> =
{
  rule_id: rule_id;
  this_version: version;
  ?min_version: version option;
  ?max_version: version option;
}

(* TODO: type exit_code = ... *)

type cli_error
    <doc text="(called SemgrepError in error.py)">
  = {
  code <doc text="exit code for the type_ of error">: int;
  level: error_severity;
  type_
    <json name="type">
    <doc text="
     before 1.45.0 the type below was 'string', but was the result
     of converting error_type into a string, so using directly
     'error_type' below should be mostly backward compatible
     thx to the <json name> annotations in error_type.
     To be fully backward compatible, we actually introduced the
     PatternParseError0 and IncompatibleRule0 cases in error_type.">
  : error_type;

  (* LATER: use a variant instead of all those ?xxx types *)
  ?rule_id: rule_id option;

  (* for most parsing errors those are set *)
  ?message <doc text="contains error location">: string option;
  ?path: fpath option;