Skip to content

Commit 919951e

Browse files
mnafeesmrkaye97jishnundthTehShrikegrutt
authored
Observability overhaul + traces support (#3213)
* [Python] Feat: Hatchet OTel (#2881) * feat: first pass at auto otel impl * refactor: clean up a bit, naming, etc. * refactor: rm instance vars * fix: rm one more instance var * chore: notes to self * traces view * minor changes * trace view by task external id * go sdk instrumentation * e2e tests for Py SDK trace --------- Co-authored-by: Mohammed Nafees <hello@mnafees.me> * fix CI * fix black lint * fix example * fix lint * inject traceparent in Go SDK * ctx propagation * Feat: Opentelemetry for TS SDK (#2828) (#3218) * add: otel as optional dep on ts packages * feat: opentelemetry instrumentor for TS sdk, with example * fix: lint * revert: debug print * remove: trailing space * fix: ts otel patch file path, throw handlesteprun error upstream, ts otel examples * fix: lint * feat: add schedule_workflow instrumentor, add otel conig loader tests * add: more robust wrap unwrap for patched modules * fix: lint, update version * refactor: ts otel config type assertion * revert: rebase issues * fix: lint * fix: update worker patch for ts otel with InternalWorker * fix: lint * refactor: parsejson on otel * fix: pnpm-lock * fix: lint * docs: add otel instrumented method warnings Co-authored-by: Jishnu <jishnun789@gmail.com> * many many random spans * refetch polling * otel postgres traces * add to rbac.yaml * some refactor * oAPI side typing impl * TS SDK example * span names * fix lint * fix comments * fix lint * fix lint * Copilot comments * o11y overhaul frontend work (Josh) (#3231) * Fix agentprism theming - make the obvious links between the Hatchet theme variables and the agentprism theme variables - update the non-obvious agentprism theme variables from oklch to hsl to fit with the Hatchet theme colors (thanks Claude) * Refactor to put waterfall and trace behind the same observability tab * delete agent-prism files we don't need DetailsView: https://storybook.agent-prism.evilmartians.io/?path=/docs/main-components-detailsview--docs&utm_medium=social&utm_source=github TraceViewer: https://storybook.agent-prism.evilmartians.io/?path=/story/demo-traceviewer--trace-viewer-story&utm_medium=social&utm_source=github SearchInput: https://storybook.agent-prism.evilmartians.io/?path=/docs/atoms-searchinput--docs&utm_medium=social&utm_source=github * Hide the status badge and dot * unused variable * Delete more agent-prism components we don't need now * add success and danger colors from the designs * use the colors from the design, albeit while still depending on agent-prism's alternate statuses * move convertRawSpansToSpanTree into our repo so I can work on the types * move agent-prism-data into this repo, remove the unnecessary statuses * Remove more unused components * remove avatar, "trace span category", and genai/openinference-specific stuff from the TraceSpan types * switch to our usual chevron icons * Use the OTEL statuses and kinds from data-contracts * Remove a bunch of unnecessary conversions from our otel types to the agent-prism data structures * Remove unnecessary intermediate type * move agent-prism specific types and transform functions to the agent-prism component directory * AgentPrismTraceSpan -> OtelSpanTree * Only expand the hatchet runs when first displaying spans also remove unnecessary re-flattening of the tree structure and start treating the root of the tree like an actual root element * make it so clicking the name expands the children also don't make the rest of the card clickable unless onSpanSelect was passed in * make the bar bigger per the design * PR comments * API naming convention * fix migration * Observability front-end: a few last fixes (#3249) * remove max height from the tree view * Make it much harder for the SpanCardTimeline to get bumped to the right at deep nestings * Use durationNs, remove the last custom property from OtelSpanTree (#3258) * Drop unnecessary useEffect * Only keep the properties we need from the trace endpoint #3213 (comment) * Remove unused agent-prism components * examples * default hatchet enable collector * fix spans * fix lint * fix renames * fix spans * env vars * BSP options * attempt py test fix * func naming * remove redundant examples * task names attrs Py TS * test fixture * fix go sdk error span status * bug fixes * lint py * bug fixes * docs push * restore comments * delete older observability docs * Fix: Misc. O11y bugs + nits (#3352) * fix: rename hatchet o11y -> observability * fix: don't disable test * fix: more instrumentor example to top level * fix: timedelta for retry * fix: remove untyped kwargs from instrumentor * fix: rm nested grpc import * chore: gen python, start reworking otel tests * fix: on demand * feat: add trace poll helper * fix: remove crufty instrumentor test * fix: start expanding tests * fix: remove n+1 query * fix: trace table pk * fix: bytea span and trace ids * fix: simplify queries, remove a bunch of cruft * fix: enum types * fix: rm print cruft * feat: continue extending tests * fix: single api * fix: api wiring * fix: comment * fix: api params * fix: fe * fix: py tests * fix: rm unused query, fix texts * fix: copilot feedback * fix: simplify * chore: gen * ci: try to fix old-engine-new-sdk test * fix: put env var in the right place * fix: namespaces * fix: naming, remove silly pagination thing * fix: rename endpoint * fix: remove in-memory pagination * fix: rbac * fix: api * fix: clickhouse * fix: revert that, it's broken * Feat: Lookup table for trace ids (#3357) * feat: add k-v table * feat: query for lookup * feat: wiring up lookup table writes * feat: wiring for lookup spans * fix: partitioning * fix: use txns everywhere * fix: trace deduping * fix: deduplication * feat: lookup query * fix: wire up lookup table reads * fix: naming * fix: import * fix: 404 * fix: rm comment * fix: test * fix: simplify query a bit * chore: gen * fix: attempt to fix py tests * fix: test fix 2 * Feat o11y UI changes (#3333) * feat: new component * feat: err * feat: mini map, filtering, zooming, oh my * parent span is open * feat: helpful popover * feat: span detail view * example changes * dag name * feat: batched spans * feat: filters * feat: queued time v0 * feat: table hover state * feat: colors match badges * fix: queued color matches state * fix: error event on timeline * feat: nearly complete engine spans * feat: synthetic real time * event focus * fix: timelines and zooms * fix: consistent expand affordance * fix: polling * fix: naming consistency * fix: engine sdk span consistency and surrogate mapping * fix: group treatment * feat: time hint * feat: match badge color even though i dont like it * feat: queued synthetic * fix: span styles * fix: grouping and runtime * feat: error state * feat: shared time bar * fix: rt times and statuses * chore: lint * fix: recursive open * chore: lint * chore: reuse existing filter/search code * chore: initial component refactor * chore: further refactor * chore: first pass refactor of the span transformer * chore: performance refactor * chore: feedback * fix: add task inserted at * fix: filter engine spans for tests * fix: lint --------- Co-authored-by: matt <mrkaye97@gmail.com> Co-authored-by: Jishnu <jishnun789@gmail.com> Co-authored-by: Josh Duff <me@JoshDuff.com> Co-authored-by: Gabe Ruttner <gabriel.ruttner@gmail.com>
1 parent b8c649f commit 919951e

File tree

185 files changed

+15328
-1667
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

185 files changed

+15328
-1667
lines changed

.github/workflows/sdk-python.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@ jobs:
102102
export SERVER_DEFAULT_ENGINE_VERSION=V1
103103
export SERVER_MSGQUEUE_RABBITMQ_URL="amqp://user:password@localhost:5672/"
104104
export SERVER_OPTIMISTIC_SCHEDULING_ENABLED=${{ matrix.optimistic-scheduling }}
105+
export SERVER_OBSERVABILITY_ENABLED=true
105106
106107
go run ./cmd/hatchet-admin quickstart
107108

.golangci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ linters:
5656
- third_party$
5757
- builtin$
5858
- ^examples/
59+
- ^sdks/go/examples/
5960
- '(.+)_test\.go'
6061
- "cmd/hatchet-loadtest/rampup/(.+).go"
6162
formatters:
@@ -72,3 +73,4 @@ formatters:
7273
- third_party$
7374
- builtin$
7475
- ^examples/
76+
- ^sdks/go/examples/

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,4 @@ repos:
1818
hooks:
1919
- id: golangci-lint
2020
args: ["--config=.golangci.yml", "--allow-parallel-runners"]
21-
exclude: ^(examples/|sdks/guides/go/)
21+
exclude: ^(examples/|sdks/guides/go/|sdks/go/examples/)

api-contracts/openapi/components/schemas/_index.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -414,3 +414,11 @@ V1CELDebugResponse:
414414
$ref: "./v1/cel.yaml#/V1CELDebugResponse"
415415
V1CELDebugResponseStatus:
416416
$ref: "./v1/cel.yaml#/V1CELDebugResponseStatus"
417+
OtelSpan:
418+
$ref: "./v1/otel.yaml#/OtelSpan"
419+
OtelSpanKind:
420+
$ref: "./v1/otel.yaml#/OtelSpanKind"
421+
OtelStatusCode:
422+
$ref: "./v1/otel.yaml#/OtelStatusCode"
423+
OtelSpanList:
424+
$ref: "./v1/otel.yaml#/OtelSpanList"
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
OtelSpan:
2+
type: object
3+
properties:
4+
traceId:
5+
type: string
6+
spanId:
7+
type: string
8+
parentSpanId:
9+
type: string
10+
spanName:
11+
type: string
12+
spanKind:
13+
$ref: "#/OtelSpanKind"
14+
serviceName:
15+
type: string
16+
statusCode:
17+
$ref: "#/OtelStatusCode"
18+
statusMessage:
19+
type: string
20+
durationNs:
21+
type: integer
22+
format: int64
23+
createdAt:
24+
type: string
25+
format: date-time
26+
resourceAttributes:
27+
type: object
28+
additionalProperties:
29+
type: string
30+
spanAttributes:
31+
type: object
32+
additionalProperties:
33+
type: string
34+
scopeName:
35+
type: string
36+
scopeVersion:
37+
type: string
38+
retryCount:
39+
type: integer
40+
format: int32
41+
required:
42+
- traceId
43+
- spanId
44+
- spanName
45+
- spanKind
46+
- serviceName
47+
- statusCode
48+
- durationNs
49+
- createdAt
50+
- retryCount
51+
52+
OtelSpanKind:
53+
type: string
54+
enum:
55+
- UNSPECIFIED
56+
- INTERNAL
57+
- SERVER
58+
- CLIENT
59+
- PRODUCER
60+
- CONSUMER
61+
62+
OtelStatusCode:
63+
type: string
64+
enum:
65+
- UNSET
66+
- OK
67+
- ERROR
68+
69+
OtelSpanList:
70+
type: object
71+
properties:
72+
pagination:
73+
$ref: "../metadata.yaml#/PaginationResponse"
74+
retryCounts:
75+
type: array
76+
items:
77+
type: integer
78+
format: int32
79+
rows:
80+
type: array
81+
items:
82+
$ref: "#/OtelSpan"

api-contracts/openapi/openapi.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,8 @@ paths:
5757
$ref: "./paths/v1/workflow-runs/workflow_run.yaml#/getWorkflowRunStatus"
5858
/api/v1/stable/workflow-runs/{v1-workflow-run}/task-events:
5959
$ref: "./paths/v1/workflow-runs/workflow_run.yaml#/listTaskEventsForWorkflowRun"
60+
/api/v1/stable/tenants/{tenant}/traces:
61+
$ref: "./paths/v1/observability/traces.yaml#/getTrace"
6062
/api/v1/stable/workflow-runs/{v1-workflow-run}/task-timings:
6163
$ref: "./paths/v1/workflow-runs/workflow_run.yaml#/getTimings"
6264
/api/v1/stable/tenants/{tenant}/task-metrics:
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
getTrace:
2+
get:
3+
x-resources: ["tenant"]
4+
description: Get OTel trace for a workflow run
5+
operationId: v1-observability:get-trace
6+
parameters:
7+
- description: The tenant id
8+
in: path
9+
name: tenant
10+
required: true
11+
schema:
12+
type: string
13+
format: uuid
14+
minLength: 36
15+
maxLength: 36
16+
- description: The workflow run external id
17+
in: query
18+
name: run_external_id
19+
required: true
20+
schema:
21+
type: string
22+
format: uuid
23+
minLength: 36
24+
maxLength: 36
25+
- description: The number of spans to skip
26+
in: query
27+
name: offset
28+
required: false
29+
schema:
30+
type: integer
31+
format: int64
32+
- description: The number of spans to limit by
33+
in: query
34+
name: limit
35+
required: false
36+
schema:
37+
type: integer
38+
format: int64
39+
responses:
40+
"200":
41+
content:
42+
application/json:
43+
schema:
44+
$ref: "../../../components/schemas/_index.yaml#/OtelSpanList"
45+
description: Successfully retrieved the OTel trace
46+
"400":
47+
content:
48+
application/json:
49+
schema:
50+
$ref: "../../../components/schemas/_index.yaml#/APIErrors"
51+
description: A malformed or bad request
52+
"403":
53+
content:
54+
application/json:
55+
schema:
56+
$ref: "../../../components/schemas/_index.yaml#/APIErrors"
57+
description: Forbidden
58+
"404":
59+
content:
60+
application/json:
61+
schema:
62+
$ref: "../../../components/schemas/_index.yaml#/APIErrors"
63+
description: Trace not found
64+
summary: Get OTel trace
65+
tags:
66+
- Observability

api-contracts/openapi/paths/v1/workflow-runs/workflow_run.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -485,6 +485,7 @@ trigger:
485485
tags:
486486
- Workflow Runs
487487

488+
488489
branchDurableTask:
489490
post:
490491
x-resources: ["tenant"]
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
package observability
2+
3+
import (
4+
"github.com/hatchet-dev/hatchet/pkg/config/server"
5+
)
6+
7+
type V1ObservabilityService struct {
8+
config *server.ServerConfig
9+
}
10+
11+
func NewV1ObservabilityService(config *server.ServerConfig) *V1ObservabilityService {
12+
13+
return &V1ObservabilityService{
14+
config: config,
15+
}
16+
}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
package observability
2+
3+
import (
4+
"errors"
5+
6+
"github.com/jackc/pgx/v5"
7+
"github.com/labstack/echo/v4"
8+
9+
"github.com/hatchet-dev/hatchet/api/v1/server/oas/gen"
10+
transformers "github.com/hatchet-dev/hatchet/api/v1/server/oas/transformers/v1"
11+
"github.com/hatchet-dev/hatchet/pkg/repository/sqlcv1"
12+
)
13+
14+
func (t *V1ObservabilityService) V1ObservabilityGetTrace(ctx echo.Context, request gen.V1ObservabilityGetTraceRequestObject) (gen.V1ObservabilityGetTraceResponseObject, error) {
15+
if !t.config.Observability.Enabled {
16+
return gen.V1ObservabilityGetTrace200JSONResponse(gen.OtelSpanList{}), nil
17+
}
18+
19+
tenant := ctx.Get("tenant").(*sqlcv1.Tenant)
20+
21+
limit := int64(1000)
22+
offset := int64(0)
23+
24+
if request.Params.Limit != nil {
25+
limit = *request.Params.Limit
26+
}
27+
28+
if request.Params.Offset != nil {
29+
offset = *request.Params.Offset
30+
}
31+
32+
if limit < 1 {
33+
limit = 1000
34+
}
35+
36+
if offset < 0 {
37+
offset = 0
38+
}
39+
40+
traceId, err := t.config.V1.OTelLookup().LookUpTraceId(ctx.Request().Context(), tenant.ID, request.Params.RunExternalId)
41+
42+
if errors.Is(err, pgx.ErrNoRows) {
43+
return gen.V1ObservabilityGetTrace404JSONResponse(gen.APIErrors{
44+
Errors: []gen.APIError{{Description: "Trace not found"}},
45+
}), nil
46+
} else if err != nil {
47+
return nil, err
48+
}
49+
50+
result, err := t.config.V1.OTelCollector().ListSpansByTraceId(ctx.Request().Context(), tenant.ID, traceId, offset, limit)
51+
if err != nil {
52+
return nil, err
53+
}
54+
55+
return gen.V1ObservabilityGetTrace200JSONResponse(transformers.ToV1OtelSpanList(result.Rows, nil, limit, offset, result.Total)), nil
56+
}

0 commit comments

Comments
 (0)