`sum` and `sumWith` (Type safe codecs for sum types with custom encodings) #62

m-bock · 2024-10-13T16:34:05Z

This PR provides a type safe way to define codecs for sum types.

Here we have a sum type with constructors which have different amounts of fields.

data Sample
  = Foo
  | Bar Int
  | Baz Boolean String Int

The codec gets defined by providing a record with a field for each Sum case. The codecs for the fields themselves get listed with nested tuples, with the special case of unit meaning "no fields".

import Data.Codec.Argonaut.Sum (sum)

codecSample ∷ JsonCodec Sample
codecSample = sum "Sample"
  { "Foo": unit
  , "Bar": C.int
  , "Baz": C.boolean /\ C.string /\ C.int
  }

Moreover, the actual encoding can be configured with a couple of options:

import Data.Codec.Argonaut.Sum (Encoding)

opts ∷ Encoding
opts =
  { tagKey: "tag"
  , valuesKey: "values"
  , unwrapSingleArguments: false
  , omitEmptyArguments: false
  }

The options can be passed to sumWith like so:

import Data.Codec.Argonaut.Sum (sumWith)

codecSample ∷ JsonCodec Sample
codecSample = sumWith opts "Sample"
  { "Foo": unit
  , "Bar": C.int
  , "Baz": C.boolean /\ C.string /\ C.int
  }

I also added some tests which may be worth looking into.

garyb · 2024-10-24T14:32:01Z

Sorry, I meant to get back to this sooner.

I was surprised when I saw the PR as I thought I'd already added something very similar to this, as I've already been using it in a project for years! I think I didn't get around to adding it because I my version doesn't take options about the representation, and I had intended to do something like what you have here.

I also have mine working without using the Nested tuple representation but maybe it's better to go with that, since it has the widely available (/\) operator rather than the (×) we have for Tuple in our custom prelude.

There is another encoding representation I'd like to support however, and that's using keys rather than tags. So instead of:

{ "tag": "Foo", "value": 42 }

It'd be:

{ "Foo": 42 }

But I imagine you might not care about that, so if you like we can merge it now and then I'll figure out adding support for this other representation (since it means changing a bunch of stuff including the options representation even - only unwrapSingleArguments would apply to the new case), and then release that once that's done? Or if you're happy to work on that too please be my guest. 😆

Also can you delete the spago files for now? I will migrate this to new spago, but want to use version ranges still, and unfortunately I'm not ready to drop bower support yet so would like to do something in CI to ensure they stay in sync or something.

m-bock · 2024-10-25T04:29:54Z

That's a good idea to implement the mentioned encoding as well. I can do it.

I'd change the encoding type to:

data Encoding
  = EncodeCtorAsTag
      { unwrapSingleArguments ∷ Boolean }
  | EncodeTagValue
      { tagKey ∷ String
      , valuesKey ∷ String
      , omitEmptyArguments ∷ Boolean
      , unwrapSingleArguments ∷ Boolean
      }

unwrapSingleArguments could be shared between both options somehow, but I think it's ok to duplicate for simplicity sake.

I am not sure if it's the best naming. Rust's serde library calls those different encodings

"Externally tagged" { "Foo": 42 }
"Adjacently tagged" { "tag": "Foo", "value": 42 }

See: https://serde.rs/enum-representations.html
But I find this difficult to remember.

Since we're at it, there are also 2 more encodings:

"Untagged" - I havn't thought about this before, but I think it should be possible, too. Decoding would just try all cases until the first one succeeds. It could be handled with a third Encoding option like: EncodeUntagged {unwrapSingleArguments :: Boolean}
"Internally tagged" { "tag": "Something", "fieldA": 42, "fieldB": "Me" } - This only works if the sum type has only one field. Also the single field must be a record and it cannot have a field that is also used as tag. The API for this would be different, since the tag must be given as type level symbol. I think it could be added in follow up PR. I have drafted something that is used like sumFlat @"tag" { "fieldA": codecA, "fieldB": codecB }.

Let me know if you have any suggestion towards this. But yeah, as I said, I can keep on working on this. I'll remove the spago files, and I think there's no need to merge now. We can merge, once everything is agreed on. There may be performance optimizations, too. I'll come back to that later.

garyb · 2024-10-28T14:54:16Z

I was using the term "nested" for { "Ctor": <value> } and "tagged" for { "tag": "Ctor", "value": <..> } but I don't have strong feelings about it. I suppose it's not very accurate since technically it's tagged either way. 😄

I would prefer to omit untagged, and force people to write codecs manually if they want that. One of the goals I had with the library is that if you only use the combinators it provides then the codecs are guaranteed to roundtrip, but that can easily be violated by the untagged representation. Admittedly there are some existing violations of that - the Compat version of Maybe can't represent Just Nothing, the Migration codecs can probably explode things, etc. If we were going to add an untagged one I'd probably prefer it be a separate thing so it's a very intentional choice to opt into it.

Thinking about it, I'd be tempted to split off a second "unsafe" library that provides things like that that are needed when you're not in control of the serialization format at both ends, but that can't be guaranteed to accurately roundtrip, as I do find myself hand writing codecs at times that could be made generic with some unsafety of that kind.

I think I understand what you mean with the internally tagged example, but just to check, it's for when you have something like this?

data Something
  = CtorA { fieldA :: Int, fieldB :: String }
  | CtorB { ... }
  | CtorC { ... }

Due to the potential for tag-field key collisions as you pointed out that maybe would be another one for the "unsafe" library, but it would be very useful indeed, that's probably the most common form of sum type representation I encounter when it's not in my control.

I've found a reasonably okay way of writing those where you leverage JPropCodec to write codecs for the record and then it's just the "tagging layer" that needs handling manually:

type DataIssueFieldsBase r =
  { pipelineId ∷ Pipeline.Id
  , datasetId ∷ Dataset.Id
  | r
  }

type DataIssueFields = DataIssueFieldsBase ()

type MissingColumnFields = DataIssueFieldsBase
  ( names ∷ NonEmptyArray String
  )

data Issue
  = MissingRows DataIssueFields
  | DuplicateRows DataIssueFields
  | MissingColumns MissingColumnFields

codecDataIssueFields ∷ CA.JPropCodec DataIssueFields
codecDataIssueFields = CAR.record
  { pipelineId: Pipeline.codecId
  , datasetId: Dataset.codecId
  }

codecMissingColumnFields ∷ CA.JPropCodec MissingColumnFields
codecMissingColumnFields =
  codecDataIssueFields
    # CA.recordProp (Proxy ∷ _ "names") (Codec.nonEmptyArray CA.string)

codec ∷ CA.JsonCodec Issue
codec = C.codec' decode encode
  where
  encode ∷ Issue → J.Json
  encode = J.fromObject ∘ case _ of
    MissingRows fields →
      Object.fromFoldable
        $ ("type" × J.fromString "missing-rows")
        : CA.encode codecDataIssueFields fields
    DuplicateRows fields →
      Object.fromFoldable
        $ ("type" × J.fromString "duplicate-rows")
        : CA.encode codecDataIssueFields fields
    MissingColumns fields →
      Object.fromFoldable
        $ ("type" × J.fromString "missing-columns")
        : CA.encode codecMissingColumnFields fields

  decode ∷ J.Json → Either CA.JsonDecodeError Issue
  decode j = do
    obj ← CA.decode CA.jobject j
    typ ← lmap
      (CA.AtKey "type")
      (CA.decode CA.string =<< note CA.MissingValue (Object.lookup "type" obj))
    case typ of
      "missing-rows" →
        MissingRows <$> CA.decode codecDataIssueFields obj
      "duplicate-rows" →
        DuplicateRows <$> CA.decode codecDataIssueFields obj
      "missing-columns" →
        MissingColumns <$> CA.decode codecMissingColumnFields obj
      other →
        Left (CA.AtKey "type" (CA.UnexpectedValue (J.fromString other)))

But we definitely should be able to eliminate the need for that.

m-bock · 2024-10-31T06:13:27Z

"I was using the term "nested" for { "Ctor": } and "tagged" for { "tag": "Ctor", "value": <..> } but I don't have strong feelings about it. I suppose it's not very accurate since technically it's tagged either way. 😄"

Ok, true. Honestly I have no idea what's the perfect name. I changed it now to Nested/Tagged as you suggest. At least the names are short and handy.
"I would prefer to omit untagged, and force people to write codecs manually if they want that...."
Makes sense. I dropped the untagged option. I never had a use case for this anyways.
"Thinking about it, I'd be tempted to split off a second "unsafe" library"
sounds like a good idea to me.
"I think I understand what you mean with the internally tagged example, but just to check, it's for when you have something like this?"
Yes, that's what I mean with internally tagged:
```
data Something
= CtorA { fieldA :: Int, fieldB :: String }
| CtorB { ... }
| CtorC { ... }
```
Encoded as:
{ tag: "CtorA", "fieldA": 1, "fieldB": "abc" }

I suggest to discuss this in a separate PR once this one is merged. I have something that checks for tag/field collision at compile time, thus I think it does not need to be regarded as "unsafe". I agree, I think there is a need for it, I think it's the most common encoding in JS world.
I mentioned a performance optimization. It's disallowed at compile time to specify non existing cases in the record. Same behavior as in the Record codec. To avoid that a new record needs to be built several times, the Record implementation uses an unsafeCoerce
I added soimething similar to the sum codec.

Ok, from my side, this PR would be ready to be merged. Do you have any other points to discuss?

garyb · 2024-11-09T14:55:33Z

Sorry about the delay, I was on holiday for a couple of days and then catching up with work stuff, etc. This is looking great though, thanks very much for your work on it!

m-bock · 2024-11-09T19:43:31Z

Great to see this merged, no worries about the delay!

m-bock added 3 commits October 12, 2024 12:25

Init

af6c2f0

Finish sum

db11d77

Cleanup

ea7d3bf

m-bock changed the title ~~sum and sumWith~~ sum and sumWith (Type safe codecs for sum types with custom encodings) Oct 13, 2024

m-bock added 4 commits October 25, 2024 07:08

Refactor encoding

019b2f7

Implement encodings

9dadffd

Remove spago files

d146777

Fix CI

94f51b3

m-bock added 3 commits October 31, 2024 06:41

Remove untagged

e9f8608

Use Nested/Tagged naming

49adfcd

Add performance optimization "unsafeDelete"

94113d4

m-bock added 5 commits October 31, 2024 07:14

Cleanup

394be76

Change README

a5828f8

change README

1a6a3b2

Change README

01e52a3

Change README

d26e86f

garyb merged commit 9dab273 into garyb:master Nov 9, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`sum` and `sumWith` (Type safe codecs for sum types with custom encodings) #62

`sum` and `sumWith` (Type safe codecs for sum types with custom encodings) #62

Uh oh!

m-bock commented Oct 13, 2024 •

edited

Loading

Uh oh!

garyb commented Oct 24, 2024

Uh oh!

m-bock commented Oct 25, 2024

Uh oh!

garyb commented Oct 28, 2024 •

edited

Loading

Uh oh!

m-bock commented Oct 31, 2024 •

edited

Loading

Uh oh!

garyb commented Nov 9, 2024

Uh oh!

Uh oh!

m-bock commented Nov 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sum and sumWith (Type safe codecs for sum types with custom encodings) #62

sum and sumWith (Type safe codecs for sum types with custom encodings) #62

Uh oh!

Conversation

m-bock commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garyb commented Oct 24, 2024

Uh oh!

m-bock commented Oct 25, 2024

Uh oh!

garyb commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m-bock commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garyb commented Nov 9, 2024

Uh oh!

Uh oh!

m-bock commented Nov 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`sum` and `sumWith` (Type safe codecs for sum types with custom encodings) #62

`sum` and `sumWith` (Type safe codecs for sum types with custom encodings) #62

m-bock commented Oct 13, 2024 •

edited

Loading

garyb commented Oct 28, 2024 •

edited

Loading

m-bock commented Oct 31, 2024 •

edited

Loading