Skip to content

Conversation

@m-bock
Copy link
Contributor

@m-bock m-bock commented Oct 13, 2024

This PR provides a type safe way to define codecs for sum types.

Here we have a sum type with constructors which have different amounts of fields.

data Sample
  = Foo
  | Bar Int
  | Baz Boolean String Int

The codec gets defined by providing a record with a field for each Sum case. The codecs for the fields themselves get listed with nested tuples, with the special case of unit meaning "no fields".

import Data.Codec.Argonaut.Sum (sum)

codecSample  JsonCodec Sample
codecSample = sum "Sample"
  { "Foo": unit
  , "Bar": C.int
  , "Baz": C.boolean /\ C.string /\ C.int
  }

Moreover, the actual encoding can be configured with a couple of options:

import Data.Codec.Argonaut.Sum (Encoding)

opts  Encoding
opts =
  { tagKey: "tag"
  , valuesKey: "values"
  , unwrapSingleArguments: false
  , omitEmptyArguments: false
  }

The options can be passed to sumWith like so:

import Data.Codec.Argonaut.Sum (sumWith)

codecSample  JsonCodec Sample
codecSample = sumWith opts "Sample"
  { "Foo": unit
  , "Bar": C.int
  , "Baz": C.boolean /\ C.string /\ C.int
  }

I also added some tests which may be worth looking into.

@m-bock m-bock changed the title sum and sumWith sum and sumWith (Type safe codecs for sum types with custom encodings) Oct 13, 2024
@garyb
Copy link
Owner

garyb commented Oct 24, 2024

Sorry, I meant to get back to this sooner.

I was surprised when I saw the PR as I thought I'd already added something very similar to this, as I've already been using it in a project for years! I think I didn't get around to adding it because I my version doesn't take options about the representation, and I had intended to do something like what you have here.

I also have mine working without using the Nested tuple representation but maybe it's better to go with that, since it has the widely available (/\) operator rather than the (×) we have for Tuple in our custom prelude.

There is another encoding representation I'd like to support however, and that's using keys rather than tags. So instead of:

{ "tag": "Foo", "value": 42 }

It'd be:

{ "Foo": 42 }

But I imagine you might not care about that, so if you like we can merge it now and then I'll figure out adding support for this other representation (since it means changing a bunch of stuff including the options representation even - only unwrapSingleArguments would apply to the new case), and then release that once that's done? Or if you're happy to work on that too please be my guest. 😆

Also can you delete the spago files for now? I will migrate this to new spago, but want to use version ranges still, and unfortunately I'm not ready to drop bower support yet so would like to do something in CI to ensure they stay in sync or something.

@m-bock
Copy link
Contributor Author

m-bock commented Oct 25, 2024

That's a good idea to implement the mentioned encoding as well. I can do it.

I'd change the encoding type to:

data Encoding
  = EncodeCtorAsTag
      { unwrapSingleArguments ∷ Boolean }
  | EncodeTagValue
      { tagKey ∷ String
      , valuesKey ∷ String
      , omitEmptyArguments ∷ Boolean
      , unwrapSingleArguments ∷ Boolean
      }

unwrapSingleArguments could be shared between both options somehow, but I think it's ok to duplicate for simplicity sake.

I am not sure if it's the best naming. Rust's serde library calls those different encodings

  • "Externally tagged" { "Foo": 42 }
  • "Adjacently tagged" { "tag": "Foo", "value": 42 }

See: https://serde.rs/enum-representations.html
But I find this difficult to remember.

Since we're at it, there are also 2 more encodings:

  • "Untagged" - I havn't thought about this before, but I think it should be possible, too. Decoding would just try all cases until the first one succeeds. It could be handled with a third Encoding option like: EncodeUntagged {unwrapSingleArguments :: Boolean}

  • "Internally tagged" { "tag": "Something", "fieldA": 42, "fieldB": "Me" } - This only works if the sum type has only one field. Also the single field must be a record and it cannot have a field that is also used as tag. The API for this would be different, since the tag must be given as type level symbol. I think it could be added in follow up PR. I have drafted something that is used like sumFlat @"tag" { "fieldA": codecA, "fieldB": codecB }.

Let me know if you have any suggestion towards this. But yeah, as I said, I can keep on working on this. I'll remove the spago files, and I think there's no need to merge now. We can merge, once everything is agreed on. There may be performance optimizations, too. I'll come back to that later.

@garyb
Copy link
Owner

garyb commented Oct 28, 2024

I was using the term "nested" for { "Ctor": <value> } and "tagged" for { "tag": "Ctor", "value": <..> } but I don't have strong feelings about it. I suppose it's not very accurate since technically it's tagged either way. 😄

I would prefer to omit untagged, and force people to write codecs manually if they want that. One of the goals I had with the library is that if you only use the combinators it provides then the codecs are guaranteed to roundtrip, but that can easily be violated by the untagged representation. Admittedly there are some existing violations of that - the Compat version of Maybe can't represent Just Nothing, the Migration codecs can probably explode things, etc. If we were going to add an untagged one I'd probably prefer it be a separate thing so it's a very intentional choice to opt into it.

Thinking about it, I'd be tempted to split off a second "unsafe" library that provides things like that that are needed when you're not in control of the serialization format at both ends, but that can't be guaranteed to accurately roundtrip, as I do find myself hand writing codecs at times that could be made generic with some unsafety of that kind.


I think I understand what you mean with the internally tagged example, but just to check, it's for when you have something like this?

data Something
  = CtorA { fieldA :: Int, fieldB :: String }
  | CtorB { ... }
  | CtorC { ... }

Due to the potential for tag-field key collisions as you pointed out that maybe would be another one for the "unsafe" library, but it would be very useful indeed, that's probably the most common form of sum type representation I encounter when it's not in my control.

I've found a reasonably okay way of writing those where you leverage JPropCodec to write codecs for the record and then it's just the "tagging layer" that needs handling manually:

type DataIssueFieldsBase r =
  { pipelineId  Pipeline.Id
  , datasetId  Dataset.Id
  | r
  }

type DataIssueFields = DataIssueFieldsBase ()

type MissingColumnFields = DataIssueFieldsBase
  ( names  NonEmptyArray String
  )

data Issue
  = MissingRows DataIssueFields
  | DuplicateRows DataIssueFields
  | MissingColumns MissingColumnFields

codecDataIssueFields  CA.JPropCodec DataIssueFields
codecDataIssueFields = CAR.record
  { pipelineId: Pipeline.codecId
  , datasetId: Dataset.codecId
  }

codecMissingColumnFields  CA.JPropCodec MissingColumnFields
codecMissingColumnFields =
  codecDataIssueFields
    # CA.recordProp (Proxy  _ "names") (Codec.nonEmptyArray CA.string)

codec  CA.JsonCodec Issue
codec = C.codec' decode encode
  where
  encode  Issue  J.Json
  encode = J.fromObject ∘ case _ of
    MissingRows fields →
      Object.fromFoldable
        $ ("type" × J.fromString "missing-rows")
        : CA.encode codecDataIssueFields fields
    DuplicateRows fields →
      Object.fromFoldable
        $ ("type" × J.fromString "duplicate-rows")
        : CA.encode codecDataIssueFields fields
    MissingColumns fields →
      Object.fromFoldable
        $ ("type" × J.fromString "missing-columns")
        : CA.encode codecMissingColumnFields fields

  decode  J.Json  Either CA.JsonDecodeError Issue
  decode j = do
    obj ← CA.decode CA.jobject j
    typ ← lmap
      (CA.AtKey "type")
      (CA.decode CA.string =<< note CA.MissingValue (Object.lookup "type" obj))
    case typ of
      "missing-rows"MissingRows <$> CA.decode codecDataIssueFields obj
      "duplicate-rows"DuplicateRows <$> CA.decode codecDataIssueFields obj
      "missing-columns"MissingColumns <$> CA.decode codecMissingColumnFields obj
      other →
        Left (CA.AtKey "type" (CA.UnexpectedValue (J.fromString other)))

But we definitely should be able to eliminate the need for that.

@m-bock
Copy link
Contributor Author

m-bock commented Oct 31, 2024

  • "I was using the term "nested" for { "Ctor": } and "tagged" for { "tag": "Ctor", "value": <..> } but I don't have strong feelings about it. I suppose it's not very accurate since technically it's tagged either way. 😄"

    Ok, true. Honestly I have no idea what's the perfect name. I changed it now to Nested/Tagged as you suggest. At least the names are short and handy.

  • "I would prefer to omit untagged, and force people to write codecs manually if they want that...."
    Makes sense. I dropped the untagged option. I never had a use case for this anyways.

  • "Thinking about it, I'd be tempted to split off a second "unsafe" library"
    sounds like a good idea to me.

  • "I think I understand what you mean with the internally tagged example, but just to check, it's for when you have something like this?"
    Yes, that's what I mean with internally tagged:

    data Something
    = CtorA { fieldA :: Int, fieldB :: String }
    | CtorB { ... }
    | CtorC { ... }
    

    Encoded as:
    { tag: "CtorA", "fieldA": 1, "fieldB": "abc" }

    I suggest to discuss this in a separate PR once this one is merged. I have something that checks for tag/field collision at compile time, thus I think it does not need to be regarded as "unsafe". I agree, I think there is a need for it, I think it's the most common encoding in JS world.

  • I mentioned a performance optimization. It's disallowed at compile time to specify non existing cases in the record. Same behavior as in the Record codec. To avoid that a new record needs to be built several times, the Record implementation uses an unsafeCoerce
    I added soimething similar to the sum codec.

Ok, from my side, this PR would be ready to be merged. Do you have any other points to discuss?

@garyb
Copy link
Owner

garyb commented Nov 9, 2024

Sorry about the delay, I was on holiday for a couple of days and then catching up with work stuff, etc. This is looking great though, thanks very much for your work on it!

@garyb garyb merged commit 9dab273 into garyb:master Nov 9, 2024
1 check passed
@m-bock
Copy link
Contributor Author

m-bock commented Nov 9, 2024

Great to see this merged, no worries about the delay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants