Skip to content

refactor: Use a common from_value method #472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

polarathene
Copy link
Collaborator

@polarathene polarathene commented Oct 7, 2023

Inspired from the json5 feature (Original PR July 2020, revised PR May 2021) with the refactoring in the revision by @matthiasbeyer

It looked useful for other formats that use serde to simplify their logic (so long as test coverage is informative of no issues 🙏 )


DRY: The json, json5 and toml parsers all leverage serde and can share a common enum to deserialize data into, instead of individual methods performing roughly the same transformations.

  • While ron improves their support for serde untagged enums with v0.9, it is still not compatible with this approach (Their README details why). (UPDATE: At least for the current config-rs test coverage, this appears to be passing now)
  • The yaml support doesn't leverage serde thus is not compatible. (UPDATE: Since there is talk about the unmaintained status, it could be switched for serde-yaml, I've verified it can pass the tests with an extra deserializer method)

from_parsed_value() is based on the approached used by format/json5.rs:

  • It has been adjusted to reflect the ValueKind enum, which could not directly be used due to the Table and Array types using Value as their value storage type instead of self-referencing the enum.
  • Very similar to a impl From, but supports the complimentary uri parameter for each Value derived.

Resolves: #394

@polarathene

This comment was marked as outdated.

@polarathene polarathene force-pushed the refactor/common-parser-transform branch from 6ca707b to 1c8ed59 Compare October 9, 2023 08:46
Explicit `ValueKind` mapping instead of implicitly inferred `Value` type.

Matches the `format/json5.rs` logic.

Signed-off-by: Brennan Kinney <[email protected]>
DRY: The `json`, `json5` and `toml` parsers all leverage `serde`
and can share a common enum to deserialize data into,
instead of individual methods performing roughly the same transformations.

- While `ron` improves their support for serde untagged enums with `v0.9`,
  it is still not compatible with this approach (_Their README details why_).
- The `yaml` support doesn't leverage `serde`, thus is not compatible.

`from_parsed_value()` is based on the approached used by `format/json5.rs`:
- It has been adjusted to reflect the `ValueKind` enum,
  which could not directly be used due to the `Table` and `Array` types using
  `Value` as their value storage type instead of self-referencing the enum.
- Very similar to a `impl From`, but supports the complimentary `uri` parameter
  for each `Value` derived.

Signed-off-by: Brennan Kinney <[email protected]>
- The enum did not properly handle the `Datetime` TOML value which needed
to be converted into a `String` type.
- This workaround approach avoids duplicating `from_parsed_value()` logic
to support one enum variant.

Signed-off-by: Brennan Kinney <[email protected]>
The enum is only intended as a helper for this deserializer into `String` type,
it can be bundled inside. Likewise, no need to `impl Display`.

Signed-off-by: Brennan Kinney <[email protected]>
If no other variant could be deserialized into successfully,
the type is not supported and treated as the `Nil` type.

This better communicates failures within tests when a type is
compared and is not expected to be `Nil`.

Signed-off-by: Brennan Kinney <[email protected]>
@polarathene polarathene force-pushed the refactor/common-parser-transform branch from 1c8ed59 to cfabdba Compare October 10, 2023 12:15
- `Char` variant needed to be converted to `String`.
- `Option` variant could be introduced generically.

NOTE: With `v0.9` of Ron, more types are introduced. Without tests, if these
do not deserialize into a supported type they will be treated as `Nil` type.

Signed-off-by: Brennan Kinney <[email protected]>
@polarathene polarathene force-pushed the refactor/common-parser-transform branch 3 times, most recently from 68df817 to c82013e Compare October 10, 2023 23:05
- Required switching to `serde_yaml`.
- Like with `yaml-rust`, requires special handling for table keys.

Signed-off-by: Brennan Kinney <[email protected]>
@polarathene polarathene force-pushed the refactor/common-parser-transform branch from c82013e to d4f2f35 Compare October 10, 2023 23:47
@polarathene
Copy link
Collaborator Author

This PR has been updated to also include support for formats Ron and Yaml.

  • They're split into their own commits those can be easily dropped (and split into follow-up PRs if preferred, especially for the Yaml crate switch).
  • Alternative PRs have been provided that could optionally be merged before this PR, or if any portion of this PR is rejected.

Ron:

Yaml:


During the integration of Ron and Yaml into this PR (like with earlier TOML test for datetime), both failed running their tests due to the more implicit deserialization_any used by Serde to support deserializing for the untagged enum.

Resolving requires either:

  • Adding a new type to deserialize into for from_value() to then handle
  • Using the serde with attribute to provide a custom deserializer that better guides handling of each formats' types into the expected type for config-rs (eg: Ron Char or Toml Datetime conversion to String).

It's not too different from the prior approach, these types were in the formats own from_value() method, whereas this approach better identifies when this is actually necessary. They still leverage feature toggles for conditional compilation.


For TOML with the Datetime type that was easier to resolve thanks to the specific test failure. Ron and YAML failures were more cryptic usually, such as:

running 7 tests
test test_error_parse ... ok
test test_override_lowercase_value_for_enums ... ok
test test_override_uppercase_value_for_enums ... ok
test test_override_lowercase_value_for_struct ... ok
test test_override_uppercase_value_for_struct ... ok
test test_file ... ok

thread 'test_yaml_parsing_key' has overflowed its stack
fatal runtime error: stack overflow
error: test failed, to rerun pass `--test file_yaml`

Which IIRC I had read was due to how deserialize_any works with the untagged enum, that is a situation of it recursively trying to resolve the enum incorrectly due to reaching a variant that references Self and performing deserialize_any on that and so forth.

  • That's the main maintenance gotcha, but I doubt it would be encountered much going forward, and this approach simplifies the maintenance as a whole quite well.
  • deserialize_any (due to untagged enum approach) won't be as performant as the previous match statements for each individual format, but unlikely a concern for the purpose of config-rs.

Note
Ron 0.9 will introduce a Bytes value variant (Vec<u8>) for byte strings.

  • I don't believe the current approach would match it? You'd get the fallback Nil.
  • Probably acceptable given it's unclear how config-rs would want to support that.

Copy link
Collaborator Author

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some inline context if it assists with review.

If it's easier to follow/digest, I've staged out changes into scoped commits. Each commit has an associated commit message describing the changes 👍

Comment on lines +165 to +169
let key = match key {
serde_yaml::Value::Number(k) => Some(k.to_string()),
serde_yaml::Value::String(k) => Some(k),
_ => None,
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I tried to use:

let key = serde_yaml::from_value::<String>(k).ok();

However this resulted in a "stack overflow" according to my notes at the time.

IIRC this failure wasn't restricted to the untagged enum approach here, and the explicit match is also required for the existing approach used (including for the map variant in the previous yaml-rust crate).


I also have this failure in the notes that was caused by the keys string being mismatched (eg: instead of the string expected_key, a similar string like unexpected_key) was not caught well by the tests failure output:

Test failure output
running 7 tests
test test_error_parse ... ok
test test_override_uppercase_value_for_struct ... FAILED
test test_override_uppercase_value_for_enums ... FAILED
test test_override_lowercase_value_for_struct ... FAILED
test test_file ... FAILED
test test_yaml_parsing_key ... FAILED
test test_override_lowercase_value_for_enums ... FAILED

failures:

---- test_override_uppercase_value_for_struct stdout ----
thread 'test_override_uppercase_value_for_struct' panicked at tests/file_yaml.rs:151:66:
called `Result::unwrap()` on an `Err` value: missing field `bar`

---- test_override_uppercase_value_for_enums stdout ----
thread 'test_override_uppercase_value_for_enums' panicked at tests/file_yaml.rs:203:54:
called `Result::unwrap()` on an `Err` value: value of enum EnumSettings should be represented by either string or table with exactly one key

---- test_override_lowercase_value_for_struct stdout ----
thread 'test_override_lowercase_value_for_struct' panicked at tests/file_yaml.rs:186:56:
called `Result::unwrap()` on an `Err` value: missing field `foo`

---- test_file stdout ----
thread 'test_file' panicked at tests/file_yaml.rs:43:43:
called `Result::unwrap()` on an `Err` value: missing field `debug`

---- test_yaml_parsing_key stdout ----
thread 'test_yaml_parsing_key' panicked at tests/file_yaml.rs:115:10:
called `Result::unwrap()` on an `Err` value: missing field `inner_string`
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- test_override_lowercase_value_for_enums stdout ----
thread 'test_override_lowercase_value_for_enums' panicked at tests/file_yaml.rs:221:54:
called `Result::unwrap()` on an `Err` value: value of enum EnumSettings should be represented by either string or table with exactly one key

Comment on lines +172 to +176
// Option to Result:
match (key, value) {
(Some(k), Some(v)) => Ok((k, v)),
_ => Err(serde::de::Error::custom("should not be serialized to Map")),
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit verbose, but easier to grok?

An alternative with .zip().ok_or_else() was considered, including here if you prefer it.


When I completed my first iteration and was after some feedback for better handling this .map() call having mixed error return types (serde_yaml + de::Error::custom), it was advised:

  • Have the key + value conversions convert their Result to Option.

  • Then if either key or value was None use the desired serde deserialization error here (which deserialize_any uses to understand this variant was unsuccessful and to try deserializing the next variant of the ParsedValue enum).

  • They had refactored to use separate functions for key and value, instead of nested inline within the map() closure.

    While instead of this match they used zip() on the tuple:

    let key = key_to_string(k);
    let value = val_to_parsed_value(v);
    key.zip(value).ok_or_else(|| serde::de::Error::custom("should not be serialized to Map") )

Which accomplishes the same by creating an iterator on the single value of key and value:

  • Produces a tuple (a.zip(b)) of Some(key, value) until one of the iterators encounters a None.
  • Return the option as a Result (ok_or_else()).

Comment on lines +158 to +164
match ParsedMap::deserialize(deserializer)? {
ParsedMap::Table(v) => Ok(v),
#[cfg(feature = "yaml")]
ParsedMap::YamlMap(table) => {
table
.into_iter()
.map(|(key, value)| {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .map() will produce a collection of Result<String, ParsedValue> items, and .collect() will transform that into a single Result<T, E> (where T would be Map<String, ParsedValue>).

An implicit feature of Rust, no explicit transform of each Result to a collection required.


I wasn't sure if it'd short-circuit on first error encountered, but due to the return type with Result for collect() it appears this is supported via FromIterator trait, thus doesn't require try_collect().

@@ -27,12 +27,13 @@ async = ["async-trait"]
[dependencies]
lazy_static = "1.0"
serde = "1.0.8"
serde_with = "3"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required for the Nil variant fallback (scoped to it's own commit) if deserialize_any fails to match anything prior.

The crate could be avoided if you wanted to implement the equivalent directly 🤷‍♂️

Any format that supports a `char` type would now be matched
for conversion to the expected `String` type.

Signed-off-by: Brennan Kinney <[email protected]>
@polarathene
Copy link
Collaborator Author

@matthiasbeyer this is ready for review, I know you're quite busy so here's a TLDR:

@polarathene polarathene mentioned this pull request Oct 20, 2023
@polarathene
Copy link
Collaborator Author

For custom format support, the format method added here, one I previously contributed (extract_root_table()) and sometimes ParsedValue are required to leverage it (to call ParsedValue::deserialize(input_value)). This would require lib.rs to make the module pub, or some refactoring.

As it's all focused on using Serde with untagged enum, perhaps it shouldn't live in format.rs if it isn't useful to other formats (not that I expect those to be contributed much).


I also need to verify an assumption that other number types like u8 would correctly be cast to u64 supported by config-rs. Prior to this the mapping was more explicit, and I may have a misunderstanding with the untagged enum approach using deserialize_any() this way 😅

I will also share the other formats I integrated locally, if desired they can be contributed via follow-up PR.

@matthiasbeyer
Copy link
Member

So from what I read, we want this in.
@polarathene I'd like this PR to be rebased to latest master, so we get a CI run with latest master.
After that, I think you can merge this!

@polarathene
Copy link
Collaborator Author

After that, I think you can merge this!

I still need to address my concerns from my prior comment here. I will return to config-rs later today or tomorrow hopefully, presently tied up elsewhere for a bit.

After that's sorted locally, I'll rebase as requested and it'd be great to see this get merged :)

Regarding custom format support, how would you approach verifying that? Mostly so that you can catch that format module should be made public, incase someone were to revert that accidentally?

@polarathene
Copy link
Collaborator Author

Note to self, since this PR was raised the proposed yaml alternative (serde compatible) has been archived and a different yaml crate has been introduced instead, so this PR will need to drop that support (unfortunate, since the serde approach was preferrable I think).

// NOTE: Order of variants is important. Serde will use whichever
// the input successfully deserializes into first.
#[derive(serde::Deserialize, Debug)]
#[serde(untagged)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serde(untagged) is a performance footgun in situations like this (easily noticeable in benchmarks when I made the same mistake in toml::Value).

We'd likely want to use serde-untagged.

Yes, json5 did this already but I care more if it spreads to all formats

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serde(untagged) is a performance footgun in situations like this (easily noticeable in benchmarks when I made the same mistake in toml::Value).

Ah, I think when I contributed this I favored it for being more convenient maintenance wise and thinking of it from an application config viewpoint where you'd usually not have so much config to parse that it'd be a perf concern? 😅

We'd likely want to use serde-untagged.

Glancing over at the crate docs, I assume this means losing the convenience of serde(untagged) and requires a bit more explicit mapping to each variant?

I don't have a problem with that, but it was nice to have for simple mappings of types to their variant.


Yes, json5 did this already but I care more if it spreads to all formats

This wasn't clear to me at first where you were referencing as I thought you meant upstream json5 crate (that only uses serde(untagged) in their tests) 😅 but I see you meant the config-rs format implementation for json5 uses serde(untagged) (and is removed via this PR too) 👍 (EDIT: I see it's also the first thing my PR description mentioned, I forgot as it's been a while 😓)

So as a bonus perhaps, this would address your concern about the json5 format being deferred to this common serde handler for supported formats?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a significant problem we run into, I stand by using serde-untaggedas I don't want the other formats to regress by adding this performance hit.

I don't have a problem with that, but it was nice to have for simple mappings of types to their variant.

I tried to push for serde_derive to have attributes to force type-variant mappings to avoid the performance overhead of untagged but the maintainers weren't interested.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stand by using serde-untaggedas I don't want the other formats to regress by adding this performance hit.

That's fair.

It could be useful as an alternative format helper to implement additional formats, which although might not be used by any official format, if a user could extend config-rs support separately without requiring upstreaming of a format into the crate?

I'll look into adapting to serde-untagged when I can find the time, but if anyone else wants to tackle that it'd be appreciated 😅


I tried to push for serde_derive to have attributes to force type-variant mappings to avoid the performance overhead of untagged but the maintainers weren't interested.

That's unfortunate, sounds like a very reasonable request 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be useful as an alternative format helper to implement additional formats, which although might not be used by any official format, if a user could extend config-rs support separately without requiring upstreaming of a format into the crate?

They can with https://docs.rs/config/latest/config/trait.Format.html

They won't get our extension-agnostic loader but that isn't always necessary and can easily be implemented on its own.

Comment on lines +50 to +53
// Equivalent to ValueKind, except Table + Array store the same enum
// Useful for serde to serialize values into, then convert to Value.
// NOTE: Order of variants is important. Serde will use whichever
// the input successfully deserializes into first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this just replaces our existing type we are deserializing to, I'll give this a pass. Overall, I'd rather we deserialize directly into our needed type.

That would either require

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this just replaces our existing type we are deserializing to, I'll give this a pass. Overall, I'd rather we deserialize directly into our needed type.

I believe that I wanted to deserialize directly too, but I think the comment here highlighted the reasoning that did not happen was due to config::Value being a struct with a value: ValueKind field while config::ValueKind enum has the Table + Array variants reference config::Value instead of Self to self-reference the enum, which complicated adapting to that?

If you look at what this PR does, it effectively replaces the format specific methods (like from_yaml_value() / from_toml_value()) with a single from_parsed_value() method instead.

Most of those format from_*_value() methods (that are serde reliant) took in a deserialized format value type as a parameter, delegating away error handling involved (except partially for Ron and the non-serde Yaml format).

In both cases the before/after approach of these methods are only really performing the task mapping to config::Value. That logic doesn't belong in value.rs when it's format/type specific though? You can see at the end of format.rs in this PR how the additional serde deserializers are added for format specific support concerns, with types/crates not really relevant to bundle into value.rs?


Hopefully that clarifies why ParsedValue was introduced as an intermediary convenience for the deserializing + mapping step. With from_parsed_value() it effectively consolidates the associated formats it can support, making this logic in config-rs much easier to grok and maintain?

Last I recall the various from_*_value() implementations were (or at least at some point when I was contributing) not particularly consistent in style/quality despite having roughly the same logic. One could tidy that up, or just make it DRY (which this PR favored), such that any new serde compatible formats would be much simpler to add/support, and thus maintain without as much boilerplate prone to diverging.

Existing (and new) serde compatible formats would have this common call going forward:

let value = format::from_parsed_value(uri, serde_json::from_str(text)?);

If format.rs can be adapted to make needing ParsedValue redundant (by adapting the same improvement to config::Value / ValueKind directly), that'd be great too! This PR was just focused on that initial step to take a DRY approach without being too invasive of a change to the codebase 😅

That said some of the PR value was lost since as the switch to serde-yaml didn't age well. It would have been nice to shift the added error handling there to serde.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting a dummy value in for the origin

NOTE: I used uri in my responses since it's frequently the variable name used instead of origin and was what came to mind when writing out my thoughts 😅


from_parsed_value() (like how other format methods before it did) provides the uri value to use when mapping from a ParsedValue to a config::Value (with it's associated uri + ValueKind variant).

Adding a placeholder uri (effectively None?) would then need a step that updates the uri for each Value (along with it's nested Value items depending on type), and I'm not sure how you'd like to approach that in the context of a method like from_parsed_value() where the uri is already given?

ParsedValue enum leveraged the serde untagged convenience to map the formats own Value type, but to have that done for config::Value instead just to add the uri would be verbose of an implementation?

I understand if using an intermediary this way instead of skipping that may seem as a codesmell, and I think I may have originally intended to look into how I could improve that after this PR was resolved. Partly because of a lack of familiarity with the project as a whole and not wanting to introduce breaking changes or deal with extra churn / considerations that may cause vs this reduced scope that was easier for me to reason about and get through review 😅

As such, I still think the current approach with this PR using the intermediary enum would be valid and keeps the PR focused on the goal of unifying the serde compatible formats to a common parse method.

Refactoring that to then make ParsedValue redundant and work directly with config::Value + ValueKind seems better suited to a follow-up PR, and would be more pleasant in commit/merge history as context for these distinct changes to any interested party, as opposed to munged together (but that could just be more perspective as I tend to encourage more focused PRs with all commits squashed via the PR merge strategy).


All the uri value represents is the source, so for these formats the config::Value returned and any nested value produced, that uri is the same throughout.

As such it's not that meaningful during the format mapping to config::Value, but collections of config::Value need each item annotated with that same uri value/ref, so that merging/layering has that traceability of source.

If that were something better handled to avoid the need for ValueKind::Array and ValueKind::Table to require config::Value items just to track the associated uri, then this whole thing could be simplified as desired 😅

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building up my serde-basepath idea I've been toying with

I glanced through the linked ref but that still didn't click for me sorry.


In the current state of this PR, I could have from_parsed_value() assign None as the uri (in lieu of a serde-untagged implementation effectively transforming to Value::new(uri, kind) for deserializing to config::ValueKind on the Array/Table variants, or kind.into() if deserializing straight to config::Value?), and then I could call a separate helper to update all the uri origin to the desired one before returning the final config::Value result?

// In value.rs:
impl Value {
    pub fn set_origin_recursively(&mut self, origin: Option<&String>) {
        self.origin = origin.cloned();
    
        // If kind is a collection, update the `origin` of their `Value`s too:
        match self.kind {
            ValueKind::Array(ref mut collection) => {
                for v in collection.iter_mut() {
                  v.set_origin_recursively(origin)
                }
            }
            
            ValueKind::Table(ref mut collection) => {
                for (_, v) in collection.iter_mut() {
                  v.set_origin_recursively(origin)
                }
            }
    
            // This is presumably unacceptable for handling all other variants
            // of `ValueKind` that don't contain nested `Value`s, but short of listing
            // all remaining variants here I'm not sure how you'd want to approach it
            _ => {},
        };
    }
}

// format.rs:
// Nothing to map if already deserialized into Value:
pub fn from_parsed_value(uri: Option<&String>, mut value: Value) -> Value {
    value.set_origin_recursively(uri);
    value
}

// Alternatively for ValueKind (implies Array/Table variants already have `v.origin = None`):
pub fn from_parsed_value(uri: Option<&String>, value: ValueKind) -> Value {
    let mut v = Value::from(value);
    v.set_origin_recursively(uri);

    v
}

Both would make from_parsed_value() largely redundant then 😅

The above set_origin_recursively() utility method could be added via a separate PR in the meantime? (origin isn't public, so updating the origin after it's set isn't supported I think?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose with the more verbose addition of serde-untagged, handling the Array + Table types of ValueKind would be a non-issue as .map/.seq calls would could leverage .into() similar to how you could construct ValueKind like this:

// ValueKind::Table(T) storing `Vec<T> => ValueKind::Array(T) => Value`
let map_of_vecs: ValueKind = HashMap::<String, Value>::from([
    ("abc".to_string(), vec!["a", "b", "c"].into()),
    ("xyz".to_string(), vec!["x", "y", "z"].into()),
]).into();

// ValueKind::Table(T) => Value
let v = Value::from(map_of_vecs);

Value::from will provide the None placeholder for origin and the added recursive method can patch that afterwards 👍


I think these two collection variants of ValueKind were only problematic with serde(untagged) when paired with the from_parsed_value() recursively calling itself to map the enum value, since they were not wrapping Self 😅

Other than that, there's the Option value type from Ron which I'm not sure how you'd approach that with serde-untagged, for serde(untagged) with ParsedValue the recursive call through from_parsed_value() worked with the type Option<Box<Self>>.

Likewise, while serde-untagged and formats like Ron support char, ValueKind lacks it and casts to ValueKind::String. A similar concern exists for Toml support when parsing it's date value as there is no equivalent in ValueKind it is mapped to the string variant too. Both cases were handled in this PR with a separate ParsedString enum delegated to serde(untagged), all wrapped in a custom deserializer method for the ParsedValue::String variant and the deserialize_with annotation (that I've not looked into if that's compatible with serde-untagged, I assume it is?).

So a few gaps on how to proceed there with serde-untagged, but I think it might still be better deferred to a follow-up PR replacing serde(untagged). It's diff might then be more straight-forward?

Copy link
Collaborator Author

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the verbosity, limited time to condense all my comments but I'll try 💪


The gist of my response is as follows:

  • ParsedValue + from_parsed_value() unifies the equivalent code it replaces. Deserializing directly to config::Value would be a separate improvement, this PR just consolidates / streamlines existing mapping. It might not be ideal, but I think it's still worthwhile?
  • I don't think performance is the deciding factor for choosing config-rs over the competition. I'm open to adopting serde-untagged here if someone could fill the gaps in proceeding, but it might be better deferred to a separate PR at a later date?

ParsedValue is not replacing ValueKind

As this just replaces our existing type we are deserializing to, I'll give this a pass.

The PR DRYs up existing format implementations (which are inconsistent in style/approach for blocks that accomplish the same task).

Assuming that your comment was referring to the ValueKind type being replaced (but I could be mistaken and you actually meant each individual format value type):

  • ParsedValue substitutes each formats own Value type, while from_parsed_value() unifies the existing equivalent format methods mapping of a format specific Value type to config::ValueKind (and subsequent conversion to config::Value).
  • The from_*_value() format methods tend to be recursively called for mapping to the final config::Value type (optionally with nested config::Value when ValueKind is a collection type) and associating the fixed origin source to each mapped config::Value.
  • ValueKind contains collection variants that store items of config::Value not Self, which is problematic for recursive calls of from_*_value(). You would instead need to deserialize directly into config::Value AFAIK and then recursively update origin.

This transition of existing formats to a common ParsedValue enum + from_parsed_value() method is AFAIK an improvement for maintainability and anyone referencing the format support to contribute new formats.

When I authored this PR I was able to easily add a few other formats without copy/pasting boilerplate of one of the existing implementations. That seems worthwhile?


serde-untagged / performance

serde(untagged) is a performance footgun in situations like this (easily noticeable in benchmarks when I made the same mistake in toml::Value).

We'd likely want to use serde-untagged.

Personally that may not be as digestible of a change within this PR?

That said, should anyone find the performance an issue when parsing configs in their real-world projects, they'd likely be motivated to contribute that suggested improvement (which is either reverting to the current state we have now prior to this PR with it's drawbacks, or adapting from the changes of this PR to use serde-untagged and skip ParsedValue?), you're an example of that with your contributions to the toml crates, so I completely understand the reluctance 😅

Config parsing benchmarks are presumably a rather niche concern for the bulk of projects though (which might prefer a config crate that is easier to grok and maintain, as existing issues tend to be focused on).

As you're effectively the core maintainer of this crate now, that is totally up to you. I am in support of your requests if you're familiar enough with how to approach the concerns I raised in my feedback that are unclear to me, but I get the impression it'd still be better delegated to a subsequent PR? (it has already been suggested that this PR alone in it's current state would benefit from splitting)

// NOTE: Order of variants is important. Serde will use whichever
// the input successfully deserializes into first.
#[derive(serde::Deserialize, Debug)]
#[serde(untagged)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serde(untagged) is a performance footgun in situations like this (easily noticeable in benchmarks when I made the same mistake in toml::Value).

Ah, I think when I contributed this I favored it for being more convenient maintenance wise and thinking of it from an application config viewpoint where you'd usually not have so much config to parse that it'd be a perf concern? 😅

We'd likely want to use serde-untagged.

Glancing over at the crate docs, I assume this means losing the convenience of serde(untagged) and requires a bit more explicit mapping to each variant?

I don't have a problem with that, but it was nice to have for simple mappings of types to their variant.


Yes, json5 did this already but I care more if it spreads to all formats

This wasn't clear to me at first where you were referencing as I thought you meant upstream json5 crate (that only uses serde(untagged) in their tests) 😅 but I see you meant the config-rs format implementation for json5 uses serde(untagged) (and is removed via this PR too) 👍 (EDIT: I see it's also the first thing my PR description mentioned, I forgot as it's been a while 😓)

So as a bonus perhaps, this would address your concern about the json5 format being deferred to this common serde handler for supported formats?

Comment on lines +50 to +53
// Equivalent to ValueKind, except Table + Array store the same enum
// Useful for serde to serialize values into, then convert to Value.
// NOTE: Order of variants is important. Serde will use whichever
// the input successfully deserializes into first.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this just replaces our existing type we are deserializing to, I'll give this a pass. Overall, I'd rather we deserialize directly into our needed type.

I believe that I wanted to deserialize directly too, but I think the comment here highlighted the reasoning that did not happen was due to config::Value being a struct with a value: ValueKind field while config::ValueKind enum has the Table + Array variants reference config::Value instead of Self to self-reference the enum, which complicated adapting to that?

If you look at what this PR does, it effectively replaces the format specific methods (like from_yaml_value() / from_toml_value()) with a single from_parsed_value() method instead.

Most of those format from_*_value() methods (that are serde reliant) took in a deserialized format value type as a parameter, delegating away error handling involved (except partially for Ron and the non-serde Yaml format).

In both cases the before/after approach of these methods are only really performing the task mapping to config::Value. That logic doesn't belong in value.rs when it's format/type specific though? You can see at the end of format.rs in this PR how the additional serde deserializers are added for format specific support concerns, with types/crates not really relevant to bundle into value.rs?


Hopefully that clarifies why ParsedValue was introduced as an intermediary convenience for the deserializing + mapping step. With from_parsed_value() it effectively consolidates the associated formats it can support, making this logic in config-rs much easier to grok and maintain?

Last I recall the various from_*_value() implementations were (or at least at some point when I was contributing) not particularly consistent in style/quality despite having roughly the same logic. One could tidy that up, or just make it DRY (which this PR favored), such that any new serde compatible formats would be much simpler to add/support, and thus maintain without as much boilerplate prone to diverging.

Existing (and new) serde compatible formats would have this common call going forward:

let value = format::from_parsed_value(uri, serde_json::from_str(text)?);

If format.rs can be adapted to make needing ParsedValue redundant (by adapting the same improvement to config::Value / ValueKind directly), that'd be great too! This PR was just focused on that initial step to take a DRY approach without being too invasive of a change to the codebase 😅

That said some of the PR value was lost since as the switch to serde-yaml didn't age well. It would have been nice to shift the added error handling there to serde.

Comment on lines +50 to +53
// Equivalent to ValueKind, except Table + Array store the same enum
// Useful for serde to serialize values into, then convert to Value.
// NOTE: Order of variants is important. Serde will use whichever
// the input successfully deserializes into first.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting a dummy value in for the origin

NOTE: I used uri in my responses since it's frequently the variable name used instead of origin and was what came to mind when writing out my thoughts 😅


from_parsed_value() (like how other format methods before it did) provides the uri value to use when mapping from a ParsedValue to a config::Value (with it's associated uri + ValueKind variant).

Adding a placeholder uri (effectively None?) would then need a step that updates the uri for each Value (along with it's nested Value items depending on type), and I'm not sure how you'd like to approach that in the context of a method like from_parsed_value() where the uri is already given?

ParsedValue enum leveraged the serde untagged convenience to map the formats own Value type, but to have that done for config::Value instead just to add the uri would be verbose of an implementation?

I understand if using an intermediary this way instead of skipping that may seem as a codesmell, and I think I may have originally intended to look into how I could improve that after this PR was resolved. Partly because of a lack of familiarity with the project as a whole and not wanting to introduce breaking changes or deal with extra churn / considerations that may cause vs this reduced scope that was easier for me to reason about and get through review 😅

As such, I still think the current approach with this PR using the intermediary enum would be valid and keeps the PR focused on the goal of unifying the serde compatible formats to a common parse method.

Refactoring that to then make ParsedValue redundant and work directly with config::Value + ValueKind seems better suited to a follow-up PR, and would be more pleasant in commit/merge history as context for these distinct changes to any interested party, as opposed to munged together (but that could just be more perspective as I tend to encourage more focused PRs with all commits squashed via the PR merge strategy).


All the uri value represents is the source, so for these formats the config::Value returned and any nested value produced, that uri is the same throughout.

As such it's not that meaningful during the format mapping to config::Value, but collections of config::Value need each item annotated with that same uri value/ref, so that merging/layering has that traceability of source.

If that were something better handled to avoid the need for ValueKind::Array and ValueKind::Table to require config::Value items just to track the associated uri, then this whole thing could be simplified as desired 😅

Comment on lines +50 to +53
// Equivalent to ValueKind, except Table + Array store the same enum
// Useful for serde to serialize values into, then convert to Value.
// NOTE: Order of variants is important. Serde will use whichever
// the input successfully deserializes into first.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building up my serde-basepath idea I've been toying with

I glanced through the linked ref but that still didn't click for me sorry.


In the current state of this PR, I could have from_parsed_value() assign None as the uri (in lieu of a serde-untagged implementation effectively transforming to Value::new(uri, kind) for deserializing to config::ValueKind on the Array/Table variants, or kind.into() if deserializing straight to config::Value?), and then I could call a separate helper to update all the uri origin to the desired one before returning the final config::Value result?

// In value.rs:
impl Value {
    pub fn set_origin_recursively(&mut self, origin: Option<&String>) {
        self.origin = origin.cloned();
    
        // If kind is a collection, update the `origin` of their `Value`s too:
        match self.kind {
            ValueKind::Array(ref mut collection) => {
                for v in collection.iter_mut() {
                  v.set_origin_recursively(origin)
                }
            }
            
            ValueKind::Table(ref mut collection) => {
                for (_, v) in collection.iter_mut() {
                  v.set_origin_recursively(origin)
                }
            }
    
            // This is presumably unacceptable for handling all other variants
            // of `ValueKind` that don't contain nested `Value`s, but short of listing
            // all remaining variants here I'm not sure how you'd want to approach it
            _ => {},
        };
    }
}

// format.rs:
// Nothing to map if already deserialized into Value:
pub fn from_parsed_value(uri: Option<&String>, mut value: Value) -> Value {
    value.set_origin_recursively(uri);
    value
}

// Alternatively for ValueKind (implies Array/Table variants already have `v.origin = None`):
pub fn from_parsed_value(uri: Option<&String>, value: ValueKind) -> Value {
    let mut v = Value::from(value);
    v.set_origin_recursively(uri);

    v
}

Both would make from_parsed_value() largely redundant then 😅

The above set_origin_recursively() utility method could be added via a separate PR in the meantime? (origin isn't public, so updating the origin after it's set isn't supported I think?)

Comment on lines +50 to +53
// Equivalent to ValueKind, except Table + Array store the same enum
// Useful for serde to serialize values into, then convert to Value.
// NOTE: Order of variants is important. Serde will use whichever
// the input successfully deserializes into first.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose with the more verbose addition of serde-untagged, handling the Array + Table types of ValueKind would be a non-issue as .map/.seq calls would could leverage .into() similar to how you could construct ValueKind like this:

// ValueKind::Table(T) storing `Vec<T> => ValueKind::Array(T) => Value`
let map_of_vecs: ValueKind = HashMap::<String, Value>::from([
    ("abc".to_string(), vec!["a", "b", "c"].into()),
    ("xyz".to_string(), vec!["x", "y", "z"].into()),
]).into();

// ValueKind::Table(T) => Value
let v = Value::from(map_of_vecs);

Value::from will provide the None placeholder for origin and the added recursive method can patch that afterwards 👍


I think these two collection variants of ValueKind were only problematic with serde(untagged) when paired with the from_parsed_value() recursively calling itself to map the enum value, since they were not wrapping Self 😅

Other than that, there's the Option value type from Ron which I'm not sure how you'd approach that with serde-untagged, for serde(untagged) with ParsedValue the recursive call through from_parsed_value() worked with the type Option<Box<Self>>.

Likewise, while serde-untagged and formats like Ron support char, ValueKind lacks it and casts to ValueKind::String. A similar concern exists for Toml support when parsing it's date value as there is no equivalent in ValueKind it is mapped to the string variant too. Both cases were handled in this PR with a separate ParsedString enum delegated to serde(untagged), all wrapped in a custom deserializer method for the ParsedValue::String variant and the deserialize_with annotation (that I've not looked into if that's compatible with serde-untagged, I assume it is?).

So a few gaps on how to proceed there with serde-untagged, but I think it might still be better deferred to a follow-up PR replacing serde(untagged). It's diff might then be more straight-forward?

@polarathene
Copy link
Collaborator Author

I'm detailing this as a refresher for future me or anyone else who may be interested in pushing this PR forward.

My time is often sparse, so with the additional requirements to adopt serde-untagged and the lack of clarity of what some serde(untagged) equivalent functionality would look like with serde-untagged, my availability to resume this PR is hindered.

Remaining tasks:

@epage
Copy link
Contributor

epage commented Aug 12, 2025

Introduce the #472 (comment) to support deserializing straight into config::Value (making redundant both from_parsed_value() from this PR, and the format equivalent methods it would have replaced).

Note that I said this was non-blocking but we will eventually want to do this.

@epage
Copy link
Contributor

epage commented Aug 12, 2025

Ensure #675 (comment) have been addressed.

It is non-blocking to update all formats to this

@epage
Copy link
Contributor

epage commented Aug 12, 2025

Use of serde_untagged is blocked on dtolnay/serde-untagged#10 being merged

epage added a commit to epage/config-rs that referenced this pull request Aug 12, 2025
epage added a commit that referenced this pull request Aug 12, 2025
Prep for #472

`serde(untagged)` translates to an `if`/`else if` ladder that ends up
being pretty expensive, including the error generation for each branch
that gets skipped. This switches us to dispatch on type instead,
speeding things up.

This is also prep for speeding up builds by removing the need for
`serde_derive`. Once `serde_core` is published, we can depend on that
and not wait for `syn`, `quote`, etc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants