Skip to content

request to add typed text/csv to registry, in parity with major databases & languages #40

@coolaj86

Description

@coolaj86

Rationale

A variety of languages support CSV as a first-class citizen for objects - Swift and PowerShell come immediately to mind, but many (perhaps most) languages support typed object serializer/deserializer for CSV through 3rd party libraries or simple map operations as well - Go, JavaScript w/ JSDoc, Python, etc.

Since CSV is much simpler, smaller, and less server intensive that JSON - and even many binary formats - for representing many kinds of typed objects, it's often the ideal choice for easy-to-cache, easy-to-update data.

Creating complex permutations in a database which explodes the size of data that needs to be loaded in RAM, then transmitted to the application, and then across the Internet just to satisfy a frontend rendering contract is a little crazy when you stop and think about it - it's trivial to iterate over and create in-memory graphs client-side (whether Web or iOS, etc).

If you use CSV, however, you can write very simple queries that satisfy business logic, often at 1/5 or 1/10 the size before gzip - which reduces stress across the entire pipeline.

Not only that, but when API endpoints are exposed as CSV they can be integrated very easily with Excel and Google Sheets, which drastically increases user productivity and drives down development cost and time, depending on the data.

Typical CSV Hydration

In general, there are 3 ways in which CSV hydration is performed, which parallel how Postgres, MariaDB, etc store and hydrate data:

1. Standard RFC

id,name,dob,ssn
1,John Smith,4/1/1969,12-34-5678
2,Jane Doe,5/5/1971,10-20-3456
  • fields separated by commas - or an alternate character, such as TAB or US (ASCII Unit Separator)
  • records separated by newlines - or an alternate character, such as RS (ASCII Record Separator)
  • values containing quotes, field, or record separators are quoted
    (where a double double quote "" escapes to a single quote)
    (it's less common for the quote character to be replaced, but sometimes ` is used)
  • empty lines are ignored, or treated as comments
  • lines beginning with a prescribed comment character (usually #) are ignored, or treated as comments
  • most parsers do NOT allow sequences of characters (i.e. //) as a separator or comment, except for \r\n
  • query parameters may be used to choose delimiters

In this configuration fields map exactly to a typical JSON object - except that:

  • strings, booleans, numbers, and nulls require extra information to distinguish (which OpenAPI already provides)
  • (optional) the CSV header format may different than the object key format
    • use index rather than name
    • map name to a different form of the name

Just for the purpose of illustration, here's a schema with csv_field_name and csv_field_index (CSV is 1-indexed):

components:
  schemas:
    Thingy:
      type: object
      properties:
        id: { type: integer, format: uint32, csv_field_name: "I.D." }
        slug: { type: string, csv_field_index: 2 }
        name: { type: string, csv_field_name: "Name" }
        is_unique: { type: boolean, csv_field_name: "Is Unique" }
        created_at: { type: string, format: date-time }
        updated_at: { type: string, format: date-time }

I realize that's probably not exactly where this belongs (probably with something related to the text/csv registration), but it's just to show the idea.

note: query parameters may be used to choose delimiters

2. RFC, but with Lists

id,name,friends,ssn
1,John Smith,"2,3",12-34-5678
2,Jane Doe,"1,3",10-20-3456
3,Bob McMaster,"1,2",44-33-5555

Another common format is simply to embed lists - so perhaps using \t as the record separator, but using a space as a subrecord separator.
(this is also how OIDC scopes and many other simple list types uses spaces to create a list in query params)

This is similar to Array column types in a database.

For example, that might be represented something like this with a list_delimiter:

components:
  schemas:
    Thingy:
      type: object
      properties:
        # ...
        codes:
          type: array
          list_delimiter: \t
          items:
            type: integer
            format: int32

And, not that I think that metadata belongs there, but just for illustration of the idea.

3. Nested JSON

id,name,options,ssn
1,John Smith,"{""foo"":""bar"",""n"":42}",12-34-5678
2,Jane Doe,"{""baz"":""qux"",""n"":37}",10-20-3456

Just as with a JSON column in a database, occasionally you need something free-form that can hold awkward or inconsistent information without adding a dozen fields that might typically be empty.

components:
  schemas:
    Thingy:
      type: object
      properties:
        # ...
        options:
          nested_json: true
          $ref: '#/components/schemas/Options'
    Options:
      type: object
      properties:
        color:
          type: string
          enum: [red, green, blue]
        size:
          type: integer
          minimum: 1
      required: [color]

All of the same quoting rules apply - the implementor simply needs to know that a field should be parsed as JSON as opposed to just so happening to start and end with a [ and ] or { and }.

Bonus: Base64 & Hex, like JSON

id,name,icon
1,John Smith,""
2,Jane Doe,"data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' width='1' height='1'/>"

It's pretty common to use data URLs or raw Base64 or Hex to encode binary data. This is no different from how it would already be handled in JSON.

Super Bonus: Multiplexed

This s a bit out there, but there's also this idea of allowing multiple data-types in a single CSV that's been floating around - but I don't know of any actual implementations yet:

type,person,id,name
type,book,id,title,author
person,1,John Smith
book,1,The Art of Computer Programming,Donald E. Knuth

The theory is that you can sort the data before sending it downstream such that all relevant data to an object graph can be frontloaded, and UI can begin rendering while streaming.

However, in my own testing of a similar concept, I don't think there's any significant savings here - in fact it may be worse.

10s of thousands of gzipped records can be loaded in parallel via CSV endpoints and reconstructed in a graph on the client in a matter of dozens of milliseconds.

The extra work to reform the data beyond what a database query ORDER BY already does might even slow it down.

Considering how easily the data is cached and how easily just updated records can be gathered - most of the time of which is waiting for the query to finish in the DB - it might be a mistake to go this far.

Other considerations

These types would be useful for both GETing data to render on or list in a client as well as POSTing bulk data (such as something a user copied from a spreadsheet, or where a list of items need some properties adjusted).

Having first-class CSV support in OpenAPI 3 will make it easier to document, build, and support simpler, data-efficient APIs (and maybe even encourage them).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions