Skip to content

Plans for Data-Forge version 2Β #108

@ashleydavis

Description

@ashleydavis

This issue is to discuss plans for version 2.

These is just ideas for the moment. I haven't started on this and am not sure when I will.

Plans for v2:

  • Minimize breaking changes
  • How do I make Data-Forge easy to use and easier to get started with?
    • Lazy evaluation is good for performance, but it makes DF hard to understand, does lazy evaluation need to die?
    • If lazy evaluation were removed, the internals of DF could be massively simplified (getting rid of all the iterables/iterators).
    • If lazy evaluation were removed you could look at a Series or DataFrame to see the current data that's in there (instead of say, having to call toArray).
    • We could say that splitting so that it fits in memory should happen above DF and is not the responsibility of DF.
  • Move plugins to the same repo (plugins will be republished under the org, e.g. @data-forge/fs)
  • Revise, improve and integrate the documentation (supported by having all the code for plugins in the one repository)
  • Delegate all maths to a pluggable library.
    • This means we can swap between floating-point and decimal maths (for the people who need that)
  • Better support for statistics (e.g. linear regressions, correlation, etc) I'm already working through this in v1.
  • Revise and overhaul serialization (e.g. support serialization/deserialization of JavaScript date objects)
    • Better support for mixed data types in columns (serializing the column type doesn't work for this, might need to serialize per-element type, I like the way MongoDB serializes dates to JSON, "$date").
  • Investigate replacing iterators with generator functions I've investigated this now and it doesn't seem possible.
  • Add map, filters and reduce functions (this is done now), deprecate select and where functions (make it more JavaScript-like)
  • Support streaming data (e.g. for processing massive CSV files)
    • Ideally DF would be async first and be used to define pipelines for async streaming data, but does async go against the goal of making DF easier to use? Is there a way that I can make it so that async usage is friendly?
    • I'm now thinking that async and parallelisation are higher level concerns that exist above DF and are not DF's responsibility.
  • Define a format/convention for running transformations (map?) and accumulations (reduce?) over a series / dataframe.
  • It would be great if somehow Series and DataFrame were integrated. Afterall DataFrame is just a Series with columns attached. Having seperate Series and DataFrame is good for easy to browse documentation, but it makes for a lot of duplicated code. If DataFrame could just derive from Series that would be quite nice, except they have differing functionality. This needs some thought.
    Stretch goals:
  • Better performance (Using Tensorflow.js ???)

Metadata

Metadata

Assignees

No one assigned

    Labels

    version-2Saving this for Data-Forge version 2.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions