-
Notifications
You must be signed in to change notification settings - Fork 34
Description
Not so long ago I discovered a nifty Spark feature: Spark's Data Source. You can read this article on Hackernoon about it.
Basically, you create a class called DefaultSource which mixes in RelationProvider and SchemaRelationProvider whose createRelation methods return an object of type BaseRelation with TableScan. This allows you to specify a Spark Schema and a method that returns an RDD[Row] based on the schema, which is automagically converted to a DataFrame when you do something like:
val df = spark.
read.
format("com.example.foo.bar").
load("hdfs://path/to/my/data")where the DefaultSource class resides in the package com.example.foo.bar.
With this, I hooked up all our reading logic for our special data formats (binary or text-based measuring data that is not always readable with the default CSV data source).
It would be really nice to have a Data source in Seahorse where you can specify the package of the DefaultSource class and the URL of the data as usual and where the data is then pulled in via this mechanism.