Spark DataSource support

Not so long ago I discovered a nifty Spark feature: Spark's Data Source. You can read [this article on Hackernoon](https://hackernoon.com/extending-our-spark-sql-query-engine-5f4a088de986) about it.

Basically, you create a class called `DefaultSource` which mixes in `RelationProvider` and `SchemaRelationProvider` whose `createRelation` methods return an object of type `BaseRelation` with `TableScan`. This allows you to specify a Spark Schema and a method that returns an `RDD[Row]` based on the schema, which is automagically converted to a `DataFrame` when you do something like:

```scala
val df = spark.
  read.
  format("com.example.foo.bar").
  load("hdfs://path/to/my/data")
```

where the `DefaultSource` class resides in the package `com.example.foo.bar`.

With this, I hooked up all our reading logic for our special data formats (binary or text-based measuring data that is not always readable with the default CSV data source).

It would be really nice to have a Data source in Seahorse where you can specify the package of the `DefaultSource` class and the URL of the data as usual and where the data is then pulled in via this mechanism.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark DataSource support #99

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Spark DataSource support #99

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions