Skip to content
This repository was archived by the owner on Dec 4, 2019. It is now read-only.

Commit 51c8ba1

Browse files
authored
Various README updates: remove outdated info, or info that was moved elsewhere (#91)
1 parent 2171519 commit 51c8ba1

File tree

3 files changed

+20
-123
lines changed

3 files changed

+20
-123
lines changed

README.md

Lines changed: 9 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,26 @@
11
# Scikit-learn integration package for Apache Spark
22

3-
This package contains some tools to integrate the [Spark computing framework](http://spark.apache.org/) with the popular [scikit-learn machine library](http://scikit-learn.org/stable/). Among other tools:
4-
- train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the [multicore implementation](https://pythonhosted.org/joblib/parallel.html) included by default in [scikit-learn](http://scikit-learn.org/stable/).
5-
- convert Spark's Dataframes seamlessly into numpy `ndarray`s or sparse matrices.
6-
- (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors.
3+
This package contains some tools to integrate the [Spark computing framework](https://spark.apache.org/) with the popular [scikit-learn machine library](https://scikit-learn.org/stable/). Among other tools:
4+
- train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the [multicore implementation](https://pythonhosted.org/joblib/parallel.html) included by default in [scikit-learn](https://scikit-learn.org/stable/).
5+
- convert Spark's Dataframes seamlessly into numpy `ndarray`s or sparse matrices.
6+
- (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors.
77

88
It focuses on problems that have a small amount of data and that can be run in parallel.
99
- for small datasets, it distributes the search for estimator parameters (`GridSearchCV` in scikit-learn), using Spark,
1010
- for datasets that do not fit in memory, we recommend using the [distributed implementation in Spark MLlib](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html).
11-
12-
> NOTE: This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).
13-
14-
**Difference with the [sparkit-learn project](https://github.com/lensacom/sparkit-learn)** The sparkit-learn project aims at a comprehensive integration between Spark and scikit-learn. In particular, it adds some primitives to distribute numerical data using Spark, and it reimplements some of the most common algorithms found in scikit-learn.
15-
16-
## License
17-
18-
This package is released under the Apache 2.0 license. See the LICENSE file.
11+
This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).
1912

2013
## Installation
2114

2215
This package is available on PYPI:
2316

2417
pip install spark-sklearn
2518

26-
This project is also available as as [Spark package](http://spark-packages.org/package/databricks/spark-sklearn).
19+
This project is also available as as [Spark package](https://spark-packages.org/package/databricks/spark-sklearn).
2720

2821
The developer version has the following requirements:
2922
- a recent release of scikit-learn. Releases 0.18.1, 0.19.0 have been tested, older versions may work too.
30-
- Spark >= 2.1.1. Spark may be downloaded from the [Spark official website](http://spark.apache.org/). In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python interpreter. See the [Spark guide](https://spark.apache.org/docs/latest/programming-guide.html#overview) for more details.
23+
- Spark >= 2.1.1. Spark may be downloaded from the [Spark official website](https://spark.apache.org/). In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python interpreter. See the [Spark guide](https://spark.apache.org/docs/latest/programming-guide.html#overview) for more details.
3124
- [nose](https://nose.readthedocs.org) (testing dependency only)
3225
- Pandas, if using the Pandas integration or testing. Pandas==0.18 has been tested.
3326

@@ -37,7 +30,7 @@ If you want to use a developer version, you just need to make sure the `python/`
3730

3831
__Running tests__ You can directly run tests:
3932

40-
cd python && ./run-tests.sh
33+
cd python && ./run-tests.sh
4134

4235
This requires the environment variable `SPARK_HOME` to point to your local copy of Spark.
4336

@@ -62,13 +55,4 @@ This classifier can be used as a drop-in replacement for any scikit-learn classi
6255
[API documentation](http://databricks.github.io/spark-sklearn-docs) is currently hosted on Github pages. To
6356
build the docs yourself, see the instructions in [docs/README.md](https://github.com/databricks/spark-sklearn/tree/master/docs).
6457

65-
## Changelog
66-
67-
- 2015-12-10 First public release (0.1)
68-
- 2016-08-16 Minor release (0.2.0):
69-
1. the official Spark target is Spark 2.0
70-
2. support for keyed models
71-
- 2017-09-20 Minor release (0.2.2):
72-
1. The official Spark target is Spark >= 2.1
73-
- 2017-09-29 Minor release (0.2.3):
74-
1. Fixes spark-package build of spark-sklearn.
58+
[![Build Status](https://travis-ci.org/databricks/spark-sklearn.svg?branch=master)](https://travis-ci.org/databricks/spark-sklearn)

docs/README.md

Lines changed: 10 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,28 @@
1-
Welcome to the spark-sklearn Spark Package documentation!
1+
# Generating the Documentation HTML
22

3-
This readme will walk you through navigating and building the spark-sklearn documentation, which is
4-
included here with the source code.
3+
## Installing Dependencies
54

6-
## Generating the Documentation HTML
7-
8-
### Installing Dependencies
9-
10-
The spark-sklearn documentation is built with [Jekyll](http://jekyllrb.com), which
5+
The spark-sklearn documentation is built with [Jekyll](https://jekyllrb.com), which
116
can be installed as follows:
127

13-
$ sudo gem install jekyll
14-
$ sudo gem install jekyll-redirect-from
8+
sudo gem install jekyll
159

1610
On macOS, with the default Ruby, please install Jekyll with Bundler as
17-
[instructed on offical website](https://jekyllrb.com/docs/quickstart/).
11+
[instructed on official website](https://jekyllrb.com/docs/quickstart/).
1812
Otherwise the build script might fail to resolve dependencies.
1913

20-
$ sudo gem install jekyll bundler
21-
$ sudo gem install jekyll-redirect-from
14+
sudo gem install jekyll bundler
2215

2316
Install the python dependencies necessary for building the docs via (from project root):
2417

25-
$ pip install -r python/requirements-docs.txt
18+
pip install -r python/requirements-docs.txt
2619

27-
### Building the Docs
20+
## Building the Docs
2821

2922
Execute `jekyll build` from the `docs/` directory to compile the site.
3023
When you run `jekyll build`, it will build (using Sphinx) the Python API
3124
docs, copying them into the `docs` directory (and then also into the `_site` directory).
3225

33-
To serve the docs locally, run:
34-
35-
# Serve content locally on port 4000
36-
$ jekyll serve --watch
26+
To serve the docs locally on port 4000, run:
3727

38-
Note that `SPARK_HOME` must be set to your local Spark installation in order to generate the docs.
39-
To manually point to a specific `Spark` installation,
40-
$ SPARK_HOME=<your-path-to-spark-home> PRODUCTION=1 jekyll build
28+
SPARK_HOME=<your-path-to-spark-home> jekyll serve --watch

python/README.md

Lines changed: 0 additions & 76 deletions
This file was deleted.

python/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../README.md

0 commit comments

Comments
 (0)