You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 4, 2019. It is now read-only.
# Scikit-learn integration package for Apache Spark
2
2
3
-
This package contains some tools to integrate the [Spark computing framework](http://spark.apache.org/) with the popular [scikit-learn machine library](http://scikit-learn.org/stable/). Among other tools:
4
-
- train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the [multicore implementation](https://pythonhosted.org/joblib/parallel.html) included by default in [scikit-learn](http://scikit-learn.org/stable/).
5
-
- convert Spark's Dataframes seamlessly into numpy `ndarray`s or sparse matrices.
6
-
- (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors.
3
+
This package contains some tools to integrate the [Spark computing framework](https://spark.apache.org/) with the popular [scikit-learn machine library](https://scikit-learn.org/stable/). Among other tools:
4
+
- train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the [multicore implementation](https://pythonhosted.org/joblib/parallel.html) included by default in [scikit-learn](https://scikit-learn.org/stable/).
5
+
- convert Spark's Dataframes seamlessly into numpy `ndarray`s or sparse matrices.
6
+
- (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors.
7
7
8
8
It focuses on problems that have a small amount of data and that can be run in parallel.
9
9
- for small datasets, it distributes the search for estimator parameters (`GridSearchCV` in scikit-learn), using Spark,
10
10
- for datasets that do not fit in memory, we recommend using the [distributed implementation in Spark MLlib](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html).
11
-
12
-
> NOTE: This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).
13
-
14
-
**Difference with the [sparkit-learn project](https://github.com/lensacom/sparkit-learn)** The sparkit-learn project aims at a comprehensive integration between Spark and scikit-learn. In particular, it adds some primitives to distribute numerical data using Spark, and it reimplements some of the most common algorithms found in scikit-learn.
15
-
16
-
## License
17
-
18
-
This package is released under the Apache 2.0 license. See the LICENSE file.
11
+
This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).
19
12
20
13
## Installation
21
14
22
15
This package is available on PYPI:
23
16
24
17
pip install spark-sklearn
25
18
26
-
This project is also available as as [Spark package](http://spark-packages.org/package/databricks/spark-sklearn).
19
+
This project is also available as as [Spark package](https://spark-packages.org/package/databricks/spark-sklearn).
27
20
28
21
The developer version has the following requirements:
29
22
- a recent release of scikit-learn. Releases 0.18.1, 0.19.0 have been tested, older versions may work too.
30
-
- Spark >= 2.1.1. Spark may be downloaded from the [Spark official website](http://spark.apache.org/). In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python interpreter. See the [Spark guide](https://spark.apache.org/docs/latest/programming-guide.html#overview) for more details.
23
+
- Spark >= 2.1.1. Spark may be downloaded from the [Spark official website](https://spark.apache.org/). In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python interpreter. See the [Spark guide](https://spark.apache.org/docs/latest/programming-guide.html#overview) for more details.
0 commit comments