|
| 1 | +Scikit-learn integration package for Apache Spark |
| 2 | +================================================= |
| 3 | + |
| 4 | +This package contains some tools to integrate the `Spark computing framework <https://spark.apache.org/>`_ |
| 5 | +with the popular `scikit-learn machine library <https://scikit-learn.org/stable/>`_. Among other things, it can: |
| 6 | + |
| 7 | +- train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the |
| 8 | + `multicore implementation <https://pythonhosted.org/joblib/parallel.html>`_ included by default in ``scikit-learn`` |
| 9 | +- convert Spark's Dataframes seamlessly into numpy ``ndarray`` or sparse matrices |
| 10 | +- (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors |
| 11 | + |
| 12 | +It focuses on problems that have a small amount of data and that can be run in parallel. |
| 13 | +For small datasets, it distributes the search for estimator parameters (``GridSearchCV`` in scikit-learn), |
| 14 | +using Spark. For datasets that do not fit in memory, we recommend using the `distributed implementation in |
| 15 | +`Spark MLlib <https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html>`_. |
| 16 | + |
| 17 | +This package distributes simple tasks like grid-search cross-validation. |
| 18 | +It does not distribute individual learning algorithms (unlike Spark MLlib). |
| 19 | + |
| 20 | +Installation |
| 21 | +------------ |
| 22 | + |
| 23 | +This package is available on PYPI: |
| 24 | + |
| 25 | +:: |
| 26 | + |
| 27 | + pip install spark-sklearn |
| 28 | + |
| 29 | +This project is also available as as `Spark package <https://spark-packages.org/package/databricks/spark-sklearn>`_. |
| 30 | + |
| 31 | +The developer version has the following requirements: |
| 32 | + |
| 33 | +- a recent release of scikit-learn. Releases 0.18.1, 0.19.0 have been tested, older versions may work too. |
| 34 | +- Spark >= 2.1.1. Spark may be downloaded from the `Spark website <https://spark.apache.org/>`_. |
| 35 | + In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python |
| 36 | + interpreter. See the `Spark guide <https://spark.apache.org/docs/latest/programming-guide.html#overview>`_ |
| 37 | + for more details. |
| 38 | +- `nose <https://nose.readthedocs.org>`_ (testing dependency only) |
| 39 | +- pandas, if using the pandas integration or testing. pandas==0.18 has been tested. |
| 40 | + |
| 41 | +If you want to use a developer version, you just need to make sure the ``python/`` subdirectory is in the |
| 42 | +``PYTHONPATH`` when launching the pyspark interpreter: |
| 43 | + |
| 44 | +:: |
| 45 | + |
| 46 | + PYTHONPATH=$PYTHONPATH:./python:$SPARK_HOME/bin/pyspark |
| 47 | + |
| 48 | +You can directly run tests: |
| 49 | + |
| 50 | +:: |
| 51 | + |
| 52 | + cd python && ./run-tests.sh |
| 53 | + |
| 54 | +This requires the environment variable ``SPARK_HOME`` to point to your local copy of Spark. |
| 55 | + |
| 56 | +Example |
| 57 | +------- |
| 58 | + |
| 59 | +Here is a simple example that runs a grid search with Spark. See the `Installation <#installation>`_ section |
| 60 | +on how to install the package. |
| 61 | + |
| 62 | +.. code:: python |
| 63 | +
|
| 64 | + from sklearn import svm, grid_search, datasets |
| 65 | + from spark_sklearn import GridSearchCV |
| 66 | + iris = datasets.load_iris() |
| 67 | + parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]} |
| 68 | + svr = svm.SVC() |
| 69 | + clf = GridSearchCV(sc, svr, parameters) |
| 70 | + clf.fit(iris.data, iris.target) |
| 71 | +
|
| 72 | +This classifier can be used as a drop-in replacement for any scikit-learn classifier, with the same API. |
| 73 | + |
| 74 | +Documentation |
| 75 | +------------- |
| 76 | + |
| 77 | +`API documentation <http://databricks.github.io/spark-sklearn-docs>`_ is currently hosted on Github pages. To |
| 78 | +build the docs yourself, see the instructions in ``docs/``. |
| 79 | + |
| 80 | +.. image:: https://travis-ci.org/databricks/spark-sklearn.svg?branch=master |
| 81 | + :target: https://travis-ci.org/databricks/spark-sklearn |
0 commit comments