Skip to content
This repository was archived by the owner on Dec 4, 2019. It is now read-only.

Commit b21cf10

Browse files
authored
Convert README to .rst for better rendering in Python docs (#93)
1 parent 51c8ba1 commit b21cf10

File tree

7 files changed

+85
-62
lines changed

7 files changed

+85
-62
lines changed

README.md

Lines changed: 0 additions & 58 deletions
This file was deleted.

README.rst

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
Scikit-learn integration package for Apache Spark
2+
=================================================
3+
4+
This package contains some tools to integrate the `Spark computing framework <https://spark.apache.org/>`_
5+
with the popular `scikit-learn machine library <https://scikit-learn.org/stable/>`_. Among other things, it can:
6+
7+
- train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the
8+
`multicore implementation <https://pythonhosted.org/joblib/parallel.html>`_ included by default in ``scikit-learn``
9+
- convert Spark's Dataframes seamlessly into numpy ``ndarray`` or sparse matrices
10+
- (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors
11+
12+
It focuses on problems that have a small amount of data and that can be run in parallel.
13+
For small datasets, it distributes the search for estimator parameters (``GridSearchCV`` in scikit-learn),
14+
using Spark. For datasets that do not fit in memory, we recommend using the `distributed implementation in
15+
`Spark MLlib <https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html>`_.
16+
17+
This package distributes simple tasks like grid-search cross-validation.
18+
It does not distribute individual learning algorithms (unlike Spark MLlib).
19+
20+
Installation
21+
------------
22+
23+
This package is available on PYPI:
24+
25+
::
26+
27+
pip install spark-sklearn
28+
29+
This project is also available as as `Spark package <https://spark-packages.org/package/databricks/spark-sklearn>`_.
30+
31+
The developer version has the following requirements:
32+
33+
- a recent release of scikit-learn. Releases 0.18.1, 0.19.0 have been tested, older versions may work too.
34+
- Spark >= 2.1.1. Spark may be downloaded from the `Spark website <https://spark.apache.org/>`_.
35+
In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python
36+
interpreter. See the `Spark guide <https://spark.apache.org/docs/latest/programming-guide.html#overview>`_
37+
for more details.
38+
- `nose <https://nose.readthedocs.org>`_ (testing dependency only)
39+
- pandas, if using the pandas integration or testing. pandas==0.18 has been tested.
40+
41+
If you want to use a developer version, you just need to make sure the ``python/`` subdirectory is in the
42+
``PYTHONPATH`` when launching the pyspark interpreter:
43+
44+
::
45+
46+
PYTHONPATH=$PYTHONPATH:./python:$SPARK_HOME/bin/pyspark
47+
48+
You can directly run tests:
49+
50+
::
51+
52+
cd python && ./run-tests.sh
53+
54+
This requires the environment variable ``SPARK_HOME`` to point to your local copy of Spark.
55+
56+
Example
57+
-------
58+
59+
Here is a simple example that runs a grid search with Spark. See the `Installation <#installation>`_ section
60+
on how to install the package.
61+
62+
.. code:: python
63+
64+
from sklearn import svm, grid_search, datasets
65+
from spark_sklearn import GridSearchCV
66+
iris = datasets.load_iris()
67+
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
68+
svr = svm.SVC()
69+
clf = GridSearchCV(sc, svr, parameters)
70+
clf.fit(iris.data, iris.target)
71+
72+
This classifier can be used as a drop-in replacement for any scikit-learn classifier, with the same API.
73+
74+
Documentation
75+
-------------
76+
77+
`API documentation <http://databricks.github.io/spark-sklearn-docs>`_ is currently hosted on Github pages. To
78+
build the docs yourself, see the instructions in ``docs/``.
79+
80+
.. image:: https://travis-ci.org/databricks/spark-sklearn.svg?branch=master
81+
:target: https://travis-ci.org/databricks/spark-sklearn

python/MANIFEST.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
include README.md
1+
include README.rst

python/README.md

Lines changed: 0 additions & 1 deletion
This file was deleted.

python/README.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../README.rst

python/setup.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
[metadata]
2-
description-file = README.md
2+
description-file = README.rst

python/setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ def read(*parts):
4444
maintainer="Tim Hunter",
4545
maintainer_email="[email protected]",
4646
keywords=KEYWORDS,
47-
long_description=read("README.md"),
47+
long_description=read("README.rst"),
4848
packages=PACKAGES,
4949
classifiers=CLASSIFIERS,
5050
zip_safe=False,

0 commit comments

Comments
 (0)