Update doc, README and author

adrien-berchet · adrien-berchet · commit 9dde38ab56ef · 2021-03-24T19:33:26.000+01:00
Change-Id: Ia625406e1c9acaa0601076c9707fc9d5a8e90114
diff --git a/README.rst b/README.rst
@@ -1,41 +1,91 @@
-# BluePyParallel: Bluebrain Python Embarassingly Parallel library
+BluePyParallel: Bluebrain Python Embarassingly Parallel library
+===============================================================
 
 
 Introduction
-============
+------------
+
+Provides an embarassingly parallel tool with sql backend, inspired by `BluePyMM <https://github.com/BlueBrain/BluePyMM>`_.
+
+
+Installation
+------------
+
+This package should be installed using pip:
+
+.. code-block:: bash
+
+    pip install bluepyparallel
+
+
+Usage
+-----
+
+General computation
+~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: python
+
+    factory_name = "multiprocessing"  # Can also be None, dask or ipyparallel
+    batch_size = 10  # This value is used to split the data into batches before processing them
+    chunk_size = 1000  # This value is used to gather the elements to process before sending them to the workers
+
+    # Setup the parallel factory
+    parallel_factory = init_parallel_factory(
+        factory_name,
+        batch_size=batch_size,
+        chunk_size=chunk_size,
+        processes=4,  # This parameter is specific to the multiprocessing factory
+    )
+
+    # Get the mapper from the factory
+    mapper = parallel_factory.get_mapper()
+
+    # Use the mapper to map the given function to each element of mapped_data and gather the results
+    result = sorted(mapper(function, mapped_data, *function_args, **function_kwargs))
+
+
+Working with Pandas and SQL backend
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This library provides a specific function working with large :class:`pandas.DataFrame`: :func:`bluepyparallel.evaluator.evaluate`.
+This function converts the DataFrame into a list of dict (one for each row), then maps a given function to element and finally gathers the results.
+As it aims at working with time consuming functions, it also provides a checkpoint and resume mechanism using a SQL backend.
+The SQL backend uses the `SQLAlchemy <https://docs.sqlalchemy.org/en/latest>`_ library, so it can work with a large variety of database types (like SQLite, PostgreSQL, MySQL, ...).
+To activate this feature, just pass a `URL that can be processed by SQLAlchemy <https://docs.sqlalchemy.org/en/latest/core/engines.html?highlight=url#database-urls>`_  to the ``db_url`` parameter of :func:`bluepyparallel.evaluator.evaluate`.
+
+.. note:: A specific driver might have to be installed to access the database (like `psycopg2 <https://www.psycopg.org/docs/>`_ for PostgreSQL for example).
+
+Example:
+
+.. code-block:: python
+
+    # Use the mapper to map the given function to each element of the DataFrame
+    result_df = evaluate(
+        input_df,  # This is the DataFrame to process
+        evaluation_function,  # This is the function that should be apllied to each row of the DataFrame
+        parallel_factory="multiprocessing",  # This could also be a Factory previously defined
+        db_url="sqlite:///db.sql",  # This could also just be "db.sql" and would be automatically turned to SQLite URL
+    )
+
+Now, if the computation crashed for any reason, the partial result is stored in the ``db.sql`` file.
+If the crash was due to an external cause (therefore executing the code again should work), it is possible to resume the
+computation from the last computed element. Thus, only the missing elements are computed, which can save a lot of time.
 
-Provides an embarassingly parallel tool with sql backend.
 
 Running using Dask
-==================
+------------------
 
-This is an example of a sbatch script that can be adapted to execute the script using multiple nodes and workers.
+This is an example of a `sbatch <https://slurm.schedmd.com/sbatch.html>`_ script that can be adapted to execute the script using multiple nodes and workers.
+In this example, the code called by the ``<command>`` should parallelized using BluePyParallel.
 
 Dask variables are not strictly required, but highly recommended, and they can be fine tuned.
 
 
 .. code:: bash
 
     #!/bin/bash -l
-    #SBATCH --nodes=2             # Number of nodes
-    #SBATCH --time=24:00:00       # Time limit
-    #SBATCH --partition=prod      # Submit to the production 'partition'
-    #SBATCH --constraint=cpu      # Constraint the job to run on nodes with/without SSDs. If you want SSD, use only "nvme". If you want KNLs then "knl"
-    #SBATCH --exclusive           # only if you need to allocate whole node
-    #SBATCH --mem=0
-    #SBATCH --ntasks-per-node=72  # no of mpi ranks to use per node
-    #SBATCH --account=projXX      # your project number
-    #SBATCH --job-name=myscript
-    #SBATCH --output=myscript_out_%j
-    #SBATCH --error=myscript_err_%j
-    set -e
-    
-    module purge
-    module load unstable hpe-mpi
-    module unload unstable
-    
-    unset PMI_RANK  # for neuron
-    
+
     # Dask configuration
     export DASK_DISTRIBUTED__LOGGING__DISTRIBUTED="info"
     export DASK_DISTRIBUTED__WORKER__USE_FILE_LOCKING=False
@@ -48,27 +98,9 @@ Dask variables are not strictly required, but highly recommended, and they can b
     # Reduce dask profile memory usage/leak (see https://github.com/dask/distributed/issues/4091)
     export DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL=10000ms  # Time between statistical profiling queries
     export DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE=1000000ms  # Time between starting new profile
-    
+
     # Split tasks to avoid some dask errors (e.g. Event loop was unresponsive in Worker)
     export PARALLEL_BATCH_SIZE=1000
-    
-    # Script parameters
-    OUTPUT="/path/to/mecombo_emodel.tsv"
-    CIRCUIT_CONFIG="/gpfs/bbp.cscs.ch/project/proj68/circuits/Isocortex/20190307/CircuitConfig"
-    MORPHOLOGY_PATH="/gpfs/bbp.cscs.ch/project/proj68/circuits/Isocortex/20190307/morphologies"
-    RELEASE_PATH="emodel_release"
-    N_CELLS=100
-    MTYPE="L5_TPC:A"
-    
-    # load the virtual env (alternatively, load the required modules)
-    source ~/venv/3.7.4-BluePyEModel/bin/activate
-    
-    srun -v \
-    BluePyEModel -v get_me_combos_parameters \
-    --circuit-config "$CIRCUIT_CONFIG" \
-    --morphology-path "$MORPHOLOGY_PATH" \
-    --release-path "$RELEASE_PATH" \
-    --output "$OUTPUT" \
-    --n-cells "$N_CELLS" \
-    --mtype "$MTYPE" \
-    --parallel-lib dask
+
+    srun -v <command>
+
diff --git a/bluepyparallel/database.py b/bluepyparallel/database.py
@@ -26,7 +26,7 @@ class DataBase:
 
     Args:
         url (str): The URL of the database following the RFC-1738 format (
-            https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls)
+            https://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls)
         create (bool): If set to True, the database will be automatically created by the
             constructor.
         args and kwargs: They will be passed to the :func:`sqlalchemy.create_engine` function.
diff --git a/bluepyparallel/evaluator.py b/bluepyparallel/evaluator.py
@@ -40,20 +40,20 @@ def evaluate(
     """Evaluate and save results in a sqlite database on the fly and return dataframe.
 
     Args:
-        df (DataFrame): each row contains information for the computation.
-        evaluation_function (function): function used to evaluate each row,
+        df (pandas.DataFrame): each row contains information for the computation.
+        evaluation_function (callable): function used to evaluate each row,
             should have a single argument as list-like containing values of the rows of df,
             and return a dict with keys corresponding to the names in new_columns.
         new_columns (list): list of names of new column and empty value to save evaluation results,
-            i.e.: [['result', 0.0], ['valid', False]].
-        resume (bool): if True, it will use only compute the empty rows of the database,
-            if False, it will ecrase or generate the database.
-        parallel_factory (ParallelFactory): parallel factory instance.
-        db_url (str): should be DB URL that can be interpreted by SQLAlchemy or can be a file path
-            that is interpreted as a SQLite database. If an URL is given, the SQL backend will be
-            enabled to store results and allowing future resume. Should not be used when
-            evaluations are numerous and fast, in order to avoid the overhead of communication with
-            SQL database.
+            i.e.: :code:`[['result', 0.0], ['valid', False]]`.
+        resume (bool): if :obj:`True` and ``db_url`` is provided, it will use only compute the
+            missing rows of the database.
+        parallel_factory (ParallelFactory or str): parallel factory name or instance.
+        db_url (str): should be DB URL that can be interpreted by :func:`sqlalchemy.create_engine`
+            or can be a file path that is interpreted as a SQLite database. If an URL is given,
+            the SQL backend will be enabled to store results and allowing future resume. Should
+            not be used when evaluations are numerous and fast, in order to avoid the overhead of
+            communication with the SQL database.
         func_args (list): the arguments to pass to the evaluation_function.
         func_kwargs (dict): the keyword arguments to pass to the evaluation_function.
         **mapper_kwargs: the keyword arguments are passed to the get_mapper() method of the
diff --git a/bluepyparallel/version.py b/bluepyparallel/version.py
@@ -1,3 +1,3 @@
 """Package version"""
 # pragma: no cover
-VERSION = "0.0.3"
+VERSION = "0.0.4.dev0"
diff --git a/doc/source/conf.py b/doc/source/conf.py
@@ -79,4 +79,5 @@
     "pandas": ("https://pandas.pydata.org/docs", None),
     "dask": ("https://docs.dask.org/en/latest/", None),
     "ipyparallel": ("https://ipyparallel.readthedocs.io/en/latest/", None),
+    "sqlalchemy": ("https://docs.sqlalchemy.org/en/latest/", None),
 }
diff --git a/examples/run_large_dask.sh b/examples/run_large_dask.sh
@@ -1,12 +1,30 @@
 #!/bin/bash -l
-#SBATCH --nodes=1             # Number of nodes
-#SBATCH --time=00:10:00       # Time limit
-#SBATCH --partition=prod
-#SBATCH --constraint=cpu
-#SBATCH --mem=0
-#SBATCH --cpus-per-task=1
-#SBATCH --account=proj82      # your project number
-#SBATCH --job-name=test_bpp
+
+# SBATCH --nodes=1             # Number of nodes
+# SBATCH --time=00:10:00       # Time limit
+# SBATCH --partition=prod
+# SBATCH --constraint=cpu
+# SBATCH --mem=0
+# SBATCH --cpus-per-task=1
+# SBATCH --account=proj82      # your project number
+# SBATCH --job-name=test_bpp
+
+# # Dask configuration
+# export DASK_DISTRIBUTED__LOGGING__DISTRIBUTED="info"
+# export DASK_DISTRIBUTED__WORKER__USE_FILE_LOCKING=False
+# export DASK_DISTRIBUTED__WORKER__MEMORY__TARGET=False  # don't spill to disk
+# export DASK_DISTRIBUTED__WORKER__MEMORY__SPILL=False  # don't spill to disk
+# export DASK_DISTRIBUTED__WORKER__MEMORY__PAUSE=0.80  # pause execution at 80% memory use
+# export DASK_DISTRIBUTED__WORKER__MEMORY__TERMINATE=0.95  # restart the worker at 95% use
+# export DASK_DISTRIBUTED__WORKER__MULTIPROCESSING_METHOD=spawn
+# export DASK_DISTRIBUTED__WORKER__DAEMON=True
+# # Reduce dask profile memory usage/leak (see https://github.com/dask/distributed/issues/4091)
+# export DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL=10000ms  # Time between statistical profiling queries
+# export DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE=1000000ms  # Time between starting new profile
+
+# # Split tasks to avoid some dask errors (e.g. Event loop was unresponsive in Worker)
+# export PARALLEL_BATCH_SIZE=1000
+
 set -e
 
 
diff --git a/setup.py b/setup.py
@@ -30,7 +30,7 @@
 
 setup(
     name="BluePyParallel",
-    author="BlueBrain cells",
+    author="bbp-ou-cells",
     author_email="bbp-ou-cells@groupes.epfl.ch",
     version=VERSION,
     description="Provides an embarassingly parallel tool with sql backend",

Original file line number	Diff line number	Diff line change
`@@ -79,4 +79,5 @@`
`79`	`79`	`"pandas": ("https://pandas.pydata.org/docs", None),`
`80`	`80`	`"dask": ("https://docs.dask.org/en/latest/", None),`
`81`	`81`	`"ipyparallel": ("https://ipyparallel.readthedocs.io/en/latest/", None),`
	`82`	`+ "sqlalchemy": ("https://docs.sqlalchemy.org/en/latest/", None),`
`82`	`83`	`}`