You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Provides an embarassingly parallel tool with sql backend, inspired by `BluePyMM <https://github.com/BlueBrain/BluePyMM>`_.
9
+
10
+
11
+
Installation
12
+
------------
13
+
14
+
This package should be installed using pip:
15
+
16
+
.. code-block:: bash
17
+
18
+
pip install bluepyparallel
19
+
20
+
21
+
Usage
22
+
-----
23
+
24
+
General computation
25
+
~~~~~~~~~~~~~~~~~~~
26
+
27
+
.. code-block:: python
28
+
29
+
factory_name ="multiprocessing"# Can also be None, dask or ipyparallel
30
+
batch_size =10# This value is used to split the data into batches before processing them
31
+
chunk_size =1000# This value is used to gather the elements to process before sending them to the workers
32
+
33
+
# Setup the parallel factory
34
+
parallel_factory = init_parallel_factory(
35
+
factory_name,
36
+
batch_size=batch_size,
37
+
chunk_size=chunk_size,
38
+
processes=4, # This parameter is specific to the multiprocessing factory
39
+
)
40
+
41
+
# Get the mapper from the factory
42
+
mapper = parallel_factory.get_mapper()
43
+
44
+
# Use the mapper to map the given function to each element of mapped_data and gather the results
45
+
result =sorted(mapper(function, mapped_data, *function_args, **function_kwargs))
46
+
47
+
48
+
Working with Pandas and SQL backend
49
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
50
+
51
+
This library provides a specific function working with large :class:`pandas.DataFrame`: :func:`bluepyparallel.evaluator.evaluate`.
52
+
This function converts the DataFrame into a list of dict (one for each row), then maps a given function to element and finally gathers the results.
53
+
As it aims at working with time consuming functions, it also provides a checkpoint and resume mechanism using a SQL backend.
54
+
The SQL backend uses the `SQLAlchemy <https://docs.sqlalchemy.org/en/latest>`_ library, so it can work with a large variety of database types (like SQLite, PostgreSQL, MySQL, ...).
55
+
To activate this feature, just pass a `URL that can be processed by SQLAlchemy <https://docs.sqlalchemy.org/en/latest/core/engines.html?highlight=url#database-urls>`_ to the ``db_url`` parameter of :func:`bluepyparallel.evaluator.evaluate`.
56
+
57
+
.. note:: A specific driver might have to be installed to access the database (like `psycopg2 <https://www.psycopg.org/docs/>`_ for PostgreSQL for example).
58
+
59
+
Example:
60
+
61
+
.. code-block:: python
62
+
63
+
# Use the mapper to map the given function to each element of the DataFrame
64
+
result_df = evaluate(
65
+
input_df, # This is the DataFrame to process
66
+
evaluation_function, # This is the function that should be apllied to each row of the DataFrame
67
+
parallel_factory="multiprocessing", # This could also be a Factory previously defined
68
+
db_url="sqlite:///db.sql", # This could also just be "db.sql" and would be automatically turned to SQLite URL
69
+
)
70
+
71
+
Now, if the computation crashed for any reason, the partial result is stored in the ``db.sql`` file.
72
+
If the crash was due to an external cause (therefore executing the code again should work), it is possible to resume the
73
+
computation from the last computed element. Thus, only the missing elements are computed, which can save a lot of time.
6
74
7
-
Provides an embarassingly parallel tool with sql backend.
8
75
9
76
Running using Dask
10
-
==================
77
+
------------------
11
78
12
-
This is an example of a sbatch script that can be adapted to execute the script using multiple nodes and workers.
79
+
This is an example of a `sbatch <https://slurm.schedmd.com/sbatch.html>`_ script that can be adapted to execute the script using multiple nodes and workers.
80
+
In this example, the code called by the ``<command>`` should parallelized using BluePyParallel.
13
81
14
82
Dask variables are not strictly required, but highly recommended, and they can be fine tuned.
15
83
16
84
17
85
.. code:: bash
18
86
19
87
#!/bin/bash -l
20
-
#SBATCH --nodes=2 # Number of nodes
21
-
#SBATCH --time=24:00:00 # Time limit
22
-
#SBATCH --partition=prod # Submit to the production 'partition'
23
-
#SBATCH --constraint=cpu # Constraint the job to run on nodes with/without SSDs. If you want SSD, use only "nvme". If you want KNLs then "knl"
24
-
#SBATCH --exclusive # only if you need to allocate whole node
25
-
#SBATCH --mem=0
26
-
#SBATCH --ntasks-per-node=72 # no of mpi ranks to use per node
0 commit comments