Skip to content

Commit 3b07d63

Browse files
committed
Documentation: Add section ## When should I user pandas, pandarallel or pyspark?
1 parent 15fdc1e commit 3b07d63

File tree

1 file changed

+36
-6
lines changed

1 file changed

+36
-6
lines changed

docs/docs/index.md

Lines changed: 36 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,16 @@ On **Windows**, `pandarallel` will work only if the Python session
4545
(`python`, `ipython`, `jupyter notebook`, `jupyter lab`, ...) is executed from
4646
[Windows Subsystem for Linux (WSL)](https://docs.microsoft.com/en-us/windows/wsl/install-win10).
4747

48-
## Warnings
48+
!!! warning
4949

50-
- Parallelization has a cost (instantiating new processes, sending data via shared memory,
51-
...), so parallelization is efficient only if the amount of computation to parallelize
52-
is high enough. For very little amount of data, using parallelization is not always
53-
worth it.
54-
- Displaying progress bars has a cost and may slighly increase computation time.
50+
Parallelization has a cost (instantiating new processes, sending data via shared memory,
51+
...), so parallelization is efficient only if the amount of computation to parallelize
52+
is high enough. For very little amount of data, using parallelization is not always
53+
worth it.
54+
55+
!!! warning
56+
57+
Displaying progress bars has a cost and may slighly increase computation time.
5558

5659
## Examples
5760

@@ -69,3 +72,30 @@ Computer used for this benchmark:
6972
![Benchmark](https://github.com/nalepae/pandarallel/blob/3d470139d409fc2cf61bab085298011fefe638c0/docs/standard_vs_parallel_4_cores.png?raw=true)
7073

7174
For those given examples, parallel operations run approximately 4x faster than the standard operations (except for `series.map` which runs only 3.2x faster).
75+
76+
## When should I user `pandas`, `pandarallel` or `pyspark`?
77+
78+
According to [`pandas` documentation](https://pandas.pydata.org/):
79+
80+
> `pandas` is a fast, powerful, flexible and easy to use open source data analysis and
81+
> manipulation tool,built on top of the Python programming language.
82+
83+
The main `pandas` drawback is the fact it uses only one core of your computer, even if
84+
multiple cores are available.
85+
86+
`pandarallel` gets around this limitation by using all cores of your computer.
87+
But, in return, `pandarallel` need twice the memory that standard `pandas` operation
88+
would normally use.
89+
90+
==> `pandarallel` should **NOT** be used if your data cannot fit into memory with
91+
`pandas` itself. In such a case, `spark` (and its `python` layer `pyspark`)
92+
will be suitable.
93+
94+
The main drawback of `spark` is that `spark` APIs are less convenient to user than
95+
`pandas` APIs (even if this is going better) and also you need a JVM (Java Virtual
96+
Machine) on your computer.
97+
98+
However, with `spark` you can:
99+
100+
- Handle data much bigger than your memory
101+
- Using a `spark` cluster, distribute your computation over multiple nodes.

0 commit comments

Comments
 (0)