You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For those given examples, parallel operations run approximately 4x faster than the standard operations (except for `series.map` which runs only 3.2x faster).
75
+
76
+
## When should I user `pandas`, `pandarallel` or `pyspark`?
77
+
78
+
According to [`pandas` documentation](https://pandas.pydata.org/):
79
+
80
+
> `pandas` is a fast, powerful, flexible and easy to use open source data analysis and
81
+
> manipulation tool,built on top of the Python programming language.
82
+
83
+
The main `pandas` drawback is the fact it uses only one core of your computer, even if
84
+
multiple cores are available.
85
+
86
+
`pandarallel` gets around this limitation by using all cores of your computer.
87
+
But, in return, `pandarallel` need twice the memory that standard `pandas` operation
88
+
would normally use.
89
+
90
+
==> `pandarallel` should **NOT** be used if your data cannot fit into memory with
91
+
`pandas` itself. In such a case, `spark` (and its `python` layer `pyspark`)
92
+
will be suitable.
93
+
94
+
The main drawback of `spark` is that `spark` APIs are less convenient to user than
95
+
`pandas` APIs (even if this is going better) and also you need a JVM (Java Virtual
96
+
Machine) on your computer.
97
+
98
+
However, with `spark` you can:
99
+
100
+
- Handle data much bigger than your memory
101
+
- Using a `spark` cluster, distribute your computation over multiple nodes.
0 commit comments