Skip to content

Commit 7731a4c

Browse files
Add a section in documentation describing code profiling
1 parent 43c341c commit 7731a4c

File tree

1 file changed

+70
-0
lines changed

1 file changed

+70
-0
lines changed

docs/contribute/contribute_code.rst

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -449,6 +449,76 @@ you want to know why we prefer tox, this
449449
will tell you everything ;)
450450

451451

452+
Code Profiling
453+
--------------
454+
455+
If you want to profile your code, you can use the **profiling** module in root directory. There you will find two files,
456+
`profiling.py` and `profiling.sh`. Both file does the same thing but in different ways. The profiling.py file is a python script
457+
containing a function that must be used as a decorator for the class/method we want to profile.
458+
The profiling.sh file is a bash/zsh script that you can run from the command line to profile whole .py script file.
459+
Let us see how to use them. First, start with profiling.py file.
460+
461+
I doubt that `DropDuplicateFeatures` class should take more time than other classes as it iterates over the columns and
462+
checks if they are duplicated or not. So, I will profile the `DropDuplicateFeatures` class.
463+
464+
First, I will find where this class resides and on top of the imports I will add the following line::
465+
466+
from profiling.profiling import profile_function
467+
468+
Now, I will decorate the `DropDuplicateFeatures.fit` method with the `profile_function` function::
469+
470+
@profile_function(output_file="profile.html")
471+
def fit(self, X: pd.DataFrame, y: pd.Series = None):
472+
...
473+
474+
The next step is to create a temporary .py file that will contain the code that we want to profile.
475+
476+
For example, I will create a file named `temp.py` and copy the following code::
477+
478+
import pandas as pd
479+
import numpy as np
480+
481+
from feature_engine.selection import DropDuplicateFeatures
482+
483+
484+
if __name__ == "__main__":
485+
rows = 10000
486+
cols = 60000
487+
col_names = [f"col_{i}" for i in range(cols)]
488+
df = pd.DataFrame(np.random.randint(0, 100, size=(rows, cols)), columns=col_names)
489+
490+
transformer = DropDuplicateFeatures()
491+
transformer.fit(df)
492+
493+
train_t = transformer.transform(df)
494+
495+
496+
Now, I will run the `temp.py` file from the command line::
497+
498+
$ python temp.py
499+
500+
This will create a file named `profile.html` in the root directory of the project. This file contains the profiling
501+
results. You can open it with your favorite browser and inspect the results.
502+
503+
If you don't like adding additional imports and decorator, then you can use the `profiling.sh` file. This file is a bash/zsh
504+
script that you can run from the command line. Let us see how to use it.
505+
506+
Again, I will profile the `DropDuplicateFeatures` class. I need to create a temporary .py file and put the same code as above.
507+
After that, open the terminal in root directory and run the following command::
508+
509+
$ ./profiling/profiling.sh temp.py
510+
511+
512+
This will create a directory, named `profiles`, in the root directory of the project. This directory contains tw files:
513+
the first is .html file and you can open it with any browser, the second file is .json file and you can use
514+
`speedscope <https://www.speedscope.app/>`_ to visualize results.
515+
516+
517+
.. note::
518+
To profile the memory usage, you can use the `memray` package. You can find more information about it
519+
`here <https://bloomberg.github.io/memray/index.html>`_.
520+
521+
452522
Review Process
453523
--------------
454524

0 commit comments

Comments
 (0)