Perf: optimize postal_parse_address implementation by lmores · Pull Request #49 · NickCrews/mismo

lmores · 2024-07-19T13:59:06Z

Intro

In this PR:

function mismo.lib.geo.postal_parse_address has been rewritten using my implementation
function mismo.lib.geo.tests.test_postal_benchmark.postal_parse_address__direct_import is the same as the above one but avoids the optional import of postal inside its body (@NickCrews: don't worry, I was just curious if that import and the inner function declaration generated any relevant overhead... seems it is irrelevant, will remove it before merge)
function mismo.lib.geo.tests.test_postal_benchmark.postal_parse_address__initial_impl is the original implementation by @jstammers

In this branch I also temporary restored the top-level benchmarks directory to manually gather execution times.
All benchmarks have been run on the full data set of 1M rows using duckdb 0.10.0, ibis-framework 9.1.0 and python 3.12.3 on the same laptop with no other programs running (except GNOME, etc).

OS: Ubuntu 21.04 x86_64 
Host: Latitude 5500 
Kernel: 5.11.0-49-generic 
CPU: Intel i7-8665U (8) @ 4.800GHz 
GPU: Intel WhiskeyLake-U GT2 [UHD Graphics 620] 
Memory: 8067MiB / 15813MiB

Results

Results obtained with manual timings are as follows.

$ python -m benchmarks.parse_address
noop                                took  19.6631 seconds
python_only                         took  20.6381 seconds
postal_only                         took  99.4631 seconds
postal_parse_address                took 109.7075 seconds
postal_parse_address__direct_import took 110.1048 seconds
postal_parse_address__initial_impl  took 118.7811 seconds

On the other hand, pytest-benchmark reports the following.

$ just bench --benchmark-sort=mean --benchmark-group-by=param:nrows -k 1m mismo/lib/geo/tests/test_postal_benchmark.py
pytest --benchmark-only --benchmark-enable --benchmark-autosave --benchmark-sort=mean --benchmark-group-by=param:nrows -k 1m mismo/lib/geo/tests/test_postal_benchmark.py
======================================================================================================== test session starts =========================================================================================================
platform linux -- Python 3.12.3, pytest-8.2.2, pluggy-1.5.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/lorenzo/devel/mismo
configfile: pyproject.toml
plugins: cov-5.0.0, benchmark-4.0.0, anyio-4.4.0
collected 24 items / 18 deselected / 6 selected                                                                                                                                                                                      

mismo/lib/geo/tests/test_postal_benchmark.py ......                                                                                                                                                                            [100%]
Saved benchmark data in: /home/lorenzo/devel/mismo/.benchmarks/Linux-CPython-3.12-64bit/0016_efaf478c0a66e11853fcf27704b1061ccff1fdff_20240719_124717.json



---------------------------------------------------------------------------------------------- benchmark 'nrows=1000000': 6 tests ---------------------------------------------------------------------------------------------
Name (time in s)                                                     Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_parse[1m-noop]                                    14.4026 (1.0)      14.7042 (1.0)      14.5671 (1.0)      0.1196 (1.0)      14.5971 (1.0)      0.1840 (1.0)           2;0  0.0686 (1.0)           5           1
test_benchmark_parse[1m-python_only]                             15.7207 (1.09)     16.1826 (1.10)     15.9631 (1.10)     0.2066 (1.73)     15.9745 (1.09)     0.3831 (2.08)          2;0  0.0626 (0.91)          5           1
test_benchmark_parse[1m-postal_only]                             67.2535 (4.67)     69.3878 (4.72)     68.0714 (4.67)     0.8035 (6.72)     67.7990 (4.64)     0.8358 (4.54)          2;0  0.0147 (0.21)          5           1
test_benchmark_parse[1m-postal_parse_address__direct_import]     73.5724 (5.11)     74.3592 (5.06)     73.9136 (5.07)     0.3081 (2.58)     73.9762 (5.07)     0.4267 (2.32)          2;0  0.0135 (0.20)          5           1
test_benchmark_parse[1m-postal_parse_address]                    73.3805 (5.09)     75.1065 (5.11)     73.9273 (5.07)     0.7074 (5.91)     73.7897 (5.06)     0.8699 (4.73)          1;0  0.0135 (0.20)          5           1
test_benchmark_parse[1m-postal_parse_address__initial_impl]      84.0357 (5.83)     87.4314 (5.95)     85.2971 (5.86)     1.5099 (12.62)    84.3496 (5.78)     2.3757 (12.91)         1;0  0.0117 (0.17)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
=========================================================================================== 6 passed, 18 deselected in 2350.17s (0:39:10) ============================================================================================

Values in the Mean column are remarkably lower than those reported manually (which lay way outside the StdDev or the IQR).
Isn't it quite/too strange? As far as I can tell pytest-benchmark is running each function 5 times, are we caching something that we should not and subsequent executions run faster than just one (as in manual timings)?
Also, in pytest benchmarks the difference between postal_parse_address and postal_only is more than 5s which is much bigger than the one from manual timings (<1s).

Additionally, before running the above results I recreated my virtualenv from zero to match mismo new requirements (python>=3.10, ibis-framework>=9.1.0) and the results I get are quite different also from those I reported in PR #48 where I was using ibis-framework 9.0.0 and duckdb 1.0.0 (not sure about which version of python though).

Questions

Before going forward, @NickCrews and @jstammers could you run both benchmarks on your systems and share the results?
@NickCrews: in light of the new results are you still willing to make the implementation change?

jstammers · 2024-07-24T21:44:04Z

@lmores here's what I see when I run those benchmarks for the following system specs

OS: Manjaro Linux x86_64 
Kernel: 6.9.9-1-MANJARO 
CPU: AMD Ryzen 5 3600 (12) @ 3.600GHz 
GPU: NVIDIA GeForce GTX 1080 Ti 
Memory: 8515MiB / 32026MiB

$ python -m benchmarks.parse_address
noop                                took  29.1380 seconds
python_only                         took  28.7530 seconds
postal_only                         took 113.7731 seconds
postal_parse_address                took 119.3624 seconds
postal_parse_address__direct_import took 119.5952 seconds
postal_parse_address__initial_impl  took 121.1622 seconds

============================= test session starts ==============================
platform linux -- Python 3.12.4, pytest-8.2.2, pluggy-1.5.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/jimmy/Work/mismo
configfile: pyproject.toml
plugins: anyio-4.4.0, cov-5.0.0, benchmark-4.0.0
collected 24 items / 18 deselected / 6 selected

mismo/lib/geo/tests/test_postal_benchmark.py ......                      [100%]


---------------------------------------------------------------------------------------------- benchmark 'nrows=1000000': 6 tests ---------------------------------------------------------------------------------------------
Name (time in s)                                                     Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_parse[1m-noop]                                    12.8445 (1.0)      12.9714 (1.0)      12.9009 (1.0)      0.0570 (1.0)      12.9015 (1.0)      0.1046 (1.44)          1;0  0.0775 (1.0)           5           1
test_benchmark_parse[1m-python_only]                             13.5212 (1.05)     13.7194 (1.06)     13.5844 (1.05)     0.0782 (1.37)     13.5659 (1.05)     0.0728 (1.0)           1;1  0.0736 (0.95)          5           1
test_benchmark_parse[1m-postal_only]                             51.1747 (3.98)     51.4688 (3.97)     51.2672 (3.97)     0.1179 (2.07)     51.2128 (3.97)     0.1199 (1.65)          1;0  0.0195 (0.25)          5           1
test_benchmark_parse[1m-postal_parse_address]                    56.0817 (4.37)     56.8488 (4.38)     56.2741 (4.36)     0.3264 (5.73)     56.1196 (4.35)     0.2928 (4.02)          1;1  0.0178 (0.23)          5           1
test_benchmark_parse[1m-postal_parse_address__direct_import]     56.2151 (4.38)     56.4120 (4.35)     56.3279 (4.37)     0.0975 (1.71)     56.3729 (4.37)     0.1837 (2.52)          2;0  0.0178 (0.23)          5           1
test_benchmark_parse[1m-postal_parse_address__initial_impl]      58.3636 (4.54)     58.5571 (4.51)     58.4671 (4.53)     0.0819 (1.44)     58.4389 (4.53)     0.1328 (1.82)          2;0  0.0171 (0.22)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
================ 6 passed, 18 deselected in 1769.89s (0:29:29) =================

Like you, I also see a significant improvement in the runtime performance when using pytest-benchmark

NickCrews · 2024-07-25T15:20:28Z

Hmm, really curious about this one, I'll dig in once I'm at a computer. Thanks for the effort. First thought is all the timings seem cut in nearly exactly half, maybe were accidentally doing computation twice in the non-pytest harness?

lmores · 2024-07-26T08:48:49Z

maybe were accidentally doing computation twice in the non-pytest harness?

Quite sure this is not the case (also, my timings are a bit far from being halved).

Differences may be due to the fact that in non-pytest benchmarks I was using DuckDB that writes to disk, and not in memory as in pytest-benchmark... however, this is what I get when running non-pytest benchmarks using a memtable (doesn't change much, still far from the others):

noop                                took  19.4047 seconds
python_only                         took  20.3601 seconds
postal_only                         took 104.8952 seconds
postal_parse_address                took 113.1792 seconds
postal_parse_address__direct_import took 119.0952 seconds
postal_parse_address__initial_impl  took 121.0536 seconds

At first I thought that pytest-benchmark library did something as simple as endTimestamp - startTimestamp to measure execution time, but as briefly explained here the measurement technique they employ seems to follow a more "probabilistical" approach (something like "wake up each X ms and see who is running").
Could it be the root cause of the differences???

lmores added 2 commits July 19, 2024 14:34

perf: optimize postal_parse_address implementation

42766f3

tmp: restore previous benchmarking script

efaf478

Run manual benchmarks using a memtable

2e63577

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: optimize postal_parse_address implementation#49

Perf: optimize postal_parse_address implementation#49
lmores wants to merge 3 commits intoNickCrews:mainfrom
lmores:perf/parse_address_impl

lmores commented Jul 19, 2024 •

edited

Loading

Uh oh!

jstammers commented Jul 24, 2024

Uh oh!

NickCrews commented Jul 25, 2024

Uh oh!

lmores commented Jul 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lmores commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intro

Results

Questions

Uh oh!

jstammers commented Jul 24, 2024

Uh oh!

NickCrews commented Jul 25, 2024

Uh oh!

lmores commented Jul 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lmores commented Jul 19, 2024 •

edited

Loading