11"""
2- Profiling your PyTorch Module
3- ------------
2+ PyTorch λͺ¨λ νλ‘νμΌλ§ νκΈ°
3+ ---------------------------
44**Author:** `Suraj Subramanian <https://github.com/suraj813>`_
55
6- PyTorch includes a profiler API that is useful to identify the time and
7- memory costs of various PyTorch operations in your code. Profiler can be
8- easily integrated in your code, and the results can be printed as a table
9- or retured in a JSON trace file .
6+ **λ²μ:** `μ΄μ¬λ³΅ <http://github.com/zzaebok>`_
7+
8+ PyTorchλ μ½λ λ΄μ λ€μν Pytorch μ°μ°μ λν μκ°κ³Ό λ©λͺ¨λ¦¬ λΉμ©μ νμ
νλ λ° μ μ©ν νλ‘νμΌλ¬(profiler) APIλ₯Ό ν¬ν¨νκ³ μμ΅λλ€.
9+ νλ‘νμΌλ¬λ μ½λμ μ½κ² ν΅ν©λ μ μμΌλ©°, νλ‘νμΌλ§ κ²°κ³Όλ νλ‘ μΆλ ₯λκ±°λ JSON νμμ μΆμ ( trace) νμΌλ‘ λ°νλ μ μμ΅λλ€ .
1010
1111.. note::
12- Profiler supports multithreaded models. Profiler runs in the
13- same thread as the operation but it will also profile child operators
14- that might run in another thread. Concurrently-running profilers will be
15- scoped to their own thread to prevent mixing of results .
12+ νλ‘νμΌλ¬λ λ©ν°μ€λ λνλ λͺ¨λΈλ€μ μ§μν©λλ€.
13+ νλ‘νμΌλ¬λ μ°μ°μ΄ μ΄λ£¨μ΄μ§λ μ€λ λμ κ°μ μ€λ λμμ μ€νλμ§λ§ λ€λ₯Έ μ€λ λμμ μ€νλλ μμ μ°μ°
14+ λν νλ‘νμΌλ§ν μ μμ΅λλ€.
15+ λμμ μ€νλλ νλ‘νμΌλ¬λ€μ κ²°κ³Όκ° μμ΄μ§ μλλ‘ κ°μμ μ€λ λ λ²μμ νμ λ©λλ€ .
1616
1717.. note::
18- PyTorch 1.8 introduces the new API that will replace the older profiler API
19- in the future releases. Check the new API at `this page <https://pytorch.org/docs/master/profiler.html>`__.
18+ Pytorch 1.8μ λ―Έλμ 릴리μ¦μμ κΈ°μ‘΄μ νλ‘νμΌλ¬ APIλ₯Ό λ체ν μλ‘μ΄ APIλ₯Ό μκ°νκ³ μμ΅λλ€.
19+ μλ‘μ΄ APIλ₯Ό `μ΄ νμ΄μ§ <https://pytorch.org/docs/master/profiler.html>`__ μμ νμΈνμΈμ .
2020
21- Head on over to `this
22- recipe <https://tutorials.pytorch.kr/recipes/recipes/profiler_recipe.html>`__
23- for a quicker walkthrough of Profiler API usage.
21+ νλ‘νμΌλ¬ API μ¬μ©λ²μ λν΄ λΉ λ₯΄κ² μ΄ν΄λ³΄κ³ μΆλ€λ©΄ `μ΄ λ μνΌ λ¬Έμ <https://tutorials.pytorch.kr/recipes/recipes/profiler_recipe.html>`__ λ₯Ό νμΈνμΈμ.
2422
2523
2624--------------
3331
3432
3533######################################################################
36- # Performance debugging using Profiler
34+ # νλ‘νμΌλ¬λ₯Ό μ΄μ©νμ¬ μ±λ₯ λλ²κΉ
νκΈ°
3735# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3836#
39- # Profiler can be useful to identify performance bottlenecks in your
40- # models. In this example, we build a custom module that performs two
41- # sub-tasks:
37+ # νλ‘νμΌλ¬λ λͺ¨λΈμμ μ±λ₯μ λ³λͺ©μ νμ
ν λ μ μ©ν μ μμ΅λλ€.
38+ # μ΄λ² μμ μμ, λ κ°μ§ νμ μμ
μ μννλ μ¬μ©μ μ μ λͺ¨λμ λ§λ€κ² μ΅λλ€:
4239#
43- # - a linear transformation on the input, and
44- # - use the transformation result to get indices on a mask tensor.
40+ # - μ
λ ₯μ λν μ ν λ³ν
41+ # - λ³ν κ²°κ³Όλ₯Ό μ΄μ©ν λ§μ€ν¬ ν
μ(mask Tensor)μμ μΈλ±μ€ μΆμΆ
4542#
46- # We wrap the code for each sub-task in separate labelled context managers using
47- # ``profiler.record_function("label")``. In the profiler output, the
48- # aggregate performance metrics of all operations in the sub-task will
49- # show up under its corresponding label.
43+ # κ° νμ μμ
λ€μ λν μ½λλ ``profiler.record_function("label")`` μ μ΄μ©νμ¬
44+ # λ μ΄λΈλ 컨ν
μ€νΈ λ§€λμ (context manager) λ€μ μν΄ κ°μλλ€.
45+ # νλ‘νμΌλ¬μ μΆλ ₯μμ, νμ μμ
λ€μ λͺ¨λ μ°μ°μ λν μ§κ³(aggregate) μ±λ₯ μ§νλ€μ΄ ν΄λΉ λ μ΄λΈ μλ λνλκ² λ©λλ€.
5046#
5147#
52- # Note that using Profiler incurs some overhead, and is best used only for investigating
53- # code. Remember to remove it if you are benchmarking runtimes .
48+ # νλ‘νμΌλ¬λ₯Ό μ¬μ©νλ κ²μ μ½κ°μ μ€λ²ν€λκ° λ°μνλ©°, μ½λλ₯Ό λΆμν λμλ§ μ¬μ©νλ κ²μ΄ κ°μ₯ μ’μ΅λλ€.
49+ # λ§μΌ μ€νμκ°μ λ²€μΉλ§νΉνλ κ²½μ°μλ μ΄λ₯Ό μ κ±°νλ κ²μ μμ§ λ§μμμ€ .
5450#
5551
5652class MyModule (nn .Module ):
@@ -71,52 +67,49 @@ def forward(self, input, mask):
7167
7268
7369######################################################################
74- # Profile the forward pass
75- # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
70+ # μμ ν λ¨κ³( forward pass) νλ‘νμΌλ§νκΈ°
71+ # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7672#
77- # We initialize random input and mask tensors, and the model .
73+ # μ
λ ₯κ³Ό λ§μ€ν¬ ν
μ, κ·Έλ¦¬κ³ λͺ¨λΈμ μμλ‘ μ΄κΈ°νν©λλ€ .
7874#
79- # Before we run the profiler, we warm-up CUDA to ensure accurate
80- # performance benchmarking. We wrap the forward pass of our module in the
81- # ``profiler.profile`` context manager. The ``with_stack=True`` parameter appends the
82- # file and line number of the operation in the trace.
75+ # νλ‘νμΌλ¬λ₯Ό μ€ννκΈ° μ , μ νν μ±λ₯ λ²€μΉλ§νΉμ 보μ₯νκΈ° μν΄ CUDAλ₯Ό μλ°μ
(warm-up) μν΅λλ€.
76+ # λͺ¨λΈμ μμ ν λ¨κ³λ₯Ό ``profiler.profile`` 컨ν
μ€νΈ λ§€λμ λ₯Ό ν΅ν΄ κ°μλλ€.
77+ # ``with_stack=True`` μΈμλ μ°μ°μ μΆμ (trace) νμΌ λ΄λΆμ νμΌκ³Ό μ€λ²νΈλ₯Ό λ§λΆμ
λλ€.
8378#
8479# .. WARNING::
85- # ``with_stack=True`` incurs an additional overhead, and is better suited for investigating code .
86- # Remember to remove it if you are benchmarking performance .
80+ # ``with_stack=True`` λ μΆκ°μ μΈ μ€λ²ν€λλ₯Ό λ°μμν€κΈ° λλ¬Έμ μ½λλ₯Ό λΆμν λμ μ¬μ©νλ κ²μ΄ λ°λμ§ν©λλ€ .
81+ # μ±λ₯μ λ²€μΉλ§νΉνλ€λ©΄ μ΄λ₯Ό μ κ±°νλ κ²μ μμ§ λ§μμμ€ .
8782#
8883
8984model = MyModule (500 , 10 ).cuda ()
9085input = torch .rand (128 , 500 ).cuda ()
9186mask = torch .rand ((500 , 500 , 500 ), dtype = torch .double ).cuda ()
9287
93- # warm-up
88+ # μλ°μ
( warm-up)
9489model (input , mask )
9590
9691with profiler .profile (with_stack = True , profile_memory = True ) as prof :
9792 out , idx = model (input , mask )
9893
9994
10095######################################################################
101- # Print profiler results
96+ # νλ‘νμΌλ¬μ κ²°κ³Ό μΆλ ₯νκΈ°
10297# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10398#
104- # Finally, we print the profiler results. ``profiler.key_averages``
105- # aggregates the results by operator name, and optionally by input
106- # shapes and/or stack trace events.
107- # Grouping by input shapes is useful to identify which tensor shapes
108- # are utilized by the model.
99+ # μ΅μ’
μ μΌλ‘ νλ‘νμΌλ¬μ κ²°κ³Όλ₯Ό μΆλ ₯ν©λλ€.
100+ # ``profiler.key_averages`` λ μ°μ°μμ μ΄λ¦μ λ°λΌ κ²°κ³Όλ₯Ό μ§κ³νλλ°,
101+ # μ νμ μΌλ‘ μ
λ ₯μ shapeκ³Ό/λλ μ€ν μΆμ (stack trace) μ΄λ²€νΈμ λ°λΌμλ κ²°κ³Όλ₯Ό μ§κ³ν μ μμ΅λλ€.
102+ # μ
λ ₯μ shapeμ λ°λΌμ κ·Έλ£Ήν νλ κ²μ μ΄λ ν shapeμ ν
μλ€μ΄ λͺ¨λΈμ μν΄ μ¬μ©λλμ§ νμ
νλ λ° μ μ©ν©λλ€.
109103#
110- # Here, we use ``group_by_stack_n=5`` which aggregates runtimes by the
111- # operation and its traceback (truncated to the most recent 5 events), and
112- # display the events in the order they are registered. The table can also
113- # be sorted by passing a ``sort_by`` argument (refer to the
114- # `docs <https://pytorch.org/docs/stable/autograd.html#profiler>`__ for
115- # valid sorting keys).
104+ # μ¬κΈ°μ, ``group_by_stack_n=5`` λ₯Ό μ¬μ©νλλ° μ΄λ μ°μ°(operation)κ³Ό traceback(κ°μ₯ μ΅κ·Ό 5κ°μ μ΄λ²€νΈμ λν)μ
105+ # κΈ°μ€μΌλ‘ μ€νμκ°μ μ§κ³νλ κ²μ΄κ³ , μ΄λ²€νΈλ€μ΄ λ±λ‘λ μμλ‘ μ λ ¬λμ΄ νμλ©λλ€.
106+ # κ²°κ³Ό νλ ``sort_by`` μΈμ (μ ν¨ν μ λ ¬ ν€λ `docs <https://pytorch.org/docs/stable/autograd.html#profiler>`__ μμ
107+ # νμΈνμΈμ) λ₯Ό λ겨μ€μΌλ‘μ¨ μ λ ¬λ μ μμ΅λλ€.
116108#
117109# .. Note::
118- # When running profiler in a notebook, you might see entries like ``<ipython-input-18-193a910735e8>(13): forward``
119- # instead of filenames in the stacktrace. These correspond to ``<notebook-cell>(line number): calling-function``.
110+ # notebookμμ νλ‘νμΌλ¬λ₯Ό μ€νν λ μ€ν μΆμ (stacktrace)μμ νμΌλͺ
λμ
111+ # ``<ipython-input-18-193a910735e8>(13): forward`` μ κ°μ νλͺ©μ λ³Ό μ μμ΅λλ€.
112+ # μ΄λ ``<notebook-cell>(line number): calling-function`` μ νμμ λμλ©λλ€.
120113
121114print (prof .key_averages (group_by_stack_n = 5 ).table (sort_by = 'self_cpu_time_total' , row_limit = 5 ))
122115
@@ -162,21 +155,21 @@ def forward(self, input, mask):
162155"""
163156
164157######################################################################
165- # Improve memory performance
158+ # λ©λͺ¨λ¦¬ μ±λ₯ ν₯μμν€κΈ°
166159# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
167- # Note that the most expensive operations - in terms of memory and time -
168- # are at ``forward (10)`` representing the operations within MASK INDICES. Letβs try to
169- # tackle the memory consumption first. We can see that the ``.to()``
170- # operation at line 12 consumes 953.67 Mb. This operation copies ``mask`` to the CPU .
171- # ``mask`` is initialized with a ``torch.double`` datatype. Can we reduce the memory footprint by casting
172- # it to ``torch.float`` instead ?
160+ # λ©λͺ¨λ¦¬μ μκ° μΈ‘λ©΄μμ κ°μ₯ λΉμ©μ΄ ν° μ°μ°μ MASK INDICES λ΄ ``forward(10)`` μ°μ°μ
λλ€.
161+ # λ¨Όμ λ©λͺ¨λ¦¬ μλͺ¨ λ¬Έμ λ₯Ό ν΄κ²°ν΄λ΄
μλ€.
162+ # 12λ²μ§Έ μ€μ ``.to()`` μ°μ°μ 953.67 Mbλ₯Ό μλͺ¨νλ κ²μ νμΈν μ μμ΅λλ€.
163+ # μ΄ μ°μ°μ ``mask`` λ₯Ό CPUμ 볡μ¬ν©λλ€ .
164+ # ``mask`` λ ``torch.double`` λ°μ΄ν° νμ
μΌλ‘ μ΄κΈ°νλ©λλ€.
165+ # μ΄λ₯Ό ``torch.float`` μΌλ‘ λ³ννμ¬ λ©λͺ¨λ¦¬ μ¬μ©λμ μ€μΌ μ μμκΉμ ?
173166#
174167
175168model = MyModule (500 , 10 ).cuda ()
176169input = torch .rand (128 , 500 ).cuda ()
177170mask = torch .rand ((500 , 500 , 500 ), dtype = torch .float ).cuda ()
178171
179- # warm-up
172+ # μλ°μ
( warm-up)
180173model (input , mask )
181174
182175with profiler .profile (with_stack = True , profile_memory = True ) as prof :
@@ -227,16 +220,15 @@ def forward(self, input, mask):
227220
228221######################################################################
229222#
230- # The CPU memory footprint for this operation has halved .
223+ # μ΄ μ°μ°μ μν CPU λ©λͺ¨λ¦¬ μ¬μ©λμ΄ μ λ°μΌλ‘ μ€μμ΅λλ€ .
231224#
232- # Improve time performance
225+ # μκ° μ±λ₯ ν₯μμν€κΈ°
233226# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
234- # While the time consumed has also reduced a bit, itβs still too high.
235- # Turns out copying a matrix from CUDA to CPU is pretty expensive!
236- # The ``aten::copy_`` operator in ``forward (12)`` copies ``mask`` to CPU
237- # so that it can use the NumPy ``argwhere`` function. ``aten::copy_`` at ``forward(13)``
238- # copies the array back to CUDA as a tensor. We could eliminate both of these if we use a
239- # ``torch`` function ``nonzero()`` here instead.
227+ # μλͺ¨λ μκ°μ΄ μ‘°κΈ μ€κΈ΄ νμ§λ§, μ΄λ μμ§λ λ무 λμ μμΉμ
λλ€.
228+ # CUDA μμ CPU λ‘ νλ ¬μ 볡μ¬νλ κ²μ΄ κ½€ λΉμ©μ΄ ν° μ°μ°μΈ κ²μ΄ λ°νμ‘μ΅λλ€.
229+ # ``forward(12)`` μ ``aten::copy_`` μ°μ°μ ``mask`` λ₯Ό CPUμ 볡μ¬νμ¬ NumPy μ ``argwhere`` ν¨μλ₯Ό μ¬μ©ν μ μκ² ν©λλ€.
230+ # ``forward(13)`` μ ``aten::copy_`` λ λ°°μ΄μ λ€μ ν
μλ‘ CUDAμ 볡μ¬ν©λλ€.
231+ # μ΄κ³³μμ ``torch`` ν¨μ ``nonzero()`` λ₯Ό λμ μ¬μ©νλ€λ©΄ λ μ°μ°μ λͺ¨λ μ κ±°ν μ μμ΅λλ€.
240232#
241233
242234class MyModule (nn .Module ):
@@ -259,7 +251,7 @@ def forward(self, input, mask):
259251input = torch .rand (128 , 500 ).cuda ()
260252mask = torch .rand ((500 , 500 , 500 ), dtype = torch .float ).cuda ()
261253
262- # warm-up
254+ # μλ°μ
( warm-up)
263255model (input , mask )
264256
265257with profiler .profile (with_stack = True , profile_memory = True ) as prof :
@@ -310,11 +302,11 @@ def forward(self, input, mask):
310302
311303
312304######################################################################
313- # Further Reading
305+ # λ μ½μ거리
314306# ~~~~~~~~~~~~~~~~~
315- # We have seen how Profiler can be used to investigate time and memory bottlenecks in PyTorch models .
316- # Read more about Profiler here :
307+ # PyTorch λͺ¨λΈμμ μκ°κ³Ό λ©λͺ¨λ¦¬ λ³λͺ©μ λΆμνκΈ° μν΄ νλ‘νμΌλ¬κ° μ΄λ»κ² μ¬μ©λ μ μλμ§λ₯Ό μ΄ν΄λ³΄μμ΅λλ€ .
308+ # μλμ νλ‘νμΌλ¬μ λν μ½μκ±°λ¦¬κ° λ μμ΅λλ€ :
317309#
318- # - `Profiler Usage Recipe <https://tutorials.pytorch.kr/recipes/recipes/profiler .html>`__
310+ # - `νλ‘νμΌλ¬ μ¬μ© λ μνΌ <https://tutorials.pytorch.kr/recipes/recipes/profiler_recipe .html>`__
319311# - `Profiling RPC-Based Workloads <https://tutorials.pytorch.kr/recipes/distributed_rpc_profiling.html>`__
320312# - `Profiler API Docs <https://pytorch.org/docs/stable/autograd.html?highlight=profiler#profiler>`__
0 commit comments