- "text": "The performance of Python applications that use TACO can be measured using\nPython's built-in \ntime.perf_counter\n function with minimal changes to the\napplications. As an example, we can benchmark the performance of the\nscientific computing application shown \nhere\n as\nfollows:\n\n\nimport pytaco as pt\nfrom pytaco import compressed, dense\nimport numpy as np\nimport time\n\ncsr = pt.format([dense, compressed])\ndv = pt.format([dense])\n\nA = pt.read(\"pwtk.mtx\", csr)\nx = pt.from_array(np.random.uniform(size=A.shape[1]))\nz = pt.from_array(np.random.uniform(size=A.shape[0]))\ny = pt.tensor([A.shape[0]], dv)\n\ni, j = pt.get_index_vars(2)\ny[i] = A[i, j] * x[j] + z[i]\n\n# Tell TACO to generate code to perform the SpMV computation\ny.compile()\n\n# Benchmark the actual SpMV computation\nstart = time.perf_counter()\ny.compute()\nend = time.perf_counter()\n\nprint(\"Execution time: {0} seconds\".format(end - start))\n\n\n\nIn order to accurately measure TACO's computational performance, \nonly the\ntime it takes to actually perform a computation should be measured. The time\nit takes to generate code under the hood for performing that computation should\nnot be measured\n, since this overhead can be quite variable but can often be\namortized in practice. By default though, TACO will only generate and compile\ncode it needs for performing a computation immediately before it has to\nactually perform the computation. As the example above demonstrates, by\nmanually calling the result tensor's \ncompile\n method, we can tell TACO to\ngenerate code needed for performing the computation before benchmarking starts,\nletting us measure only the performance of the computation itself.\n\n\n\n\nWarning\n\n\npytaco.evaluate\n and \npytaco.einsum\n should not be used to benchmark\nTACO's computational performance, since timing those functions will\ninclude the time it takes to generate code for performing the computation.\n\n\n\n\nThe time it takes to construct the initial input tensors should also not be\nmeasured\n, since again this overhead can often be amortized in practice. By\ndefault, \npytaco.read\n and functions for converting NumPy arrays and SciPy\nmatrices to TACO tensors return fully constructed tensors. If you add nonzero\nelements to an input tensor by calling \ninsert\n though, then \npack\n must also\nbe explicitly invoked before any benchmarking is done:\n\n\nimport pytaco as pt\nfrom pytaco import compressed, dense\nimport numpy as np\nimport random\nimport time\n\ncsr = pt.format([dense, compressed])\ndv = pt.format([dense])\n\nA = pt.read(\"pwtk.mtx\", csr)\nx = pt.tensor([A.shape[1]], dv)\nz = pt.tensor([A.shape[0]], dv)\ny = pt.tensor([A.shape[0]], dv)\n\n# Insert random values into x and z and pack them into dense arrays\nfor k in range(A.shape[1]):\n x.insert([k], random.random())\nx.pack()\nfor k in range(A.shape[0]):\n z.insert([k], random.random())\nz.pack()\n\ni, j = pt.get_index_vars(2)\ny[i] = A[i, j] * x[j] + z[i]\n\ny.compile()\n\nstart = time.perf_counter()\ny.compute()\nend = time.perf_counter()\n\nprint(\"Execution time: {0} seconds\".format(end - start))\n\n\n\nTACO avoids regenerating code for performing the same computation though as\nlong as the computation is redefined with the same index variables and with the\nsame operand and result tensors. Thus, if your application executes the same\ncomputation many times in a loop and if the computation is executed on\nsufficiently large data sets, TACO will naturally amortize the overhead\nassociated with generating code for performing the computation. In such \nscenarios, it is acceptable to include the initial code generation overhead \nin the performance measurement:\n\n\nimport pytaco as pt\nfrom pytaco import compressed, dense\nimport numpy as np\nimport time\n\ncsr = pt.format([dense, compressed])\ndv = pt.format([dense])\n\nA = pt.read(\"pwtk.mtx\", csr)\nx = pt.tensor([A.shape[1]], dv)\nz = pt.tensor([A.shape[0]], dv)\ny = pt.tensor([A.shape[0]], dv)\n\nfor k in range(A.shape[1]):\n x.insert([k], random.random())\nx.pack()\nfor k in range(A.shape[0]):\n z.insert([k], random.random())\nz.pack()\n\ni, j = pt.get_index_vars(2)\n\n# Benchmark the iterative SpMV computation, including overhead for \n# generating code in the first iteration to perform the computation\nstart = time.perf_counter()\nfor k in range(1000):\n y[i] = A[i, j] * x[j] + z[i]\n y.evaluate()\n x[i] = y[i]\n x.evaluate()\nend = time.perf_counter()\n\nprint(\"Execution time: {0} seconds\".format(end - start))\n\n\n\n\n\nWarning\n\n\nIn order to avoid regenerating code for performing a computation, the\ncomputation must be redefined with the exact same index variable \nobjects\n\nand also with the exact same tensor objects for operands and result. In\nthe example above, every loop iteration redefines the computation of \ny\n\nand \nx\n using the same tensor and index variable objects costructed outside\nthe loop, so TACO will only generate code to compute \ny\n and \nx\n in the\nfirst iteration. If the index variables were constructed inside the loop\nthough, TACO would regenerate code to compute \ny\n and \nx\n in every loop\niteration, and the compilation overhead would not be amortized. \n\n\n\n\n\n\nNote\n\n\nAs a rough rule of thumb, if a computation takes on the order of seconds or\nmore in total to perform across all invocations with identical operands and\nresult (and is always redefined with identical index variables), then it is\nacceptable to include the overhead associated with generating code for\nperforming the computation in performance measurements.",
0 commit comments