Skip to content

Commit e14a045

Browse files
authored
Merge pull request #380 from cleophass/AvoidIterativeMatrixOperations
GCI107 AvoidIterativeMatrixOperations #Python #DLG #RulesSpecifications
2 parents c0ab270 + 2a90b74 commit e14a045

File tree

7 files changed

+279
-0
lines changed

7 files changed

+279
-0
lines changed

src/main/rules/GCI107/GCI107.json

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"title": "DATA : Avoid Iterative Matrix Operations",
3+
"type": "CODE_SMELL",
4+
"status": "ready",
5+
"remediation": {
6+
"func": "Constant\/Issue",
7+
"constantCost": "10min"
8+
},
9+
"tags": [
10+
"creedengo",
11+
"eco-design",
12+
"performance",
13+
"data",
14+
"ai",
15+
"vector",
16+
"pandas",
17+
"numpy"
18+
],
19+
"defaultSeverity": "Minor"
20+
}
21+
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
Before going into more detail, it's important to understand how vectorization works in Python. When performing a calculation on an array/matrix, there are several feasible methods:
2+
3+
The first is to go through the list and perform the calculation element by element, known as an iterative approach.
4+
The second method consists of applying the calculation to the entire array/matrix at once, which is known as vectorization.
5+
6+
Although it's not feasible to do this in all cases without applying real parallelism using a GPU, for example, we speak of vectorization when we use the built-in functions of TensorFlow, NumPy or Pandas.
7+
8+
We'll also have an iterative loop, but it will be executed in lower-level code (C). As with the use of built-in functions in general, since low-level languages like C are optimized, execution will be much faster and therefore emit less CO2.
9+
10+
== Non compliant Code Example
11+
12+
[source,python]
13+
----
14+
results = [[0 for _ in range(cols_B)] for _ in range(rows_A)]
15+
16+
17+
for i in range(len(A)):
18+
for j in range(len(B[0])):
19+
for k in range(len(B)):
20+
results[i][j] += A[i][k] * B[k][j]
21+
----
22+
23+
== Compliant Solution
24+
25+
[source,python]
26+
----
27+
results = np.dot(A, B)
28+
# np stands for NumPy, the Python library used to manipulate data series.
29+
----
30+
31+
== Relevance Analysis
32+
33+
The following results were obtained through local experiments.
34+
35+
=== Configuration
36+
37+
* Processor: Intel(R) Core(TM) Ultra 5 135U, 2100 MHz, 12 cores, 14 logical processors
38+
* RAM: 16 GB
39+
* CO2 Emissions Measurement: Using CodeCarbon
40+
41+
=== Context
42+
43+
This study is divided into 3 parts, comparing a vectorized and an iterative method:
44+
measuring the impact on a dot product between two vectors,
45+
measuring the impact on an outer product between two vectors,
46+
measuring the impact on a matrix calculation.
47+
48+
=== Impact Analysis
49+
50+
*1. dot product:*
51+
52+
*Non compliant*
53+
[source,python]
54+
----
55+
def iterative_dot_product(x,y):
56+
total = 0
57+
for i in range(len(x)):
58+
total += x[i] * y[i]
59+
return total
60+
----
61+
*Compliant*
62+
[source,python]
63+
----
64+
def vectorized_dot_product(x,y):
65+
return np.dot(x,y)
66+
----
67+
image::dot.png[]
68+
69+
*2. Outer product:*
70+
71+
*Non compliant*
72+
[source,python]
73+
----
74+
def iterative_outer_product(x, y):
75+
o = np.zeros((len(x), len(y)))
76+
for i in range(len(x)):
77+
for j in range(len(y)):
78+
o[i][j] = x[i] * y[j]
79+
return o
80+
----
81+
*Compliant*
82+
[source,python]
83+
----
84+
def vectorized_outer_product(x, y):
85+
return np.outer(x, y)
86+
----
87+
image::outer.png[]
88+
89+
*3. Matrix product:*
90+
91+
*Non compliant*
92+
[source,python]
93+
----
94+
def iterative_matrix_product(A, B):
95+
for i in range(len(A)):
96+
for j in range(len(B[0])):
97+
for k in range(len(B)):
98+
results[i][j] += A[i][k] * B[k][j]
99+
return results
100+
----
101+
*Compliant*
102+
[source,python]
103+
----
104+
def vectorized_outer_product(A, B):
105+
return np.dot(A, B)
106+
----
107+
image::matrix.png[]
108+
109+
=== Conclusion
110+
111+
The results show that the vectorized method is significantly faster than the iterative method. The CO2 emissions are also lower. This is a clear example of how using built-in functions can lead to more efficient code, both in terms of performance and environmental impact.
112+
113+
=== References
114+
115+
https://sciresol.s3.us-east-2.amazonaws.com/IJST/Articles/2024/Issue-24/IJST-2024-914.pdf
116+
117+
https://arxiv.org/pdf/2308.01269
118+
119+
https://www.db-thueringen.de/servlets/MCRFileNodeServlet/dbt_derivate_00062165/ilm1-2024200012.pdf
31.5 KB
Loading
31 KB
Loading
26.7 KB
Loading

src/main/rules/GCI96/GCI96.json

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,26 @@
11
{
2+
<<<<<<< HEAD
3+
"title": "DATA : Avoid Iterative Matrix Operations",
4+
"type": "CODE_SMELL",
5+
"status": "ready",
6+
"remediation": {
7+
"func": "Constant\/Issue",
8+
"constantCost": "10min"
9+
},
10+
"tags": [
11+
"creedengo",
12+
"eco-design",
13+
"performance",
14+
"data",
15+
"ai",
16+
"vector",
17+
"pandas",
18+
"numpy"
19+
],
20+
"defaultSeverity": "Minor"
21+
}
22+
23+
=======
224
"title": "DATA/AI Pandas - Avoid Reading Unnecessary Columns in CSV Files",
325
"type": "CODE_SMELL",
426
"status": "ready",
@@ -16,3 +38,4 @@
1638
],
1739
"defaultSeverity": "Minor"
1840
}
41+
>>>>>>> main

src/main/rules/GCI96/python/GCI96.asciidoc

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,24 @@
1+
<<<<<<< HEAD
2+
Before going into more detail, it's important to understand how vectorization works in Python. When performing a calculation on an array/matrix, there are several possible methods:
3+
4+
The first is to go through the list and perform the calculation element by element, known as an iterative approach.
5+
The second method consists of applying the calculation to the entire array/matrix at once, which is known as vectorization.
6+
7+
Although it's not possible to do this in all cases without applying real parallelism using a GPU, for example, we speak of vectorization when we use the built-in functions of TensorFlow, NumPy or Pandas.
8+
9+
We'll also have a iterative loop, but it will be executed in lower-level code (C). As with the use of built-in functions in general, since low-level languages like C are optimized, execution will be much faster and therefore emit less CO2.
10+
11+
== Non compliant Code Example
12+
13+
[source,python]
14+
----
15+
for i in range(len(A)):
16+
for j in range(len(B[0])):
17+
for k in range(len(B)):
18+
results[i][j] += A[i][k] * B[k][j]
19+
----
20+
21+
=======
122
This rule is specific to Python because it's related to the Pandas library, which is widely used for data manipulation and analysis in Python.
223
324
Reading CSV files without explicitly specifying which columns to load leads to unnecessary data loading and increases memory and energy consumption. This guidance is specific to the use of the Pandas library in Python, but it aligns with the more general GCI74: Avoid SELECT * from table in SQL. To ensure low environmental impact and optimal performance, always use the usecols parameter in pandas.read_csv() to select only the required columns.
@@ -14,10 +35,20 @@ df = pd.read_csv('data.csv')
1435
1536
In this case, **all columns** are read into memory, even if only one or two are needed.
1637
38+
>>>>>>> main
1739
== Compliant Solution
1840
1941
[source,python]
2042
----
43+
<<<<<<< HEAD
44+
results = np.dot(A, B)
45+
# np stands for NumPy, the Python library used to manipulate data series.
46+
----
47+
48+
== Relevance Analysis
49+
50+
The following results were obtained through local experiments.
51+
=======
2152
file_path = 'data.csv'
2253
df = pd.read_csv(file_path, usecols=['A', 'B']) # Only read needed columns
2354
----
@@ -27,11 +58,95 @@ This ensures only the necessary data is loaded, reducing memory usage and energy
2758
== Relevance Analysis
2859
2960
Local experiments were conducted to assess the environmental impact of reading CSV files with and without column selection.
61+
>>>>>>> main
3062
3163
=== Configuration
3264
3365
* Processor: Intel(R) Core(TM) Ultra 5 135U, 2100 MHz, 12 cores, 14 logical processors
3466
* RAM: 16 GB
67+
<<<<<<< HEAD
68+
* CO2 Emissions Measurement: Using CodeCarbon
69+
70+
=== Context
71+
72+
This study is divided into 3 parts, comparing a vectorized and an iterative method:
73+
measuring the impact on a dot product between two vectors,
74+
measuring the impact on an outer product between two vectors,
75+
measuring the impact on a matrix calculation.
76+
77+
=== Impact Analysis
78+
79+
*1. dot product:*
80+
81+
*Non compliant*
82+
[source,python]
83+
----
84+
def iterative_dot_product(x,y):
85+
total = 0
86+
for i in range(len(x)):
87+
total += x[i] * y[i]
88+
return total
89+
----
90+
*Compliant*
91+
[source,python]
92+
----
93+
def vectorized_dot_product(x,y):
94+
return np.dot(x,y)
95+
----
96+
image::dot.png[]
97+
98+
*2. Outer product:*
99+
100+
*Non compliant*
101+
[source,python]
102+
----
103+
def iterative_outer_product(x, y):
104+
o = np.zeros((len(x), len(y)))
105+
for i in range(len(x)):
106+
for j in range(len(y)):
107+
o[i][j] = x[i] * y[j]
108+
return o
109+
----
110+
*Compliant*
111+
[source,python]
112+
----
113+
def vectorized_outer_product(x, y):
114+
return np.outer(x, y)
115+
----
116+
image::outer.png[]
117+
118+
*3. Matrix product:*
119+
120+
*Non compliant*
121+
[source,python]
122+
----
123+
def iterative_matrix_product(A, B):
124+
for i in range(len(A)):
125+
for j in range(len(B[0])):
126+
for k in range(len(B)):
127+
results[i][j] += A[i][k] * B[k][j]
128+
return results
129+
----
130+
*Compliant*
131+
[source,python]
132+
----
133+
def vectorized_outer_product(A, B):
134+
return np.dot(A, B)
135+
----
136+
image::matrix.png[]
137+
138+
=== Conclusion
139+
140+
The results show that the vectorized method is significantly faster than the iterative method. The CO2 emissions are also lower. This is a clear example of how using built-in functions can lead to more efficient code, both in terms of performance and environmental impact.
141+
142+
=== References
143+
144+
https://sciresol.s3.us-east-2.amazonaws.com/IJST/Articles/2024/Issue-24/IJST-2024-914.pdf
145+
146+
https://arxiv.org/pdf/2308.01269
147+
148+
https://www.db-thueringen.de/servlets/MCRFileNodeServlet/dbt_derivate_00062165/ilm1-2024200012.pdf
149+
=======
35150
* CO₂ Emissions Measurement: https://mlco2.github.io/codecarbon/[CodeCarbon]
36151
37152
=== Context
@@ -72,3 +187,4 @@ This is especially critical when working with large datasets or in environments
72187
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
73188
https://medium.com/@amit25173/what-is-usecols-in-pandas-7a6a43885f4b
74189
190+
>>>>>>> main

0 commit comments

Comments
 (0)