You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Before going into more detail, it's important to understand how vectorization works in Python. When performing a calculation on an array/matrix, there are several feasible methods:
2
+
3
+
The first is to go through the list and perform the calculation element by element, known as an iterative approach.
4
+
The second method consists of applying the calculation to the entire array/matrix at once, which is known as vectorization.
5
+
6
+
Although it's not feasible to do this in all cases without applying real parallelism using a GPU, for example, we speak of vectorization when we use the built-in functions of TensorFlow, NumPy or Pandas.
7
+
8
+
We'll also have an iterative loop, but it will be executed in lower-level code (C). As with the use of built-in functions in general, since low-level languages like C are optimized, execution will be much faster and therefore emit less CO2.
9
+
10
+
== Non compliant Code Example
11
+
12
+
[source,python]
13
+
----
14
+
results = [[0 for _ in range(cols_B)] for _ in range(rows_A)]
15
+
16
+
17
+
for i in range(len(A)):
18
+
for j in range(len(B[0])):
19
+
for k in range(len(B)):
20
+
results[i][j] += A[i][k] * B[k][j]
21
+
----
22
+
23
+
== Compliant Solution
24
+
25
+
[source,python]
26
+
----
27
+
results = np.dot(A, B)
28
+
# np stands for NumPy, the Python library used to manipulate data series.
29
+
----
30
+
31
+
== Relevance Analysis
32
+
33
+
The following results were obtained through local experiments.
This study is divided into 3 parts, comparing a vectorized and an iterative method:
44
+
measuring the impact on a dot product between two vectors,
45
+
measuring the impact on an outer product between two vectors,
46
+
measuring the impact on a matrix calculation.
47
+
48
+
=== Impact Analysis
49
+
50
+
*1. dot product:*
51
+
52
+
*Non compliant*
53
+
[source,python]
54
+
----
55
+
def iterative_dot_product(x,y):
56
+
total = 0
57
+
for i in range(len(x)):
58
+
total += x[i] * y[i]
59
+
return total
60
+
----
61
+
*Compliant*
62
+
[source,python]
63
+
----
64
+
def vectorized_dot_product(x,y):
65
+
return np.dot(x,y)
66
+
----
67
+
image::dot.png[]
68
+
69
+
*2. Outer product:*
70
+
71
+
*Non compliant*
72
+
[source,python]
73
+
----
74
+
def iterative_outer_product(x, y):
75
+
o = np.zeros((len(x), len(y)))
76
+
for i in range(len(x)):
77
+
for j in range(len(y)):
78
+
o[i][j] = x[i] * y[j]
79
+
return o
80
+
----
81
+
*Compliant*
82
+
[source,python]
83
+
----
84
+
def vectorized_outer_product(x, y):
85
+
return np.outer(x, y)
86
+
----
87
+
image::outer.png[]
88
+
89
+
*3. Matrix product:*
90
+
91
+
*Non compliant*
92
+
[source,python]
93
+
----
94
+
def iterative_matrix_product(A, B):
95
+
for i in range(len(A)):
96
+
for j in range(len(B[0])):
97
+
for k in range(len(B)):
98
+
results[i][j] += A[i][k] * B[k][j]
99
+
return results
100
+
----
101
+
*Compliant*
102
+
[source,python]
103
+
----
104
+
def vectorized_outer_product(A, B):
105
+
return np.dot(A, B)
106
+
----
107
+
image::matrix.png[]
108
+
109
+
=== Conclusion
110
+
111
+
The results show that the vectorized method is significantly faster than the iterative method. The CO2 emissions are also lower. This is a clear example of how using built-in functions can lead to more efficient code, both in terms of performance and environmental impact.
Copy file name to clipboardExpand all lines: src/main/rules/GCI96/python/GCI96.asciidoc
+116Lines changed: 116 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,24 @@
1
+
<<<<<<< HEAD
2
+
Before going into more detail, it's important to understand how vectorization works in Python. When performing a calculation on an array/matrix, there are several possible methods:
3
+
4
+
The first is to go through the list and perform the calculation element by element, known as an iterative approach.
5
+
The second method consists of applying the calculation to the entire array/matrix at once, which is known as vectorization.
6
+
7
+
Although it's not possible to do this in all cases without applying real parallelism using a GPU, for example, we speak of vectorization when we use the built-in functions of TensorFlow, NumPy or Pandas.
8
+
9
+
We'll also have a iterative loop, but it will be executed in lower-level code (C). As with the use of built-in functions in general, since low-level languages like C are optimized, execution will be much faster and therefore emit less CO2.
10
+
11
+
== Non compliant Code Example
12
+
13
+
[source,python]
14
+
----
15
+
for i in range(len(A)):
16
+
for j in range(len(B[0])):
17
+
for k in range(len(B)):
18
+
results[i][j] += A[i][k] * B[k][j]
19
+
----
20
+
21
+
=======
1
22
This rule is specific to Python because it's related to the Pandas library, which is widely used for data manipulation and analysis in Python.
2
23
3
24
Reading CSV files without explicitly specifying which columns to load leads to unnecessary data loading and increases memory and energy consumption. This guidance is specific to the use of the Pandas library in Python, but it aligns with the more general GCI74: Avoid SELECT * from table in SQL. To ensure low environmental impact and optimal performance, always use the usecols parameter in pandas.read_csv() to select only the required columns.
@@ -14,10 +35,20 @@ df = pd.read_csv('data.csv')
14
35
15
36
In this case, **all columns** are read into memory, even if only one or two are needed.
16
37
38
+
>>>>>>> main
17
39
== Compliant Solution
18
40
19
41
[source,python]
20
42
----
43
+
<<<<<<< HEAD
44
+
results = np.dot(A, B)
45
+
# np stands for NumPy, the Python library used to manipulate data series.
46
+
----
47
+
48
+
== Relevance Analysis
49
+
50
+
The following results were obtained through local experiments.
51
+
=======
21
52
file_path = 'data.csv'
22
53
df = pd.read_csv(file_path, usecols=['A', 'B']) # Only read needed columns
23
54
----
@@ -27,11 +58,95 @@ This ensures only the necessary data is loaded, reducing memory usage and energy
27
58
== Relevance Analysis
28
59
29
60
Local experiments were conducted to assess the environmental impact of reading CSV files with and without column selection.
This study is divided into 3 parts, comparing a vectorized and an iterative method:
73
+
measuring the impact on a dot product between two vectors,
74
+
measuring the impact on an outer product between two vectors,
75
+
measuring the impact on a matrix calculation.
76
+
77
+
=== Impact Analysis
78
+
79
+
*1. dot product:*
80
+
81
+
*Non compliant*
82
+
[source,python]
83
+
----
84
+
def iterative_dot_product(x,y):
85
+
total = 0
86
+
for i in range(len(x)):
87
+
total += x[i] * y[i]
88
+
return total
89
+
----
90
+
*Compliant*
91
+
[source,python]
92
+
----
93
+
def vectorized_dot_product(x,y):
94
+
return np.dot(x,y)
95
+
----
96
+
image::dot.png[]
97
+
98
+
*2. Outer product:*
99
+
100
+
*Non compliant*
101
+
[source,python]
102
+
----
103
+
def iterative_outer_product(x, y):
104
+
o = np.zeros((len(x), len(y)))
105
+
for i in range(len(x)):
106
+
for j in range(len(y)):
107
+
o[i][j] = x[i] * y[j]
108
+
return o
109
+
----
110
+
*Compliant*
111
+
[source,python]
112
+
----
113
+
def vectorized_outer_product(x, y):
114
+
return np.outer(x, y)
115
+
----
116
+
image::outer.png[]
117
+
118
+
*3. Matrix product:*
119
+
120
+
*Non compliant*
121
+
[source,python]
122
+
----
123
+
def iterative_matrix_product(A, B):
124
+
for i in range(len(A)):
125
+
for j in range(len(B[0])):
126
+
for k in range(len(B)):
127
+
results[i][j] += A[i][k] * B[k][j]
128
+
return results
129
+
----
130
+
*Compliant*
131
+
[source,python]
132
+
----
133
+
def vectorized_outer_product(A, B):
134
+
return np.dot(A, B)
135
+
----
136
+
image::matrix.png[]
137
+
138
+
=== Conclusion
139
+
140
+
The results show that the vectorized method is significantly faster than the iterative method. The CO2 emissions are also lower. This is a clear example of how using built-in functions can lead to more efficient code, both in terms of performance and environmental impact.
0 commit comments