@@ -29,30 +29,30 @@ To explore and manipulate a dataset, it must first be downloaded from the blob s
29
29
30
30
1 . Download the data from Azure blob with the following Python code sample using Blob service. Replace the variable in the following code with your specific values:
31
31
32
- ``` python
33
- from azure.storage.blob import BlockBlobService
34
- import tables
35
-
36
- STORAGEACCOUNTNAME = < storage_account_name>
37
- STORAGEACCOUNTKEY = < storage_account_key>
38
- LOCALFILENAME = < local_file_name>
39
- CONTAINERNAME = < container_name>
40
- BLOBNAME = < blob_name>
41
-
42
- # download from blob
43
- t1= time.time()
44
- blob_service= BlockBlobService(account_name = STORAGEACCOUNTNAME ,account_key = STORAGEACCOUNTKEY )
45
- blob_service.get_blob_to_path(CONTAINERNAME ,BLOBNAME ,LOCALFILENAME )
46
- t2= time.time()
47
- print ((" It takes %s seconds to download " + blobname) % (t2 - t1))
48
- ```
32
+ ``` python
33
+ from azure.storage.blob import BlockBlobService
34
+ import tables
35
+
36
+ STORAGEACCOUNTNAME = < storage_account_name>
37
+ STORAGEACCOUNTKEY = < storage_account_key>
38
+ LOCALFILENAME = < local_file_name>
39
+ CONTAINERNAME = < container_name>
40
+ BLOBNAME = < blob_name>
41
+
42
+ # download from blob
43
+ t1= time.time()
44
+ blob_service= BlockBlobService(account_name = STORAGEACCOUNTNAME ,account_key = STORAGEACCOUNTKEY )
45
+ blob_service.get_blob_to_path(CONTAINERNAME ,BLOBNAME ,LOCALFILENAME )
46
+ t2= time.time()
47
+ print ((" It takes %s seconds to download " + blobname) % (t2 - t1))
48
+ ```
49
49
50
50
1 . Read the data into a pandas DataFrame from the downloaded file .
51
51
52
- ``` python
53
- # LOCALFILE is the file path
54
- dataframe_blobdata = pd.read_csv(LOCALFILE )
55
- ```
52
+ ```python
53
+ # LOCALFILE is the file path
54
+ dataframe_blobdata = pd.read_csv(LOCALFILENAME )
55
+ ```
56
56
57
57
Now you are ready to explore the data and generate features on this dataset.
58
58
@@ -61,72 +61,72 @@ Here are a few examples of ways to explore data using pandas:
61
61
62
62
1 . Inspect the ** number of rows and columns**
63
63
64
- ``` python
65
- print ' the size of the data is: %d rows and %d columns' % dataframe_blobdata.shape
66
- ```
64
+ ```python
65
+ print ' the size of the data is: %d rows and %d columns' % dataframe_blobdata.shape
66
+ ```
67
67
68
68
1 . ** Inspect** the first or last few ** rows** in the following dataset:
69
69
70
- ``` python
71
- dataframe_blobdata.head(10 )
70
+ ```python
71
+ dataframe_blobdata.head(10 )
72
72
73
- dataframe_blobdata.tail(10 )
74
- ```
73
+ dataframe_blobdata.tail(10 )
74
+ ```
75
75
76
76
1 . Check the ** data type ** each column was imported as using the following sample code
77
77
78
- ``` python
79
- for col in dataframe_blobdata.columns:
80
- print dataframe_blobdata[col].name, ' :\t ' , dataframe_blobdata[col].dtype
81
- ```
78
+ ```python
79
+ for col in dataframe_blobdata.columns:
80
+ print dataframe_blobdata[col].name, ' :\t ' , dataframe_blobdata[col].dtype
81
+ ```
82
82
83
83
1 . Check the ** basic stats** for the columns in the data set as follows
84
84
85
- ``` python
86
- dataframe_blobdata.describe()
87
- ```
85
+ ```python
86
+ dataframe_blobdata.describe()
87
+ ```
88
88
89
89
1 . Look at the number of entries for each column value as follows
90
90
91
- ``` python
92
- dataframe_blobdata[' <column_name>' ].value_counts()
93
- ```
91
+ ```python
92
+ dataframe_blobdata[' <column_name>' ].value_counts()
93
+ ```
94
94
95
95
1 . ** Count missing values** versus the actual number of entries in each column using the following sample code
96
96
97
- ``` python
98
- miss_num = dataframe_blobdata.shape[0 ] - dataframe_blobdata.count()
99
- print miss_num
100
- ```
97
+ ```python
98
+ miss_num = dataframe_blobdata.shape[0 ] - dataframe_blobdata.count()
99
+ print miss_num
100
+ ```
101
101
102
102
1 . If you have ** missing values** for a specific column in the data, you can drop them as follows:
103
103
104
- ``` python
105
- dataframe_blobdata_noNA = dataframe_blobdata.dropna()
106
- dataframe_blobdata_noNA.shape
107
- ```
104
+ ```python
105
+ dataframe_blobdata_noNA = dataframe_blobdata.dropna()
106
+ dataframe_blobdata_noNA.shape
107
+ ```
108
108
109
- Another way to replace missing values is with the mode function:
109
+ Another way to replace missing values is with the mode function:
110
110
111
- ``` python
112
- dataframe_blobdata_mode = dataframe_blobdata.fillna(
113
- {' <column_name>' : dataframe_blobdata[' <column_name>' ].mode()[0 ]})
114
- ```
111
+ ```python
112
+ dataframe_blobdata_mode = dataframe_blobdata.fillna(
113
+ {' <column_name>' : dataframe_blobdata[' <column_name>' ].mode()[0 ]})
114
+ ```
115
115
116
116
1 . Create a ** histogram** plot using variable number of bins to plot the distribution of a variable
117
117
118
- ``` python
119
- dataframe_blobdata[' <column_name>' ].value_counts().plot(kind = ' bar' )
118
+ ```python
119
+ dataframe_blobdata[' <column_name>' ].value_counts().plot(kind = ' bar' )
120
120
121
- np.log(dataframe_blobdata[' <column_name>' ]+ 1 ).hist(bins = 50 )
122
- ```
121
+ np.log(dataframe_blobdata[' <column_name>' ]+ 1 ).hist(bins = 50 )
122
+ ```
123
123
124
124
1 . Look at ** correlations** between variables using a scatterplot or using the built- in correlation function
125
125
126
- ``` python
127
- # relationship between column_a and column_b using scatter plot
128
- plt.scatter(dataframe_blobdata[' <column_a>' ], dataframe_blobdata[' <column_b>' ])
126
+ ```python
127
+ # relationship between column_a and column_b using scatter plot
128
+ plt.scatter(dataframe_blobdata[' <column_a>' ], dataframe_blobdata[' <column_b>' ])
129
129
130
- # correlation between column_a and column_b
131
- dataframe_blobdata[[' <column_a>' , ' <column_b>' ]].corr()
132
- ```
130
+ # correlation between column_a and column_b
131
+ dataframe_blobdata[[' <column_a>' , ' <column_b>' ]].corr()
132
+ ```
0 commit comments