Skip to content

Commit 25f2652

Browse files
authored
Merge pull request #52447 from mpalumbo7/patch-1
Fixed formatting of documentation
2 parents 4eaae40 + 0149c1c commit 25f2652

File tree

1 file changed

+61
-61
lines changed

1 file changed

+61
-61
lines changed

articles/machine-learning/team-data-science-process/explore-data-blob.md

Lines changed: 61 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -29,30 +29,30 @@ To explore and manipulate a dataset, it must first be downloaded from the blob s
2929

3030
1. Download the data from Azure blob with the following Python code sample using Blob service. Replace the variable in the following code with your specific values:
3131

32-
```python
33-
from azure.storage.blob import BlockBlobService
34-
import tables
35-
36-
STORAGEACCOUNTNAME= <storage_account_name>
37-
STORAGEACCOUNTKEY= <storage_account_key>
38-
LOCALFILENAME= <local_file_name>
39-
CONTAINERNAME= <container_name>
40-
BLOBNAME= <blob_name>
41-
42-
#download from blob
43-
t1=time.time()
44-
blob_service=BlockBlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
45-
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
46-
t2=time.time()
47-
print(("It takes %s seconds to download "+blobname) % (t2 - t1))
48-
```
32+
```python
33+
from azure.storage.blob import BlockBlobService
34+
import tables
35+
36+
STORAGEACCOUNTNAME= <storage_account_name>
37+
STORAGEACCOUNTKEY= <storage_account_key>
38+
LOCALFILENAME= <local_file_name>
39+
CONTAINERNAME= <container_name>
40+
BLOBNAME= <blob_name>
41+
42+
#download from blob
43+
t1=time.time()
44+
blob_service=BlockBlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
45+
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
46+
t2=time.time()
47+
print(("It takes %s seconds to download "+blobname) % (t2 - t1))
48+
```
4949

5050
1. Read the data into a pandas DataFrame from the downloaded file.
5151

52-
```python
53-
# LOCALFILE is the file path
54-
dataframe_blobdata = pd.read_csv(LOCALFILE)
55-
```
52+
```python
53+
# LOCALFILE is the file path
54+
dataframe_blobdata = pd.read_csv(LOCALFILENAME)
55+
```
5656

5757
Now you are ready to explore the data and generate features on this dataset.
5858

@@ -61,72 +61,72 @@ Here are a few examples of ways to explore data using pandas:
6161

6262
1. Inspect the **number of rows and columns**
6363

64-
```python
65-
print 'the size of the data is: %d rows and %d columns' % dataframe_blobdata.shape
66-
```
64+
```python
65+
print 'the size of the data is: %d rows and %d columns' % dataframe_blobdata.shape
66+
```
6767

6868
1. **Inspect** the first or last few **rows** in the following dataset:
6969

70-
```python
71-
dataframe_blobdata.head(10)
70+
```python
71+
dataframe_blobdata.head(10)
7272

73-
dataframe_blobdata.tail(10)
74-
```
73+
dataframe_blobdata.tail(10)
74+
```
7575

7676
1. Check the **data type** each column was imported as using the following sample code
7777

78-
```python
79-
for col in dataframe_blobdata.columns:
80-
print dataframe_blobdata[col].name, ':\t', dataframe_blobdata[col].dtype
81-
```
78+
```python
79+
for col in dataframe_blobdata.columns:
80+
print dataframe_blobdata[col].name, ':\t', dataframe_blobdata[col].dtype
81+
```
8282

8383
1. Check the **basic stats** for the columns in the data set as follows
8484

85-
```python
86-
dataframe_blobdata.describe()
87-
```
85+
```python
86+
dataframe_blobdata.describe()
87+
```
8888

8989
1. Look at the number of entries for each column value as follows
9090

91-
```python
92-
dataframe_blobdata['<column_name>'].value_counts()
93-
```
91+
```python
92+
dataframe_blobdata['<column_name>'].value_counts()
93+
```
9494

9595
1. **Count missing values** versus the actual number of entries in each column using the following sample code
9696

97-
```python
98-
miss_num = dataframe_blobdata.shape[0] - dataframe_blobdata.count()
99-
print miss_num
100-
```
97+
```python
98+
miss_num = dataframe_blobdata.shape[0] - dataframe_blobdata.count()
99+
print miss_num
100+
```
101101

102102
1. If you have **missing values** for a specific column in the data, you can drop them as follows:
103103

104-
```python
105-
dataframe_blobdata_noNA = dataframe_blobdata.dropna()
106-
dataframe_blobdata_noNA.shape
107-
```
104+
```python
105+
dataframe_blobdata_noNA = dataframe_blobdata.dropna()
106+
dataframe_blobdata_noNA.shape
107+
```
108108

109-
Another way to replace missing values is with the mode function:
109+
Another way to replace missing values is with the mode function:
110110

111-
```python
112-
dataframe_blobdata_mode = dataframe_blobdata.fillna(
113-
{'<column_name>': dataframe_blobdata['<column_name>'].mode()[0]})
114-
```
111+
```python
112+
dataframe_blobdata_mode = dataframe_blobdata.fillna(
113+
{'<column_name>': dataframe_blobdata['<column_name>'].mode()[0]})
114+
```
115115

116116
1. Create a **histogram** plot using variable number of bins to plot the distribution of a variable
117117

118-
```python
119-
dataframe_blobdata['<column_name>'].value_counts().plot(kind='bar')
118+
```python
119+
dataframe_blobdata['<column_name>'].value_counts().plot(kind='bar')
120120

121-
np.log(dataframe_blobdata['<column_name>']+1).hist(bins=50)
122-
```
121+
np.log(dataframe_blobdata['<column_name>']+1).hist(bins=50)
122+
```
123123

124124
1. Look at **correlations** between variables using a scatterplot or using the built-in correlation function
125125

126-
```python
127-
# relationship between column_a and column_b using scatter plot
128-
plt.scatter(dataframe_blobdata['<column_a>'], dataframe_blobdata['<column_b>'])
126+
```python
127+
# relationship between column_a and column_b using scatter plot
128+
plt.scatter(dataframe_blobdata['<column_a>'], dataframe_blobdata['<column_b>'])
129129

130-
# correlation between column_a and column_b
131-
dataframe_blobdata[['<column_a>', '<column_b>']].corr()
132-
```
130+
# correlation between column_a and column_b
131+
dataframe_blobdata[['<column_a>', '<column_b>']].corr()
132+
```

0 commit comments

Comments
 (0)