Skip to content

Commit 43d424a

Browse files
committed
Pulled HDInsight, as it's not supported anymore
1 parent 9c1922d commit 43d424a

File tree

1 file changed

+4
-359
lines changed

1 file changed

+4
-359
lines changed

articles/machine-learning/data-science-virtual-machine/vm-do-ten-things.md

Lines changed: 4 additions & 359 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ services: machine-learning
66
ms.service: machine-learning
77
ms.subservice: data-science-vm
88

9-
author: vijetajo
10-
ms.author: vijetaj
9+
author: lobrien
10+
ms.author: laobri
1111
ms.topic: conceptual
12-
ms.date: 09/24/2018
12+
ms.date: 05/08/2020
1313

1414
---
1515

@@ -27,7 +27,7 @@ In this article, you'll learn how to use your DSVM to perform data science tasks
2727
- Administer your Azure resources by using the Azure portal or PowerShell.
2828
- Extend your storage space and share large-scale datasets/code across your whole team by creating an Azure Files share as a mountable drive on your DSVM.
2929
- Share code with your team by using GitHub. Access your repository by using the pre-installed Git clients: Git Bash and Git GUI.
30-
- Access Azure data and analytics services like Azure Blob storage, Azure Data Lake, Azure HDInsight (Hadoop), Azure Cosmos DB, Azure SQL Data Warehouse, and Azure SQL Database.
30+
- Access Azure data and analytics services like Azure Blob storage, Azure Data Lake, Azure Cosmos DB, Azure SQL Data Warehouse, and Azure SQL Database.
3131
- Build reports and a dashboard by using the Power BI Desktop instance that's pre-installed on the DSVM, and deploy them in the cloud.
3232
- Dynamically scale your DSVM to meet your project's needs.
3333
- Install additional tools on your virtual machine.
@@ -447,361 +447,6 @@ The file information appears:
447447

448448
![Screenshot of the file summary information](./media/vm-do-ten-things/USQL_tripdata_summary.png)
449449

450-
### HDInsight Hadoop clusters
451-
Azure HDInsight is a managed Apache Hadoop, Spark, HBase, and Storm service in the cloud. You can work easily with Azure HDInsight clusters from the Data Science Virtual Machine.
452-
453-
#### Prerequisites
454-
455-
* Create your Azure Blob storage account from the [Azure portal](https://portal.azure.com). This storage account is used to store data for HDInsight clusters.
456-
457-
![Screenshot of creating a storage account from the Azure portal](./media/vm-do-ten-things/Create_Azure_Blob.PNG)
458-
459-
* Customize Azure HDInsight Hadoop clusters from the [Azure portal](../team-data-science-process/customize-hadoop-cluster.md).
460-
461-
Link the storage account created with your HDInsight cluster when it's created. This storage account is used for accessing data that can be processed within the cluster.
462-
463-
![Selections for linking the storage account created with an HDInsight cluster](./media/vm-do-ten-things/Create_HDI_v4.PNG)
464-
465-
* Enable Remote Desktop access to the head node of the cluster after it's created. Remember the remote access credentials that you specify here, because you'll need them in the subsequent procedure.
466-
467-
![Remote Desktop button for enabling remote access to the HDInsight cluster](./media/vm-do-ten-things/Create_HDI_dashboard_v3.PNG)
468-
469-
* Create an Azure Machine Learning workspace. Your Machine Learning experiments are stored in this Machine Learning workspace. Select the highlighted options in the portal, as shown in the following screenshot:
470-
471-
![Create an Azure Machine Learning workspace](./media/vm-do-ten-things/Create_ML_Space.PNG)
472-
473-
* Enter the parameters for your workspace.
474-
475-
![Enter Machine Learning workspace parameters](./media/vm-do-ten-things/Create_ML_Space_step2_v2.PNG)
476-
477-
* Upload data by using IPython Notebook. Import required packages, plug in credentials, create a database in your storage account, and then load data into HDI clusters.
478-
479-
```python
480-
# Import required packages
481-
import pyodbc
482-
import time as time
483-
import json
484-
import os
485-
import urllib
486-
import urllib2
487-
import warnings
488-
import re
489-
import pandas as pd
490-
import matplotlib.pyplot as plt
491-
from azure.storage.blob import BlobService
492-
warnings.filterwarnings("ignore", category=UserWarning, module='urllib2')
493-
494-
495-
# Create the connection to Hive by using ODBC
496-
SERVER_NAME = 'xxx.azurehdinsight.net'
497-
DATABASE_NAME = 'nyctaxidb'
498-
USERID = 'xxx'
499-
PASSWORD = 'xxxx'
500-
DB_DRIVER = 'Microsoft Hive ODBC Driver'
501-
driver = 'DRIVER={' + DB_DRIVER + '}'
502-
server = 'Host=' + SERVER_NAME + ';Port=443'
503-
database = 'Schema=' + DATABASE_NAME
504-
hiveserv = 'HiveServerType=2'
505-
auth = 'AuthMech=6'
506-
uid = 'UID=' + USERID
507-
pwd = 'PWD=' + PASSWORD
508-
CONNECTION_STRING = ';'.join(
509-
[driver, server, database, hiveserv, auth, uid, pwd])
510-
connection = pyodbc.connect(CONNECTION_STRING, autocommit=True)
511-
cursor = connection.cursor()
512-
513-
514-
# Create the Hive database and tables
515-
queryString = "create database if not exists nyctaxidb;"
516-
cursor.execute(queryString)
517-
518-
queryString = """
519-
create external table if not exists nyctaxidb.trip
520-
(
521-
medallion string,
522-
hack_license string,
523-
vendor_id string,
524-
rate_code string,
525-
store_and_fwd_flag string,
526-
pickup_datetime string,
527-
dropoff_datetime string,
528-
passenger_count int,
529-
trip_time_in_secs double,
530-
trip_distance double,
531-
pickup_longitude double,
532-
pickup_latitude double,
533-
dropoff_longitude double,
534-
dropoff_latitude double)
535-
PARTITIONED BY (month int)
536-
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\\n'
537-
STORED AS TEXTFILE LOCATION 'wasb:///nyctaxidbdata/trip' TBLPROPERTIES('skip.header.line.count'='1');
538-
"""
539-
cursor.execute(queryString)
540-
541-
queryString = """
542-
create external table if not exists nyctaxidb.fare
543-
(
544-
medallion string,
545-
hack_license string,
546-
vendor_id string,
547-
pickup_datetime string,
548-
payment_type string,
549-
fare_amount double,
550-
surcharge double,
551-
mta_tax double,
552-
tip_amount double,
553-
tolls_amount double,
554-
total_amount double)
555-
PARTITIONED BY (month int)
556-
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\\n'
557-
STORED AS TEXTFILE LOCATION 'wasb:///nyctaxidbdata/fare' TBLPROPERTIES('skip.header.line.count'='1');
558-
"""
559-
cursor.execute(queryString)
560-
561-
562-
# Upload data from Blob storage to an HDI cluster
563-
for i in range(1, 13):
564-
queryString = "LOAD DATA INPATH 'wasb:///nyctaxitripraw2/trip_data_%d.csv' INTO TABLE nyctaxidb2.trip PARTITION (month=%d);" % (
565-
i, i)
566-
cursor.execute(queryString)
567-
queryString = "LOAD DATA INPATH 'wasb:///nyctaxifareraw2/trip_fare_%d.csv' INTO TABLE nyctaxidb2.fare PARTITION (month=%d);" % (
568-
i, i)
569-
cursor.execute(queryString)
570-
```
571-
572-
Alternatively, you can follow [this walkthrough](../team-data-science-process/hive-walkthrough.md) to upload NYC Taxi data to the HDI cluster. Major steps include:
573-
574-
* Use AzCopy to download zipped CSVs from the public blob to your local folder.
575-
* Use AzCopy to upload unzipped CSVs from the local folder to an HDI cluster.
576-
* Log in to the head node of Hadoop cluster and prepare for exploratory data analysis.
577-
578-
After the data is loaded into the HDI cluster, you can check your data in Azure Storage Explorer. And the nyctaxidb database has been created in the HDI cluster.
579-
580-
#### Data exploration: Hive Queries in Python
581-
582-
Because the data is in a Hadoop cluster, you can use the pyodbc package to connect to Hadoop clusters and query databases by using Hive to do exploration and feature engineering. You can view the existing tables that you created in the prerequisite step.
583-
584-
```python
585-
queryString = """
586-
show tables in nyctaxidb2;
587-
"""
588-
pd.read_sql(queryString, connection)
589-
```
590-
591-
![View existing tables](./media/vm-do-ten-things/Python_View_Existing_Tables_Hive_v3.PNG)
592-
593-
Let's look at the number of records in each month and the frequencies of tipped or not in the trip table:
594-
595-
```python
596-
queryString = """
597-
select month, count(*) from nyctaxidb.trip group by month;
598-
"""
599-
results = pd.read_sql(queryString,connection)
600-
601-
%matplotlib inline
602-
603-
results.columns = ['month', 'trip_count']
604-
df = results.copy()
605-
df.index = df['month']
606-
df['trip_count'].plot(kind='bar')
607-
```
608-
609-
![Plot of number of records in each month](./media/vm-do-ten-things/Exploration_Number_Records_by_Month_v3.PNG)
610-
611-
```python
612-
queryString = """
613-
SELECT tipped, COUNT(*) AS tip_freq
614-
FROM
615-
(
616-
SELECT if(tip_amount > 0, 1, 0) as tipped, tip_amount
617-
FROM nyctaxidb.fare
618-
)tc
619-
GROUP BY tipped;
620-
"""
621-
results = pd.read_sql(queryString, connection)
622-
623-
results.columns = ['tipped', 'trip_count']
624-
df = results.copy()
625-
df.index = df['tipped']
626-
df['trip_count'].plot(kind='bar')
627-
```
628-
629-
![Plot of tip frequencies](./media/vm-do-ten-things/Exploration_Frequency_tip_or_not_v3.PNG)
630-
631-
You can also compute the distance between pickup location and drop-off location, and then compare it to the trip distance.
632-
633-
```python
634-
queryString = """
635-
select pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, trip_distance, trip_time_in_secs,
636-
3959*2*2*atan((1-sqrt(1-pow(sin((dropoff_latitude-pickup_latitude)
637-
*radians(180)/180/2),2)-cos(pickup_latitude*radians(180)/180)
638-
*cos(dropoff_latitude*radians(180)/180)*pow(sin((dropoff_longitude-pickup_longitude)*radians(180)/180/2),2)))
639-
/sqrt(pow(sin((dropoff_latitude-pickup_latitude)*radians(180)/180/2),2)
640-
+cos(pickup_latitude*radians(180)/180)*cos(dropoff_latitude*radians(180)/180)*
641-
pow(sin((dropoff_longitude-pickup_longitude)*radians(180)/180/2),2))) as direct_distance
642-
from nyctaxidb.trip
643-
where month=1
644-
and pickup_longitude between -90 and -30
645-
and pickup_latitude between 30 and 90
646-
and dropoff_longitude between -90 and -30
647-
and dropoff_latitude between 30 and 90;
648-
"""
649-
results = pd.read_sql(queryString, connection)
650-
results.head(5)
651-
```
652-
653-
![Top rows of the pickup and drop-off table](./media/vm-do-ten-things/Exploration_compute_pickup_dropoff_distance_v2.PNG)
654-
655-
```python
656-
results.columns = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
657-
'dropoff_latitude', 'trip_distance', 'trip_time_in_secs', 'direct_distance']
658-
df = results.loc[results['trip_distance'] <= 100] # remove outliers
659-
df = df.loc[df['direct_distance'] <= 100] # remove outliers
660-
plt.scatter(df['direct_distance'], df['trip_distance'])
661-
```
662-
663-
![Plot of pickup/drop-off distance to trip distance](./media/vm-do-ten-things/Exploration_direct_distance_trip_distance_v2.PNG)
664-
665-
Now let's prepare a downsampled (1 percent) set of data for modeling. You can use this data in the Machine Learning reader module.
666-
667-
```python
668-
queryString = """
669-
create table if not exists nyctaxi_downsampled_dataset_testNEW (
670-
medallion string,
671-
hack_license string,
672-
vendor_id string,
673-
rate_code string,
674-
store_and_fwd_flag string,
675-
pickup_datetime string,
676-
dropoff_datetime string,
677-
pickup_hour string,
678-
pickup_week string,
679-
weekday string,
680-
passenger_count int,
681-
trip_time_in_secs double,
682-
trip_distance double,
683-
pickup_longitude double,
684-
pickup_latitude double,
685-
dropoff_longitude double,
686-
dropoff_latitude double,
687-
direct_distance double,
688-
payment_type string,
689-
fare_amount double,
690-
surcharge double,
691-
mta_tax double,
692-
tip_amount double,
693-
tolls_amount double,
694-
total_amount double,
695-
tipped string,
696-
tip_class string
697-
)
698-
row format delimited fields terminated by ','
699-
lines terminated by '\\n'
700-
stored as textfile;
701-
"""
702-
cursor.execute(queryString)
703-
```
704-
705-
Now insert contents of the join into the preceding internal table.
706-
707-
```python
708-
queryString = """
709-
insert overwrite table nyctaxi_downsampled_dataset_testNEW
710-
select
711-
t.medallion,
712-
t.hack_license,
713-
t.vendor_id,
714-
t.rate_code,
715-
t.store_and_fwd_flag,
716-
t.pickup_datetime,
717-
t.dropoff_datetime,
718-
hour(t.pickup_datetime) as pickup_hour,
719-
weekofyear(t.pickup_datetime) as pickup_week,
720-
from_unixtime(unix_timestamp(t.pickup_datetime, 'yyyy-MM-dd HH:mm:ss'),'u') as weekday,
721-
t.passenger_count,
722-
t.trip_time_in_secs,
723-
t.trip_distance,
724-
t.pickup_longitude,
725-
t.pickup_latitude,
726-
t.dropoff_longitude,
727-
t.dropoff_latitude,
728-
t.direct_distance,
729-
f.payment_type,
730-
f.fare_amount,
731-
f.surcharge,
732-
f.mta_tax,
733-
f.tip_amount,
734-
f.tolls_amount,
735-
f.total_amount,
736-
if(tip_amount>0,1,0) as tipped,
737-
if(tip_amount=0,0,
738-
if(tip_amount>0 and tip_amount<=5,1,
739-
if(tip_amount>5 and tip_amount<=10,2,
740-
if(tip_amount>10 and tip_amount<=20,3,4)))) as tip_class
741-
from
742-
(
743-
select
744-
medallion,
745-
hack_license,
746-
vendor_id,
747-
rate_code,
748-
store_and_fwd_flag,
749-
pickup_datetime,
750-
dropoff_datetime,
751-
passenger_count,
752-
trip_time_in_secs,
753-
trip_distance,
754-
pickup_longitude,
755-
pickup_latitude,
756-
dropoff_longitude,
757-
dropoff_latitude,
758-
3959*2*2*atan((1-sqrt(1-pow(sin((dropoff_latitude-pickup_latitude)
759-
radians(180)/180/2),2)-cos(pickup_latitude*radians(180)/180)
760-
*cos(dropoff_latitude*radians(180)/180)*pow(sin((dropoff_longitude-pickup_longitude)*radians(180)/180/2),2)))
761-
/sqrt(pow(sin((dropoff_latitude-pickup_latitude)*radians(180)/180/2),2)
762-
+cos(pickup_latitude*radians(180)/180)*cos(dropoff_latitude*radians(180)/180)*pow(sin((dropoff_longitude-pickup_longitude)*radians(180)/180/2),2))) as direct_distance,
763-
rand() as sample_key
764-
765-
from trip
766-
where pickup_latitude between 30 and 90
767-
and pickup_longitude between -90 and -30
768-
and dropoff_latitude between 30 and 90
769-
and dropoff_longitude between -90 and -30
770-
)t
771-
join
772-
(
773-
select
774-
medallion,
775-
hack_license,
776-
vendor_id,
777-
pickup_datetime,
778-
payment_type,
779-
fare_amount,
780-
surcharge,
781-
mta_tax,
782-
tip_amount,
783-
tolls_amount,
784-
total_amount
785-
from fare
786-
)f
787-
on t.medallion=f.medallion and t.hack_license=f.hack_license and t.pickup_datetime=f.pickup_datetime
788-
where t.sample_key<=0.01
789-
"""
790-
cursor.execute(queryString)
791-
```
792-
793-
After a while, you can see that the data has been loaded in Hadoop clusters:
794-
795-
```python
796-
queryString = """
797-
select * from nyctaxi_downsampled_dataset limit 10;
798-
"""
799-
cursor.execute(queryString)
800-
pd.read_sql(queryString, connection)
801-
```
802-
803-
![Top rows of data from the table](./media/vm-do-ten-things/DownSample_Data_For_Modeling_v2.PNG)
804-
805450
### Azure SQL Data Warehouse and databases
806451
Azure SQL Data Warehouse is an elastic data warehouse as a service with an enterprise-class SQL Server experience.
807452

0 commit comments

Comments
 (0)