You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/data-science-virtual-machine/vm-do-ten-things.md
+4-359Lines changed: 4 additions & 359 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,10 +6,10 @@ services: machine-learning
6
6
ms.service: machine-learning
7
7
ms.subservice: data-science-vm
8
8
9
-
author: vijetajo
10
-
ms.author: vijetaj
9
+
author: lobrien
10
+
ms.author: laobri
11
11
ms.topic: conceptual
12
-
ms.date: 09/24/2018
12
+
ms.date: 05/08/2020
13
13
14
14
---
15
15
@@ -27,7 +27,7 @@ In this article, you'll learn how to use your DSVM to perform data science tasks
27
27
- Administer your Azure resources by using the Azure portal or PowerShell.
28
28
- Extend your storage space and share large-scale datasets/code across your whole team by creating an Azure Files share as a mountable drive on your DSVM.
29
29
- Share code with your team by using GitHub. Access your repository by using the pre-installed Git clients: Git Bash and Git GUI.
30
-
- Access Azure data and analytics services like Azure Blob storage, Azure Data Lake, Azure HDInsight (Hadoop), Azure Cosmos DB, Azure SQL Data Warehouse, and Azure SQL Database.
30
+
- Access Azure data and analytics services like Azure Blob storage, Azure Data Lake, Azure Cosmos DB, Azure SQL Data Warehouse, and Azure SQL Database.
31
31
- Build reports and a dashboard by using the Power BI Desktop instance that's pre-installed on the DSVM, and deploy them in the cloud.
32
32
- Dynamically scale your DSVM to meet your project's needs.
33
33
- Install additional tools on your virtual machine.
@@ -447,361 +447,6 @@ The file information appears:
447
447
448
448

449
449
450
-
### HDInsight Hadoop clusters
451
-
Azure HDInsight is a managed Apache Hadoop, Spark, HBase, and Storm service in the cloud. You can work easily with Azure HDInsight clusters from the Data Science Virtual Machine.
452
-
453
-
#### Prerequisites
454
-
455
-
* Create your Azure Blob storage account from the [Azure portal](https://portal.azure.com). This storage account is used to store data for HDInsight clusters.
456
-
457
-

458
-
459
-
* Customize Azure HDInsight Hadoop clusters from the [Azure portal](../team-data-science-process/customize-hadoop-cluster.md).
460
-
461
-
Link the storage account created with your HDInsight cluster when it's created. This storage account is used for accessing data that can be processed within the cluster.
462
-
463
-

464
-
465
-
* Enable Remote Desktop access to the head node of the cluster after it's created. Remember the remote access credentials that you specify here, because you'll need them in the subsequent procedure.
466
-
467
-

468
-
469
-
* Create an Azure Machine Learning workspace. Your Machine Learning experiments are stored in this Machine Learning workspace. Select the highlighted options in the portal, as shown in the following screenshot:
470
-
471
-

* Upload data by using IPython Notebook. Import required packages, plug in credentials, create a database in your storage account, and then load data into HDI clusters.
queryString ="create database if not exists nyctaxidb;"
516
-
cursor.execute(queryString)
517
-
518
-
queryString ="""
519
-
create external table if not exists nyctaxidb.trip
520
-
(
521
-
medallion string,
522
-
hack_license string,
523
-
vendor_id string,
524
-
rate_code string,
525
-
store_and_fwd_flag string,
526
-
pickup_datetime string,
527
-
dropoff_datetime string,
528
-
passenger_count int,
529
-
trip_time_in_secs double,
530
-
trip_distance double,
531
-
pickup_longitude double,
532
-
pickup_latitude double,
533
-
dropoff_longitude double,
534
-
dropoff_latitude double)
535
-
PARTITIONED BY (month int)
536
-
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\\n'
537
-
STORED AS TEXTFILE LOCATION 'wasb:///nyctaxidbdata/trip' TBLPROPERTIES('skip.header.line.count'='1');
538
-
"""
539
-
cursor.execute(queryString)
540
-
541
-
queryString ="""
542
-
create external table if not exists nyctaxidb.fare
543
-
(
544
-
medallion string,
545
-
hack_license string,
546
-
vendor_id string,
547
-
pickup_datetime string,
548
-
payment_type string,
549
-
fare_amount double,
550
-
surcharge double,
551
-
mta_tax double,
552
-
tip_amount double,
553
-
tolls_amount double,
554
-
total_amount double)
555
-
PARTITIONED BY (month int)
556
-
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\\n'
557
-
STORED AS TEXTFILE LOCATION 'wasb:///nyctaxidbdata/fare' TBLPROPERTIES('skip.header.line.count'='1');
558
-
"""
559
-
cursor.execute(queryString)
560
-
561
-
562
-
# Upload data from Blob storage to an HDI cluster
563
-
for i inrange(1, 13):
564
-
queryString ="LOAD DATA INPATH 'wasb:///nyctaxitripraw2/trip_data_%d.csv' INTO TABLE nyctaxidb2.trip PARTITION (month=%d);"% (
565
-
i, i)
566
-
cursor.execute(queryString)
567
-
queryString ="LOAD DATA INPATH 'wasb:///nyctaxifareraw2/trip_fare_%d.csv' INTO TABLE nyctaxidb2.fare PARTITION (month=%d);"% (
568
-
i, i)
569
-
cursor.execute(queryString)
570
-
```
571
-
572
-
Alternatively, you can follow [this walkthrough](../team-data-science-process/hive-walkthrough.md) to upload NYC Taxi data to the HDI cluster. Major steps include:
573
-
574
-
* Use AzCopy to download zipped CSVs from the public blob to your local folder.
575
-
* Use AzCopy to upload unzipped CSVs from the local folder to an HDI cluster.
576
-
* Log in to the head node of Hadoop cluster and prepare for exploratory data analysis.
577
-
578
-
After the data is loaded into the HDI cluster, you can check your data in Azure Storage Explorer. And the nyctaxidb database has been created in the HDI cluster.
579
-
580
-
#### Data exploration: Hive Queries in Python
581
-
582
-
Because the data is in a Hadoop cluster, you can use the pyodbc package to connect to Hadoop clusters and query databases by using Hive to do exploration and feature engineering. You can view the existing tables that you created in the prerequisite step.
+cos(pickup_latitude*radians(180)/180)*cos(dropoff_latitude*radians(180)/180)*pow(sin((dropoff_longitude-pickup_longitude)*radians(180)/180/2),2))) as direct_distance,
763
-
rand() as sample_key
764
-
765
-
from trip
766
-
where pickup_latitude between 30 and 90
767
-
and pickup_longitude between -90 and -30
768
-
and dropoff_latitude between 30 and 90
769
-
and dropoff_longitude between -90 and -30
770
-
)t
771
-
join
772
-
(
773
-
select
774
-
medallion,
775
-
hack_license,
776
-
vendor_id,
777
-
pickup_datetime,
778
-
payment_type,
779
-
fare_amount,
780
-
surcharge,
781
-
mta_tax,
782
-
tip_amount,
783
-
tolls_amount,
784
-
total_amount
785
-
from fare
786
-
)f
787
-
on t.medallion=f.medallion and t.hack_license=f.hack_license and t.pickup_datetime=f.pickup_datetime
788
-
where t.sample_key<=0.01
789
-
"""
790
-
cursor.execute(queryString)
791
-
```
792
-
793
-
After a while, you can see that the data has been loaded in Hadoop clusters:
794
-
795
-
```python
796
-
queryString ="""
797
-
select * from nyctaxi_downsampled_dataset limit 10;
798
-
"""
799
-
cursor.execute(queryString)
800
-
pd.read_sql(queryString, connection)
801
-
```
802
-
803
-

804
-
805
450
### Azure SQL Data Warehouse and databases
806
451
Azure SQL Data Warehouse is an elastic data warehouse as a service with an enterprise-class SQL Server experience.
0 commit comments