You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/apache-spark-custom-library-website-log-analysis.md
+25-23Lines changed: 25 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,23 +2,20 @@
2
2
title: Analyze website logs with Python libraries in Spark - Azure
3
3
description: This notebook demonstrates how to analyze log data using a custom library with Spark on Azure HDInsight.
4
4
author: hrasheed-msft
5
+
ms.author: hrasheed
5
6
ms.reviewer: jasonh
6
-
7
7
ms.service: hdinsight
8
8
ms.custom: hdinsightactive
9
9
ms.topic: conceptual
10
10
ms.date: 11/28/2017
11
-
ms.author: hrasheed
12
-
13
11
---
12
+
14
13
# Analyze website logs using a custom Python library with Apache Spark cluster on HDInsight
15
14
16
15
This notebook demonstrates how to analyze log data using a custom library with Apache Spark on HDInsight. The custom library we use is a Python library called **iislogparser.py**.
17
16
18
17
> [!TIP]
19
18
> This article is also available as a Jupyter notebook on a Spark (Linux) cluster that you create in HDInsight. The notebook experience lets you run the Python snippets from the notebook itself. To perform the article from within a notebook, create a Spark cluster, launch a Jupyter notebook (`https://CLUSTERNAME.azurehdinsight.net/jupyter`), and then run the notebook **Analyze logs with Spark using a custom library.ipynb** under the **PySpark** folder.
20
-
>
21
-
>
22
19
23
20
**Prerequisites:**
24
21
@@ -33,18 +30,19 @@ In this section, we use the [Jupyter](https://jupyter.org) notebook associated w
33
30
34
31
Once your data is saved as an Apache Hive table, in the next section we will connect to the Hive table using BI tools such as Power BI and Tableau.
35
32
36
-
1. From the [Azure portal](https://portal.azure.com/), from the startboard, click the tile for your Spark cluster (if you pinned it to the startboard). You can also navigate to your cluster under **Browse All** > **HDInsight Clusters**.
33
+
1. From the [Azure portal](https://portal.azure.com/), from the startboard, click the tile for your Spark cluster (if you pinned it to the startboard). You can also navigate to your cluster under **Browse All** > **HDInsight Clusters**.
34
+
37
35
2. From the Spark cluster blade, click **Cluster Dashboard**, and then click **Jupyter Notebook**. If prompted, enter the admin credentials for the cluster.
38
36
39
37
> [!NOTE]
40
38
> You may also reach the Jupyter Notebook for your cluster by opening the following URL in your browser. Replace **CLUSTERNAME** with the name of your cluster:
3. Create a new notebook. Click **New**, and then click **PySpark**.
46
43
47
-

44
+

45
+
48
46
4. A new notebook is created and opened with the name Untitled.pynb. Click the notebook name at the top, and enter a friendly name.
49
47
50
48

@@ -53,15 +51,13 @@ Once your data is saved as an Apache Hive table, in the next section we will con
53
51
from pyspark.sql import Row
54
52
from pyspark.sql.types import *
55
53
56
-
57
-
1. Create an RDD using the sample log data already available on the cluster. You
54
+
6. Create an RDD using the sample log data already available on the cluster. You
58
55
can access the data in the default storage account associated with the cluster
59
56
at **\HdiSamples\HdiSamples\WebsiteLogSampleData\SampleLog\909f2b.log**.
1. In the output above, the first couple lines include the header information and each remaining line matches the schema described in that header. Parsing such logs could be complicated. So, we use a custom Python library
83
80
(**iislogparser.py**) that makes parsing such logs much easier. By default, this library is included with your Spark cluster on HDInsight at **/HdiSamples/HdiSamples/WebsiteLogSampleData/iislogparser.py**.
84
81
85
82
However, this library is not in the `PYTHONPATH` so we cannot use it by using an import statement like `import iislogparser`. To use this library, we must distribute it to all the worker nodes. Run the following snippet.
3. The `LogLine` class, in turn, has some useful methods, like `is_error()`, which
109
+
110
+
1. The `LogLine` class, in turn, has some useful methods, like `is_error()`, which
113
111
returns whether a log entry has an error code. Use this to compute the number of
114
112
errors in the extracted log lines, and then log all the errors to a different
115
113
file.
@@ -127,7 +125,7 @@ Once your data is saved as an Apache Hive table, in the next section we will con
127
125
# -----------------
128
126
129
127
There are 30 errors and 646 log entries
130
-
4. You can also use **Matplotlib** to construct a visualization of the data. For
128
+
1. You can also use **Matplotlib** to construct a visualization of the data. For
131
129
example, if you want to isolate the cause of requests that run for a long time,
132
130
you might want to find the files that take the most time to serve on average.
133
131
The snippet below retrieves the top 25 resources that took most time to serve a
@@ -172,15 +170,17 @@ Once your data is saved as an Apache Hive table, in the next section we will con
172
170
(u'/blogposts/sqlvideos/sqlvideos.jpg', 102.0),
173
171
(u'/blogposts/mvcrouting/step21.jpg', 101.0),
174
172
(u'/blogposts/mvc4/step1.png', 98.0)]
175
-
5. You can also present this information in the form of plot. As a first step to create a plot, let us first create a temporary table **AverageTime**. The table groups the logs by time to see if there were any unusual latency spikes at any particular time.
173
+
174
+
1. You can also present this information in the form of plot. As a first step to create a plot, let us first create a temporary table **AverageTime**. The table groups the logs by time to see if there were any unusual latency spikes at any particular time.
For more information about the `%%sql` magic, see [Parameters supported with the %%sql magic](apache-spark-jupyter-notebook-kernels.md#parameters-supported-with-the-sql-magic).
195
-
7. You can now use Matplotlib, a library used to construct visualization of data, to create a plot. Because the plot must be created from the locally persisted **averagetime** dataframe, the code snippet must begin with the `%%local` magic. This ensures that the code is run locally on the Jupyter server.
195
+
196
+
1. You can now use Matplotlib, a library used to construct visualization of data, to create a plot. Because the plot must be created from the locally persisted **averagetime** dataframe, the code snippet must begin with the `%%local` magic. This ensures that the code is run locally on the Jupyter server.
196
197
197
198
%%local
198
199
%matplotlib inline
@@ -204,8 +205,9 @@ Once your data is saved as an Apache Hive table, in the next section we will con
8. After you have finished running the application, you should shutdown the notebook to release the resources. To do so, from the **File** menu on the notebook, click **Close and Halt**. This will shutdown and close the notebook.
208
+

209
+
210
+
1. After you have finished running the application, you should shutdown the notebook to release the resources. To do so, from the **File** menu on the notebook, click **Close and Halt**. This will shutdown and close the notebook.
209
211
210
212
## <aname="seealso"></a>See also
211
213
*[Overview: Apache Spark on Azure HDInsight](apache-spark-overview.md)
0 commit comments