Skip to content

Commit d1f8b0a

Browse files
authored
Merge pull request #89153 from dagiro/cats157
cats157
2 parents 320a3f3 + fbde97d commit d1f8b0a

File tree

1 file changed

+25
-23
lines changed

1 file changed

+25
-23
lines changed

articles/hdinsight/spark/apache-spark-custom-library-website-log-analysis.md

Lines changed: 25 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,23 +2,20 @@
22
title: Analyze website logs with Python libraries in Spark - Azure
33
description: This notebook demonstrates how to analyze log data using a custom library with Spark on Azure HDInsight.
44
author: hrasheed-msft
5+
ms.author: hrasheed
56
ms.reviewer: jasonh
6-
77
ms.service: hdinsight
88
ms.custom: hdinsightactive
99
ms.topic: conceptual
1010
ms.date: 11/28/2017
11-
ms.author: hrasheed
12-
1311
---
12+
1413
# Analyze website logs using a custom Python library with Apache Spark cluster on HDInsight
1514

1615
This notebook demonstrates how to analyze log data using a custom library with Apache Spark on HDInsight. The custom library we use is a Python library called **iislogparser.py**.
1716

1817
> [!TIP]
1918
> This article is also available as a Jupyter notebook on a Spark (Linux) cluster that you create in HDInsight. The notebook experience lets you run the Python snippets from the notebook itself. To perform the article from within a notebook, create a Spark cluster, launch a Jupyter notebook (`https://CLUSTERNAME.azurehdinsight.net/jupyter`), and then run the notebook **Analyze logs with Spark using a custom library.ipynb** under the **PySpark** folder.
20-
>
21-
>
2219
2320
**Prerequisites:**
2421

@@ -33,18 +30,19 @@ In this section, we use the [Jupyter](https://jupyter.org) notebook associated w
3330

3431
Once your data is saved as an Apache Hive table, in the next section we will connect to the Hive table using BI tools such as Power BI and Tableau.
3532

36-
1. From the [Azure portal](https://portal.azure.com/), from the startboard, click the tile for your Spark cluster (if you pinned it to the startboard). You can also navigate to your cluster under **Browse All** > **HDInsight Clusters**.
33+
1. From the [Azure portal](https://portal.azure.com/), from the startboard, click the tile for your Spark cluster (if you pinned it to the startboard). You can also navigate to your cluster under **Browse All** > **HDInsight Clusters**.
34+
3735
2. From the Spark cluster blade, click **Cluster Dashboard**, and then click **Jupyter Notebook**. If prompted, enter the admin credentials for the cluster.
3836

3937
> [!NOTE]
4038
> You may also reach the Jupyter Notebook for your cluster by opening the following URL in your browser. Replace **CLUSTERNAME** with the name of your cluster:
4139
>
4240
> `https://CLUSTERNAME.azurehdinsight.net/jupyter`
43-
>
44-
>
41+
4542
3. Create a new notebook. Click **New**, and then click **PySpark**.
4643

47-
![Create a new Jupyter notebook](./media/apache-spark-custom-library-website-log-analysis/hdinsight-create-jupyter-notebook.png "Create a new Jupyter notebook")
44+
![Create a new Apache Jupyter notebook](./media/apache-spark-custom-library-website-log-analysis/hdinsight-create-jupyter-notebook.png "Create a new Jupyter notebook")
45+
4846
4. A new notebook is created and opened with the name Untitled.pynb. Click the notebook name at the top, and enter a friendly name.
4947

5048
![Provide a name for the notebook](./media/apache-spark-custom-library-website-log-analysis/hdinsight-name-jupyter-notebook.png "Provide a name for the notebook")
@@ -53,15 +51,13 @@ Once your data is saved as an Apache Hive table, in the next section we will con
5351
from pyspark.sql import Row
5452
from pyspark.sql.types import *
5553

56-
57-
1. Create an RDD using the sample log data already available on the cluster. You
54+
6. Create an RDD using the sample log data already available on the cluster. You
5855
can access the data in the default storage account associated with the cluster
5956
at **\HdiSamples\HdiSamples\WebsiteLogSampleData\SampleLog\909f2b.log**.
6057

6158
logs = sc.textFile('wasb:///HdiSamples/HdiSamples/WebsiteLogSampleData/SampleLog/909f2b.log')
6259

63-
64-
1. Retrieve a sample log set to verify that the previous step completed
60+
7. Retrieve a sample log set to verify that the previous step completed
6561
successfully.
6662

6763
logs.take(5)
@@ -79,14 +75,14 @@ Once your data is saved as an Apache Hive table, in the next section we will con
7975
u'2014-01-01 02:01:09 SAMPLEWEBSITE GET /blogposts/mvc4/step4.png X-ARR-LOG-ID=4bea5b3d-8ac9-46c9-9b8c-ec3e9500cbea 80 - 1.54.23.196 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36 - http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx www.sample.com 200 0 0 72177 871 47']
8076

8177
## Analyze log data using a custom Python library
78+
8279
1. In the output above, the first couple lines include the header information and each remaining line matches the schema described in that header. Parsing such logs could be complicated. So, we use a custom Python library
8380
(**iislogparser.py**) that makes parsing such logs much easier. By default, this library is included with your Spark cluster on HDInsight at **/HdiSamples/HdiSamples/WebsiteLogSampleData/iislogparser.py**.
8481

8582
However, this library is not in the `PYTHONPATH` so we cannot use it by using an import statement like `import iislogparser`. To use this library, we must distribute it to all the worker nodes. Run the following snippet.
8683

8784
sc.addPyFile('wasb:///HdiSamples/HdiSamples/WebsiteLogSampleData/iislogparser.py')
8885

89-
9086
1. `iislogparser` provides a function `parse_log_line` that returns `None` if a log
9187
line is a header row, and returns an instance of the `LogLine` class if it
9288
encounters a log line. Use the `LogLine` class to extract only the log lines
@@ -96,7 +92,8 @@ Once your data is saved as an Apache Hive table, in the next section we will con
9692
import iislogparser
9793
return iislogparser.parse_log_line(l)
9894
logLines = logs.map(parse_line).filter(lambda p: p is not None).cache()
99-
2. Retrieve a couple of extracted log lines to verify that the step completed
95+
96+
1. Retrieve a couple of extracted log lines to verify that the step completed
10097
successfully.
10198

10299
logLines.take(2)
@@ -109,7 +106,8 @@ Once your data is saved as an Apache Hive table, in the next section we will con
109106

110107
[2014-01-01 02:01:09 SAMPLEWEBSITE GET /blogposts/mvc4/step2.png X-ARR-LOG-ID=2ec4b8ad-3cf0-4442-93ab-837317ece6a1 80 - 1.54.23.196 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36 - http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx www.sample.com 200 0 0 53175 871 46,
111108
2014-01-01 02:01:09 SAMPLEWEBSITE GET /blogposts/mvc4/step3.png X-ARR-LOG-ID=9eace870-2f49-4efd-b204-0d170da46b4a 80 - 1.54.23.196 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36 - http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx www.sample.com 200 0 0 51237 871 32]
112-
3. The `LogLine` class, in turn, has some useful methods, like `is_error()`, which
109+
110+
1. The `LogLine` class, in turn, has some useful methods, like `is_error()`, which
113111
returns whether a log entry has an error code. Use this to compute the number of
114112
errors in the extracted log lines, and then log all the errors to a different
115113
file.
@@ -127,7 +125,7 @@ Once your data is saved as an Apache Hive table, in the next section we will con
127125
# -----------------
128126

129127
There are 30 errors and 646 log entries
130-
4. You can also use **Matplotlib** to construct a visualization of the data. For
128+
1. You can also use **Matplotlib** to construct a visualization of the data. For
131129
example, if you want to isolate the cause of requests that run for a long time,
132130
you might want to find the files that take the most time to serve on average.
133131
The snippet below retrieves the top 25 resources that took most time to serve a
@@ -172,15 +170,17 @@ Once your data is saved as an Apache Hive table, in the next section we will con
172170
(u'/blogposts/sqlvideos/sqlvideos.jpg', 102.0),
173171
(u'/blogposts/mvcrouting/step21.jpg', 101.0),
174172
(u'/blogposts/mvc4/step1.png', 98.0)]
175-
5. You can also present this information in the form of plot. As a first step to create a plot, let us first create a temporary table **AverageTime**. The table groups the logs by time to see if there were any unusual latency spikes at any particular time.
173+
174+
1. You can also present this information in the form of plot. As a first step to create a plot, let us first create a temporary table **AverageTime**. The table groups the logs by time to see if there were any unusual latency spikes at any particular time.
176175

177176
avgTimeTakenByMinute = avgTimeTakenByKey(logLines.map(lambda p: (p.datetime.minute, p))).sortByKey()
178177
schema = StructType([StructField('Minutes', IntegerType(), True),
179178
StructField('Time', FloatType(), True)])
180179

181180
avgTimeTakenByMinuteDF = sqlContext.createDataFrame(avgTimeTakenByMinute, schema)
182181
avgTimeTakenByMinuteDF.registerTempTable('AverageTime')
183-
6. You can then run the following SQL query to get all the records in the **AverageTime** table.
182+
183+
1. You can then run the following SQL query to get all the records in the **AverageTime** table.
184184

185185
%%sql -o averagetime
186186
SELECT * FROM AverageTime
@@ -189,10 +189,11 @@ Once your data is saved as an Apache Hive table, in the next section we will con
189189

190190
You should see an output like the following:
191191

192-
![SQL query output](./media/apache-spark-custom-library-website-log-analysis/hdinsight-jupyter-sql-qyery-output.png "SQL query output")
192+
![hdinsight jupyter sql qyery output](./media/apache-spark-custom-library-website-log-analysis/hdinsight-jupyter-sql-qyery-output.png "SQL query output")
193193

194194
For more information about the `%%sql` magic, see [Parameters supported with the %%sql magic](apache-spark-jupyter-notebook-kernels.md#parameters-supported-with-the-sql-magic).
195-
7. You can now use Matplotlib, a library used to construct visualization of data, to create a plot. Because the plot must be created from the locally persisted **averagetime** dataframe, the code snippet must begin with the `%%local` magic. This ensures that the code is run locally on the Jupyter server.
195+
196+
1. You can now use Matplotlib, a library used to construct visualization of data, to create a plot. Because the plot must be created from the locally persisted **averagetime** dataframe, the code snippet must begin with the `%%local` magic. This ensures that the code is run locally on the Jupyter server.
196197

197198
%%local
198199
%matplotlib inline
@@ -204,8 +205,9 @@ Once your data is saved as an Apache Hive table, in the next section we will con
204205

205206
You should see an output like the following:
206207

207-
![Matplotlib output](./media/apache-spark-custom-library-website-log-analysis/hdinsight-apache-spark-web-log-analysis-plot.png "Matplotlib output")
208-
8. After you have finished running the application, you should shutdown the notebook to release the resources. To do so, from the **File** menu on the notebook, click **Close and Halt**. This will shutdown and close the notebook.
208+
![apache spark web log analysis plot](./media/apache-spark-custom-library-website-log-analysis/hdinsight-apache-spark-web-log-analysis-plot.png "Matplotlib output")
209+
210+
1. After you have finished running the application, you should shutdown the notebook to release the resources. To do so, from the **File** menu on the notebook, click **Close and Halt**. This will shutdown and close the notebook.
209211

210212
## <a name="seealso"></a>See also
211213
* [Overview: Apache Spark on Azure HDInsight](apache-spark-overview.md)

0 commit comments

Comments
 (0)