Skip to content

Commit 0c342c8

Browse files
authored
Merge pull request #102592 from hrasheed-msft/hdinsight_reinstate_pig
hdinsight restoring pig article
2 parents 180a075 + 9a2764c commit 0c342c8

File tree

3 files changed

+114
-1
lines changed

3 files changed

+114
-1
lines changed

articles/hdinsight/TOC.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -431,6 +431,8 @@
431431
items:
432432
- name: Use Apache Hadoop sandbox
433433
href: ./hadoop/apache-hadoop-emulator-get-started.md
434+
- name: Use Apache Pig
435+
href: use-pig.md
434436
- name: Develop
435437
items:
436438
- name: Use MapReduce with Apache Hadoop
@@ -477,7 +479,7 @@
477479
href: ./hadoop/apache-hadoop-mahout-linux-mac.md
478480
- name: Advanced analytics for HDInsight
479481
href: ./hadoop/apache-hadoop-deep-dive-advanced-analytics.md
480-
- name: Manage
482+
- name: Manage
481483
items:
482484
- name: Use empty edge nodes
483485
href: ./hdinsight-apps-use-edge-node.md
18.3 KB
Loading

articles/hdinsight/use-pig.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
title: Use Apache Pig
3+
titleSuffix: Azure HDInsight
4+
description: Learn how to use Pig with Apache Hadoop on HDInsight.
5+
author: hrasheed-msft
6+
ms.author: hrasheed
7+
ms.reviewer: jasonh
8+
ms.service: hdinsight
9+
ms.custom: hdinsightactive
10+
ms.topic: conceptual
11+
ms.date: 01/28/2020
12+
---
13+
# Use Apache Pig with Apache Hadoop on HDInsight
14+
15+
Learn how to use [Apache Pig](https://pig.apache.org/) with HDInsight.
16+
17+
Apache Pig is a platform for creating programs for Apache Hadoop by using a procedural language known as *Pig Latin*. Pig is an alternative to Java for creating *MapReduce* solutions, and it is included with Azure HDInsight. Use the following table to discover the various ways that Pig can be used with HDInsight:
18+
19+
## <a id="why"></a>Why use Apache Pig
20+
21+
One of the challenges of processing data by using MapReduce in Hadoop is implementing your processing logic by using only a map and a reduce function. For complex processing, you often have to break processing into multiple MapReduce operations that are chained together to achieve the desired result.
22+
23+
Pig allows you to define processing as a series of transformations that the data flows through to produce the desired output.
24+
25+
The Pig Latin language allows you to describe the data flow from raw input, through one or more transformations, to produce the desired output. Pig Latin programs follow this general pattern:
26+
27+
* **Load**: Read data to be manipulated from the file system.
28+
29+
* **Transform**: Manipulate the data.
30+
31+
* **Dump or store**: Output data to the screen or store it for processing.
32+
33+
### User-defined functions
34+
35+
Pig Latin also supports user-defined functions (UDF), which allows you to invoke external components that implement logic that is difficult to model in Pig Latin.
36+
37+
For more information about Pig Latin, see [Pig Latin Reference Manual 1](https://archive.cloudera.com/cdh/3/pig/piglatin_ref1.html) and [Pig Latin Reference Manual 2](https://archive.cloudera.com/cdh/3/pig/piglatin_ref2.html).
38+
39+
## <a id="data"></a>Example data
40+
41+
HDInsight provides various example data sets, which are stored in the `/example/data` and `/HdiSamples` directories. These directories are in the default storage for your cluster. The Pig example in this document uses the *log4j* file from `/example/data/sample.log`.
42+
43+
Each log inside the file consists of a line of fields that contains a `[LOG LEVEL]` field to show the type and the severity, for example:
44+
45+
2012-02-03 20:26:41 SampleClass3 [ERROR] verbose detail for id 1527353937
46+
47+
In the previous example, the log level is ERROR.
48+
49+
> [!NOTE]
50+
> You can also generate a log4j file by using the [Apache Log4j](https://en.wikipedia.org/wiki/Log4j) logging tool and then upload that file to your blob. See [Upload Data to HDInsight](hdinsight-upload-data.md) for instructions. For more information about how blobs in Azure storage are used with HDInsight, see [Use Azure Blob Storage with HDInsight](hdinsight-hadoop-use-blob-storage.md).
51+
52+
## <a id="job"></a>Example job
53+
54+
The following Pig Latin job loads the `sample.log` file from the default storage for your HDInsight cluster. Then it performs a series of transformations that result in a count of how many times each log level occurred in the input data. The results are written to STDOUT.
55+
56+
```
57+
LOGS = LOAD 'wasb:///example/data/sample.log';
58+
LEVELS = foreach LOGS generate REGEX_EXTRACT($0, '(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)', 1) as LOGLEVEL;
59+
FILTEREDLEVELS = FILTER LEVELS by LOGLEVEL is not null;
60+
GROUPEDLEVELS = GROUP FILTEREDLEVELS by LOGLEVEL;
61+
FREQUENCIES = foreach GROUPEDLEVELS generate group as LOGLEVEL, COUNT(FILTEREDLEVELS.LOGLEVEL) as COUNT;
62+
RESULT = order FREQUENCIES by COUNT desc;
63+
DUMP RESULT;
64+
```
65+
66+
The following image shows a summary of what each transformation does to the data.
67+
68+
![Graphical representation of the transformations][image-hdi-pig-data-transformation]
69+
70+
## <a id="run"></a>Run the Pig Latin job
71+
72+
HDInsight can run Pig Latin jobs by using a variety of methods. Use the following table to decide which method is right for you, then follow the link for a walkthrough.
73+
74+
## Pig and SQL Server Integration Services
75+
76+
You can use SQL Server Integration Services (SSIS) to run a Pig job. The Azure Feature Pack for SSIS provides the following components that work with Pig jobs on HDInsight.
77+
78+
* [Azure HDInsight Pig Task][pigtask]
79+
80+
* [Azure Subscription Connection Manager][connectionmanager]
81+
82+
Learn more about the Azure Feature Pack for SSIS [here][ssispack].
83+
84+
## <a id="nextsteps"></a>Next steps
85+
86+
Now that you have learned how to use Pig with HDInsight, use the following links to explore other ways to work with Azure HDInsight.
87+
88+
* [Upload data to HDInsight](hdinsight-upload-data.md)
89+
* [Use Apache Hive with HDInsight](/hadoop/hdinsight-use-hive.md)
90+
* [Use Apache Sqoop with HDInsight](hdinsight-use-sqoop.md)
91+
* [Use MapReduce jobs with HDInsight](/hadoop/hdinsight-use-mapreduce.md)
92+
93+
[apachepig-home]: https://pig.apache.org/
94+
[putty]: https://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
95+
[curl]: https://curl.haxx.se/
96+
[pigtask]: https://msdn.microsoft.com/library/mt146781(v=sql.120).aspx
97+
[connectionmanager]: https://msdn.microsoft.com/library/mt146773(v=sql.120).aspx
98+
[ssispack]: https://msdn.microsoft.com/library/mt146770(v=sql.120).aspx
99+
[hdinsight-admin-powershell]: hdinsight-administer-use-powershell.md
100+
101+
[hdinsight-use-hive]:../hdinsight-use-hive.md
102+
103+
[hdinsight-provision]: hdinsight-hadoop-provision-linux-clusters.md
104+
[hdinsight-submit-jobs]:submit-apache-hadoop-jobs-programmatically.md#mapreduce-sdk
105+
106+
[Powershell-install-configure]: /powershell/azureps-cmdlets-docs
107+
108+
[powershell-start]: https://technet.microsoft.com/library/hh847889.aspx
109+
110+
111+
[image-hdi-pig-data-transformation]: ./media/use-pig/hdi-data-transformation.gif

0 commit comments

Comments
 (0)