spark hive pig

Jump to bottom

abk edited this page Jul 27, 2020 · 1 revision

Apache Spark

Open source distributed processing system for big data system.
Used for processing large data sets.
Uses in memory execution and DAG (directed Acyclic graphs).
You need to write code to use Spark.
Comes with lot of code libraries built in. Batch processing, ML , graph processing etc
Stream processing using Spark streaming.
Can do streaming analytics in fault tolerant way and write to HDFS or S3.
ML -Lib for machine learning.
Spark SQL -used to run SQL
Not used for OLTP or batch processing.
Architecture
- Driver Program ---------> Cluster Manager --------------> Executors (Spark context) (Spark, YARN) (Cache, tasks)
Spark Components
- Spark streaming SparkSQL MLLib GraphX
- Spark CORE.
Spark Streaming + Kinesis
- Kinesis (producer ) > Kinesis Data Stream > Spark Dataset implemented from KCL.
Spark + Redshift
- Spark-redshift package allows spark datasets from Redshift
- It's a Spark SQL data source.
- Useful for ETL using Spark.
- S3 > Redshift > Spark / EMR > Redshift

Apache Hive

Run SQL like queries to be run on data stored on Hadoop Yarn

Components

```
            HIVE
```
```
  MapReduce + Tez
```
```
    Hadoop YARN
```

Use familiar SQL called HiveQL
Interactive interafce
Scalable - works with 'big data' on a cluster.
- Really most appropriate for data warehouse apps
Easy OLAP queries - way easier than writing MapReduce in Java
Highly optimized
Highly extensible
- User defined funds
- Thrift server
- JDBC/ODBC driver - External apps can communicate with HIVE.
- HIVE is NOT for OLTP.
Hive Metastore
- Imparts structure on unstructured data.
- You define a structure that you define on then unstructured data, which is stored on HDFS.
- External Hive megastores.
  - Metastore is stored in MySQL on the master node by default.
  - External megastores offer better resiliency / integration.
    - AWS Glue data catalog
    - Amazon RDS.
    - It's similar to Glue. Basically provide structured information on unstructured data, which is metadata.
Other Hive / AWS integration points.
- Load table partitions from S3
- Write tables in S3
- Load scripts from S3
- DynamoDB as an external table.

Apache Pig

Writer mappers and reducers by hand takes a long time
Pig introduces Pig Latin, a scripting language that lets you to use SQL like syntax to define your map and reduce steps
Highly extensible with user-defined functions (UDFs).
Components (Imagine building blocks , with HDFS being at the lowest block).
- PIG
- MapReduce + TEZ
- YARN
- HDFS
Pig / AWS integration
- Ability to use multiple FS (not just HDFS)
- Query data in S3.
- Load JARs and scripts from S3
- Kind of older technology.
- It's an higher level alternative to MapReduce code.