-
Notifications
You must be signed in to change notification settings - Fork 0
spark hive pig
abk edited this page Jul 27, 2020
·
1 revision
Apache Spark
- Open source distributed processing system for big data system.
- Used for processing large data sets.
- Uses in memory execution and DAG (directed Acyclic graphs).
- You need to write code to use Spark.
- Comes with lot of code libraries built in. Batch processing, ML , graph processing etc
- Stream processing using Spark streaming.
- Can do streaming analytics in fault tolerant way and write to HDFS or S3.
- ML -Lib for machine learning.
- Spark SQL -used to run SQL
- Not used for OLTP or batch processing.
- Architecture
- Driver Program ---------> Cluster Manager --------------> Executors (Spark context) (Spark, YARN) (Cache, tasks)
- Spark Components
- Spark streaming SparkSQL MLLib GraphX
- Spark CORE.
- Spark Streaming + Kinesis
- Kinesis (producer ) > Kinesis Data Stream > Spark Dataset implemented from KCL.
- Spark + Redshift
- Spark-redshift package allows spark datasets from Redshift
- It's a Spark SQL data source.
- Useful for ETL using Spark.
- S3 > Redshift > Spark / EMR > Redshift
Apache Hive
- Run SQL like queries to be run on data stored on Hadoop Yarn
- Components
-
HIVE -
MapReduce + Tez -
Hadoop YARN
-
- Use familiar SQL called HiveQL
- Interactive interafce
- Scalable - works with 'big data' on a cluster.
- Really most appropriate for data warehouse apps
- Easy OLAP queries - way easier than writing MapReduce in Java
- Highly optimized
- Highly extensible
- User defined funds
- Thrift server
- JDBC/ODBC driver - External apps can communicate with HIVE.
- HIVE is NOT for OLTP.
- Hive Metastore
- Imparts structure on unstructured data.
- You define a structure that you define on then unstructured data, which is stored on HDFS.
- External Hive megastores.
- Metastore is stored in MySQL on the master node by default.
- External megastores offer better resiliency / integration.
- AWS Glue data catalog
- Amazon RDS.
- It's similar to Glue. Basically provide structured information on unstructured data, which is metadata.
- Other Hive / AWS integration points.
- Load table partitions from S3
- Write tables in S3
- Load scripts from S3
- DynamoDB as an external table.
Apache Pig
- Writer mappers and reducers by hand takes a long time
- Pig introduces Pig Latin, a scripting language that lets you to use SQL like syntax to define your map and reduce steps
- Highly extensible with user-defined functions (UDFs).
- Components (Imagine building blocks , with HDFS being at the lowest block).
- PIG
- MapReduce + TEZ
- YARN
- HDFS
- Pig / AWS integration
- Ability to use multiple FS (not just HDFS)
- Query data in S3.
- Load JARs and scripts from S3
- Kind of older technology.
- It's an higher level alternative to MapReduce code.