Skip to content

spark hive pig

abk edited this page Jul 27, 2020 · 1 revision

Apache Spark

  • Open source distributed processing system for big data system.
  • Used for processing large data sets.
  • Uses in memory execution and DAG (directed Acyclic graphs).
  • You need to write code to use Spark.
  • Comes with lot of code libraries built in. Batch processing, ML , graph processing etc
  • Stream processing using Spark streaming.
  • Can do streaming analytics in fault tolerant way and write to HDFS or S3.
  • ML -Lib for machine learning.
  • Spark SQL -used to run SQL
  • Not used for OLTP or batch processing.
  • Architecture
    • Driver Program ---------> Cluster Manager --------------> Executors
(Spark context) (Spark, YARN) (Cache, tasks)
  • Spark Components
    • Spark streaming SparkSQL MLLib GraphX
    • Spark CORE.
  • Spark Streaming + Kinesis
    • Kinesis (producer ) > Kinesis Data Stream > Spark Dataset implemented from KCL.
  • Spark + Redshift
    • Spark-redshift package allows spark datasets from Redshift
    • It's a Spark SQL data source.
    • Useful for ETL using Spark.
    • S3 > Redshift > Spark / EMR > Redshift


Apache Hive

  • Run SQL like queries to be run on data stored on Hadoop Yarn
  • Components
    •             HIVE
      
    •   MapReduce + Tez
      
    •     Hadoop YARN
      
  • Use familiar SQL called HiveQL
  • Interactive interafce
  • Scalable - works with 'big data' on a cluster.
    • Really most appropriate for data warehouse apps
  • Easy OLAP queries - way easier than writing MapReduce in Java
  • Highly optimized
  • Highly extensible
    • User defined funds
    • Thrift server
    • JDBC/ODBC driver - External apps can communicate with HIVE.
    • HIVE is NOT for OLTP.
  • Hive Metastore
    • Imparts structure on unstructured data.
    • You define a structure that you define on then unstructured data, which is stored on HDFS.
    • External Hive megastores.
      • Metastore is stored in MySQL on the master node by default.
      • External megastores offer better resiliency / integration.
        • AWS Glue data catalog
        • Amazon RDS.
        • It's similar to Glue. Basically provide structured information on unstructured data, which is metadata.
  • Other Hive / AWS integration points.
    • Load table partitions from S3
    • Write tables in S3
    • Load scripts from S3
    • DynamoDB as an external table.

Apache Pig

  • Writer mappers and reducers by hand takes a long time
  • Pig introduces Pig Latin, a scripting language that lets you to use SQL like syntax to define your map and reduce steps
  • Highly extensible with user-defined functions (UDFs).
  • Components (Imagine building blocks , with HDFS being at the lowest block).
    • PIG
    • MapReduce + TEZ
    • YARN
    • HDFS
  • Pig / AWS integration
    • Ability to use multiple FS (not just HDFS)
    • Query data in S3.
    • Load JARs and scripts from S3
    • Kind of older technology.
    • It's an higher level alternative to MapReduce code.
Clone this wiki locally