Skip to content

Latest commit

 

History

History
70 lines (49 loc) · 5.85 KB

File metadata and controls

70 lines (49 loc) · 5.85 KB

Apache Spark Beginner's Guide

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for different types of data processing, including interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is designed to cover wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.

Data Engineering

PySpark is one of the most loved and in-demand framework in the world of Big Data Processing and Data Engineering. Learning PySpark will boost your profile for Data Engineering Jobs.

Below is the list of free resources which I have once gone through to learn Python and PySpark and

Coding Environment Setup

We don't need a cluster to start learning Spark. We can use Databricks or setup spark locally

  • Databricks Community Edition: Databricks is a cloud based platform for working with Spark, that provides automated cluster management and Jupyter-style notebooks.

    Databricks(the company) is owned by the creators of Spark.

    Databricks provides a Community Edition of their platform for developers to learn Spark. We get a basic cluster setup and a Jupyter-style notebook attached to the cluster. Whenever we run a code on that notebook, it is executed on the cluster. It's super easy to use and learn. Also we get some good datasets built in that we can import and practice.

  • Anaconda Distribution: Anaconda is a distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment. The distribution includes data-science packages suitable for Windows, Linux, and macOS.

Start Learning Python

Start Learning Spark

  • Spark Starter Kit - Udemy: Free Udemy course that covers Spark architecture and internal concepts. It visually explains what happens once we submit a job to spark, how jobs, stages, tasks are created and executed, etc. Good starting point on the architecture side.
  • SparkByExamples: An awesome site of hands-on tutorials in spark with Python and Scala. Complete explanation with hands-on program arranged in a topic-wise manner. You can learn Spark transformation here.
  • GitHub: You can find infinite resources on GitHub: articles, notes, code, jupyter notebooks, and projects on GitHub. Just need to search.
  • Medium: Medium is the most popular blogging site amongst developers. You can find detailed articles on every topic here. Every article is created by developers focusing on the simplicity of understanding.

Learning Done. Practice Practice Practice...!

Python Practice

  • CodingBat: CodingBat provides a set of python problems that focuses on the understanding of basic python concepts like loops, functions, strings, collections, etc. A great place to do some python practice.

  • Hackerrank: HackerRank python tutorial track for practicing python problems from beginner to advanced level.

  • Hackerearth: HackerEarth provides a set of tutorials and practice problems in different areas ranging from Basic Programming to Data Structures to Algorithms and so on. Good resource if you want to practice python programming and enhance your knowledge in problem-solving and DSA

Spark Practice

  • Databricks: Databricks provides a good amount of datasets as part of Community edition. You can use it inside a notebook and start coding.

  • Kaggle: Your Home for Data Science: Kaggle is a platform for data engineers and data scientists. You can get a lot of resources here

    • Datasets
    • Notebooks
    • Discussions etc.
  • BigQuery public datasets:

Download sample datasets from any Website like Kaggle and throw your hands on it.

Work on Pyspark Projects

There is a plethora of resources available on the internet. Try not to get confused. Pick one and stick to it till the end.

Hope you find this useful. Peace✌