Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for different types of data processing, including interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
Spark is designed to cover wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.
PySpark is one of the most loved and in-demand framework in the world of Big Data Processing and Data Engineering. Learning PySpark will boost your profile for Data Engineering Jobs.
Below is the list of free resources which I have once gone through to learn Python and PySpark and
We don't need a cluster to start learning Spark. We can use Databricks or setup spark locally
-
Databricks Community Edition: Databricks is a cloud based platform for working with Spark, that provides automated cluster management and Jupyter-style notebooks.
Databricks(the company) is owned by the creators of Spark.
Databricks provides a Community Edition of their platform for developers to learn Spark. We get a basic cluster setup and a Jupyter-style notebook attached to the cluster. Whenever we run a code on that notebook, it is executed on the cluster. It's super easy to use and learn. Also we get some good datasets built in that we can import and practice. -
Anaconda Distribution: Anaconda is a distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment. The distribution includes data-science packages suitable for Windows, Linux, and macOS.
- Programiz Python Tutorials: A clean website of well-organized python tutorials. Simplified explanations with a great set of examples.
- Python for Everybody: This is the official site for Python for Everybody Specialization on Coursera by University of Michigan professor Charles Russell Severance. It has videos, articles, documentation, etc. This course focuses on hands-on python with live assignments and Capstone project. A great source to cover a wide area of concepts, that are required during an actual job/project.
- Python Docs: Python official docs
- RealPython: In-depth python tutorials and discussions on python topics. A compilation of great tutorials, courses, quizzes, learning paths, etc.
- Python Tutorial for Absolute Beginners by CS Dojo (Youtube): Easy tutorials for absolute beginners.
- Spark Starter Kit - Udemy: Free Udemy course that covers Spark architecture and internal concepts. It visually explains what happens once we submit a job to spark, how jobs, stages, tasks are created and executed, etc. Good starting point on the architecture side.
- SparkByExamples: An awesome site of hands-on tutorials in spark with Python and Scala. Complete explanation with hands-on program arranged in a topic-wise manner. You can learn Spark transformation here.
- GitHub: You can find infinite resources on GitHub: articles, notes, code, jupyter notebooks, and projects on GitHub. Just need to search.
- Medium: Medium is the most popular blogging site amongst developers. You can find detailed articles on every topic here. Every article is created by developers focusing on the simplicity of understanding.
-
CodingBat: CodingBat provides a set of python problems that focuses on the understanding of basic python concepts like loops, functions, strings, collections, etc. A great place to do some python practice.
-
Hackerrank: HackerRank python tutorial track for practicing python problems from beginner to advanced level.
-
Hackerearth: HackerEarth provides a set of tutorials and practice problems in different areas ranging from Basic Programming to Data Structures to Algorithms and so on. Good resource if you want to practice python programming and enhance your knowledge in problem-solving and DSA
-
Databricks: Databricks provides a good amount of datasets as part of Community edition. You can use it inside a notebook and start coding.
-
Kaggle: Your Home for Data Science: Kaggle is a platform for data engineers and data scientists. You can get a lot of resources here
- Datasets
- Notebooks
- Discussions etc.
Download sample datasets from any Website like Kaggle and throw your hands on it.
- https://github.com/AlexIoannides/pyspark-example-project
- https://www.youtube.com/watch?v=5gK5eYwuKiM&ab_channel=learnbydoingit
- https://www.theseattledataguy.com/5-data-engineering-projects-to-add-to-your-resume/#page-content
There is a plethora of resources available on the internet. Try not to get confused. Pick one and stick to it till the end.
Hope you find this useful. Peace✌