What is PySpark: Clear Your Basics for Interview

Sayan Chowdhury
2 min readJun 27, 2022

--

CREDIT: SAYAN CHOWDHURY

What is PySpark?

PySpark is a Python interface to Apache Spark. It includes the PySpark shell for interactively examining data in a distributed environment, as well as the ability to develop Spark applications using Python APIs. Most Spark technologies, such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning), and Spark Core, are supported by PySpark.

What are the advantages and disadvantages of PySpark?

Advantages of PySpark:

  • Simple to write: Parallelized code can be written more quickly.
  • Error Handling: The PySpark framework handles faults with ease.
  • Algorithms Included: PySpark includes a number of important algorithms for Machine Learning and Graphs.
  • Python has a large library collection for working in the field of data science and data visualisation when compared to Scala.
  • PySpark is a simple to learn programming language.

PySpark’s drawbacks include:

  • The MapReduce model can make it difficult to articulate difficulties at times.
  • Because Spark was created in Scala, using PySpark in Python programmes is significantly less efficient and around 10 times slower than utilising Scala programmes. This would have an impact on the performance of apps that process large amounts of data.
  • PySpark’s Spark Streaming API is not yet complete.

What is PySpark SparkContext?

PySpark SparkContext is the spark functionality’s first point of entry. It also stands for Spark Cluster Connection, which may be used to create RDDs (Resilient Distributed Datasets) and broadcast variables across the cluster.

When we wish to begin the Spark application, we’ll start a driver programme with the main function. The SparkContext that we defined is started at this moment. The driver application then performs activities inside the worker nodes’ executors. Additionally, Py4J will be used to launch JVM, which will result in the creation of JavaSparkContext. There will be no need to create a new SparkContext in PySpark because the default SparkContext is “sc.”

What are the different cluster manager types supported by PySpark?

A cluster manager is a cluster mode platform that aids in the operation of Spark by allocating all resources to worker nodes based on their needs.

The following cluster manager types are supported by PySpark:

Standalone — This is a straightforward cluster manager bundled with Spark.

Apache Mesos — It is a Hadoop MapReduce and PySpark application manager.

Hadoop YARN — Hadoop2 uses this management.

Kubernetes — This is an open-source cluster manager that helps with containerized app deployment, scaling, and management.

Local — This is a mode for executing Spark applications on laptops and desktop computers.

Liked it? Give your feedback in the comments 😄

Originally published at https://www.linkedin.com.

--

--

Sayan Chowdhury

Design Engineer at Larsen and Toubro • Data Enthusiast • Google Cloud Platform