PySpark Intermediate to Advanced Questions and Answers

1. How do you create a PySpark DataFrame from a Python dictionary?

You can create a DataFrame using createDataFrame() method from a list of dictionaries.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

data = [{"name": "Alice", "age": 29}, {"name": "Bob", "age": 30}]
df = spark.createDataFrame(data)

df.show()

2. How can you add a new column to a PySpark DataFrame based on conditions?

Using the withColumn() method and when function from pyspark.sql.functions.

from pyspark.sql.functions import when

df = df.withColumn("age_category", when(df.age > 30, "Senior").otherwise("Junior"))
df.show()

3. How do you remove duplicates from a PySpark DataFrame?

Use the dropDuplicates() method.

df.dropDuplicates().show()

4. How can you join two DataFrames in PySpark?

Use the join() method to perform joins between DataFrames.

df1 = spark.createDataFrame([("Alice", 29), ("Bob", 30)], ["name", "age"])
df2 = spark.createDataFrame([("Alice", "NY"), ("Bob", "LA")], ["name", "city"])

joined_df = df1.join(df2, on="name", how="inner")
joined_df.show()

5. How do you filter rows in a PySpark DataFrame?

Use the filter() method.

df.filter(df.age > 30).show()

6. How can you calculate the average value of a column in PySpark?

Use the agg() and avg() functions.

from pyspark.sql.functions import avg

df.agg(avg("age")).show()

7. How do you convert a PySpark DataFrame to a Pandas DataFrame?

Use the toPandas() method.

pandas_df = df.toPandas()

8. How do you cache a DataFrame in PySpark?

Use the cache() method.

df.cache()

9. How do you repartition a PySpark DataFrame?

Use the repartition() method.

df = df.repartition(5)

10. How do you group data in PySpark?

Use the groupBy() method followed by aggregation functions.

df.groupBy("age").count().show()

11. How do you sort a PySpark DataFrame?

Use the orderBy() method.

df.orderBy(df.age.desc()).show()

12. How do you drop a column in PySpark?

Use the drop() method.

df = df.drop("age")

13. How can you read a CSV file in PySpark?

Use the read.csv() method.

df = spark.read.csv("file.csv", header=True, inferSchema=True)

14. How do you write a DataFrame to a Parquet file?

Use the write.parquet() method.

df.write.parquet("output.parquet")

15. How do you use window functions in PySpark?

Use Window from pyspark.sql.window for windowed operations.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy("age").orderBy("name")
df = df.withColumn("row_num", row_number().over(windowSpec))
df.show()

16. How do you handle missing data in PySpark?

Use the dropna() or fillna() methods.

df.dropna().show()
df.fillna({"age": 0}).show()

17. How do you union two DataFrames in PySpark?

Use the union() method.

df1.union(df2).show()

18. How do you pivot a DataFrame in PySpark?

Use the pivot() method.

df.groupBy("name").pivot("age").count().show()

19. How do you run SQL queries on a PySpark DataFrame?

First, register the DataFrame as a table, then use the sql() method.

df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE age > 30").show()

20. How do you broadcast a variable in PySpark?

Use SparkContext.broadcast() to broadcast variables to executors.

broadcastVar = spark.sparkContext.broadcast([1, 2, 3])
broadcastVar.value

21. How do you convert RDD to DataFrame in PySpark?

Use createDataFrame() method.

rdd = spark.sparkContext.parallelize([(1, "Alice"), (2, "Bob")])
df = spark.createDataFrame(rdd, ["id", "name"])
df.show()

22. How do you calculate the correlation between two columns in PySpark?

Use the corr() method.

df.stat.corr("age", "salary")

23. How do you change the data type of a column in PySpark?

Use the cast() method.

df = df.withColumn("age", df["age"].cast("integer"))

24. How do you split a column into multiple columns in PySpark?

Use the split() function.

from pyspark.sql.functions import split

df = df.withColumn("name_split", split(df["name"], " "))
df.show()

25. How do you apply user-defined functions (UDFs) in PySpark?

Use pyspark.sql.functions.udf to define UDFs.

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def square(x):
    return x * x

square_udf = udf(square, IntegerType())
df = df.withColumn("age_squared", square_udf(df["age"]))
df.show()

26. How can you calculate the running total in PySpark?

Use Window function with rowsBetween().

from pyspark.sql.window import Window
from pyspark.sql.functions import sum

windowSpec = Window.orderBy("age").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df = df.withColumn("running_total", sum("age").over(windowSpec))
df.show()

27. How do you collect DataFrame rows into a Python list?

Use the collect() method.

rows = df.collect()

28. How do you broadcast a DataFrame in PySpark?

Use broadcast() from pyspark.sql.functions.

from pyspark.sql.functions import broadcast

broadcasted_df = broadcast(df)

29. How do you find the top N records in a PySpark DataFrame?

Use the limit() method.

df.orderBy(df.age.desc()).limit(5).show()

30. How do you perform a left anti join in PySpark?

Use the join() method with how="left_anti".

df1.join(df2, on="name", how="left_anti").show()