Essential PySpark Functions: Practical Examples for Beginners

PySpark Functions

Working with big data can initially feel overwhelming — with rows and columns stretching into millions, traditional tools often slow to a crawl. That’s where PySpark shines. It combines Python’s simplicity with Spark’s distributed power, letting you process massive datasets with ease. However, learning PySpark can feel like wandering through a giant toolbox without knowing which tools matter. You don’t need every single function to get real work done. What you need are the essentials — the ones you’ll use daily to clean, transform, and analyze data. This guide walks you through those key PySpark functions, with simple examples.

Top PySpark Functions with Examples

select()

The select() function is your go-to when you only need certain columns from a DataFrame. Instead of hauling around the whole table, you can keep just what matters.

Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()
data = [("Alice", 34), ("Bob", 45), ("Charlie", 29)]
df = spark.createDataFrame(data, ["name", "age"])
df.select("name").show()

Output:

+-------+
|   name|
+-------+
|  Alice|
|    Bob|
|Charlie|
+-------+

You can also use selectExpr() to write SQL-like expressions when selecting columns.

withColumn()

When you need to create a new column or modify an existing one, use withColumn(). You pass it the name of the column and the expression to compute.

Example:

from pyspark.sql.functions import col

df.withColumn("age_plus_10", col("age") + 10).show()

Output:

+-------+---+-----------+
|   name|age|age_plus_10|
+-------+---+-----------+
|  Alice| 34|        44 |
|    Bob| 45|        55 |
|Charlie| 29|        39 |
+-------+---+-----------+

filter() / where()

You’ll often need to work with a subset of your data. filter() or where() helps you keep only rows that match a condition.

Example:

df.filter(col("age") > 30).show()

Output:

+-----+---+
| name|age|
+-----+---+
|Alice| 34|
|  Bob| 45|
+-----+---+

Both filter() and where() are interchangeable. Use whichever feels more readable to you.

groupBy() and agg()

GroupBy Example

To summarize data, you’ll use groupBy() combined with aggregation functions. You can compute counts, averages, sums, etc.

Example:

from pyspark.sql.functions import avg

data = [("Math", "Alice", 85), ("Math", "Bob", 78),
        ("English", "Alice", 90), ("English", "Bob", 80)]
df2 = spark.createDataFrame(data, ["subject", "student", "score"])
df2.groupBy("subject").agg(avg("score").alias("avg_score")).show()

Output:

+-------+---------+
|subject|avg_score|
+-------+---------+
|   Math|     81.5|
|English|     85.0|
+-------+---------+

orderBy() / sort()

If you want your results in a specific order, use orderBy() or sort(). Both do the same thing.

Example:

df.orderBy(col("age").desc()).show()

Output:

+-------+---+
|   name|age|
+-------+---+
|    Bob| 45|
|  Alice| 34|
|Charlie| 29|
+-------+---+

You can sort by multiple columns if needed.

drop()

Sometimes you want to remove a column you don’t need anymore.

Example:

df.drop("age").show()

Output:

+-------+
|   name|
+-------+
|  Alice|
|    Bob|
|Charlie|
+-------+

distinct()

To get unique rows from your DataFrame, use distinct().

Example:

data = [("Alice", 34), ("Alice", 34), ("Bob", 45)]
df3 = spark.createDataFrame(data, ["name", "age"])
df3.distinct().show()

Output:

+-----+---+
| name|age|
+-----+---+
|  Bob| 45|
|Alice| 34|
+-----+---+

dropDuplicates()

This is like distinct(), but you can specify which columns to consider when checking for duplicates.

Example:

df3.dropDuplicates(["name"]).show()

Output:

+-----+---+
| name|age|
+-----+---+
|  Bob| 45|
|Alice| 34|
+-----+---+

join()

Combining two DataFrames is a common need. Use join() to merge on a common column.

Example:

data1 = [("Alice", "Math"), ("Bob", "English")]
df4 = spark.createDataFrame(data1, ["name", "subject"])
data2 = [("Alice", 85), ("Bob", 78)]
df5 = spark.createDataFrame(data2, ["name", "score"])
df4.join(df5, on="name").show()

Output:

+-----+-------+-----+
| name|subject|score|
+-----+-------+-----+
|Alice|   Math|   85|
|  Bob|English|   78|
+-----+-------+-----+

cache()

When working with large datasets, reusing the same DataFrame can get slow. cache() keeps it in memory for faster access.

Example:

df.cache()
df.count()  # This action triggers caching

There’s no visible output here, but future operations on df will run faster.

collect()

To get your results back to Python as a list of rows, use collect(). Be careful — if your data is huge, this can crash your driver.

Example:

rows = df.collect()
print(rows)

Output:

[Row(name='Alice', age=34), Row(name='Bob', age=45), Row(name='Charlie', age=29)]

show()

This one you’ve already seen throughout the examples. show() prints your DataFrame in a readable tabular format.

Example:

df.show()

count()

Count Example

To quickly find out how many rows you have.

Example:

df.count()

Output:

replace()

To replace specific values in a DataFrame.

Example:

df.replace("Alice", "Alicia", "name").show()

Output:

+-------+---+
|   name|age|
+-------+---+
| Alicia| 34|
|    Bob| 45|
|Charlie| 29|
+-------+---+

fillna()

To fill in missing values with a default.

Example:

data = [("Alice", None), ("Bob", 45)]
df6 = spark.createDataFrame(data, ["name", "age"])
df6.fillna(0).show()

Output:

+-----+---+
| name|age|
+-----+---+
|Alice|  0|
|  Bob| 45|
+-----+---+

explode()

When working with columns that contain arrays or lists, you often need to turn each element of the array into its row. That’s what explode() does — it flattens an array column.

Example:

from pyspark.sql.functions import explode

data = [("Alice", ["Math", "English"]), ("Bob", ["History", "Science"])]
df7 = spark.createDataFrame(data, ["name", "subjects"])
df7.select("name", explode("subjects").alias("subject")).show()

Output:

+-----+--------+
| name| subject|
+-----+--------+
|Alice|    Math|
|Alice| English|
|  Bob| History|
|  Bob| Science|
+-----+--------+

Conclusion

Knowing which PySpark functions to focus on saves you time and makes your code much cleaner. You don’t need to memorize every single function in the library — the ones we’ve covered here are more than enough to handle most real-world tasks. They cover selecting and transforming columns, filtering and grouping data, joining DataFrames, dealing with duplicates and missing values, and working efficiently with cached data. As you get more practice, you’ll start using these almost without thinking. PySpark is powerful because of how much you can do with just a few well-chosen functions. Start with these, experiment with your datasets, and the rest will come naturally.

For more information on PySpark, you can visit the Apache Spark Documentation.

Essential PySpark Functions: Practical Examples for Beginners

Top PySpark Functions with Examples

select()

withColumn()

filter() / where()

groupBy() and agg()

orderBy() / sort()

drop()

distinct()

dropDuplicates()

join()

cache()

collect()

show()

count()

replace()

fillna()

explode()

Conclusion

On this page

Related Articles

Loss Functions Explained: A Simple Guide for Beginners

ChatGPT in 2025: The Features Everyone’s Asking For

Why Superalignment Matters in the Development of Smart AI Systems

7 Clear Signs We’ve Already Hit Peak AI in Hype, Usage, and Innovation

7 Clear Signs We’ve Already Hit Peak AI in Hype, Usage, and Innovation

Can’t Afford ChatGPT Operator? Perplexity Assistant Is the Answer

Who’s Leading AI Innovation: Top Six AI Influencers to Follow in 2025

A Beginner’s Guide to Digital Twins: Types, Uses, and How They Work

Migrate to AI-Enabled Cloud ERP for Smarter Business Operations

AI in Education: How Blended Labs is Changing K-12 Schools Forever

Who’s Leading AI Innovation: Top Six AI Influencers to Follow in 2025

Boost Your Workflow with Micro-Personalized GenAI Creation and Collaboration

Popular Articles

How to Prepare for an AI-First Future: Skills and Mindsets for Tomorrow

Transforming Queries: How AI Lets You Ask Questions in New Ways

Windsurf vs Cursor AI: Which Coding Assistant is the Better Choice?

Balancing Data Privacy and Marketing in the Modern Era

Smart Farming with AI: The New Era of Crop Monitoring and Yield Forecasting

From Idea to Execution: How AI and No-Code Empower Developers

Boosting Oil and Gas Production Efficiency with Artificial Intelligence

All About Python 3.13.0: Performance Boosts and Key Enhancements

From Games to Content: How AI is Powering Modern Entertainment

GPT-4 vs. Llama 3.1: A Comparative Analysis of AI Language Models

How China Benefits from US and EU Delays on AI Rules

How to Download and Install Auto-GPT on Any Device: Full Guide