
Pyspark Essential Training: Introduction To Building Data Pipelines
Published 08/2025
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz, 2 Ch
Language: English | Duration: 1h 18m | Size: 146 MB
Video description
PySpark is a powerful library that brings Apache Spark's distributed computing capabilities to Python, making it a key tool for processing large-scale data efficiently. In this course, data engineer and analyst Sam Bail provides a structured and hands-on introduction to PySpark, starting with an overview of Apache Spark, its architecture, and its ecosystem. Learn about Spark's core concepts, such as the DataFrame API, transformations, lazy evaluations, and actions, before setting up a lab environment and working with a real dataset. Plus, gain insights into how PySpark fits into a broader data engineering ecosystem and best practices on running PySpark in a production environment.