PySpark Tutorial (Part 3): How to Deploy PySpark Pipelines to Google DataProc
Автор: Anton T. Ruberts
Загружено: 2024-01-12
Просмотров: 1869
Описание:
After learning all the basics of PySpark, it's finally time to put it all together into one coherent pipeline. We can run this data and ML pipeline locally but what happens when you need to scale it past your personal computer capabilities? That's when the services like DataProc come in.
DataProc is a managed Spark service that helps you create clusters quickly, manage them easily, and gives you the flexibility to turn the on/off on demand.
This tutorial will show you how to put all the code from the previous parts (and some new code as well) into a PySpark pipeline, how UDFs can be used to extend Spark's functionalities, how hyper-parameter tuning can be performed with Hyperopt and PySpark, how to create GCP infrastructure for running PySpark code, and how PySpark jobs can be submitted to your DataProc Cluster.
Tutorial Part 1 - • PySpark Tutorial for Beginners: Step-by-St...
Tutorial Part 2 - • PySpark Tutorial for Beginners: Feature En...
GitHub Repository - https://github.com/aruberts/tutorials...
Dataset link - https://www.kaggle.com/datasets/agung...
DataProc Documentation - https://cloud.google.com/dataproc/doc...
0:00 - Introduction
0:26 - Project Setup
02:11 - PySpark Pipelien Overview
08:05 - Used Defined Functions
11:14 - UDF example
14:34 - Hyper-parameter tuning
20:16 - Google Cloud Storage and DataProc setup
27:44 - Submit jobs to DataProc
30:07 - Outro
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: