Fast Copy-On-Write in Apache Parquet for Data Lakehouse Upserts
Автор: Databricks
Загружено: 2024-07-23
Просмотров: 999
Описание:
Efficient table ACID upsert is essential for today’s Lakehouse. Important use cases, such as GDPR Right to be Forgotten and Change Data Capture, rely heavily on it. While Apache Delta Lake, Iceberg, and Hudi are widely adopted, the slowness of upserts is seen when the data volume scales up, particularly for copy-on-write mode. Sometimes, the slow upserts become a blocker to finishing compliance requirements on time. We introduced partial copy-on-write within Parquet with row-level index to skip unnecessary column chunks efficiently. The term partial here means only performing copy-on-write for the needed chunks but skipping unrelated ones. Generally, only a small portion of the file needs to be updated, and most of the data chunks can be skipped. We have observed an increased speed of up to 20x compared to existing upserts.
Talk By: Mingmin Chen, Director of Engineering, Uber Technologies, Inc ; Xinli Shang, Engineering Manager, Uber
Here's more to explore:
Rise of the Data Lakehouse: https://dbricks.co/3NHT7CD
Lakehouse Fundamentals Training: https://dbricks.co/44ancQs
Connect with us: Website: https://databricks.com
Twitter: / databricks
LinkedIn: / data…
Instagram: / databricksinc
Facebook: / databricksinc
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: