Apache Datasketches for Big Data Analysis
Автор: Big Data LDN
Загружено: 2023-10-31
Просмотров: 888
Описание:
16:00 - 16:30 | FAST DATA THEATRE
APACHE DATASKETCHES FOR BIG DATA ANALYSIS
WEDNESDAY 20 SEPTEMBER 2023
SPEAKER: CHARLIE DICKENS, YAHOO
Many businesses face queries such as counting unique identifiers, finding frequent items, and understanding data distributions. However, these tasks are incredibly resource intensive at a large scale; particularly on streaming data or for real-time analytics. Given the rapid growth in dataset sizes, performing this type of analysis is now crucial to organisations of all sizes, rather than simply large enterprises.
We present Apache Software Foundation (ASF) DataSketches; a high-performance library for efficient large-scale data analysis. Using DataSketches, analysis can be performed orders of magnitudes faster than brute force. The sketches are extremely small compared to the original data and can be easily integrated into data cubes for efficient aggregate analysis. Our library is distributed in both Java and C++ and also has bindings to Python. It is compatible with Druid, Cloudera, Hive, Impala, PostgreSQL, Pinot, and Iceberg, in addition to being used by companies such as Yahoo. Our open-source library is free for any person or organisation to use.
We will introduce the audience to the notion of data sketching and detail the key wins they can expect by deploying these approaches. We will demonstrate how to use the sketches for OLAP-type queries using the Python API. Finally, we will showcase the key mergeability feature of our sketches. Using this feature we will show how to include sketches in data cubes so that aggregate statistics can easily be found over varying time periods. This is an example of a type of analysis for which a brute-force approach simply would not scale.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: