Activity
-
We contributed to Apache Arrow making Parquet bloom filters smaller and more effective. Full write up by Adrian Garcia Badaracco on what it means…
We contributed to Apache Arrow making Parquet bloom filters smaller and more effective. Full write up by Adrian Garcia Badaracco on what it means…
Liked by Benjamin Wagner
-
Come find us at Iceberg Summit! Coolest looking team out here 🔥🔥 Win a PS5!!
Come find us at Iceberg Summit! Coolest looking team out here 🔥🔥 Win a PS5!!
Liked by Benjamin Wagner
Experience
Education
-
Technical University Munich
-
Master’s Thesis: “Incremental Fusion: Unifying Compiled and Vectorized Query Execution”
-
-
-
-
Publications
-
Assembling a Query Engine From Spare Parts
CDMS @ VLDB '22, September 9, 2022, Sydney, Australia
Building a new cloud data warehouse is a daunting challenge, requiring massive investments into both the query engine and surrounding cloud infrastructure. Given the mature space, it might seem like a Herculean task to enter the market as a small startup.
At Firebolt we assembled a working, high-performance cloud data warehouse in less than 18 months. We achieved this by building our query engine on top of existing projects and then investing heavily into differentiating features. This paper…Building a new cloud data warehouse is a daunting challenge, requiring massive investments into both the query engine and surrounding cloud infrastructure. Given the mature space, it might seem like a Herculean task to enter the market as a small startup.
At Firebolt we assembled a working, high-performance cloud data warehouse in less than 18 months. We achieved this by building our query engine on top of existing projects and then investing heavily into differentiating features. This paper presents our decision-making and learned lessons along the way.Other authorsSee publication -
Incremental Fusion: Unifying Compiled and Vectorized Query Execution
ICDE'24, May 13-17, 2024, Utrecht, Netherlands
Modern high-performance analytical query engines follow one of two execution paradigms. Vectorized engines implement an interpreter for relational algebra operators that operates on batches of tuples to maximize performance. Compiling engines, on the other hand, generate optimized and specialized
code for every query. This paper unifies these two approaches. We present Incremental Fusion, a novel execution paradigm for modern, high-performance query engines. An Incremental Fusion engine…Modern high-performance analytical query engines follow one of two execution paradigms. Vectorized engines implement an interpreter for relational algebra operators that operates on batches of tuples to maximize performance. Compiling engines, on the other hand, generate optimized and specialized
code for every query. This paper unifies these two approaches. We present Incremental Fusion, a novel execution paradigm for modern, high-performance query engines. An Incremental Fusion engine performs operator-fusing code generation – with a twist: The compiling engine generates its own vectorized interpreter. The engine uses a finite set of building blocks below relational algebra for code generation. It can enumerate each building block and generate a vectorized primitive for it. The vectorized interpreter becomes a free byproduct of carefully choosing the right abstraction for code generation. This allows an Incremental Fusion engine to dynamically switch between vectorized interpretation and operator-fusing code generation. We demonstrate Incremental Fusion in our open-source prototype engine InkFuse.
We measure InkFuse against the state-of-the-art vectorized and compiling engines DuckDB and Umbra. InkFuse is able to achieve competitive performance both for low-latency processing, and compute-intensive long-running queries.Other authorsSee publication -
Self-Tuning Query Scheduling for Analytical Workloads
SIGMOD ’21, June 20–25, 2021, Virtual Event, China
Most database systems delegate scheduling decisions to the operating system. While such an approach simplifies the overall database design, it also entails problems. Adaptive resource allocation becomes hard in the face of concurrent queries. Furthermore, incorporating domain knowledge to improve query scheduling is difficult. To mitigate these problems, many modern systems employ forms of task-based parallelism. The execution of a single query is broken up into small, independent chunks of…
Most database systems delegate scheduling decisions to the operating system. While such an approach simplifies the overall database design, it also entails problems. Adaptive resource allocation becomes hard in the face of concurrent queries. Furthermore, incorporating domain knowledge to improve query scheduling is difficult. To mitigate these problems, many modern systems employ forms of task-based parallelism. The execution of a single query is broken up into small, independent chunks of work (tasks). Now, fine-grained scheduling decisions based on these tasks are the responsibility of the database system. Despite being commonplace, little work has focused on the opportunities arising from this execution model.
In this paper, we show how task-based scheduling in database systems opens up new areas for optimization. We present a novel lock-free, self-tuning stride scheduler that optimizes query latencies for analytical workloads. By adaptively managing query priorities and task granularity, we provide high scheduling elasticity. By incorporating domain knowledge into the scheduling decisions, our system is able to cope with workloads that other systems struggle with. Even at high load, we retain near optimal latencies for short running queries. Compared to traditional database systems, our design often improves tail latencies by more than 10xOther authorsSee publication
Languages
-
German
Native or bilingual proficiency
-
English
Professional working proficiency
-
French
Elementary proficiency
Other similar profiles
Explore top content on LinkedIn
Find curated posts and insights for relevant topics all in one place.
View top content