Jay Kreps
Mountain View, California, United States
12K followers
500+ connections
About
I am a co-founder and CEO at Confluent a company built around realtime data streams and…
Articles by Jay
Activity
12K followers
Experience
Education
Publications
-
Building a Replicated Logging System with Apache Kafka
Very Large Data Base Endowment Inc. (VLDB Endowment)
Apache Kafka is a scalable publish-subscribe messaging system
with its core architecture as a distributed commit log.
It was originally built at LinkedIn as its centralized event
pipelining platform for online data integration tasks. Over
the past years developing and operating Kafka, we extend
its log-structured architecture as a replicated logging backbone
for much wider application scopes in the distributed
environment. In this abstract, we will talk about our…Apache Kafka is a scalable publish-subscribe messaging system
with its core architecture as a distributed commit log.
It was originally built at LinkedIn as its centralized event
pipelining platform for online data integration tasks. Over
the past years developing and operating Kafka, we extend
its log-structured architecture as a replicated logging backbone
for much wider application scopes in the distributed
environment. In this abstract, we will talk about our design
and engineering experience to replicate Kafka logs for various
distributed data-driven systems at LinkedIn, including
source-of-truth data storage and stream processing.Other authorsSee publication -
Serving Large-scale Batch Computed Data with Project Voldemort
FAST 2012
Project Voldemort is a general purpose distributed storage and serving system inspired by Amazon's Dynamo. We present a novel pipeline for computing, deploying and serving massive read-only data sets that we have integrated into Voldemort. This pipeline builds on the inherent fault-tolerance and horizontal scalability of the Dynamo architecture to solve a common problem: performing massive data loads into an online system without impacting serving performance. The data generation is done…
Project Voldemort is a general purpose distributed storage and serving system inspired by Amazon's Dynamo. We present a novel pipeline for computing, deploying and serving massive read-only data sets that we have integrated into Voldemort. This pipeline builds on the inherent fault-tolerance and horizontal scalability of the Dynamo architecture to solve a common problem: performing massive data loads into an online system without impacting serving performance. The data generation is done offline using Hadoop, and our system effectively bridges the gap between batch-oriented clusters and real-time serving systems. As a production system at LinkedIn, this has helped us rapidly build out various data-intensive social products that are computed offline, and then publish the multi-TB result data to the live production throughout the day.
Other authorsSee publication -
Kafka: A Distributed Messaging System for Log Processing
NetDB 2011
Log processing has become a critical component of the data pipeline for consumer internet companies. We introduce Kafka, a distributed messaging system that we developed for collecting and delivering high volumes of log data with low latency. Our system incorporates ideas from existing log aggregators and messaging systems, and is suitable for both offline and online message consumption. We made quite a few unconventional yet practical design choices in Kafka to make our system efficient and…
Log processing has become a critical component of the data pipeline for consumer internet companies. We introduce Kafka, a distributed messaging system that we developed for collecting and delivering high volumes of log data with low latency. Our system incorporates ideas from existing log aggregators and messaging systems, and is suitable for both offline and online message consumption. We made quite a few unconventional yet practical design choices in Kafka to make our system efficient and scalable. Our experimental results show that Kafka has superior performance when compared to two popular messaging systems. We have been using Kafka in production for some time and it is processing hundreds of gigabytes of new data each day.
Other authorsSee publication -
Aerial LiDAR Data Classification Using Support Vector Machines
Third International Symposium on 3D Data Processing, Visualization, and Transmission
We classify 3D aerial LiDAR scattered height data into buildings, trees, roads, and grass using the support vector machine (SVM) algorithm. To do so we use five features: height, height variation, normal variation, LiDAR return intensity, and image intensity. We also use only LiDAR- derived features to organize the data into three classes (the road and grass classes are merged). We have implemented and experimented with several variations of the SVM algorithm with soft-margin classification to…
We classify 3D aerial LiDAR scattered height data into buildings, trees, roads, and grass using the support vector machine (SVM) algorithm. To do so we use five features: height, height variation, normal variation, LiDAR return intensity, and image intensity. We also use only LiDAR- derived features to organize the data into three classes (the road and grass classes are merged). We have implemented and experimented with several variations of the SVM algorithm with soft-margin classification to allow for the noise in the data. We have applied our results to classify aerial LiDAR data collected over approximately 8 square miles. We visualize the classification results along with the associated confidence using a variation of the SVM algorithm producing probabilistic classifications. We observe that the results are stable and robust. We compare the results against the ground truth and obtain higher than 90% accuracy and convincing visual results.
Other authors -
-
Avatara: OLAP for Web-scale Analytics Products
VLDB 2012 - International Conference on Very Large Databases
Projects
-
Apache Incubator Samza
Samza provides a system for processing stream data from publish-subscribe systems such as Apache Kafka. The developer writes a stream processing task, and executes it as a Samza job. Samza then routes messages between stream processing tasks and the publish-subscribe systems that the messages are addressed to.
Other creatorsSee project
Organizations
-
Apache Software Foundation
Member
- Present
Recommendations received
1 person has recommended Jay
Join now to viewExplore top content on LinkedIn
Find curated posts and insights for relevant topics all in one place.
View top content