Jay Kreps

Jay Kreps

Mountain View, California, United States
12K followers 500+ connections

About

I am a co-founder and CEO at Confluent a company built around realtime data streams and…

Articles by Jay

Activity

12K followers

See all activities

Experience

  • Anthropic Graphic

    Anthropic

    San Francisco Bay Area

  • -

    San Francisco Bay Area

  • -

    Mountain View, CA

  • -

  • -

  • -

  • -

  • -

Education

Publications

  • Building a Replicated Logging System with Apache Kafka

    Very Large Data Base Endowment Inc. (VLDB Endowment)

    Apache Kafka is a scalable publish-subscribe messaging system
    with its core architecture as a distributed commit log.
    It was originally built at LinkedIn as its centralized event
    pipelining platform for online data integration tasks. Over
    the past years developing and operating Kafka, we extend
    its log-structured architecture as a replicated logging backbone
    for much wider application scopes in the distributed
    environment. In this abstract, we will talk about our…

    Apache Kafka is a scalable publish-subscribe messaging system
    with its core architecture as a distributed commit log.
    It was originally built at LinkedIn as its centralized event
    pipelining platform for online data integration tasks. Over
    the past years developing and operating Kafka, we extend
    its log-structured architecture as a replicated logging backbone
    for much wider application scopes in the distributed
    environment. In this abstract, we will talk about our design
    and engineering experience to replicate Kafka logs for various
    distributed data-driven systems at LinkedIn, including
    source-of-truth data storage and stream processing.

    Other authors
    See publication
  • The "Big Data" ecosystem at LinkedIn

    SIGMOD 2013 - Special Interest Group on Management Of Data

    Other authors
  • Serving Large-scale Batch Computed Data with Project Voldemort

    FAST 2012

    Project Voldemort is a general purpose distributed storage and serving system inspired by Amazon's Dynamo. We present a novel pipeline for computing, deploying and serving massive read-only data sets that we have integrated into Voldemort. This pipeline builds on the inherent fault-tolerance and horizontal scalability of the Dynamo architecture to solve a common problem: performing massive data loads into an online system without impacting serving performance. The data generation is done…

    Project Voldemort is a general purpose distributed storage and serving system inspired by Amazon's Dynamo. We present a novel pipeline for computing, deploying and serving massive read-only data sets that we have integrated into Voldemort. This pipeline builds on the inherent fault-tolerance and horizontal scalability of the Dynamo architecture to solve a common problem: performing massive data loads into an online system without impacting serving performance. The data generation is done offline using Hadoop, and our system effectively bridges the gap between batch-oriented clusters and real-time serving systems. As a production system at LinkedIn, this has helped us rapidly build out various data-intensive social products that are computed offline, and then publish the multi-TB result data to the live production throughout the day.

    Other authors
    See publication
  • Kafka: A Distributed Messaging System for Log Processing

    NetDB 2011

    Log processing has become a critical component of the data pipeline for consumer internet companies. We introduce Kafka, a distributed messaging system that we developed for collecting and delivering high volumes of log data with low latency. Our system incorporates ideas from existing log aggregators and messaging systems, and is suitable for both offline and online message consumption. We made quite a few unconventional yet practical design choices in Kafka to make our system efficient and…

    Log processing has become a critical component of the data pipeline for consumer internet companies. We introduce Kafka, a distributed messaging system that we developed for collecting and delivering high volumes of log data with low latency. Our system incorporates ideas from existing log aggregators and messaging systems, and is suitable for both offline and online message consumption. We made quite a few unconventional yet practical design choices in Kafka to make our system efficient and scalable. Our experimental results show that Kafka has superior performance when compared to two popular messaging systems. We have been using Kafka in production for some time and it is processing hundreds of gigabytes of new data each day.

    Other authors
    See publication
  • Aerial LiDAR Data Classification Using Support Vector Machines

    Third International Symposium on 3D Data Processing, Visualization, and Transmission

    We classify 3D aerial LiDAR scattered height data into buildings, trees, roads, and grass using the support vector machine (SVM) algorithm. To do so we use five features: height, height variation, normal variation, LiDAR return intensity, and image intensity. We also use only LiDAR- derived features to organize the data into three classes (the road and grass classes are merged). We have implemented and experimented with several variations of the SVM algorithm with soft-margin classification to…

    We classify 3D aerial LiDAR scattered height data into buildings, trees, roads, and grass using the support vector machine (SVM) algorithm. To do so we use five features: height, height variation, normal variation, LiDAR return intensity, and image intensity. We also use only LiDAR- derived features to organize the data into three classes (the road and grass classes are merged). We have implemented and experimented with several variations of the SVM algorithm with soft-margin classification to allow for the noise in the data. We have applied our results to classify aerial LiDAR data collected over approximately 8 square miles. We visualize the classification results along with the associated confidence using a variation of the SVM algorithm producing probabilistic classifications. We observe that the results are stable and robust. We compare the results against the ground truth and obtain higher than 90% accuracy and convincing visual results.

    Other authors
    • Suresh Lodha
    • David Helmbold
    • D. Fitzpatrick
    See publication
  • Avatara: OLAP for Web-scale Analytics Products

    VLDB 2012 - International Conference on Very Large Databases

Projects

  • Apache Incubator Samza

    Samza provides a system for processing stream data from publish-subscribe systems such as Apache Kafka. The developer writes a stream processing task, and executes it as a Samza job. Samza then routes messages between stream processing tasks and the publish-subscribe systems that the messages are addressed to.

    Other creators
    See project

Organizations

  • Apache Software Foundation

    Member

    - Present

Recommendations received

View Jay’s full profile

  • See who you know in common
  • Get introduced
  • Contact Jay directly
Join to view full profile

Explore top content on LinkedIn

Find curated posts and insights for relevant topics all in one place.

View top content

Add new skills with these courses