Duo Ding

Duo Ding

Menlo Park, California, United States
4K followers 500+ connections

About

Managing Research/Engineering teams at AI Startup:
LLM training: post-training [MoE…

Experience

  • Cresta Graphic

    Cresta

    Sunnyvale, California, United States

  • -

    San Francisco Bay Area

  • -

    Sunnyvale, California, United States

  • -

    Cupertino, California, United States

  • -

    Cupertino, CA

  • -

  • -

  • -

    Boston, Massachusetts, United States

  • -

  • -

    Shanghai Jiao Tong University

Education

  • Carnegie Mellon University Graphic

    Carnegie Mellon University

    -

    Activities and Societies: Co-Director, CMU Summit New Venture Competition, April 2012. President of the Chinese Student and Scholar Association (CSSA) in Carnegie Mellon University, 2012-2013. Organizing Committee, 2012 LTI Student Research Symposium, School of Computer Science, Carnegie Mellon University, August 2012.

  • -

    -

Publications

  • Beyond Audio and Video Retrieval: Topic Oriented Multimedia Summarization

    In Proc. of the International Journal of Multimedia Information Retrieval, 2012.

    Consumer-grade video is becoming abundant on the Internet, and it is now easier than ever to download multimedia material of any kind and quality. With cell- phones now featuring video recording capability along with broadband connectivity, multimedia material can be recorded and distributed across the world just as easily as text could just a couple of years ago. The easy availability of vast amounts of text gave a huge boost to the Natural Language Processing (NLP) re- search community, which…

    Consumer-grade video is becoming abundant on the Internet, and it is now easier than ever to download multimedia material of any kind and quality. With cell- phones now featuring video recording capability along with broadband connectivity, multimedia material can be recorded and distributed across the world just as easily as text could just a couple of years ago. The easy availability of vast amounts of text gave a huge boost to the Natural Language Processing (NLP) re- search community, which was critical in order to orga- nize the amount of information that was suddenly available. The above-mentioned multimedia material is set to do the same for multi-modal audio and video analysis and generation, and in this paper we will argue that natural language can play a big role in organizing this information. We see this as a first step towards systems that will be able to discriminate visually similar, but semanti- cally different videos, compare two videos and provide textual output or summarize a large number of videos at once. In this paper, we introduce our approach of solving the TOMS problem. We extract various visual concept features, environmental sounds and ASR tran- scription features from a given video, and develop a template-based natural language generation system to produce a textual recounting based on the extracted features. We also propose possible experimental designs for continuously evaluating and improving TOMS sys- tems, and present results of a pilot evaluation of our initial system.

    Other authors
    • Florian Metze
    • Ehsan Younessian
    • Alexander Hauptmann
  • Informedia E-Lamp@TRECVID 2012 Multimedia Event Detection and Recounting (MED and MER)

    In Proceeding of the 2012 National Institute of Standards and Technology (NIST) TREC Video Retrieval Evaluation Workshop, Gaithersburg, MD, USA.

    We report on our system used in the TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, generally, it consists of three main steps: extracting features, training detectors and fusion. In the feature extraction part, we extract many low-level, high-level features and text features. Those features are then represented in three different ways which are spatial bag-of words with standard tiling, spatial bag-of-words with feature and event specific…

    We report on our system used in the TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, generally, it consists of three main steps: extracting features, training detectors and fusion. In the feature extraction part, we extract many low-level, high-level features and text features. Those features are then represented in three different ways which are spatial bag-of words with standard tiling, spatial bag-of-words with feature and event specific tiling and the Gaussian Mixture Model Super Vector. In the detector training and fusion, two classifiers and three fusion methods are employed. The results from both of the official sources and our internal evaluations show good performance of our system. For our MER system, it takes some of the features and detection results from the MED system from which the recount is then generated.

    Other authors
    • Shoou-I Yu
    • Zhongwen Xu
    • Waito Sze
    • Francisco Vicente
    • Zhenzhong Lan
    • etc.
  • Event-based Video Retrieval Using Audio.

    In Proceeding of the 13th Annual Conference of the International Speech Communication Association (INTERSPEECH-2012), Portland, Oregon, USA.

    Multimedia Event Detection (MED) is an annual task in the NIST TRECVID evaluation, and requires participants to build indexing and retrieval systems for locating videos in which certain predefined events are shown. Typical systems focus heavily on the use of visual data. Audio data, however, also contains rich information that can be effectively used for video retrieval, and MED could benefit from the attention of researchers in audio analysis. We present several systems for performing MED…

    Multimedia Event Detection (MED) is an annual task in the NIST TRECVID evaluation, and requires participants to build indexing and retrieval systems for locating videos in which certain predefined events are shown. Typical systems focus heavily on the use of visual data. Audio data, however, also contains rich information that can be effectively used for video retrieval, and MED could benefit from the attention of researchers in audio analysis. We present several systems for performing MED using only audio data, report the results of each system on the TRECVID MED 2011 development dataset, and compare the strengths and weaknesses of each approach.

    Other authors
    • Qin Jin
    • Peter F. Schulam
    • Shourabh Rawat
    • Susanne Burger
    • Florian Metze
    See publication
  • Beyond Audio and Video Retrieval: Towards Multimedia Summarization.

    In Proceeding of the 2012 ACM International Conference on Multimedia Retrieval (ICMR-2012), Hong Kong. (Best Paper Nomination)

    Given the deluge of multimedia content that is becoming available over the Internet, it is increasingly important to be able to effectively examine and organize these large stores of information in ways that go beyond browsing or collaborative filtering. In this paper we review previous work on audio and video processing, and define the task of Topic-Oriented Multimedia Summarization (TOMS) using natural language generation: given a set of automatically extracted features from a video (such as…

    Given the deluge of multimedia content that is becoming available over the Internet, it is increasingly important to be able to effectively examine and organize these large stores of information in ways that go beyond browsing or collaborative filtering. In this paper we review previous work on audio and video processing, and define the task of Topic-Oriented Multimedia Summarization (TOMS) using natural language generation: given a set of automatically extracted features from a video (such as visual concepts and ASR transcripts) a TOMS system will automatically generate a paragraph of natural language (“a recounting”), which summarizes the important information in a video belonging to a certain topic area, and provides explanations for why a video was matched and retrieved. We see this as a first step towards systems that will be able to discriminate visually similar, but semantically different videos, compare two videos and provide textual output or summarize a large number of videos at once. In this paper, we introduce our approach of solving the TOMS problem. We extract visual concept features and ASR transcription features from a given video, and develop a template-based natural language generation system to produce a textual recounting based on the extracted features. We also propose possible experimental designs for continuously evaluating and improving TOMS systems, and present results of a pilot evaluation of our initial system.

    Other authors
    • Florian Metze
    • Shourabh Rawat
    • Peter Franz Schulam
    • Susanne Burger
    • Ehsan Younessian
    • Lei Bao
    • Michael G. Christel
    • Alexander Hauptmann
    See publication
  • Generating Natural Language Summaries for Multimedia.

    In Proceeding of the 7th International Natural Language Generation Conference (INLG-2012), Demo Session, Starved Rock, IL, USA.

    In this paper we introduce an automatic system that generates textual summaries of Internet-style video clips by first identifying suitable high-level descriptive features that have been detected in the video (e.g. visual concepts, recognized speech, actions, objects, persons, etc.). Then a natural language generator is constructed using SimpleNLG to compile the high-level features into a textual form. The generated summary contains information from both visual and acoustic sources, intending…

    In this paper we introduce an automatic system that generates textual summaries of Internet-style video clips by first identifying suitable high-level descriptive features that have been detected in the video (e.g. visual concepts, recognized speech, actions, objects, persons, etc.). Then a natural language generator is constructed using SimpleNLG to compile the high-level features into a textual form. The generated summary contains information from both visual and acoustic sources, intending to give a general review and summary of the video. To reduce the complexity of the task, we restrict ourselves to work with videos that show a limited number of “events”. In this demo paper, we describe the design of the system and present example outputs generated by the video summarization system.

    Other authors
    • Florian Metze
    • Shourabh Rawat
    • Peter F. Schulam
    • Susanne Burger
    See publication
  • Integrate Multilingual Web Search Results using Cross-Lingual Topic Models

    In Proceeding of the 5th International Joint Conference on Natural Language Processing (IJCNLP-2011), Workshop: Cross Lingual Information Access. Chiang Mai, Thailand.

    With the thriving of the Internet, web users today have access to resources around the world in more than 200 different languages. How to effectively manage multilingual web search results has emerged as an essential problem. In this paper, we introduce the ongoing work of leveraging a Cross-Lingual Topic Model (CLTM) to integrate the multilingual search results. The CLTM detects the underlying topics of different language results and uses the topic distribution of each result to cluster them…

    With the thriving of the Internet, web users today have access to resources around the world in more than 200 different languages. How to effectively manage multilingual web search results has emerged as an essential problem. In this paper, we introduce the ongoing work of leveraging a Cross-Lingual Topic Model (CLTM) to integrate the multilingual search results. The CLTM detects the underlying topics of different language results and uses the topic distribution of each result to cluster them into topic-based classes. In CLTM, we unify distributions in topic level by direct translation, thus distinguishing from other multilingual topic models, which mainly concern the parallelism at document or sentence level (Mimno 2009; Ni, 2009). Experimental results suggest that our CLTM clustering method is effective and outperforms the 6 compared clustering approaches.

  • Tulsa: Web Search for Writing Assistance

    The 34th Annual International ACM SIGIR Conference

    Searching the web while authoring has become a common behavior for many users. Some search the web to research content, while others, especially those writing in a foreign language, search to learn if their usage is appropriate. Can we unify the experiences of search and writing to make authoring more productive? That’s the central question of project Tulsa, which puts the web at writers’ fingertips in a novel writing assistance experience based on implicit web search and natural language…

    Searching the web while authoring has become a common behavior for many users. Some search the web to research content, while others, especially those writing in a foreign language, search to learn if their usage is appropriate. Can we unify the experiences of search and writing to make authoring more productive? That’s the central question of project Tulsa, which puts the web at writers’ fingertips in a novel writing assistance experience based on implicit web search and natural language techniques. It provides assistance at three levels: word, phrase and paragraph. Tulsa offers web-mined, contextual reference information and suggestions for completing or revising words and phrases. Paragraph analysis is also provided which can detect outlier usage of language in larger chunks of text. Tulsa bases its suggestions and rankings on the Web as Corpus (WaC) through search engine queries, combined with a Support Vector Machine (SVM) trained on N-gram language features of a web-scale language model.

    Other authors
    • Xingping Jiang
    • Matthew R. Scott
    • Ming Zhou
    • Yong Yu
    See publication

Courses

  • Advanced Algebra

    -

  • Algorithm Analysis and Design

    -

  • Algorithms for Natural Language Processing

    -

  • Applied Machine Learning

    -

  • Artificial Intelligence

    -

  • Compiler Principles

    -

  • Computational Models of Discourse Analysis

    -

  • Computer Network

    -

  • Computer Organization and Architecture

    -

  • Data Structure

    -

  • Digital Logic and Analog Circuit

    -

  • Directed Research 2012

    -

  • Directed Research 2013

    -

  • Graph Theory and Combinatoric

    -

  • Innovation of Science and Technology

    -

  • Language Technologies Institute Colloquium 2012

    -

  • Language Technologies Institute Colloquium 2013

    -

  • Language and Statistics

    -

  • Modern Calculus and Analysis

    -

  • Neural Network Theory and Application

    -

  • Object-Oriented Analysis and Design

    -

  • Operating System

    -

  • Physics

    -

  • Principles of Database System

    -

  • Programming

    -

  • Research Design and Writing

    -

  • Research Seminar in Machine Learning and Policy

    -

  • Scientific and Engineering Computing

    -

  • Self-Paced Lab: Rich Interaction in Virtual World

    -

  • Set Theory and Mathematical Logic

    -

  • Software Engineering for Information Systems I

    -

  • Software Engineering for Information Systems II

    -

  • Speech Recognition and Understanding

    -

  • The Theory of Computability

    -

  • Theory of Western Economics

    -

Languages

  • English

    Professional working proficiency

  • Chinese

    Native or bilingual proficiency

View Duo’s full profile

  • See who you know in common
  • Get introduced
  • Contact Duo directly
Join to view full profile

Other similar profiles

Explore top content on LinkedIn

Find curated posts and insights for relevant topics all in one place.

View top content

Add new skills with these courses