About
I work on AI systems for cybersecurity, with a focus on LLM agents, evaluation…
Articles by Cristian
Activity
5K followers
Experience
Education
-
Columbia University in the City of New York
4.1/4.0
-
Activities and Societies: Executive Member of Columbia Data Science Society Associate of Applied Analytics Club Associate of Business Management Club
Relevant Coursework: Machine Learning for Finance, Cloud Computing (AWS), Managing Data, Storytelling with Data, Applied Analytics Frameworks and Methods, Research Design, Analytics and Leading change, Applied Analytics in Organizational Context, Strategy & Analytics
-
-
-
-
-
Licenses & Certifications
Volunteer Experience
Publications
-
SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents
We present SIR-Bench, a benchmark of 794 test cases for evaluating autonomous security incident response agents that distinguishes genuine forensic investigation from alert parroting. Derived from 129 anonymized incident patterns with expert-validated ground truth, SIR-Bench measures not only whether agents reach correct triage decisions, but whether they discover novel evidence through active investigation. To construct SIR-Bench, we develop Once Upon A Threat (OUAT), a framework that replays…
We present SIR-Bench, a benchmark of 794 test cases for evaluating autonomous security incident response agents that distinguishes genuine forensic investigation from alert parroting. Derived from 129 anonymized incident patterns with expert-validated ground truth, SIR-Bench measures not only whether agents reach correct triage decisions, but whether they discover novel evidence through active investigation. To construct SIR-Bench, we develop Once Upon A Threat (OUAT), a framework that replays real incident patterns in controlled cloud environments, producing authentic telemetry with measurable investigation outcomes. Our evaluation methodology introduces three complementary metrics: triage accuracy (M1), novel finding discovery (M2), and tool usage appropriateness (M3), assessed through an adversarial LLM-as-Judge that inverts the burden of proof -- requiring concrete forensic evidence to credit investigations. Evaluating our SIR agent on the benchmark demonstrates 97.1% true positive (TP) detection, 73.4% false positive (FP) rejection, and 5.67 novel key findings per case, establishing a baseline against which future investigation agents can be measured.
Other authorsSee publication -
Survey of Attention Mechanisms in Encoder-Only Language Models
The introduction of the Transformer architecture in 2017 catalyzed a fundamental paradigm shift in natural language processing Vaswani et al. 2017. While decoder-only autoregressive models have come to dominate generative AI, encoder-only bidirectional models historically establish and maintain state-of-the-art results in natural language understanding, dense information retrieval, and sequence classification. This paper presents an exhaustive survey of the self-attention mechanism within…
The introduction of the Transformer architecture in 2017 catalyzed a fundamental paradigm shift in natural language processing Vaswani et al. 2017. While decoder-only autoregressive models have come to dominate generative AI, encoder-only bidirectional models historically establish and maintain state-of-the-art results in natural language understanding, dense information retrieval, and sequence classification. This paper presents an exhaustive survey of the self-attention mechanism within encoder-only models. We analyze its mathematical foundations, trace its architectural evolution from the original BERT Devlin et al. 2018 through RoBERTa Liu et al. 2019, DeBERTa He et al. 2021, and ModernBERT Warner et al. 2024, evaluate efficiency enhancements including sparse attention Zaheer et al. 2020, linear kernelized attention, and hardware-aware FlashAttention Dao et al. 2022, and dissect the ongoing theoretical debates surrounding attention-based interpretability. We further survey hybrid architectures that interleave state-space models with self-attention, and discuss the practical limits and deployment considerations of encoder attention in long-context regimes. To complement this survey, we open-source an interactive visualization tool for exploring these architectures at https://github. com/cristianleoo/attention-in-encoders.
Other authorsSee publication -
Geometric Concept Spaces in Small Encoders: A Comparative Mechanistic Probing of ModernBERT and DeBERTa-v3
See publicationBidirectional transformer encoders have bifurcated into two optimization paradigms: topological precision via disentangled attention (DeBERTa-v3) and hardware-aware scaling via rotary positional embeddings (ModernBERT). This study presents an exhaustive geometric and mechanistic investigation of these architectures using 100,000 activation samples. Through linear probing, Centered Kernel Alignment (CKA), and intrinsic dimensionality estimation, we reveal a 16.5% performance gap in linear…
Bidirectional transformer encoders have bifurcated into two optimization paradigms: topological precision via disentangled attention (DeBERTa-v3) and hardware-aware scaling via rotary positional embeddings (ModernBERT). This study presents an exhaustive geometric and mechanistic investigation of these architectures using 100,000 activation samples. Through linear probing, Centered Kernel Alignment (CKA), and intrinsic dimensionality estimation, we reveal a 16.5% performance gap in linear concept separability favoring DeBERTa-v3 (p< 0.001). We identify an extreme" Topological Collapse" in ModernBERT's final layers, where concept manifolds condense from 30 dimensions to 2. We quantify a fundamental stability-precision trade-off: ModernBERT's RoPE provides 4.3 x higher local positional stability but induces severe semantic entanglement, while DeBERTa-v3 utilizes sparse, specialized sub-circuits to maintain precise orthogonal boundaries. Our findings provide a rigorous geometric explanation for the" token classification anomaly" in modern encoders.
Courses
-
Applied Analytics in Organizational Context
APANPS 5100
-
Business Process Modeling
-
-
Cloud Computing
APANPS 5450
-
Financial Analysis
-
-
Frameworks & Methods
APANPS 5205
-
Managing Data
APANPS 5400
-
Negotiating in English
-
-
Persuading in English
-
-
Research Design
APANPS 5300
-
Social Media Management
-
-
Storytelling with Data
APANPS 5800
-
Strategy & Analytics
APANPS 5600
Projects
-
Scrap Metal Directional Price Prediction
-
This project focuses on analyzing financial news sentiment and utilizing machine learning models to predict stock prices based on the sentiment analysis. It integrates data from various sources, including financial news articles, stock prices, economic indicators, and weather data. The sentiment analysis is performed using two models: FinBert and GPT (Generative Pre-trained Transformer). The machine learning model for price prediction employs the CatBoost algorithm.
-
Quant AI
-
As you're undoubtedly aware, the vast volume of news and rumors that emerge daily can be overwhelming, rendering it practically impossible for an individual to thoroughly process each piece of information.
To counter this challenge, we have integrated cutting-edge LLMs to develop an innovative application designed to assist users in comprehending market sentiment.
Our application harnesses the power of user-specified sources, processing and analyzing vast amounts of data with exceptional…As you're undoubtedly aware, the vast volume of news and rumors that emerge daily can be overwhelming, rendering it practically impossible for an individual to thoroughly process each piece of information.
To counter this challenge, we have integrated cutting-edge LLMs to develop an innovative application designed to assist users in comprehending market sentiment.
Our application harnesses the power of user-specified sources, processing and analyzing vast amounts of data with exceptional accuracy.
It leverages the capabilities of LSTM models, a type of recurrent neural network well-suited for sequence prediction problems, to predict market trends.
This integration of LLMs and LSTM models provides a robust and comprehensive solution to keep up with the pace of real-time information flow, resulting in a powerful tool for understanding and predicting market sentiment.
The ultimate goal is to empower our users to make informed decisions based on accurate, up-to-date, and predictive insights. (Initially built for Tribe AI Hackathon). -
Prediction of Wild Blueberry Yield | Kaggle Competition | Top 1.5% Leaderboard
-
This data science project involves regression analysis using two models: LADRegression and LightGBM.
The project includes data preprocessing, feature engineering using Principal Component Analysis (PCA) and Partial Least Squares (PLS) regression, hyperparameter tuning using grid search, and model evaluation.
In particular, the project performs several predictions using different models, and then stacks them using Least Additive Regression with positive parameters.
The goal is to…This data science project involves regression analysis using two models: LADRegression and LightGBM.
The project includes data preprocessing, feature engineering using Principal Component Analysis (PCA) and Partial Least Squares (PLS) regression, hyperparameter tuning using grid search, and model evaluation.
In particular, the project performs several predictions using different models, and then stacks them using Least Additive Regression with positive parameters.
The goal is to develop accurate regression models and optimize their performance on the given dataset.
I created this notebook for a Kaggle Competition, where I resulted in the top 1.5% of the leaderboard. -
The Tinder Of Food
-
See projectThis project is a Flask interactive web application that displays a map of New York City and allows users to query it, along with a recommendation algorithm that matches suppliers to restuarants.
The application uses a combination of Python, html, css, and Javascript. The data is stored using Apache Spark and MongoDB. -
LSTM to predict the number of weekly appointments for Columbia University
-
See projectIn this data science project I predicted the number of weekly appointments for Columbia University. The goal of this project was to determine if there was a need to hire temporary staff based on the predicted appointment volume. To achieve this, I used first feature engineering, which involved analyzing changes in the characteristics of students over time. This helped me to identify patterns and trends that could affect the number of appointments. After performing the feature engineering, I…
In this data science project I predicted the number of weekly appointments for Columbia University. The goal of this project was to determine if there was a need to hire temporary staff based on the predicted appointment volume. To achieve this, I used first feature engineering, which involved analyzing changes in the characteristics of students over time. This helped me to identify patterns and trends that could affect the number of appointments. After performing the feature engineering, I used a deep learning model called a Long Short-Term Memory (LSTM) model to make predictions. The LSTM model was trained on the historical appointment data to learn from the patterns and trends in the data. Using the LSTM model, I was able to predict the weekly appointment volume for the next two months. This information can be used by Columbia University to make informed decisions about hiring temporary staff and managing resources efficiently.
-
Predicting Tesla Stock from Elon Musk's tweets
-
See projectIn this data science project I predicted the the fluctuation of Tesla Stock from Elon Musk Tweets.
The goal of this project was to predict if the stock would decrease or increase based on the previous day's Elon Musk tweets.
To achieve this, I used first data preprocessing on the tweets column, which involved removing URLs, stripping whitespaces, removing non alpha characters, stemming words and creating a TF-IDF matrix.
Secondly, I performed exploratory analysis to visualize…In this data science project I predicted the the fluctuation of Tesla Stock from Elon Musk Tweets.
The goal of this project was to predict if the stock would decrease or increase based on the previous day's Elon Musk tweets.
To achieve this, I used first data preprocessing on the tweets column, which involved removing URLs, stripping whitespaces, removing non alpha characters, stemming words and creating a TF-IDF matrix.
Secondly, I performed exploratory analysis to visualize correlation between words and stock fluctuation, the top words used in the tweets, etc.
Then, I used feature engineering. Firstly I right merged the tesla stock dataset imported using Yahoo Finance API, then I grouped by day concatenating all the tweets that happened in the same day. After that, I was able to extract some other useful information such as the number of tweets per day, the average length of the tweets, number of emoji used and so on.
After that, I performed sentiment analysis using vader, which provided useful sentiment scores.
Finally, I performed data modeling. For this step I used both a LSTM model on the stock data, and a neural network on the NLP data. Then, I combined the outputs with a second neural network to have one final layer with one output. -
Using Deep Learning to predict disasters from Twitter
-
See projectIn this project, I am using the pre-trained BERT (Bidirectional Encoder Representations from Transformers) model to classify tweets as either disaster-related or not. I train the model using a combination of cross-entropy loss and mixup regularization, and use early stopping to prevent overfitting. Overall, this project demonstrates the use of transfer learning and fine-tuning with BERT for natural language processing tasks.
-
Building a Recommendation System with Machine Learning
-
See projectThe project predicts the popularity of recipes, recommending the system to show or less a certain recipe. This project involves data Validation, data cleaning, exploratory analysis, feature engineering, data modeling with Genetic Tuning algorithms, SGD Classification, XGBoost Classification, Random Forest Classification, and model stacking.
-
API application (12-Twenty & Google Cloud) - Columbia University
-
See projectCreated an application to automatize the extraction of data from the University database, data cleaning, feature engineering, and posting data to Google data studio. The application uses two main API endpoints: 12-Twenty and Google Cloud.
Languages
-
English
Full professional proficiency
-
Italian
Native or bilingual proficiency
-
Spanish
Full professional proficiency
-
Chinese
Limited working proficiency
Other similar profiles
Explore top content on LinkedIn
Find curated posts and insights for relevant topics all in one place.
View top content