Mez Gebre

Audio Classification: A Weekend of Experiments (FAAAAHH!)

2026-02-17T00:00:00+00:00

I went down an audio classification rabbit hole. What started as “how hard can ESC-50 be?” turned into a weekend of building CNNs, Transformers, and finally catching up on self-supervised learning, only a few years behind everyone else.

This post documents everything I tried, what worked, what didn’t, and the lessons learned.

The Dataset: ESC-50

The ESC-50 dataset contains:

2,000 audio clips (5 seconds each)
50 classes of environmental sounds
5 folds for cross-validation

With only 40 samples per class, this is a challenging dataset that rewards good representations over brute-force memorization.

The Supervised Baseline

ResNet-34 from Scratch

Classic approach: convert audio to mel spectrograms and treat them as images for a CNN.

batch_size = 32
learning_rate = 1e-3
optimizer = AdamW(weight_decay=0.01)
scheduler = OneCycleLR

Result: 80.5% accuracy

Test-Time Augmentation

I noticed validation performance was inconsistent, so I added TTA, averaging predictions across multiple augmented versions of each test sample.

def predict_with_tta(model, spectrogram, n_augments=5):
    predictions = []
    for _ in range(n_augments):
        aug_spec = apply_augmentation(spectrogram)
        pred = model(aug_spec)
        predictions.append(pred)
    return torch.stack(predictions).mean(dim=0)

Result: 83.5% accuracy (+3% from TTA alone!)

Transfer Learning Detour

Before diving into SSL, I tried transfer learning with ImageNet-pretrained EfficientNet-B0.

Result: 81.5% accuracy

This performed worse than my from-scratch ResNet. ImageNet features (edges, textures, objects) don’t transfer perfectly to spectrograms. Good to know.

Self-Supervised Learning Experiments

This is where things got interesting. I wanted to understand why and how SSL works, not just use pretrained models.

SimCLR (Contrastive Learning)

The idea: learn representations by pulling augmented views of the same audio together while pushing different audios apart.

Audio → Mel Spectrogram → ResNet Encoder → Projection Head → Contrastive Loss

Key components:

NT-Xent Loss: Normalized Temperature-scaled Cross Entropy
Projection Head: 512 → 512 → 128 (discarded after pretraining)
Temperature: 0.5

Result: 82.5% accuracy

But here’s the catch: the contrastive task itself reached 98% accuracy during pretraining.

The model was solving a nearly trivial task.

Environmental sounds in ESC-50 are already highly separable in the spectral domain (dog bark vs chainsaw vs rain, etc.). With relatively mild augmentations, the identity of each sound barely changes. That means the encoder can solve the contrastive objective using coarse cues (energy bands, temporal envelope) instead of learning robust, invariant representations.

In other words, the model learned:

“these sounds are different”

instead of:

“these two transformed versions of the same sound are meaningfully the same”

Lesson: Contrastive learning only works when the task is hard enough. You need augmentations and negatives that force the model to learn invariances, not shortcuts.

Even with this limitation, the learned representation still reached 82.5% accuracy, which is competitive with standard supervised CNN baselines on ESC-50.

Masked Prediction (BEATs-style)

Inspired by BERT and BEATs: hide patches of the spectrogram and predict discrete tokens.

Note: This was a simplified, educational implementation to understand the BEATs framework, not a full reproduction. The real BEATs includes iterative tokenizer refinement, larger models (90M+ params), and training on millions of samples. My goal was to grasp the core concepts: patch embeddings, masking, and discrete token prediction.

Spectrogram → Patch Embedding → Encoder → Predict Masked Tokens

The approach:

Split spectrogram into patches (16×16)
Randomly mask 40% of patches
Predict the cluster ID of masked patches

The tokenizer (codebook):

K-means clustering on spectrogram patches
Each patch gets assigned to nearest centroid
Model predicts which cluster the masked patch belongs to

# Tokenizing a patch
distances = ||patch - codebook||²  # Distance to each centroid
token = argmin(distances)          # Closest centroid = token ID

Result: 74.5% accuracy (with CNN encoder)

Underperformed the supervised baseline. Hmm.

Transformer-BEATs

I tried a full Transformer encoder instead of CNN.

Architecture:

6 Transformer layers
256 embedding dimension
8 attention heads
~5M parameters

Result: 40% accuracy (nearly random, FAAAHH!!)

This was humbling. Transformers need massive amounts of data. With only 1,600 training samples, it couldn’t learn meaningful patterns. CNNs have stronger inductive biases that help with limited data.

Call An Ambulance But Not For Me GIFfrom Call An Ambulance But Not For Me GIFs

The BEATs Revelation

After all my experiments, I tried Microsoft’s BEATs model, pretrained on AudioSet (2 million+ clips).

I tested two approaches:

Frozen encoder: Only train a new classifier head on top.
Result: 94.50% accuracy

Fine-tuned with differential learning rates:

param_groups = [
    {'params': model.encoder.parameters(), 'lr': 1e-5},    # Slow for pretrained
    {'params': model.classifier.parameters(), 'lr': 1e-3}  # Fast for new head
]

Result: 95.25% accuracy 🎯

The frozen approach gets you 94.5% with minimal compute. The AudioSet pretraining is that good. Fine-tuning squeezes out another 0.75%, which matters if you’re chasing leaderboards but honestly? Frozen is probably fine for most use cases.

Even though I knew I probably wouldn’t make a dent against the pretrained model, I had to try it.

Things I Had to Figure Out

What are “tokens” in audio? In BEATs, tokens are discrete IDs representing audio patterns. K-means clustering groups similar spectrogram patches, and each patch gets a cluster ID. It’s like creating a vocabulary of audio building blocks.

Why iterate on the tokenizer? Clever trick: the tokenizer and model improve each other. Each iteration creates more meaningful tokens, forcing the model to learn finer distinctions.

Why two learning rates? Differential rates balance preservation and adaptation. Without this, you either destroy pretrained features (high LR) or the classifier never converges (low LR).

Final Results

Method	Accuracy	Approach
BEATs (fine-tuned)	95.25%	Pretrained + differential LR
BEATs (frozen)	94.50%	Pretrained, classifier only
ResNet-34 + TTA	83.50%	Supervised baseline
SimCLR	82.50%	Contrastive SSL
EfficientNet-B0	81.50%	ImageNet transfer
CNN-BEATs	74.50%	Masked prediction
Transformer-BEATs	40.00%	Tomfoolery 💀

The gap between my from-scratch SSL attempts and the pretrained BEATs tells the whole story.

Takeaways

Scale matters enormously for SSL. My weekend experiments couldn’t match 2M-sample pretraining, but they taught me how these methods work.
Contrastive learning needs hard negatives. Trivial pretext tasks don’t force useful representations.
Transformers are data-hungry. CNNs win on small datasets.
TTA is free accuracy. 3% boost for minimal effort.

This is why I love jumping between domains: contrastive learning from NLP, masked prediction from BERT, spectrogram tricks from signal processing. The dots connect in unexpected ways.

The real takeaway? Running these experiments taught me more than any tutorial ever could. Sometimes you have to build the thing yourself to understand why it works, and why it doesn’t.

What surprised me most is that self-supervised learning doesn’t automatically produce meaningful representations. It will happily learn shortcuts if the task allows it. Designing the right objective turns out to matter just as much as the model itself.

I'm Back

2026-02-16T00:00:00+00:00

It’s been a while. Almost nine years since my last post. Life happened, priorities shifted, and this blog sat quietly collecting dust.

A lot has changed. I have a family now, including a 13-month-old daughter who’s been teaching me more about persistence and curiosity than any codebase ever could.

But I’ve missed having a place to think out loud. So here I am, back at it.

No grand promises, just an intention: write more, share ideas, and see where it goes.

Let’s ship.

HomographyNet: Deep Image Homography Estimation

2017-07-21T00:00:00+00:00

Introduction

Today we are going to talk about a paper I read a month ago titled Deep Image Homography Estimation. It is a paper that presents a deep convolutional neural network for estimating the relative homography between a pair of images.

What is a Homography?

In projective geometry, a homography is an isomorphism of projective spaces, induced by an isomorphism of the vector spaces from which the projective spaces derive. - Wikipedia

If you understood that, go ahead and skip directly to the deep learning model discussion. For the rest of us, let’s go ahead and learn some terminology. If you can forget about that definition from Wikipedia for a second, let’s use some anologies to at least get a high-level idea.

Isomorphism?

If you decide to bake a cake; the process of taking the ingredients and following the recipe to create the cake could be seen as morphing. The recipe allowed you to morph/transition from ingredients to a cake. If you’re thinking wtf, just stick with me a bit longer. Now, imagine the same recipe could allow you to take a cake and morph/transition back to the ingredients if you follow it in reverse! In a nutshell, that is the definition of an isomorphism. A recipe that can allow you to transition between the two, loosely speaking.

Isomorphism in mathematics is morphism or a mapping (recipe) that can also give you the inverse, again loosely speaking.

Projective geometry?

I don’t have a clever analogy here, but I’ll give you a simple example to tie it all together. Imagine you’re driving and your dash cam snapped a picture of the road in front of you (let’s call this pic A); at the same time, imagine there was a drone right above you and it also took a picture of the road in front of you (let’s call this pic B). You can see that pic A and B are related, but how? They’re both pictures of the road in front you, only difference is the perspective! The big question….

Is there a recipe/isomorphism that can take you from A to B and vice versa?

There you go, the question you just asked is what a Homography tries to answer. Homography is an isomorphism of perspectives. A 2D homography between A and B would give you the projection transformation between the two images! It is a 3x3 matrix that descibes the affine transformation. Entire books are written on these concepts, but hopefully we now have the general idea to continue.

NOTE

There are some constraints about a homography I have not mentioned such as:

Usually both images are taken from the same camera.
Both images should be viewing the same plane.

Motivation

Where is homography used?

To name a few, the homography is an essential part of the following

Creating panoramas
Monocular SLAM
3D image reconstruction
Camera calibration
Augmented Reality
Autonomous Cars

How do you get the homography?

In traditional computer vision, the homography estimation process is done in two stages.

Corner estimation
Homography estimation

Due to the nature of this problem, these pipelines are only producing estimates. To make the estimates more robust, practitioners go as far as manually engineering corner-ish features, line-ish features etc as mentioned by the paper.

Robustness is introduced into the corner detection stage by returning a large and over-complete set of points, while robustness into the homography estimation step shows up as heavy use of RANSAC or robustification of the squared loss function. - Deep Image Homography Estimation

It is a very hard problem that is error-prone and requires heavy compute to get any sort of robustness. Here we are, finally ready to talk about the question this paper wants to answer.

Is there a single robust algorithm that, given a pair of images, simply returns the homography relating the pair?

HomographyNet: The Model

HomographyNet is a VGG style CNN which produces the homography relating two images. The model doesn’t require a two stage process and all the parameters are trained in an end-to-end fashion!

HomographyNet as descibed in the paper comes in two flavors, classification and regression. Based on the version you decide to use, they have their own pros and cons. Including different loss functions.

The regression network produces eight real-valued numbers and uses the Euclidean (L2) loss as the final layer. The classification network uses a quantization scheme and uses a softmax as the final layer. Since they stick a continuous range into finite bins, they end up with quantization error; hence, the classification network additionally also produces a confidence for each of the corners produced. The paper uses 21 quantization bins for each of the eight output dimensions, which results in a final layer of 168 output neurons. Each of the corner’s 2D grid of scores are interpreted as a distribution.

The architecture seems simple enough, but how is the model trained? Where is the labeled dataset coming from? Glad you asked!

As the paper states and I agree, the simplist way to parameterize a homography is with a 3x3 matrix and a fixed scale. In the 3x3 homography matrix, [H11:H21, H12:H22] are responsible for the rotation and [H13:H23] handle the translational offset. This way you can map each pixel at position [u,v,1] from the image against the homograpy like the figure below, to get the new projected transformation [u',v',1].

However! If you try to unroll the 3x3 matrix and use it as the label (ground truth), you’d end up mixing the rotational and translational components. Creating a loss function that balances the mixed components would have been an unnecessary hurdle. Hence, it was better to use well known loss functions (L2 and Cross-Entropy) and instead figure out a way to re-parameterize the homography matrix!

The 4-Point Homography Parameterization

The 4-Point parameterization is based on corner locations, removing the need to store rotational and translation terms in the label! This is not a new method by any means, but the use of it as a label was clever! Here is how it works.

Earlier we saw how each pixel [u,v,1] was transformed by the 3x3 matrix to produce [u',v',1]. Well if you have at least four pixels where you calculate delta_u = u' - u and delta_v = v' - v then it is possible to reconstruct the 3x3 homography matrix that was used! You could for example use the getPerspectiveTransform() method in OpenCV. From here on out we will call these four pixels (points) as corners.

Training Data Generation

Now that we have a way to represent the homography such that we can use well known loss functions, we can start talking about how the training data is generated.

As a sidenote, H^AB denotes the homorgraphy matrix between A and B

Creating the dataset was pretty straightforward, so I’ll only highlight some of the things that might get tricky. The steps as the describe in the above figure are:

Randomly crop a 128x128 patch at position p from the grayscale image I and call it patch A. Staying away from the edges!
Randomly perturb the four corners of patch A within the range [-rho, rho] and let’s call this new position p'. The paper used rho=32
Compute the homography of patch A using position p and p' and we’ll call this H^AB
Take the inverse of the homography (H^AB)^-1 which equals H^BA and apply that to image I, calling this new image I'. Crop a 128x128 patch at position p from image I' and call it patch B.
Finally, take patch A and B and stack them channel-wise. This will give you a 128x128x2 image that you’ll use as input to the models! The label would be the 4-point parameterization of H^AB.

They cleverly used this five step process on random images from the MS-COCO dataset to create 500,000 training examples. Pretty damn cool (excuse my english).

The Code

If you’d like to see the regression network coded up in Keras and the data generation process visualized, you’re in luck! Follow the links below to my Github repo.

Final thoughts

That pretty much highlights the major parts of the paper. I am hoping you now have an idea of what the paper was about and learned something new! Go read the paper because I didn’t talk about the results etc.

I’d like to think easy future improvements could be to swap out the heavy VGG network for squeezenet! Giving you the improvement of a smaller network. I’ll maybe experiment with this idea and see if I can match or improve on their results. As for possible uses today, I could see this network being used as a cascade classifier. The traditional methods are very compute heavy, so if we could maybe have this network in front to filter out the easy wins, we could cut down on compute cost.

Stay tuned for the next paper and please comment with any corrections or thoughts. That is all folks!

SqueezeDet: Deep Learning for Object Detection

2017-04-21T00:00:00+00:00

Why bother writing this post?

Often, examples you see around computer vision and deep learning is about classification. Those class of problems are asking what do you see in the image? Object detection is another class of problems that ask where in the image do you see it?

Classification answers what and Object Detection answers where.

Object detection has been making great advancement in recent years. The hello world of object detection would be using HOG features combined with a classifier like SVM and using sliding windows to make predictions at different patches of the image. This complex pipeline has a major drawback!

Cons:

Computationally expensive.
Multiple step pipeline.
Requires feature engineering.
Each step in the pipeline has parameters that need to be tuned individually, but can only be tested together. Resulting in a complex trial and error process that is not unified.
Not realtime.

Pros:

Easy to implement, relatively speaking…

Speed becomes a major concern when we are thinking of running these models on the edge (IoT, mobile, cars). For example, a car needs to detect where other cars, people and bikes are to name a few; I could go on… puppies, kittens… you get the idea. The major motivation for me is the need for speed given the constraints that edge computes have; we need compact models that can make quick predictions and are energy efficient.

The SqueezeDet Model

The latest in object detection is to use a convolutional neural network (CNN) that outputs a regression to predict the bounding boxes. This post is about SqueezeDet. I got interested because they used one of my favorite cnn, SqueezeNet! You can read my last post on SqueezeNet if you haven’t yet. To be fair SqueezeDet is pretty much just the YOLO model that uses a SqueezeNet.

Highlevel SqueezeDet

Inspired by YOLO, SqueezeDet is a single stage detection pipeline that does region proposal and classification by one single network. The cnn first extracts feature maps from the input image and feeds it to the ConvDet layer. ConvDet takes the feature maps, overlays them with a WxH grid and at each cell computes K pre-computed bounding boxes called anchors. Each bounding box has the following:

Four scalars (x, y, w, h)
A confidence score ( Pr(Object)xIOU )
C conditional classes

Hence SqueezeDet has a fixed output of WxHxK(4+1+C).

The final step is to use non max suppression aka NMS to filter the bounding boxes to make the final predictions.

The networks regresses and learns how to transform the highest probably bounding box for the prediction. Since there are bounding boxes being generated at each cells of the grid, the top N bounding boxes sorted by the confidence score are kept as the predictions.

The Composite Loss function

The figure above is the four part loss function that makes this entire model possible. Don’t get intimidated by it; let’s take it apart and see how it fits together. Each loss function is described below and highlighted:

Note: The yellow that bleed into the blue loss function is actually suppose to be blue, sorry!

Yellow: Regression of the scalars for the anchors
Green: The confidence score regression which uses IOU of ground and predicted bounding boxes.
Blue: Penalize anchors that are not responsible for detection by dropping their confidence score.
Pink: The cross entropy.

Using SqueezeDet

The authors of the paper did implement the model via TensorFlow! Go check it out on github https://github.com/BichenWuUCB/squeezeDet

Thresholding SqueezeDet

You have to tweak how confident or doubtful you want the model to be; the predictions are centered around K bounding boxes at each cell. So we have to use the top N bouding boxes, sorted by confidence score and then you can do additional thresholding on the class conditional probability score.

Here is an example of a recent project I did where I tweak the params:

Like the paper: N = 64

The image below shows mc.PLOT_PROB_THRESH = 0.1

The image below shows mc.PLOT_PROB_THRESH = 0.5

Final thoughts

Reading through the paper was a real grind. Some math notations were a bit wonky; the paper referenced a lot and it was a recursive process. I literally had to crunch through the entire history of object detection to understand this paper. At the very least I hope you were able to get a high level understanding of this paper. Comment below with any corrections or questions!

MainSqueeze: The 52 parameter model that drives in the Udacity simulator

2017-02-14T00:00:00+00:00

Introduction

What a time to be alive! The year is 2017, Donald Trump is president of the United States of America and autonomous vehicles are all the rage. Still at its infancy, the winning solution to dominate the mass production of autonomous vehicles are ongoing. The two main factions currently are the robotics approach and the end-to-end neural networks approach. Like the four seasons, the AI winter has come and gone. It’s Spring and this is the story of one man’s attempt to explore the pros and cons of the end-to-end neural networks faction in a controlled environment. The hope is to draw some conclusions that will help the greater community advance as a whole.

The Controlled Environment

The Udacity Simulator which is open sourced will be our controlled environment for this journey. It has two modes, training and autonomous mode. Training mode is for the human to drive and record/collect the driving. The result would be a directory of images from three cameras (left, center, right) and a driver log CSV file that records the image along with steering angle, speed etc. Autonomous mode requires a model that can send the simulator steering angle predictions.

The goals of this project are:

Use the simulator to collect data of good driving behavior.
Construct a convolution neural network in Keras that predicts steering angles from images.
Train and validate the model with a training and validation set.
Test that the model successfully drives around track one without leaving the road!
Draw conclusions for future work.

Data Exploration

If you think the Lewis and Clark expedition was tough, try exploring an unknown dataset. I am being dramatic. When you are exploring data; you want to keep thinking, what is the least amount of data you can sample which will represent your problem (population if you like stats). For this problem we will be using the provided Udacity dataset. Let’s dive in!

Our goal is steering angle prediction, so let’s take a look at what the dataset shows us! The plot below shows a few take aways:

Range is [-1,1]
Clustering around [-0.5, 0.5]

Let’s take a look at another angle, pun intended, of the steering data.

This histogram is the making of sampling nightmares; this is how alternate facts are created! Unbalanced data sampling would train our model to be very biased so we will clean this up.

Clean up steps:

Downsample the over represented examples
Upsample the under represented examples
Expose more varied examples to try and represent a uniform distribution.

1. Downsample

Steering angle zero is over represented, so drop 90% of the examples. Easy!

2. Upsample

Time to augment so we can start upsampling under represented examples. We start by flipping 40% of the examples that are not zero steering angle. So far so good!

3. Expose more varied examples

Variety is the spice of life; this is also true to training a well generalized model. For this problem we will introduce more steering angles by shifting the examples horizontally and adding or subtracting the appropriate angles corresponding to the shift we performed. There was no magic shift amount, you get this by experimenting. The result is below:

Great our dataset/sample is looking better. Next, let’s trim this to look more uniform. This was accomplished by:

Grabbing the bin ranges of the 100 bins.
Finding bins that have more than 400 examples.
Randomly sample and drop examples so we have no more than 400 examples.

How did I know 400 examples was enough? I simply kept reducing the number until I had reached a point where my model can still produced stable results. Below 400 my model started getting unstable results.

Model Architecture and Training Strategy

Main goals for us to think about while creating a model

Is it efficient for the task at hand?
If this was to be downloaded onto hardware in a car what would the power consumption and usability be like?
- There is an interesting paper called Deep Compression, which we don’t implement here but it is food for thought and shows that model can be tiny and still work!

I started with a modified comma.ai model and I had a successful model, but it needed 300,000 trainable parameters. Reading the Deep compression paper and a blog post titled Self-driving car in a simulator with a tiny neural network I quickly realized that we can do better. The blog post linked shows a 63 parameter model that uses tiny images and a small network to get a stable model. I wanted to experiment and see if I could make it smaller and still stable. The solution developed over a number of experiments is a modified SqueezeNet implementation. With a squeeze net you get three additional hyperparameters that are used to generate the fire module:

S1x1: Number of 1x1 kernels to use in the squeeze layer within the fire module
E1x1: Number of 1x1 kernels to use in the expand layer within the fire module
E3x3: Number of 3x3 kernels to use in the expand layer within the fire module

Fire Module Zoomed In

The fire module is the workhorse of squeezenet. Sqeezenet as described in the paper is around 700k trainable parameters. I went through a process, where I kept reducing the number of parameters while all other variables were kept constant. You get the theme here? We are at the frontier and this calls for empirical testing. Through some experiments I went from 10k parameters to 1005, then 329, then 159, then 63 and finally 52!!

The final 52 parameter squeezenet variant Model!!

This model combats overfitting by being super tiny and for kicks I added a small dropout layer. The model works on both tracks and has a six second epoch on my late 2012 Macbook air!! From the comma.ai model, it was evident that a validation loss of around 0.03 on 30% of this dataset results in a stable model that can handle the track at a throttle of around 0.2, which is a speed of around 20mph in the simulator. So, I didn’t bother worrying about the epoch hyperparameter. I simply created a custom early termination Keras callback that stopped the training when we hit our requirement.

One good rule of thumb I developed from this project is to try and reduce the number of variables you are tuning to gain better results faster.

Training Strategy

To get the model to drive in the simulator, you need:

Show the model how to drive straight
How to recover if it drifts off track
How to handle turns

The Udacity slack community was a huge help here; from their experience, I used the left and right camera images and adjusted the steering (+.25 for left, -.25 for right) angles to show the model how to correct steering back to center. Then I used the horizontal shifting to capture more angles. This was enough to get a stable model working on the fastest setting (means lowest resolution) on both tracks.

Reducing input image size was the next challenge. We do this by first cropping the top and bottom from the image that would be noise to the model. Then we resize the image to (64,64) and convert to HSV and only returning the S channel. This takes us from (160,320,3) to (64, 64, 1)!!

Before Cropping and Resizing.

After Cropping and Resizing.

Hyperparameters

Learning rate: 1e-1 (very aggressive!)
Batch size: 128 (tried 64, 128, 256, 1024)
Adam optimizer

I used Keras to do a validation split on 30% of the data and my custom early termination to stop the training when we we reach a validation loss of around 0.03!

</br>

Note: this loss plot is from a previous run without early termination

This model was tiny (52 params!) and with only ~20k images I was able to train it on a 2012 Macbook air. An epoch was about six seconds. The memory requirement were small, so I just loaded the entire dataset!

Pros

Can train without a GPU.
Enough to pass the challenge.
Smaller model meant I could experiment with more variables and different models to gain better intuition.
Learned that our current method of training with back prop is not efficient and that we can achieve a lot with a smaller network.
Get to use aggressive learning rate for faster convergence :D

Cons

Can not handle highest resolution setting.
Can not go over .22 throttle and still be stable.
As the network increases in size, hard to understand why the decisions are being made. I can see this being a huge problem for legal reasons. The end-to-end neural network faction is not looking so good here!

This bring me to the other rule of thumb, every problem will have tradeoffs. The question becomes what are the tradeoffs you are willing to make? This will depend on the business case you are solving!

The github repo link

P3_behavioral_cloning

Mez Gebre

Audio Classification: A Weekend of Experiments (FAAAAHH!)

The Dataset: ESC-50

The Supervised Baseline

ResNet-34 from Scratch

Test-Time Augmentation

Transfer Learning Detour

Self-Supervised Learning Experiments

SimCLR (Contrastive Learning)

Masked Prediction (BEATs-style)

Transformer-BEATs

The BEATs Revelation

Things I Had to Figure Out

Final Results

Takeaways

I'm Back

HomographyNet: Deep Image Homography Estimation

Introduction

What is a Homography?

Isomorphism?

Projective geometry?

NOTE

Motivation

Where is homography used?

How do you get the homography?

HomographyNet: The Model

The 4-Point Homography Parameterization

Training Data Generation

The Code

Final thoughts

SqueezeDet: Deep Learning for Object Detection

Why bother writing this post?

Cons:

Pros:

The SqueezeDet Model

Highlevel SqueezeDet

The Composite Loss function

Using SqueezeDet

Thresholding SqueezeDet

Final thoughts

MainSqueeze: The 52 parameter model that drives in the Udacity simulator

Introduction

The Controlled Environment

Data Exploration

Clean up steps:

1. Downsample

2. Upsample

3. Expose more varied examples

Model Architecture and Training Strategy

Main goals for us to think about while creating a model

Fire Module Zoomed In

The final 52 parameter squeezenet variant Model!!

Training Strategy

Before Cropping and Resizing.

After Cropping and Resizing.

Hyperparameters

Pros

Cons

The github repo link

References