<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>Mez Gebre</title>
 <link href="http://mez.sh/atom.xml" rel="self"/>
 <link href="http://mez.sh/"/>
 <updated>2026-04-19T00:18:23+00:00</updated>
 <id>http://mez.sh/</id>
 <author>
   <name>Mez Gebre</name>
 </author>

 
 <entry>
   <title>Audio Classification: A Weekend of Experiments (FAAAAHH!)</title>
   <link href="http://mez.sh/2026/02/17/audio-classification-ssl-deep-dive/"/>
   <updated>2026-02-17T00:00:00+00:00</updated>
   <id>http://mez.sh/2026/02/17/audio-classification-ssl-deep-dive</id>
   <content type="html">&lt;p&gt;I went down an audio classification rabbit hole. What started as “how hard can ESC-50 be?” turned into a weekend of building CNNs, Transformers, and finally catching up on self-supervised learning, only a few years behind everyone else.&lt;/p&gt;

&lt;p&gt;This post documents everything I tried, what worked, what didn’t, and the lessons learned.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-dataset-esc-50&quot;&gt;The Dataset: ESC-50&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/karolpiczak/ESC-50&quot;&gt;ESC-50 dataset&lt;/a&gt; contains:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;2,000 audio clips&lt;/strong&gt; (5 seconds each)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;50 classes&lt;/strong&gt; of environmental sounds&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;5 folds&lt;/strong&gt; for cross-validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With only 40 samples per class, this is a challenging dataset that rewards good representations over brute-force memorization.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-supervised-baseline&quot;&gt;The Supervised Baseline&lt;/h2&gt;

&lt;h3 id=&quot;resnet-34-from-scratch&quot;&gt;ResNet-34 from Scratch&lt;/h3&gt;

&lt;p&gt;Classic approach: convert audio to mel spectrograms and treat them as images for a CNN.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;batch_size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;learning_rate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1e-3&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;optimizer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AdamW&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;weight_decay&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;scheduler&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OneCycleLR&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result: 80.5% accuracy&lt;/strong&gt;&lt;/p&gt;

&lt;h3 id=&quot;test-time-augmentation&quot;&gt;Test-Time Augmentation&lt;/h3&gt;

&lt;p&gt;I noticed validation performance was inconsistent, so I added TTA, averaging predictions across multiple augmented versions of each test sample.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;predict_with_tta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;spectrogram&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n_augments&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;predictions&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_augments&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;aug_spec&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;apply_augmentation&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spectrogram&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;pred&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;aug_spec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;predictions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pred&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predictions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result: 83.5% accuracy&lt;/strong&gt; (+3% from TTA alone!)&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;transfer-learning-detour&quot;&gt;Transfer Learning Detour&lt;/h2&gt;

&lt;p&gt;Before diving into SSL, I tried transfer learning with ImageNet-pretrained EfficientNet-B0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result: 81.5% accuracy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This performed &lt;em&gt;worse&lt;/em&gt; than my from-scratch ResNet. ImageNet features (edges, textures, objects) don’t transfer perfectly to spectrograms. Good to know.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;self-supervised-learning-experiments&quot;&gt;Self-Supervised Learning Experiments&lt;/h2&gt;

&lt;p&gt;This is where things got interesting. I wanted to understand &lt;em&gt;why&lt;/em&gt; and &lt;em&gt;how&lt;/em&gt; SSL works, not just use pretrained models.&lt;/p&gt;

&lt;h3 id=&quot;simclr-contrastive-learning&quot;&gt;SimCLR (Contrastive Learning)&lt;/h3&gt;

&lt;p&gt;The idea: learn representations by pulling augmented views of the same audio together while pushing different audios apart.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Audio → Mel Spectrogram → ResNet Encoder → Projection Head → Contrastive Loss
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key components:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;NT-Xent Loss:&lt;/strong&gt; Normalized Temperature-scaled Cross Entropy&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Projection Head:&lt;/strong&gt; 512 → 512 → 128 (discarded after pretraining)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Temperature:&lt;/strong&gt; 0.5&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Result: 82.5% accuracy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But here’s the catch: the contrastive task itself reached &lt;strong&gt;98% accuracy during pretraining&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The model was solving a nearly trivial task.&lt;/p&gt;

&lt;p&gt;Environmental sounds in ESC-50 are already highly separable in the spectral domain (dog bark vs chainsaw vs rain, etc.). With relatively mild augmentations, the identity of each sound barely changes. That means the encoder can solve the contrastive objective using coarse cues (energy bands, temporal envelope) instead of learning robust, invariant representations.&lt;/p&gt;

&lt;p&gt;In other words, the model learned:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“these sounds are different”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;instead of:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“these two transformed versions of the same sound are meaningfully the same”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Contrastive learning only works when the task is hard enough. You need augmentations and negatives that force the model to learn invariances, not shortcuts.&lt;/p&gt;

&lt;p&gt;Even with this limitation, the learned representation still reached 82.5% accuracy, which is competitive with standard supervised CNN baselines on ESC-50.&lt;/p&gt;

&lt;h3 id=&quot;masked-prediction-beats-style&quot;&gt;Masked Prediction (BEATs-style)&lt;/h3&gt;

&lt;p&gt;Inspired by BERT and BEATs: hide patches of the spectrogram and predict discrete tokens.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This was a simplified, educational implementation to understand the BEATs framework, not a full reproduction. The real BEATs includes iterative tokenizer refinement, larger models (90M+ params), and training on millions of samples. My goal was to grasp the core concepts: patch embeddings, masking, and discrete token prediction.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Spectrogram → Patch Embedding → Encoder → Predict Masked Tokens
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The approach:&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Split spectrogram into patches (16×16)&lt;/li&gt;
  &lt;li&gt;Randomly mask 40% of patches&lt;/li&gt;
  &lt;li&gt;Predict the cluster ID of masked patches&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The tokenizer (codebook):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;K-means clustering on spectrogram patches&lt;/li&gt;
  &lt;li&gt;Each patch gets assigned to nearest centroid&lt;/li&gt;
  &lt;li&gt;Model predicts which cluster the masked patch belongs to&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Tokenizing a patch
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;distances&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;patch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;codebook&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;²&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Distance to each centroid
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;token&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argmin&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;distances&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;          &lt;span class=&quot;c1&quot;&gt;# Closest centroid = token ID
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result: 74.5% accuracy&lt;/strong&gt; (with CNN encoder)&lt;/p&gt;

&lt;p&gt;Underperformed the supervised baseline. Hmm.&lt;/p&gt;

&lt;h3 id=&quot;transformer-beats&quot;&gt;Transformer-BEATs&lt;/h3&gt;

&lt;p&gt;I tried a full Transformer encoder instead of CNN.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;6 Transformer layers&lt;/li&gt;
  &lt;li&gt;256 embedding dimension&lt;/li&gt;
  &lt;li&gt;8 attention heads&lt;/li&gt;
  &lt;li&gt;~5M parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Result: 40% accuracy&lt;/strong&gt; (nearly random, FAAAHH!!)&lt;/p&gt;

&lt;p&gt;This was humbling. Transformers need &lt;em&gt;massive&lt;/em&gt; amounts of data. With only 1,600 training samples, it couldn’t learn meaningful patterns. CNNs have stronger inductive biases that help with limited data.&lt;/p&gt;

&lt;hr /&gt;

&lt;div style=&quot;max-width: 400px; margin: 2rem auto;&quot;&gt;
  &lt;div class=&quot;tenor-gif-embed&quot; data-postid=&quot;13987821127347517247&quot; data-share-method=&quot;host&quot; data-aspect-ratio=&quot;1.74074&quot; data-width=&quot;100%&quot;&gt;&lt;a href=&quot;https://tenor.com/view/call-an-ambulance-but-not-for-me-but-not-for-me-call-an-ambulance-gif-13987821127347517247&quot;&gt;Call An Ambulance But Not For Me GIF&lt;/a&gt;from &lt;a href=&quot;https://tenor.com/search/call+an+ambulance+but+not+for+me-gifs&quot;&gt;Call An Ambulance But Not For Me GIFs&lt;/a&gt;&lt;/div&gt;
  &lt;script type=&quot;text/javascript&quot; async=&quot;&quot; src=&quot;https://tenor.com/embed.js&quot;&gt;&lt;/script&gt;
&lt;/div&gt;

&lt;h2 id=&quot;the-beats-revelation&quot;&gt;The BEATs Revelation&lt;/h2&gt;

&lt;p&gt;After all my experiments, I tried Microsoft’s BEATs model, pretrained on AudioSet (2 million+ clips).&lt;/p&gt;

&lt;p&gt;I tested two approaches:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Frozen encoder:&lt;/strong&gt; Only train a new classifier head on top.&lt;br /&gt;
&lt;strong&gt;Result: 94.50% accuracy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuned with differential learning rates:&lt;/strong&gt;&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;param_groups&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;params&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encoder&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;lr&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1e-5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# Slow for pretrained
&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;params&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;classifier&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;lr&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1e-3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Fast for new head
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result: 95.25% accuracy&lt;/strong&gt; 🎯&lt;/p&gt;

&lt;p&gt;The frozen approach gets you 94.5% with minimal compute. The AudioSet pretraining is &lt;em&gt;that&lt;/em&gt; good. Fine-tuning squeezes out another 0.75%, which matters if you’re chasing leaderboards but honestly? Frozen is probably fine for most use cases.&lt;/p&gt;

&lt;p&gt;Even though I knew I probably wouldn’t make a dent against the pretrained model, I had to try it.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;things-i-had-to-figure-out&quot;&gt;Things I Had to Figure Out&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What are “tokens” in audio?&lt;/strong&gt; In BEATs, tokens are discrete IDs representing audio patterns. K-means clustering groups similar spectrogram patches, and each patch gets a cluster ID. It’s like creating a vocabulary of audio building blocks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why iterate on the tokenizer?&lt;/strong&gt; Clever trick: the tokenizer and model improve each other. Each iteration creates more meaningful tokens, forcing the model to learn finer distinctions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why two learning rates?&lt;/strong&gt; Differential rates balance preservation and adaptation. Without this, you either destroy pretrained features (high LR) or the classifier never converges (low LR).&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;final-results&quot;&gt;Final Results&lt;/h2&gt;

&lt;table style=&quot;width:100%; border-collapse: collapse; margin: 1.5rem 0;&quot;&gt;
  &lt;thead&gt;
    &lt;tr style=&quot;background: #2a2a2a; color: #fff;&quot;&gt;
      &lt;th style=&quot;padding: 12px 15px; text-align: left;&quot;&gt;Method&lt;/th&gt;
      &lt;th style=&quot;padding: 12px 15px; text-align: center;&quot;&gt;Accuracy&lt;/th&gt;
      &lt;th style=&quot;padding: 12px 15px; text-align: left;&quot;&gt;Approach&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr style=&quot;background: #e8f5e9;&quot;&gt;
      &lt;td style=&quot;padding: 12px 15px; font-weight: bold;&quot;&gt;BEATs (fine-tuned)&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px; text-align: center; font-weight: bold;&quot;&gt;95.25%&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;Pretrained + differential LR&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;background: #f1f8e9;&quot;&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;BEATs (frozen)&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px; text-align: center;&quot;&gt;94.50%&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;Pretrained, classifier only&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;background: #fff;&quot;&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;ResNet-34 + TTA&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px; text-align: center;&quot;&gt;83.50%&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;Supervised baseline&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;background: #f5f5f5;&quot;&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;SimCLR&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px; text-align: center;&quot;&gt;82.50%&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;Contrastive SSL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;background: #fff;&quot;&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;EfficientNet-B0&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px; text-align: center;&quot;&gt;81.50%&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;ImageNet transfer&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;background: #f5f5f5;&quot;&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;CNN-BEATs&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px; text-align: center;&quot;&gt;74.50%&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;Masked prediction&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;background: #ffebee;&quot;&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;Transformer-BEATs&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px; text-align: center;&quot;&gt;40.00%&lt;/td&gt;
      &lt;td style=&quot;padding: 12px 15px;&quot;&gt;Tomfoolery 💀&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The gap between my from-scratch SSL attempts and the pretrained BEATs tells the whole story.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;takeaways&quot;&gt;Takeaways&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Scale matters enormously for SSL.&lt;/strong&gt; My weekend experiments couldn’t match 2M-sample pretraining, but they taught me &lt;em&gt;how&lt;/em&gt; these methods work.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Contrastive learning needs hard negatives.&lt;/strong&gt; Trivial pretext tasks don’t force useful representations.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Transformers are data-hungry.&lt;/strong&gt; CNNs win on small datasets.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;TTA is free accuracy.&lt;/strong&gt; 3% boost for minimal effort.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why I love jumping between domains: contrastive learning from NLP, masked prediction from BERT, spectrogram tricks from signal processing. The dots connect in unexpected ways.&lt;/p&gt;

&lt;p&gt;The real takeaway? Running these experiments taught me more than any tutorial ever could. Sometimes you have to build the thing yourself to understand why it works, and why it doesn’t.&lt;/p&gt;

&lt;p&gt;What surprised me most is that self-supervised learning doesn’t automatically produce meaningful representations. It will happily learn shortcuts if the task allows it. Designing the right objective turns out to matter just as much as the model itself.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>I'm Back</title>
   <link href="http://mez.sh/2026/02/16/im-back/"/>
   <updated>2026-02-16T00:00:00+00:00</updated>
   <id>http://mez.sh/2026/02/16/im-back</id>
   <content type="html">&lt;p&gt;It’s been a while. Almost nine years since my last post. Life happened, priorities shifted, and this blog sat quietly collecting dust.&lt;/p&gt;

&lt;p&gt;A lot has changed. I have a family now, including a 13-month-old daughter who’s been teaching me more about persistence and curiosity than any codebase ever could.&lt;/p&gt;

&lt;p&gt;But I’ve missed having a place to think out loud. So here I am, back at it.&lt;/p&gt;

&lt;p&gt;No grand promises, just an intention: write more, share ideas, and see where it goes.&lt;/p&gt;

&lt;p&gt;Let’s ship.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>HomographyNet: Deep Image Homography Estimation</title>
   <link href="http://mez.sh/2017/07/21/homographynet-deep-image-homography-estimation/"/>
   <updated>2017-07-21T00:00:00+00:00</updated>
   <id>http://mez.sh/2017/07/21/homographynet-deep-image-homography-estimation</id>
   <content type="html">&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;Today we are going to talk about a paper I read a month ago titled &lt;a href=&quot;https://arxiv.org/abs/1606.03798&quot;&gt;Deep Image Homography Estimation&lt;/a&gt;. It is a paper that presents a deep convolutional neural network for estimating the relative homography between a pair of images.&lt;/p&gt;

&lt;h3 id=&quot;what-is-a-homography&quot;&gt;&lt;em&gt;What is a Homography?&lt;/em&gt;&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;In projective geometry, a homography is an isomorphism of projective spaces, induced by an isomorphism of the vector spaces from which the projective spaces derive. - &lt;a href=&quot;https://en.wikipedia.org/wiki/Homography&quot;&gt;Wikipedia&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you understood that, go ahead and skip directly to the deep learning model discussion. For the rest of us, let’s go ahead and learn some terminology. If you can forget about that definition from Wikipedia for a second, let’s use some anologies to at least get a high-level idea.&lt;/p&gt;

&lt;h3 id=&quot;isomorphism&quot;&gt;Isomorphism?&lt;/h3&gt;
&lt;p&gt;If you decide to bake a cake; the process of taking the ingredients and following the recipe to create the cake could be seen as morphing. The recipe allowed you to morph/transition from ingredients to a cake. If you’re thinking wtf, just stick with me a bit longer. Now, imagine the same recipe could allow you to take a cake and morph/transition back to the ingredients if you follow it in reverse! In a nutshell, that is the definition of an isomorphism. A recipe that can allow you to transition between the two, loosely speaking.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Isomorphism in mathematics is morphism or a mapping (recipe) that can also give you the inverse, again loosely speaking.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;projective-geometry&quot;&gt;Projective geometry?&lt;/h3&gt;
&lt;p&gt;I don’t have a clever analogy here, but I’ll give you a simple example to tie it all together. Imagine you’re driving and your dash cam snapped a picture of the road in front of you (let’s call this pic &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt;); at the same time, imagine there was a drone right above you and it also took a picture of the road in front of you (let’s call this pic &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B&lt;/code&gt;). You can see that pic &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B&lt;/code&gt; are related, but how? They’re both pictures of the road in front you, only difference is the perspective! The big question….&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Is there a recipe/isomorphism that can take you from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B&lt;/code&gt; and vice versa?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There you go, the question you just asked is what a Homography tries to answer. Homography is an isomorphism of perspectives. A 2D homography between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B&lt;/code&gt; would give you the projection transformation between the two images! It is a 3x3 matrix that descibes the affine transformation. Entire books are written on these concepts, but hopefully we now have the general idea to continue.&lt;/p&gt;

&lt;h3 id=&quot;note&quot;&gt;&lt;em&gt;NOTE&lt;/em&gt;&lt;/h3&gt;
&lt;p&gt;There are some constraints about a homography I have not mentioned such as:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Usually both images are taken from the same camera.&lt;/li&gt;
  &lt;li&gt;Both images should be viewing the same plane.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;motivation&quot;&gt;Motivation&lt;/h2&gt;

&lt;h3 id=&quot;where-is-homography-used&quot;&gt;&lt;em&gt;Where is homography used?&lt;/em&gt;&lt;/h3&gt;

&lt;p&gt;To name a few, the homography is an essential part of the following&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Creating panoramas&lt;/li&gt;
  &lt;li&gt;Monocular SLAM&lt;/li&gt;
  &lt;li&gt;3D image reconstruction&lt;/li&gt;
  &lt;li&gt;Camera calibration&lt;/li&gt;
  &lt;li&gt;Augmented Reality&lt;/li&gt;
  &lt;li&gt;Autonomous Cars&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;how-do-you-get-the-homography&quot;&gt;&lt;em&gt;How do you get the homography?&lt;/em&gt;&lt;/h3&gt;

&lt;p&gt;In traditional computer vision, the homography estimation process is done in two stages.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Corner estimation&lt;/li&gt;
  &lt;li&gt;Homography estimation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Due to the nature of this problem, these pipelines are only producing estimates. To make the estimates more robust, practitioners go as far as manually engineering corner-ish features, line-ish features etc as mentioned by the paper.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Robustness is introduced into the corner detection stage by returning a large and over-complete set of points, while robustness into the homography estimation step shows up as heavy use of RANSAC or robustification of the squared loss function. - &lt;a href=&quot;https://arxiv.org/abs/1606.03798&quot;&gt;Deep Image Homography Estimation&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is a very hard problem that is error-prone and requires heavy compute to get any sort of robustness. Here we are, finally ready to talk about the question this paper wants to answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Is there a single robust algorithm that, given a pair of images, simply returns the homography relating the pair?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2 id=&quot;homographynet-the-model&quot;&gt;HomographyNet: The Model&lt;/h2&gt;

&lt;p&gt;HomographyNet is a VGG style CNN which produces the homography relating two images. The model doesn’t require a two stage process and all the parameters are trained in an end-to-end fashion!&lt;br /&gt;
&lt;img src=&quot;/public/img/hn/homographynet.png&quot; alt=&quot;alt HomographyNet Model&quot; /&gt;&lt;/p&gt;

&lt;p&gt;HomographyNet as descibed in the paper comes in two flavors, classification and regression. Based on the version you decide to use, they have their own pros and cons. Including different loss functions.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/public/img/hn/twoheads.png&quot; alt=&quot;alt Classification HomographyNet vs Regression HomographyNet.&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The regression network produces eight real-valued numbers and uses the Euclidean (L2) loss as the final layer. The classification network uses a quantization scheme and uses a softmax as the final layer. Since they stick a continuous range into finite bins, they end up with quantization error; hence, the classification network additionally also produces a confidence for each of the corners produced. The paper uses 21 quantization bins for each of the eight output dimensions, which results in a final layer of 168 output neurons. Each of the corner’s 2D grid of scores are interpreted as a distribution.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/public/img/hn/classificationconfidence.png&quot; alt=&quot;alt Corner Confidences Measure&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The architecture seems simple enough, but how is the model trained? Where is the labeled dataset coming from? Glad you asked!&lt;/p&gt;

&lt;p&gt;As the paper states and I agree, the simplist way to parameterize a homography is with a 3x3 matrix and a fixed scale. In the 3x3 homography matrix, [H11:H21, H12:H22] are responsible for the rotation and [H13:H23] handle the translational offset. This way you can map each pixel at position &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[u,v,1]&lt;/code&gt; from the image against the homograpy like the figure below, to get the new projected transformation &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[u&apos;,v&apos;,1]&lt;/code&gt;.&lt;br /&gt;
&lt;img src=&quot;/public/img/hn/homographymatrix.png&quot; alt=&quot;alt HomographyNet 3x3 matrix&quot; /&gt;&lt;/p&gt;

&lt;p&gt;However! If you try to unroll the 3x3 matrix and use it as the label (ground truth), you’d end up mixing the rotational and translational components. Creating a loss function that balances the mixed components would have been an unnecessary hurdle. Hence, it was better to use well known loss functions (L2 and Cross-Entropy) and instead figure out a way to &lt;em&gt;re-parameterize&lt;/em&gt; the homography matrix!&lt;/p&gt;

&lt;h3 id=&quot;the-4-point-homography-parameterization&quot;&gt;The 4-Point Homography Parameterization&lt;/h3&gt;

&lt;p&gt;The 4-Point parameterization is based on corner locations, removing the need to store rotational and translation terms in the label! This is not a new method by any means, but the use of it as a label was clever! Here is how it works.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/public/img/hn/4point.png&quot; alt=&quot;alt 4-Point Homography Parameterization&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Earlier we saw how each pixel &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[u,v,1]&lt;/code&gt; was transformed by the 3x3 matrix to produce &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[u&apos;,v&apos;,1]&lt;/code&gt;.
Well if you have at least four pixels where you calculate &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;delta_u = u&apos; - u&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;delta_v = v&apos; - v&lt;/code&gt;
then it is possible to reconstruct the 3x3 homography matrix that was used! You could for example use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;getPerspectiveTransform()&lt;/code&gt; method in OpenCV. From here on out we will call these four pixels (points) as corners.&lt;/p&gt;

&lt;h3 id=&quot;training-data-generation&quot;&gt;Training Data Generation&lt;/h3&gt;

&lt;p&gt;Now that we have a way to represent the homography such that we can use well known loss functions, we can start talking about how the training data is generated.&lt;/p&gt;
&lt;p align=&quot;center&quot;&gt;
    &lt;img src=&quot;/public/img/hn/datagen.png&quot; alt=&quot;Training Data Generation&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;As a sidenote, &lt;em&gt;H&lt;sup&gt;AB&lt;/sup&gt; denotes the homorgraphy matrix between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Creating the dataset was pretty straightforward, so I’ll only highlight some of the things that might get tricky. The steps as the describe in the above figure are:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Randomly crop a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;128x128&lt;/code&gt; patch at position &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&lt;/code&gt; from the grayscale image &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;I&lt;/code&gt; and call it patch &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt;. &lt;em&gt;Staying away from the edges!&lt;/em&gt;&lt;/li&gt;
  &lt;li&gt;Randomly perturb the four corners of patch &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt; within the range &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[-rho, rho]&lt;/code&gt; and let’s call this new position &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&apos;&lt;/code&gt;. &lt;em&gt;The paper used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rho=32&lt;/code&gt;&lt;/em&gt;&lt;/li&gt;
  &lt;li&gt;Compute the homography of patch &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt; using position &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&apos;&lt;/code&gt; and we’ll call this &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;H&lt;/code&gt;&lt;sup&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AB&lt;/code&gt;&lt;/sup&gt;&lt;/li&gt;
  &lt;li&gt;Take the inverse of the homography (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;H&lt;/code&gt;&lt;sup&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AB&lt;/code&gt;&lt;/sup&gt;)&lt;sup&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-1&lt;/code&gt;&lt;/sup&gt; which equals &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;H&lt;/code&gt;&lt;sup&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BA&lt;/code&gt;&lt;/sup&gt; and apply that to image &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;I&lt;/code&gt;, calling this new image &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;I&apos;&lt;/code&gt;. Crop a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;128x128&lt;/code&gt; patch at position &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&lt;/code&gt; from image &lt;strong&gt;&lt;em&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;I&apos;&lt;/code&gt;&lt;/em&gt;&lt;/strong&gt; and call it patch &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;Finally, take patch &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B&lt;/code&gt; and stack them channel-wise. This will give you a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;128x128x2&lt;/code&gt; image that you’ll use as input to the models! The label would be the 4-point parameterization of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;H&lt;/code&gt;&lt;sup&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AB&lt;/code&gt;&lt;/sup&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;They cleverly used this five step process on random images from the &lt;em&gt;MS-COCO&lt;/em&gt; dataset to create 500,000 training examples. Pretty damn cool (excuse my english).&lt;/p&gt;

&lt;h2 id=&quot;the-code&quot;&gt;The Code&lt;/h2&gt;

&lt;p&gt;If you’d like to see the regression network coded up in Keras &lt;em&gt;and&lt;/em&gt; the data generation process visualized, you’re in luck! Follow the links below to my Github repo.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/mez/deep_homography_estimation/blob/master/HomograpyNET.ipynb&quot;&gt;HomographyNet Regression variant&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/mez/deep_homography_estimation/blob/master/Dataset_Generation_Visualization.ipynb&quot;&gt;Dataset Generation Visualization&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final thoughts&lt;/h2&gt;

&lt;p&gt;That pretty much highlights the major parts of the paper. I am hoping you now have an idea of what the paper was about and learned something new! Go read the paper because I didn’t talk about the results etc.&lt;/p&gt;

&lt;p&gt;I’d like to think easy future improvements could be to swap out the heavy VGG network for squeezenet! Giving you the improvement of a smaller network. I’ll maybe experiment with this idea and see if I can match or improve on their results. As for possible uses today, I could see this network being used as a cascade classifier. The traditional methods are very compute heavy, so if we could maybe have this network in front to filter out the easy wins, we could cut down on compute cost.&lt;/p&gt;

&lt;p&gt;Stay tuned for the next paper and please comment with any corrections or thoughts. That is all folks!&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>SqueezeDet: Deep Learning for Object Detection</title>
   <link href="http://mez.sh/2017/04/21/squeezedet-deep-learning-for-object-detection/"/>
   <updated>2017-04-21T00:00:00+00:00</updated>
   <id>http://mez.sh/2017/04/21/squeezedet-deep-learning-for-object-detection</id>
   <content type="html">&lt;h3 id=&quot;why-bother-writing-this-post&quot;&gt;&lt;em&gt;Why bother writing this post?&lt;/em&gt;&lt;/h3&gt;

&lt;p&gt;Often, examples you see around computer vision and deep learning is about classification. Those class of problems are asking what do you see in the image? Object detection is another class of problems that ask where in the image do you see it?&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Classification answers &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;what&lt;/code&gt; and Object Detection answers &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;where&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Object detection has been making great advancement in recent years. The &lt;a href=&quot;https://en.wikipedia.org/wiki/%22Hello,_World!%22_program&quot; target=&quot;_blank&quot;&gt;&lt;em&gt;hello world&lt;/em&gt;&lt;/a&gt; of object detection would be using &lt;a href=&quot;https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients&quot; target=&quot;_blank&quot;&gt;HOG&lt;/a&gt; features combined with a classifier like &lt;a href=&quot;https://en.wikipedia.org/wiki/Support_vector_machine&quot; target=&quot;_blank&quot;&gt;SVM&lt;/a&gt; and using sliding windows to make predictions at different patches of the image. This complex pipeline has a major drawback!&lt;/p&gt;

&lt;h3 id=&quot;cons&quot;&gt;Cons:&lt;/h3&gt;
&lt;ol&gt;
  &lt;li&gt;Computationally expensive.&lt;/li&gt;
  &lt;li&gt;Multiple step pipeline.&lt;/li&gt;
  &lt;li&gt;Requires feature engineering.&lt;/li&gt;
  &lt;li&gt;Each step in the pipeline has parameters that need to be tuned individually, but can only be tested together. Resulting in a complex trial and error process that is not unified.&lt;/li&gt;
  &lt;li&gt;Not realtime.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;pros&quot;&gt;Pros:&lt;/h3&gt;
&lt;ol&gt;
  &lt;li&gt;Easy to implement, relatively speaking…&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Speed becomes a major concern when we are thinking of running these models on the edge (IoT, mobile, cars). For example, a car needs to detect where other cars, people and bikes are to name a few; I could go on… puppies, kittens… you get the idea. The major motivation for me is the need for speed given the constraints that edge computes have; we need compact models that can make quick predictions and are energy efficient.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-squeezedet-model&quot;&gt;The SqueezeDet Model&lt;/h2&gt;
&lt;p&gt;The latest in object detection is to use a convolutional neural network (CNN) that outputs a regression to predict the bounding boxes. This post is about &lt;a href=&quot;https://arxiv.org/abs/1612.01051&quot; target=&quot;_blank&quot;&gt;SqueezeDet&lt;/a&gt;. I got interested because they used one of my favorite cnn, SqueezeNet! You can read my last post on &lt;a href=&quot;https://mez.github.io/deep%20learning/2017/02/14/mainsqueeze-the-52-parameter-model-that-drives-in-the-udacity-simulator/&quot; target=&quot;_blank&quot;&gt;SqueezeNet&lt;/a&gt; if you haven’t yet. To be fair SqueezeDet is pretty much just the &lt;a href=&quot;https://pjreddie.com/media/files/papers/yolo.pdf&quot; target=&quot;_blank&quot;&gt;YOLO&lt;/a&gt; model that uses a SqueezeNet.&lt;/p&gt;

&lt;h3 id=&quot;highlevel-squeezedet&quot;&gt;Highlevel SqueezeDet&lt;/h3&gt;

&lt;p&gt;&lt;img src=&quot;/public/img/sd/squeezedet.png&quot; alt=&quot;alt SqueezeDet Model&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Inspired by YOLO, SqueezeDet is a single stage detection pipeline that does region proposal and classification by one single network. The cnn first extracts feature maps from the input image and feeds it to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ConvDet&lt;/code&gt; layer. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ConvDet&lt;/code&gt; takes the feature maps, overlays them with a WxH grid and at each cell computes &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;K&lt;/code&gt; pre-computed bounding boxes called anchors. Each bounding box has the following:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Four scalars (x, y, w, h)&lt;/li&gt;
  &lt;li&gt;A confidence score ( Pr(Object)xIOU )&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C&lt;/code&gt; conditional classes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hence SqueezeDet has a fixed output of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WxHxK(4+1+C)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The final step is to use non max suppression aka &lt;a href=&quot;http://www.pyimagesearch.com/2014/11/17/non-maximum-suppression-object-detection-python/&quot; target=&quot;_blank&quot;&gt;NMS&lt;/a&gt; to filter the bounding boxes to make the final predictions.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/public/img/sd/kanchors.png&quot; alt=&quot;alt text&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The networks regresses and learns how to transform the highest probably bounding box for the prediction. Since there are bounding boxes being generated at each cells of the grid, the top &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N&lt;/code&gt; bounding boxes sorted by the confidence score are kept as the predictions.&lt;/p&gt;

&lt;h3 id=&quot;the-composite-loss-function&quot;&gt;The Composite Loss function&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;/public/img/sd/squeezedet_loss.png&quot; alt=&quot;alt text&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The figure above is the four part loss function that makes this entire model possible. Don’t get intimidated by it; let’s take it apart and see how it fits together. Each loss function is described below and highlighted:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: The yellow that bleed into the blue loss function is actually suppose to be blue, sorry!&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Yellow: Regression of the scalars for the anchors&lt;/li&gt;
  &lt;li&gt;Green: The confidence score regression which uses &lt;a href=&quot;http://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/&quot; target=&quot;_blank&quot;&gt;IOU&lt;/a&gt; of ground and predicted bounding boxes.&lt;/li&gt;
  &lt;li&gt;Blue: Penalize anchors that are not responsible for detection by dropping their confidence score.&lt;/li&gt;
  &lt;li&gt;Pink: The cross entropy.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;using-squeezedet&quot;&gt;Using SqueezeDet&lt;/h2&gt;

&lt;p&gt;The authors of the paper did implement the model via TensorFlow!
Go check it out on github &lt;a href=&quot;https://github.com/BichenWuUCB/squeezeDet&quot; target=&quot;_blank&quot;&gt;https://github.com/BichenWuUCB/squeezeDet&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&quot;thresholding-squeezedet&quot;&gt;Thresholding SqueezeDet&lt;/h3&gt;

&lt;p&gt;You have to tweak how confident or doubtful you want the model to be; the predictions are centered around &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;K&lt;/code&gt; bounding boxes at each cell. So we have to use the top &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N&lt;/code&gt; bouding boxes, sorted by confidence score and then you can do additional thresholding on the class conditional probability score.&lt;/p&gt;

&lt;p&gt;Here is an example of a recent project I did where I tweak the params:&lt;/p&gt;

&lt;p&gt;Like the paper:
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N = 64&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The image below shows &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mc.PLOT_PROB_THRESH      = 0.1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/public/img/sd/without_thres_test5.jpg&quot; alt=&quot;alt text&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The image below shows &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mc.PLOT_PROB_THRESH      = 0.5&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;&lt;img src=&quot;/public/img/sd/out_test5.jpg&quot; alt=&quot;alt text&quot; /&gt;&lt;/h2&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final thoughts&lt;/h2&gt;

&lt;p&gt;Reading through the paper was a real grind. Some math notations were a bit wonky; the paper referenced a lot and it was a recursive process. I literally had to crunch through the entire history of object detection to understand this paper. At the very least I hope you were able to get a high level understanding of this paper. Comment below with any corrections or questions!&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>MainSqueeze: The 52 parameter model that drives in the Udacity simulator</title>
   <link href="http://mez.sh/deep%20learning/2017/02/14/mainsqueeze-the-52-parameter-model-that-drives-in-the-udacity-simulator/"/>
   <updated>2017-02-14T00:00:00+00:00</updated>
   <id>http://mez.sh/deep%20learning/2017/02/14/mainsqueeze-the-52-parameter-model-that-drives-in-the-udacity-simulator</id>
   <content type="html">&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;What a time to be alive! The year is 2017, Donald Trump is president of the United States of America and autonomous vehicles are all the rage. Still at its infancy, the winning solution to dominate the mass production of autonomous vehicles are ongoing. The two main factions currently are the robotics approach and the end-to-end neural networks approach. Like the four seasons, the AI winter has come and gone. It’s Spring and this is the story of one man’s attempt to explore the pros and cons of the end-to-end neural networks faction in a controlled environment. The hope is to draw some conclusions that will help the greater community advance as a whole.&lt;/p&gt;

&lt;h2 id=&quot;the-controlled-environment&quot;&gt;The Controlled Environment&lt;/h2&gt;

&lt;div align=&quot;center&quot;&gt;
   &lt;br /&gt;
  &lt;img src=&quot;/public/img/bc/sim_image.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/udacity/self-driving-car-sim&quot;&gt;Udacity Simulator&lt;/a&gt; which is open sourced will be our controlled environment for this journey. It has two modes, training and autonomous mode. Training mode is for the human to drive and record/collect the driving. The result would be a directory of images from three cameras (left, center, right) and a driver log CSV file that records the image along with steering angle, speed etc. Autonomous mode requires a model that can send the simulator steering angle predictions.&lt;/p&gt;

&lt;div align=&quot;center&quot;&gt;
   &lt;br /&gt;
  &lt;img src=&quot;/public/img/bc/driver_log.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;p&gt;The goals of this project are:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Use the simulator to collect data of good driving behavior.&lt;/li&gt;
  &lt;li&gt;Construct a convolution neural network in &lt;a href=&quot;https://keras.io/&quot;&gt;Keras&lt;/a&gt; that predicts steering angles from images.&lt;/li&gt;
  &lt;li&gt;Train and validate the model with a training and validation set.&lt;/li&gt;
  &lt;li&gt;Test that the model successfully drives around track one without leaving the road!&lt;/li&gt;
  &lt;li&gt;Draw conclusions for future work.
&lt;!-- more --&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;data-exploration&quot;&gt;Data Exploration&lt;/h2&gt;

&lt;p&gt;If you think the Lewis and Clark expedition was tough, try exploring an unknown dataset. I am being dramatic. When you are exploring data; you want to keep thinking, what is the least amount of data you can sample which will represent your problem (population if you like stats). For this problem we will be using the provided Udacity dataset. Let’s dive in!&lt;/p&gt;

&lt;p&gt;Our goal is steering angle prediction, so let’s take a look at what the dataset shows us! The plot below shows a few take aways:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Range is [-1,1]&lt;/li&gt;
  &lt;li&gt;Clustering around [-0.5, 0.5]&lt;/li&gt;
&lt;/ol&gt;

&lt;div align=&quot;center&quot;&gt;
   &lt;br /&gt;
  &lt;img src=&quot;/public/img/bc/data1.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;p&gt;Let’s take a look at another angle, pun intended, of the steering data.&lt;/p&gt;

&lt;div align=&quot;center&quot;&gt;
   &lt;br /&gt;
  &lt;img src=&quot;/public/img/bc/data2.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;p&gt;This histogram is the making of sampling nightmares; this is how alternate facts are created! Unbalanced data sampling would train our model to be very biased so we will clean this up.&lt;/p&gt;

&lt;h2 id=&quot;clean-up-steps&quot;&gt;Clean up steps:&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;Downsample the over represented examples&lt;/li&gt;
  &lt;li&gt;Upsample the under represented examples&lt;/li&gt;
  &lt;li&gt;Expose more varied examples to try and represent a uniform distribution.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;1-downsample&quot;&gt;1. Downsample&lt;/h3&gt;

&lt;p&gt;Steering angle zero is over represented, so drop 90% of the examples. Easy!&lt;/p&gt;

&lt;div align=&quot;center&quot;&gt;
   &lt;br /&gt;
  &lt;img src=&quot;/public/img/bc/data3v2.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;h3 id=&quot;2-upsample&quot;&gt;2. Upsample&lt;/h3&gt;

&lt;p&gt;Time to augment so we can start upsampling under represented examples. We start by flipping 40% of the examples that are not zero steering angle. So far so good!&lt;/p&gt;
&lt;div align=&quot;center&quot;&gt;
   &lt;br /&gt;
  &lt;img src=&quot;/public/img/bc/data4.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;h3 id=&quot;3-expose-more-varied-examples&quot;&gt;3. Expose more varied examples&lt;/h3&gt;

&lt;p&gt;Variety is the spice of life; this is also true to training a well generalized model. For this problem we will introduce more steering angles by shifting the examples horizontally and adding or subtracting the appropriate angles corresponding to the shift we performed. There was no magic shift amount, you get this by experimenting. The result is below:&lt;/p&gt;
&lt;div align=&quot;center&quot;&gt;
   &lt;br /&gt;
  &lt;img src=&quot;/public/img/bc/data5.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;p&gt;Great our dataset/sample is looking better. Next, let’s trim this to look more uniform. This was accomplished by:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Grabbing the bin ranges of the 100 bins.&lt;/li&gt;
  &lt;li&gt;Finding bins that have more than 400 examples.&lt;/li&gt;
  &lt;li&gt;Randomly sample and drop examples so we have no more than 400 examples.&lt;/li&gt;
&lt;/ol&gt;

&lt;div align=&quot;center&quot;&gt;
   &lt;br /&gt;
  &lt;img src=&quot;/public/img/bc/data6v2.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;p&gt;How did I know 400 examples was enough? I simply kept reducing the number until I had reached a point where my model can still produced stable results. Below 400 my model started getting unstable results.&lt;/p&gt;

&lt;h2 id=&quot;model-architecture-and-training-strategy&quot;&gt;Model Architecture and Training Strategy&lt;/h2&gt;

&lt;h4 id=&quot;main-goals-for-us-to-think-about-while-creating-a-model&quot;&gt;Main goals for us to think about while creating a model&lt;/h4&gt;
&lt;ol&gt;
  &lt;li&gt;Is it efficient for the task at hand?&lt;/li&gt;
  &lt;li&gt;If this was to be downloaded onto hardware in a car what would the power consumption and usability be like?
    &lt;ul&gt;
      &lt;li&gt;There is an interesting paper called &lt;a href=&quot;https://arxiv.org/abs/1510.00149&quot;&gt;Deep Compression&lt;/a&gt;, which we don’t implement here but it is food for thought and shows that model can be tiny and still work!&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I started with a modified comma.ai model and I had a successful model, but it needed 300,000 trainable parameters. Reading the Deep compression paper and a blog post titled &lt;a href=&quot;https://medium.com/@xslittlegrass/self-driving-car-in-a-simulator-with-a-tiny-neural-network-13d33b871234#.x1kdv5hgt&quot;&gt;Self-driving car in a simulator with a tiny neural network&lt;/a&gt; I quickly realized that we can do better. The blog post linked shows a 63 parameter model that uses tiny images and a small network to get a stable model. I wanted to experiment and see if I could make it smaller and still stable. The solution developed over a number of experiments is a modified &lt;a href=&quot;https://arxiv.org/abs/1602.07360&quot;&gt;SqueezeNet&lt;/a&gt; implementation. With a squeeze net you get three additional hyperparameters that are used to generate the fire module:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;S1x1: Number of 1x1 kernels to use in the squeeze layer within the fire module&lt;/li&gt;
  &lt;li&gt;E1x1: Number of 1x1 kernels to use in the expand layer within the fire module&lt;/li&gt;
  &lt;li&gt;E3x3: Number of 3x3 kernels to use in the expand layer within the fire module&lt;/li&gt;
&lt;/ol&gt;

&lt;h4 id=&quot;fire-module-zoomed-in&quot;&gt;Fire Module Zoomed In&lt;/h4&gt;

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/public/img/bc/fire_module.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;p&gt;The fire module is the workhorse of squeezenet. Sqeezenet as described in the paper is around 700k trainable parameters. I went through a process, where I kept reducing the number of parameters while all other variables were kept constant. You get the theme here? We are at the frontier and this calls for empirical testing. Through some experiments I went from 10k parameters to 1005, then 329, then 159, then &lt;strong&gt;63&lt;/strong&gt; and finally &lt;strong&gt;52&lt;/strong&gt;!!&lt;/p&gt;

&lt;h2 id=&quot;the-final-52-parameter-squeezenet-variant-model&quot;&gt;The final 52 parameter squeezenet variant Model!!&lt;/h2&gt;

&lt;div align=&quot;left&quot;&gt;
  &lt;br /&gt;
  &lt;img src=&quot;/public/img/bc/SqueezeNet52.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;p&gt;This model combats overfitting by being super tiny and for kicks I added a small dropout layer. The model works on both tracks and has a six second epoch on my late 2012 Macbook air!! From the comma.ai model, it was evident that a validation loss of around 0.03 on 30% of &lt;strong&gt;this&lt;/strong&gt; dataset results in a stable model that can handle the track at a throttle of around 0.2, which is a speed of around 20mph in the simulator. So, I didn’t bother worrying about the epoch hyperparameter. I simply created a custom early termination Keras callback that stopped the training when we hit our requirement.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;One good rule of thumb I developed from this project is &lt;strong&gt;to try and reduce the number of variables you are tuning to gain better results faster.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;training-strategy&quot;&gt;Training Strategy&lt;/h2&gt;

&lt;p&gt;To get the model to drive in the simulator, you need:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Show the model how to drive straight&lt;/li&gt;
  &lt;li&gt;How to recover if it drifts off track&lt;/li&gt;
  &lt;li&gt;How to handle turns&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Udacity slack community was a huge help here; from their experience, I used the left and right camera images and adjusted the steering (+.25 for left, -.25 for right) angles to show the model how to correct steering back to center. Then I used the horizontal shifting to capture more angles. This was enough to get a stable model working on the fastest setting (means lowest resolution) on both tracks.&lt;/p&gt;

&lt;p&gt;Reducing input image size was the next challenge. We do this by first cropping the top and bottom from the image that would be noise to the model. Then we resize the image to (64,64) and convert to HSV and only returning the S channel. This takes us from (160,320,3) to (64, 64, 1)!!&lt;/p&gt;

&lt;h3 id=&quot;before-cropping-and-resizing&quot;&gt;Before Cropping and Resizing.&lt;/h3&gt;
&lt;div align=&quot;left&quot;&gt;
  &lt;br /&gt;
  &lt;img src=&quot;/public/img/bc/crop1.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;h3 id=&quot;after-cropping-and-resizing&quot;&gt;After Cropping and Resizing.&lt;/h3&gt;
&lt;div align=&quot;left&quot;&gt;
  &lt;br /&gt;
  &lt;img src=&quot;/public/img/bc/crop2.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;h2 id=&quot;hyperparameters&quot;&gt;Hyperparameters&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Learning rate: 1e-1 (very aggressive!)&lt;/li&gt;
  &lt;li&gt;Batch size: 128 (tried 64, 128, 256, 1024)&lt;/li&gt;
  &lt;li&gt;Adam optimizer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I used Keras to do a validation split on 30% of the data and my custom early termination to stop the training when we we reach a validation loss of around 0.03!&lt;/p&gt;

&lt;div align=&quot;left&quot;&gt;
  &amp;lt;/br&amp;gt;
  &lt;img src=&quot;/public/img/bc/loss_plot.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Note: this loss plot is from a previous run without early termination&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This model was tiny (52 params!) and with only ~20k images I was able to train it on a 2012 Macbook air. An epoch was about six seconds. The memory requirement were small, so I just loaded the entire dataset!&lt;/p&gt;

&lt;div align=&quot;left&quot;&gt;
  &lt;br /&gt;
  &lt;img src=&quot;/public/img/bc/callback.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;
&lt;div align=&quot;left&quot;&gt;
  &lt;img src=&quot;/public/img/bc/train.png&quot; /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;/div&gt;

&lt;h2 id=&quot;pros&quot;&gt;Pros&lt;/h2&gt;
&lt;ol&gt;
  &lt;li&gt;Can train without a GPU.&lt;/li&gt;
  &lt;li&gt;Enough to pass the challenge.&lt;/li&gt;
  &lt;li&gt;Smaller model meant I could experiment with more variables and different models to gain better intuition.&lt;/li&gt;
  &lt;li&gt;Learned that our current method of training with back prop is not efficient and that we can achieve a lot with a smaller network.&lt;/li&gt;
  &lt;li&gt;Get to use aggressive learning rate for faster convergence :D&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;cons&quot;&gt;Cons&lt;/h2&gt;
&lt;ol&gt;
  &lt;li&gt;Can not handle highest resolution setting.&lt;/li&gt;
  &lt;li&gt;Can not go over .22 throttle and still be stable.&lt;/li&gt;
  &lt;li&gt;As the network increases in size, hard to understand why the decisions are being made. I can see this being a &lt;strong&gt;huge&lt;/strong&gt; problem for legal reasons. The end-to-end neural network faction is not looking so good here!&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
  &lt;p&gt;This bring me to the other rule of thumb, &lt;strong&gt;every problem will have tradeoffs&lt;/strong&gt;. The question becomes what are the tradeoffs you are willing to make? This will depend on the business case you are solving!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;the-github-repo-link&quot;&gt;The github repo link&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/mez/carnd/blob/master/P3_behavioral_cloning/&quot;&gt;P3_behavioral_cloning&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;references&quot;&gt;References&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/1602.07360&quot;&gt;SqueezeNet&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/1510.00149&quot;&gt;Deep Compression&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://chatbotslife.com/learning-human-driving-behavior-using-nvidias-neural-network-model-and-image-augmentation-80399360efee#.zbkgfcnoz&quot;&gt;Vivek Yadav’s Learning human driving behavior using NVIDIA’s neural network model and image augmentation.&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://medium.com/@xslittlegrass/self-driving-car-in-a-simulator-with-a-tiny-neural-network-13d33b871234#.x1kdv5hgt&quot;&gt;xslittlegrass’ Self-driving car in a simulator with a tiny neural network&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/commaai/research/blob/master/train_steering_model.py&quot;&gt;Comma.ai steering model&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
</content>
 </entry>
 

</feed>
