Learning Notes of Contrastive Learning

Learning Notes of Contrastive Learning

Contrastive learning and metric learning

Related works

Loss

Others

Contrastive learning and metric learning

In the world of machine learning, there are various techniques to help computers understand and recognize patterns in data. Two such techniques, metric learning and contrastive learning, are popular methods for learning meaningful embeddings or representations of data points.

Metric Learning

Metric learning is a method that teaches a computer program to measure the similarity or difference between different data points, like images, text, or sounds. The goal is to learn a distance metric in the embedding space such that similar data points are close to each other, while dissimilar points are farther apart. This technique is often used in tasks like k-Nearest Neighbors classification, clustering, and image retrieval, where distance or similarity between data points is important.

Example:
Consider a scenario where you want to teach a program to recognize different types of animals, like cats, dogs, and birds. Using metric learning, you would provide the program with images of these animals and teach it to measure the similarities and differences between them. This understanding allows the program to correctly classify new images of animals it has never seen before.

Contrastive Learning

Contrastive learning is another technique that helps computer programs recognize and understand data points by comparing one thing to a group of other things. The main idea is to generate representations where similar pairs have similar embeddings, while dissimilar pairs have distinct embeddings. Contrastive learning has recently gained popularity in self-supervised and unsupervised learning, particularly for tasks like representation learning, pretraining for downstream tasks, and learning disentangled representations.

Example:
In the same animal recognition scenario, you would provide the program with pairs of images — some containing the same type of animal and others containing different types of animals. The program would then learn to compare the images and understand what makes the same-type pairs similar and the different-type pairs distinct. This way, when it encounters new images of animals, it can recognize the differences and classify them correctly.

Conclusion

Both metric learning and contrastive learning play crucial roles in machine learning, particularly in representation learning. While they share some similarities in their goals, they differ in their objectives, loss functions, and applications. Metric learning focuses on directly optimizing the embedding space based on the relationships between data points, while contrastive learning emphasizes the differences between similar and dissimilar pairs of data points.

SimCLR

simclr

Supervised contrastive learning

This proposed loss contrasts the set of all samples from the same class as positives against the negatives from the remainder of the batch, using the labels.

scl

CLIP

clip

Loss

Contrastive loss

Contrastive loss is one of the earliest training objectives used for deep metric learning in a contrastive fashion.

Given a list of input samples xi{x_i}, each has a corresponding label yi1,...,Ly_i \in {1,...,L} among LL classes. We would like to learn a function fθ(.):χRdf_{\theta}(.):\chi \to \Bbb R^d that encodes xix_i into an embedding vector such that examples from the same class have similar embeddings and samples from different classes have very different ones.

Thus, contrastive loss takes a pair of inputs (xi,xj)(x_i, x_j) and minimizes the embedding distance when they are from the same class but maximizes the distance otherwise.
Lcl(xi,xj,θ)=L[yi=yj]fθ(xi)fθ(xj)22+L[yiyj]max(0,ϵfθ(xi)fθ(xj)2)2L_{cl}(x_i,x_j,\theta)=\Bbb L[y_i=y_j]||f_{\theta}(x_i)-f_{\theta}(x_j)||_2^2+ \Bbb L[y_i\neq y_j]max(0, \epsilon - ||f_{\theta}(x_i)-f_{\theta}(x_j)||_2)^2
where ϵ\epsilon is a hyperparameter, defining the lower bound distance between samples of different classes.

NCE

Noise Contrastive Estimation, is a method for estimating parameters of a statistical model. The idea is to run logistic regression to tell apart the target data from noise.

Let xx be the target sample P(xC=1;θ)=pθ(x)P(x|C=1;\theta)=p_{\theta}(x) and x^\hat x be the noise sample P(x^C=0)=qθ(x^)P(\hat x|C=0)=q_{\theta}(\hat x). Note that the logistic regression models the logit (i.e. log-odds) and in this case we would like to model the logit of a sample uu from the target data distribution instead of the noise distribution:
log_odds: lθ(u)=logpθ(u)q(u)=logpθ(u)logq(u) ,+)log\_odds:\ l_{\theta}(u)=log \frac {p_{\theta}(u)} {q(u)}=logp_{\theta}(u)-logq(u)\ \in (-\infty, +\infty)
After converting logits into probabilities with sigmoid σ(.)\sigma(.), we can apply cross entropy loss:
LNCE=1Ni1N[logσ(lθ(xi))+log(1σ(lθ(x^i)))]where σ(l)=11+exp(l)=pθpθ+q (0,1)L_{NCE}=-\frac{1}{N}\sum_{i-1}^N[log\sigma(l_{\theta}(x_i))+log(1-\sigma(l_{\theta}(\hat x_i)))] \\ where\ \sigma(l)=\frac {1}{1+exp(-l)}=\frac {p_{\theta}}{p_{\theta}+q}\ \in (0, 1)
In many follow-up works, contrastive loss incorporating multiple negative samples is also broadly referred to as NCE.

InfoNCE

The InfoNCE loss uses categorical cross-entropy loss to identify the positive sample amongst a set of unrelated noise samples.

Given a context vector cc, the positive sample should be drawn from the conditional distribution p(xc)p(x|c), while N1N-1 negative samples are drawn from the proposal distribution p(x)p(x), independent from the context cc.

The InfoNCE loss optimizes the negative log probability of classifying the positive sample correctly:
LInfoNCE=E[logf(x,c)xXf(x,c)]L_{InfoNCE}=-\Bbb E[log \frac {f(x,c)}{\sum_{x'\in X} f(x',c)}]

Others

When NOT CLR

You will find this work to be very insightful if you study Contrastive Learning. Unlike previous works in un-/self-supervised learning that propose learning augmentation invariant representations, the authors stress the importance of preserving some style information (e.g., distinguishing red vs. yellow cars).

They demonstrate that the style-variant framework outperforms some SOTA methods that learn invariant representations by a decent margin. For example:

Adding rotation may help with view-independent aerial image recognition, but significantly downgrade the capacity of a network to solve tasks such as detecting which way is up in a photograph for a display application.

LOOC

Key tips

Heavy Data Augmentation

Large Batch Size

Hard Negative Mining

LoRA

lora

what is catastrophic forgetting
Catastrophic forgetting, in simple terms, is when a machine learning model, like a neural network or artificial intelligence system, forgets how to perform a task it previously learned when it’s trained on new, different tasks especially Fine-Tuning over a pretrained model.

How does LoRA avoid catastrophic forgetting?
As we are not updating the pretrained weights, the model never forgets what it has already learned. While in general Fine-Tuning, we are updating the actual weights hence there are chances of catastrophic forgetting.

Papers and blogs