Documentation

Documentation of the functions used in building the OOD Metric Class

Visual overview of how the pipeline looks like.

flowchart LR
  A[Feature\nEmbeddings] --> C{OOD Detection}
  B[In Distribution\nLabels] --> C
  C --> F[Uncertainty Score]
  F --> D[Out of Distribtuion]
  F --> E[In Distribtuion]

Utility Functions

source

compute_mean_and_covariance

 compute_mean_and_covariance (embdedding:numpy.ndarray,
                              labels:numpy.ndarray)

Computes class-specific means and a shared covariance matrix.

	Type	Details
embdedding	ndarray	(n_sample, n_dim) n_sample - sample size of training set, n_dim - dimension of the embedding
labels	ndarray	(n_sample, ) n_sample - sample size of training set
Returns	Tuple	Mean of dimension (n_dim, ) and Covariance matrix of dimension(n_dim, n_dim)

source

compute_mahalanobis_distance

 compute_mahalanobis_distance (embdedding:numpy.ndarray,
                               means:numpy.ndarray,
                               covariance:numpy.ndarray)

Computes Mahalanobis distance between the input and the fitted Guassians. The Mahalanobis distance (Mahalanobis, 1936) is defined as

\[distance(x, mu, sigma) = sqrt((x-\mu)^T \sigma^{-1} (x-\mu))\]

where x is a vector, mu is the mean vector for a Gaussian, and sigma is the covariance matrix. We compute the distance for all examples in embdedding, and across all classes in means.

Note that this function technically computes the squared Mahalanobis distance

	Type	Details
embdedding	ndarray	Embdedding of dimension (n_sample, n_dim)
means	ndarray	A matrix of size (num_classes, n_dim), where the ith row corresponds to the mean of the fitted Gaussian distribution for the i-th class.
covariance	ndarray	The shared covariance matrix of the size (n_dim, n_dim)
Returns	ndarray	A matrix of size (n_sample, n_class) where the (i, j) element corresponds to the Mahalanobis distance between i-th sample to the j-th class Gaussian.

OOD Metric Computation Class

source

OODMetric

 OODMetric (train_embdedding:numpy.ndarray, train_labels:numpy.ndarray)

OOD Metric Class that calculates the OOD scores for a batch of input embeddings. Initialises the class by fitting the class conditional gaussian using training data and the class independent gaussian using training data.

	Type	Details
train_embdedding	ndarray	An array of size (n_sample, n_dim) where n_sample is the sample size of training set, n_dim is the dimension of the embedding.
train_labels	ndarray	An array of size (n_train_sample, )

source

OODMetric.compute_rmd

 OODMetric.compute_rmd (embdedding:numpy.ndarray)

This function computes the OOD score using the mahalanobis distance

	Type	Details
embdedding	ndarray	An array of size (n_sample, n_dim), where n_sample is the sample size of the test set, and n_dim is the size of the embeddings.
Returns	ndarray	An array of size (n_sample, ) where the ith element corresponds to the ood score of the ith data point.

Example

Here is an example where we generate 1000 samples having a 1024 dimensional embedding belonging to 10 clusters. Using the samples from last 5 cluster as test embeddings or ood embeddings and rest used as train embeddings

n_samples = 1000
n_centers = 10
n_features = 1024

x, y = make_blobs(n_samples=n_samples, n_features=n_features, centers=n_centers, random_state=0)

train_embedding = x[np.where(y < (n_centers - 5))]
train_labels = y[np.where(y < (n_centers - 5))]

test_embedding = x[np.where(y >= (n_centers - 5))]
test_labels = y[np.where(y >= (n_centers - 5))]

tsne = TSNE(n_components=2).fit_transform(x)

x_train_tsne = tsne[np.where(y < (n_centers - 5))]
x_test_tsne = tsne[np.where(y >= (n_centers - 5))]

plt.scatter(x_train_tsne[:, 0], x_train_tsne[:, 1], marker="o", c=train_labels)
plt.scatter(x_test_tsne[:, 0], x_test_tsne[:, 1], marker="+", c="red")

<matplotlib.collections.PathCollection>

test_eq(type(train_embedding), np.ndarray) # check that embeddings are a numpy array
test_eq(type(train_labels), np.ndarray) # check that labels are numpy array
test_eq(train_labels.dtype, int) # check that labels are integers only
test_eq(train_labels.ndim, 1) # check that labels is one dimensional
test_eq(train_embedding.shape[0], train_labels.shape[0]) # check n_samples are same

ood = OODMetric(train_embedding, train_labels)

test_eq(ood.means.shape[0], len(np.unique(train_labels))) # for each unique class, we should get one mean embedding
test_eq(ood.means.shape[1], train_embedding.shape[1]) # size of mean vector should be the same of the size of embedding

test_eq(ood.covariance.shape[0], train_embedding.shape[1]) # covariance matrix should be of size n_dim, n_dim
test_eq(ood.covariance.shape[1], train_embedding.shape[1])

test_eq(ood.means_bg.shape[0], 1)
test_eq(ood.means_bg.shape[1], train_embedding.shape[1])

test_eq(ood.covariance_bg.shape[0], train_embedding.shape[1])
test_eq(ood.covariance_bg.shape[1], train_embedding.shape[1])

# testing on the train embedding itself
in_distribution_rmd = ood.compute_rmd(train_embedding)

ood_rmd = ood.compute_rmd(test_embedding)

By looking at the scores, we can get an idea of setting the threshold for classifying any datapoint as out of distribution. Below histogram shows a clear idea of what OOD Relative Mahalanobis distance look like.

plt.hist([in_distribution_rmd, ood_rmd], label=["In Distribution", "OOD"])
plt.legend()
plt.show()

test_eq(ood_rmd.shape[0], test_embedding.shape[0])
test_eq(ood_rmd.ndim, 1)