import numpy as np
Introduction
Information flows around us. It’s everywhere. No matter what we have, either it will be some well-known play or painting or just a bunch of numbers or video streams. For computers, all of them are represented by only two digits 0 and 1, and they carry some information. “Information theory studies the transmission, processing, extraction, and utilization of information.”wikipedia In simple words, with information theory, given different kinds of signals, we try to measure how much information is presented in each of those signals. The theory itself originates from the original work of Claude Shannon named A Mathematical Theory of Communication
It will be helpful to see how machine learning and information theory are related. According to “Dive Into Deep Learning” hence d2l considers this relationship to be
Machine learning aims to extract interesting signals from data and make critical predictions. On the other hand, information theory studies encoding, decoding, transmitting, and manipulating information. As a result, information theory provides a fundamental language for discussing the information processing in machine learned systems.source
Information theory is tightly connected to mathematics and statistics. We will see later on how, but before that, it’s worth to say where is used the concepts of information theory in statistics and mathematics. We all know or have heard about random variables that are drawn from some probability distribution. From linear algebra, we also know how to measure the distance between two points, or between two planes. But, how can we measure the distance between two probability distribution? In other words, how similar or dissimilar are these two probability distribution? Information theory gives us the ability to answer this question and quantify the similarity measure between two distributions. Before we continue, let me outline the measurement unit of information theory. Shannon introduced the bit as the unit of information. The series of 0 and 1 encode any data. Accordingly, the sequence of binary digits of length
There are a few main concepts in information theory, and I will go through each of them in a detailed manner. First in line is:
Self-Information
To understand this concept well, I will review two examples—one from statistics and probability and the second from the information theory. Let start with statistics and probability. Imagine we conduct an experiment giving several outcomes with a different probability. For example, rolling the fair dice with uniform probability
For our three events,
From an information theory perspective, if we have a series of binary digits of the length 0101
, then its information content is 4 bits according to our formula:
def self_information(p):
return -np.log2(p)
1 / 2**4) self_information(
4.0
The main takeaway here is that if a particular event has 100% probability, its self-information is
We see that information content only measures the information of a single event. To generalize this notion for any discrete and/or continues event, we will get the idea of Entropy.
Entropy
If we have any random variable p.d.f
if it’s continuous or p.m.f
if it’s discrete. Can we calculate the average value of
Where
In Python it looks the following:
# np.nansum return the sum of NaNs. Treats them as zeros.
def entropy(p):
= np.nansum(-p * np.log2(p))
out return out
0.1, 0.5, 0.1, 0.3])) entropy(np.array([
1.6854752972273346
Here, we only consider one random variable,
Joint Entropy
To review this concept let me introduce two random variables
Here are two important facts. If
def joint_entropy(p_xy):
= np.nansum(-p_xy * np.log2(p_xy))
out return out
0.1, 0.5, 0.8], [0.1, 0.3, 0.02]])) joint_entropy(np.array([[
2.0558948969327187
As we see, joint entropy indicates the amount of information in the pair of two random variables. What if we are interested to know how much information is contained, say in
Conditional Entropy
The conditional entropy is used to measure the relationship between variables. The following formula gives this measurement:
Let investigate how conditional entropy is related to entropy and joint entropy. Using the above formula, we can conclude that:
meaning that the information contained in
def conditional_entropy(p_xy, p_x):
= p_xy / p_x
p_y_given_x = np.nansum(-p_xy * np.log2(p_y_given_x))
out return out
0.1, 0.5], [0.2, 0.3]]), np.array([0.2, 0.8])) conditional_entropy(np.array([[
0.8635472023399721
Knowing conditional entropy means knowing the amount of information contained in
Mutual Information
To find the mutual information between two random variables
This means that we have to subtract the information only contained in
The concept of mutual information likewise correlation coefficient, allow us to measure the linear relationship between two random variables as well as the amount of maximum information shared between them.
def mutual_information(p_xy, p_x, p_y):
= p_xy / (p_x * p_y)
p = np.nansum(p_xy * np.log2(p))
out return out
0.1, 0.5], [0.1, 0.3]]), np.array([0.2, 0.8]), np.array([[0.75, 0.25]])) mutual_information(np.array([[
0.7194602975157967
As in the case of the correlation coefficient, mutual information has some notable properties:
- Mutual information is symmetric
- Mutual information is non-negative
iff and are independent
We can interpret the mutual information
Kullback–Leibler Divergence - Relative Entropy
I asked the question about measuring the distance between two probability distributions. The time has come to answer this question precisely. If we have random variable p.d.f
or p.m.f
p.d.f
or p.m.f
The lower value of the
- The KL divergence is non-symmetric or equivalently,
- The KL divergence is non-negative or equivalently,
def kl_divergence(p, q):
= p * np.log2(p / q)
kl = np.nansum(kl)
out return np.abs(out)
= np.random.normal(1, 2, size=1000)
p = np.random.normal(1, 2, size=1000)
q
kl_divergence(p, q)
/var/folders/mz/3thvm62j52l8lpr5sllrt6rh0000gn/T/ipykernel_10058/409365545.py:2: RuntimeWarning: invalid value encountered in log2
kl = p * np.log2(p / q)
810.477871653346
Cross Entropy
To understand Cross-Entropy, let me use the example from the KL divergence part. Now, imagine we perform classification tasks, where
The two terms on the right-hand side are self-information and KL divergence.
def cross_entropy(y_hat, y):
= -np.log(y_hat[range(len(y_hat)), y])
ce return ce.mean()
= np.array([0, 2])
labels = np.array([[0.3, 0.6, 0.1], [0.2, 0.3, 0.5]])
preds
cross_entropy(preds, labels)
0.9485599924429406
Conclusion
By reviewing these concepts from the information theory, we have some rough sense of how it’s related to the statistics and mathematics and is used in machine learning and deep learning. There is much more to discover, and that’s up to you how far you want to go. Moreover, even interesting is how information theory is related to the coding theory, in gambling and musical composition.