## Matrix Completion

If we roll a dice and then flip a coin, the outcome of the dice roll shouldn’t affect whether the coin lands head or tails: the two events are independent. This is characterized by the equation

$\mathbb{P}(\text{dice} = i \text{ and} \text{ coin} = j) = \mathbb{P}(\text{dice} = i) \times \mathbb{P}(\text{coin} = j)$

where $i$ could be any of the possible outcomes of the die, $1, 2, 3, 4, 5, 6$, and $j$ could be either heads or tails. Now assume that we have two events, both of which could take 3 possible outcomes. For example, the weather is sunny, cloudy or rainy and I could arrive early, on-time or late. I want to work out whether my arrival time is independent of the weather.

To do this, we collect data on all the possible 9 outcomes, and obtain estimates for the probabilities of these 9 events. We depict them in a table:

$\begin{tabular}{| l || l | l | l |} \hline & Sunny (S) & Cloudy (C) & Rainy (R) \\ \hline \hline Early (E) & 0.05 & 0.03 & 0.02 \\ \hline On-time (O) & 0.2 & 0.12 & 0.08 \\ \hline Late (L) & 0.25 & 0.15 & 0.1 \\ \hline \end{tabular}$

We see from the shape that the data can naturally be given the structure of a $3 \times 3$ matrix,

$P = \left( \begin{matrix} p_{ES} & p_{EC} & p_{ER} \\ p_{OS} & p_{OC} & p_{OR} \\ p_{LS} & p_{LC} & p_{LR} \end{matrix} \right)$

where $p_{ES} = 0.05$ is the probability that I am early and it is sunny, $p_{LR} = 0.1$ is the probability that I am late and it is rainy, etc.

If the two events really are independent, then the probabilities will satisfy the multiplicative relation above, $p_{ij} = p_i p_j$ where $p_E$ is the probability that I am early, etc. In this case, we can write matrix $P$ as

$P = \left( \begin{matrix} p_E p_S & p_E p_C & p_E p_R \\ p_O p_S & p_O p_C & p_O p_R \\ p_L p_S & p_L p_C & p_L p_R \end{matrix} \right)$

This is a rank 1 matrix, because all the rows are scalar multiples of each other, and likewise so are all the columns. So, to test to see if the events are independent, we could see if our computed probabilities in the table above make a rank 1 matrix. This is a test for the events being independent.

What if instead we have the incomplete data

$\begin{tabular}{| l || l | l | l |} \hline & Sunny (S) & Cloudy (C) & Rainy (R) \\ \hline \hline Early (E) & 0.05 & ? & ? \\ \hline On-time (O) & ? & 0.12 & ? \\ \hline Late (L) & ? & ? & 0.1 \\ \hline \end{tabular}$

• Is it possible to show that our events must be independent? (No, think about why!)
• Is it possible to show the events cannot be independent? (Yes, see below)

It might be that there are no possible values for the remaining entries of the table such that the overall matrix of probabilities will be rank 1. This will depend on the known diagonal entries of the matrix.

The picture below (taken from this paper from 2014 called “Matrix Completion for the Independence Model” by Kaie Kubjas and Zvi Rosen) shows one restriction on what the possible diagonal entries could be in order that the rest of the entries can be filled to get something rank one.

The three axes are the possible values for the three diagonal entries and the filled in part of the shape gives the possible combinations that might come from two independent events.