Let’s imagine a hypothetical situation. There’s an infection going round, and we want to predict the future severity of someone’s illness.

There is a test that offers a good prediction. Let’s say the outcome of the test has a correlation of 0.78 with the patient’s severity of infection. The problem with the test is that it is expensive and time-consuming. But there’s an alternative test, which is much cheaper and faster. We don’t know how well the cheap test correlates with the severity of infection, but we know the correlation between the cheap test and the expensive test is quite high, at 0.89.

What can we say about how well the cheap test correlates with the severity of infection?

We might expect to be able to say something about the unknown correlation. For example, if the expensive test had a correlation of 1 with the severity of infection, and that the cheap test also had a correlation of 1 with the expensive test, then everything is perfectly correlated and the cheap test must also have a correlation of 1 with the severity of infection.

But, let’s assume the expensive test only has a correlation of 0.5 with the severity of infection, and that the two tests are also only correlated with correlation 0.5. Now it isn’t clear whether we can say *anything* about the correlation of the cheap test with the severity of infection.

Let’s go back to the numbers in the original example.

We can organise our correlations into a matrix. We have three variables:

- severity of infection
- expensive test outcome
- cheap test outcome

And we build the matrix:

The entry of the matrix gives the correlation between variable and . Each variable has a perfect correlation with itself, so the diagonal entries of the matrix are equal to 1. In addition, the correlation between and is the same as the correlation between and , so the matrix is symmetric.

A matrix of correlations has to be positive semi-definite. This condition will allow us to find the range of possible values for .

But first, let’s consider the case where all three correlations are unknown. We then have correlation matrix

The region of possible values for the correlations are the values for which the matrix is positive semi-definite. This is a region of 3-dimensional space that looks like this:

It is called the elliptope, a name that dates back at least as far as 1996 (see here) and is also called the samosa, a name that dates back at least as far as 2011 (see here).

We can now fix values for our two known correlations, to see the possible values for the third correlation. The possibilities are all values that keep the triple of correlations inside the samosa.

We can find the upper and lower limit of the third correlation by seeing where the black line intersects the boundary of the samosa. The boundary is defined by setting the determinant of the matrix

to be zero. We get a quadratic polynomial in with roots at approximately 0.41 and 0.98. So the third correlation has to lie in the range 0.41 to 0.98.

So for the infection example, although the cheap test has a high correlation with the expensive test, in the worst case it only offers a correlation of 0.41 with the severity of infection.

If the two known correlations had both been 0.5, a similar computation shows that the third correlation has to lie in the range -0.5 to 1. The third correlation could even be negative!

This is the first part of a small series of “correlated” posts that I hope to write about correlations – stay tuned for more!