## SIAGA: Topology of Data

### Seven pictures from Applied Algebra and Geometry: Picture #4

The Society for Industrial and Applied Mathematics, SIAM, has recently released a journal of Applied Algebra and Geometry called SIAGA. The poster for the journal features seven pictures. In this blog post I will talk about the fourth picture. See here for my blog posts on pictures onetwo and three. And see here for more information on the new journal.

In the first section of this post “The Context”, I’ll set the mathematical scene and in the second section “The Picture” I’ll talk about this particular image, representing Topology of Data.

# The Context

Topology offers a set of tools that can be used to understand the shape of data. The techniques detect intrinsic geometric structures that are robust to many common sources of error including noise and arbitrary choice of metric. For an introduction, see the book “Elementary Applied Topology” by Robert Ghrist (2014), or the article “Topology and Data” by Gunnar Carlsson (2009).

Say we have noisy data points coming from some unknown space $X$ which we believe possesses an interesting shape. We are interested in using the data to capture the topological invariants of the unknown space. These are its holes of different dimensions, unchanged by continuous squeezing and stretching.

The holes of different dimensions are the homology groups of the space $X$. They are denoted by $H_k(X)$ for $k$ some non-negative integer. The zeroth homology group tells us the number of zero-dimensional holes or, more intuitively, the connectedness of the space. For a space $X$ with $n$ connected components, it is
$H_0(X) = \mathbb{Z}^n$,
the free abelian group with $n$ generators. One-dimensional holes are counted by $H_1(X)$. For example, a circle $X = S^1$ has a single one-dimensional hole, so $H_1(S^1) = \mathbb{Z}$.

The connectedness properties of sampled data tell us a lot about the underlying space from which they are sampled. In some situations, such as for structural biological information, it is indispensable to know the structure of the holes too. These features are unchanged no matter which metric we use, or which space we embed the points into. The higher homology groups $H_k(X)$ for $k \geq 2$ similarly give us such summarizing features.

But there’s a problem: sampling $N$ points from a space gives us a collection of zero-dimensional pieces, which  – unless two points land in exactly the same place – are all unconnected. Let us call this data space $D_0$. The space $D_0$ has homology groups
$H_k(D_0) = \begin{cases} \mathbb{Z}^N & k = 0 \\ 0 & \text{otherwise.} \end{cases}$

It is usually the case that many points are very close together, and ought to be considered to come from the same connected component. To measure this we use persistent homology. We take balls of increasing size centered at the original data points, and measure the homology groups of the space consisting of the union of these balls. We call this space $D_\epsilon$, where $\epsilon$ is the radius of the balls. The important structural features are those that persist for large ranges of values of $\epsilon$. For some great illustrations of persistent homology, see this post Rachael wrote for our blog in July 2015.

# The Picture

This picture shows data points sampled from a torus, which we imagine to live in three-dimensional space. It was made by Dmitriy Morozov, who works at the Lawrence Berkeley National Lab. He applies topological methods in cosmology, climate modeling and material science.

The sampled points in the picture lie on the torus, and furthermore in a more specialized slinky-shaped zone of the torus. This is an important feature of the shape which topological methods will capture.

The original data consists of 5000 points, and our persistent homology approach involves taking three-dimensional balls $B_\epsilon(d_i)$ of radius $\epsilon$ centered at each data point $d_i$. When the radius $\epsilon$ is very extremely small, none of the balls will be connected, and the shape of our data is indistinguishable from any other collection of 5000 points in space.

Before long, the radius will exceed half the distance to all the points’ nearest neighbors. The 5000 balls join together to form a curled up circular piece of string. Topological invariants do not notice the curling, so topologically the shape obtained is a thickened circle with a one-dimensional hole $H_1(D_{\epsilon}) = \mathbb{Z}$. When the radius is large enough for the adjacent curls of the slinky to meet, but not to read the opposite side of each curl, we get a hollow torus with $H_1(D_{\epsilon}) = \mathbb{Z}^2$ and $H_2(D_\epsilon) = \mathbb{Z}$. Finally, the opposite sides of each curl of the slinky will meet, and they will meet up with the slinky-curls on the opposite side of the torus. Our shape then becomes a three-dimensional shape with no holes, and $H_1(D_R) = 0$.

In this example, the data points can be visualized and we are able to confirm that our intuition for the important structure of the shape agrees with the homological computations. For higher-dimensional examples it is these persistent features that will guide our understanding of the shape of the data.