SIAGA: Topology of Data

Seven pictures from Applied Algebra and Geometry: Picture #4

The Society for Industrial and Applied Mathematics, SIAM, has recently released a journal of Applied Algebra and Geometry called SIAGA. The poster for the journal features seven pictures. In this blog post I will talk about the fourth picture. See here for my blog posts on pictures onetwo and three. And see here for more information on the new journal.

In the first section of this post “The Context”, I’ll set the mathematical scene and in the second section “The Picture” I’ll talk about this particular image, representing Topology of Data.

fig4big

The Context

Topology offers a set of tools that can be used to understand the shape of data. The techniques detect intrinsic geometric structures that are robust to many common sources of error including noise and arbitrary choice of metric. For an introduction, see the book “Elementary Applied Topology” by Robert Ghrist (2014), or the article “Topology and Data” by Gunnar Carlsson (2009).

Say we have noisy data points coming from some unknown space X which we believe possesses an interesting shape. We are interested in using the data to capture the topological invariants of the unknown space. These are its holes of different dimensions, unchanged by continuous squeezing and stretching.

The holes of different dimensions are the homology groups of the space X. They are denoted by H_k(X) for k some non-negative integer. The zeroth homology group tells us the number of zero-dimensional holes or, more intuitively, the connectedness of the space. For a space X with n connected components, it is
H_0(X) = \mathbb{Z}^n,
the free abelian group with n generators. One-dimensional holes are counted by H_1(X). For example, a circle X = S^1 has a single one-dimensional hole, so H_1(S^1) = \mathbb{Z}.

The connectedness properties of sampled data tell us a lot about the underlying space from which they are sampled. In some situations, such as for structural biological information, it is indispensable to know the structure of the holes too. These features are unchanged no matter which metric we use, or which space we embed the points into. The higher homology groups H_k(X) for k \geq 2 similarly give us such summarizing features.

But there’s a problem: sampling N points from a space gives us a collection of zero-dimensional pieces, which  – unless two points land in exactly the same place – are all unconnected. Let us call this data space D_0. The space D_0 has homology groups
H_k(D_0) = \begin{cases} \mathbb{Z}^N & k = 0 \\ 0 & \text{otherwise.} \end{cases}

It is usually the case that many points are very close together, and ought to be considered to come from the same connected component. To measure this we use persistent homology. We take balls of increasing size centered at the original data points, and measure the homology groups of the space consisting of the union of these balls. We call this space D_\epsilon, where \epsilon is the radius of the balls. The important structural features are those that persist for large ranges of values of \epsilon. For some great illustrations of persistent homology, see this post Rachael wrote for our blog in July 2015.

The Picture

This picture shows data points sampled from a torus, which we imagine to live in three-dimensional space. It was made by Dmitriy Morozov, who works at the Lawrence Berkeley National Lab. He applies topological methods in cosmology, climate modeling and material science.

The sampled points in the picture lie on the torus, and furthermore in a more specialized slinky-shaped zone of the torus. This is an important feature of the shape which topological methods will capture.

The original data consists of 5000 points, and our persistent homology approach involves taking three-dimensional balls B_\epsilon(d_i) of radius \epsilon centered at each data point d_i. When the radius \epsilon is very extremely small, none of the balls will be connected, and the shape of our data is indistinguishable from any other collection of 5000 points in space.

Before long, the radius will exceed half the distance to all the points’ nearest neighbors. The 5000 balls join together to form a curled up circular piece of string. Topological invariants do not notice the curling, so topologically the shape obtained is a thickened circle with a one-dimensional hole H_1(D_{\epsilon}) = \mathbb{Z}. When the radius is large enough for the adjacent curls of the slinky to meet, but not to read the opposite side of each curl, we get a hollow torus with H_1(D_{\epsilon}) = \mathbb{Z}^2 and H_2(D_\epsilon) = \mathbb{Z}. Finally, the opposite sides of each curl of the slinky will meet, and they will meet up with the slinky-curls on the opposite side of the torus. Our shape then becomes a three-dimensional shape with no holes, and H_1(D_R) = 0.

In this example, the data points can be visualized and we are able to confirm that our intuition for the important structure of the shape agrees with the homological computations. For higher-dimensional examples it is these persistent features that will guide our understanding of the shape of the data.

Advertisements

One thought on “SIAGA: Topology of Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s