Published on June 8, 2020

Introduction to MSI data analysis

We recently published a review paper on unsupervised analysis of MSI data together with profs. Raf Van de Plas (TU Delft) and Richard Caprioli (Vanderbilt University). For in-depth information about the topics we touch upon in this blog series, please consult our review paper and its references.

These approaches are applied to the matrix representation commonly used for MSI data sets, as discussed in our introductory post on MSI data analysis. When representing MSI data in matrix form, D is a data matrix where rows denote pixels and columns denote m/z bins. Hence, for a single MSI data set, m is the number of pixels and n is the number of spectral bins or m/z bins, i.e. each row of D represents the mass spectrum of a pixel in the sample along a common m/z binning.

Note that, while we started with all-positive data on axes 1 and 2 initially, the origin of the new orthogonal axes defined by PCA is placed at the center of where the variance occurs (mean subtraction). This will result in negative values when we project our data onto these new axes, which will be an important point below.

Some caveats must be made: ideally, this pseudo-spectrum would give a straightforward list of which bio-moleculecular ions are involved in each region from a biological sense. However, as we said before, the goal of PCA is to capture and summarize as much of the variation in the data as possible per principal component. This is not necessarily the same as correctly modeling the underlying biology or sample content. It’s a bit like you had only one sheet of paper to summarize an entire book and went a bit overboard. While it may capture the essence of the book, the end result might not be too reader-friendly or clear on what everything means.

`FactorAnalyzer`

module. Similar to the PCA results, we plotted the 10 components that we get out of the Varimax rotation. We again get a spatial expression and a pseudo-spectrum, which we plotted for each. As anticipated, the pseudo-spectra for many of the components clearly contain fewer and larger peaks, making the resulting components more readily interpretable. Some of these are now perhaps too sparse to give a proper summarization of the data, however the pseudo-spectra are now much more likely to have the spatial distribution shown on the left.- Compute the PCA composition with a certain (relatively large) number of components, typically 100-500 though this depends on the dataset and use-case.
- Determine the number of PCs that must be retained () to capture sufficient variation within the data, for instance using a Pareto chart as in Figure 5.
- Compute the Varimax rotation on the first PCs to obtain the new basis (which spans the same subspace as the original PCs).

There are two important caveats with using Varimax. First off, unlike PCA, the components are no longer ranked by variance i.e. “importance”, so we would have to look at all 50 components to get a good image of what is going on. Nonetheless, when applying a Varimax rotation to N principal components, the exact same amount of variance is explained as in PCA with N components. Secondly, unlike PCA, Varimax does not have an analytical solution; it is an iterative algorithm, and a global optimum is not guaranteed. This means that running the algorithm multiple times will generally result in different solutions of comparable quality.

Caveats: Similar to Varimax, fastICA does not rank the components and does not have an analytical solution. Likewise a global optimum is not guaranteed and running the algorithm multiple times will generally result in different solutions.

Contact us

View author profile