# Normalizing gene expression values for classification

What’s the best way to normalize an array of gene expression values? First question: Best for what?

This depends on what you want to do with the array. The requirements for sample classification seem to me to be a little different than for, say, class discovery. What we want for sample classification is to normalize each array in a self-referential manner without taking into account the rest of the samples, because we want to be able to do the exact same normalization on any new sample that we might want to classify, given what we’ve learned from the training set. So suppose we normalize by total mRNA. Now all samples will be points on some high-dimensional hypersurface, but if some gene had high expression values in disease samples, the other genes in disease samples will now have depressed values because of the overall normalization. That doesn’t mean that those genes are biomarkers for that disease. On the other hand, we could arbitrarily declare that some genes are `household’ genes and normalize one (or several) to a fixed expression level, with rescalings of all the rest. This avoids the problem of the total mRNA normalization but exactly how do you decide that a gene is a household gene without taking into account the entire set of samples? And if you do pick the entire set of samples to decide what is a household gene (i.e. one that shows no correlation with the known classification, for example), then you are not normalizing in a sample-specific manner.

So, in some sense we want to live in the projective space associated with gene expression, and the relevant data is the gene expression values up to a constant. We’re picking charts on this projective space with all these different normalizations but what we should do is sample classification in a coordinate invariant manner. So how do we do this? Take any fixed gene, something with a median value for the normal samples for example (this is just bowing to experimental reality, and has no theoretical justification), and normalize all samples $S_i$ so that this gene is 1. Now $d(S_1,S_2) \equiv \sum ((S_{1i}-S_{2i})/\sigma_i)^2$ with $\sigma_i$ the uncertainty in $g_i$ expression is a measure of the distance in projective space. We could also compute the distance on a chart adapted to a sphere by normalizing each $S$ and then computing $\arccos S_1\cdot S_2,$ with appropriate $\sigma_i$ in various places of course.

With such a measure of distance, the first thing one might do is compute the distribution of distances between all the normal samples. It’s a common view that healthy people are all alike, every unhealthy person is unhealthy in his/her own way (with apologies to Tolstoy). So I would imagine that one would get more insight in seeing how the distribution of disease sample-to-normal samples distances looks different from the normal-to-normal distribution.

Up next: PCA on projective space??