PCA on projective space

Projective spaces are spaces of lines in some ambient space. For example, {\mathbf R}P^2 is the space of straight lines in {\mathbf R}^3. What is this space? Well, it is the space you get when you identify points in {\mathbf R}^3\backslash{0} up to the multiplicative group x \rightarrow \lambda x, \lambda\neq 0. In this particular case, it is the upper hemisphere of {\mathbf S}^2, with the boundary circle points identified. All good stuff which Brock Fuller and Jack Conn taught me a long long time ago.

But now we’re dealing with trying to do something like PCA on projective space, so we want to ask: What is PCA in a linear space and why do we care? We’re trying to find the directions along which some cloud of points is most spread out. In particular, the rotation symmetry in this space is what underlies wanting to diagonalize \langle (X_i - \langle X_i\rangle)(X_j - \langle X_j\rangle)\rangle. So on a projective space, we still have lots of symmetry, and we should be using this symmetry to find something like a great circle along which the cloud of points in projective space is most spread out. The distance measure inherited from the underlying ambient space has the same rotation symmetry (except for a minor caveat about the identification of antipodal points, which affects the group by modding out a discrete Z_2).

It’s not obvious to me how to do this as some sort of simple matrix manipulation, but as a maximization problem, we want to find a rotation in the ambient space such that the points in the projective space have the largest possible range of \phi\in [0,\pi). (Remember the identification of x and  -x.) Then the other directions have to be determined by rotations that don’t change the inverse image of this half-circle in the ambient space and so on. (Might be better to have some different orientation standard but this is more a matter of notation and calculation.)

Each line is represented by a direction vector. A 2-plane in the original space defines a circle on the projective space. A 2-plane could be defined by two data points in the original space that maximize the projective distance between pairs of data points. Obviously, picking two specific data points is too susceptible to outliers so we want something like two orthonormal vectors n_1, n_2 such that the mean distance of the data points from the plane defined by these vectors (which is the angle between the data point and its projection on the plane) is minimized and the standard deviation of the angles between the projections of the points in this plane is maximized. Just implement the second cost:

n_{1j}n_{1k} +n_{2j}n_{2k}

Of course, there are other sets of charts on the projective space, but the metric distances are not so trivially calculated. It is interesting that this projective PCA would reduce the dynamic range. Is this good? There are instances in image processing where I imagine that the important thing is not the absolute intensity but the relative intensity. I wonder if one would get different results from the Pearson correlation. The Pearson correlation normalizes the mean and the standard deviation before computing the cosine of the angle between the two vectors. The projective distance is scale-independent but does take the mean of the two vectors into account. PCA can be generalized in many ways. There are nonlinear manifold PCA methods, neural network PCA methods and so on.

http://pca.narod.ru/contentsgkwz.htm

 

Advertisements

Normalizing gene expression values for classification

What’s the best way to normalize an array of gene expression values? First question: Best for what?

This depends on what you want to do with the array. The requirements for sample classification seem to me to be a little different than for, say, class discovery. What we want for sample classification is to normalize each array in a self-referential manner without taking into account the rest of the samples, because we want to be able to do the exact same normalization on any new sample that we might want to classify, given what we’ve learned from the training set. So suppose we normalize by total mRNA. Now all samples will be points on some high-dimensional hypersurface, but if some gene had high expression values in disease samples, the other genes in disease samples will now have depressed values because of the overall normalization. That doesn’t mean that those genes are biomarkers for that disease. On the other hand, we could arbitrarily declare that some genes are `household’ genes and normalize one (or several) to a fixed expression level, with rescalings of all the rest. This avoids the problem of the total mRNA normalization but exactly how do you decide that a gene is a household gene without taking into account the entire set of samples? And if you do pick the entire set of samples to decide what is a household gene (i.e. one that shows no correlation with the known classification, for example), then you are not normalizing in a sample-specific manner.

So, in some sense we want to live in the projective space associated with gene expression, and the relevant data is the gene expression values up to a constant. We’re picking charts on this projective space with all these different normalizations but what we should do is sample classification in a coordinate invariant manner. So how do we do this? Take any fixed gene, something with a median value for the normal samples for example (this is just bowing to experimental reality, and has no theoretical justification), and normalize all samples S_i so that this gene is 1. Now d(S_1,S_2) \equiv \sum ((S_{1i}-S_{2i})/\sigma_i)^2 with \sigma_i the uncertainty in g_i expression is a measure of the distance in projective space. We could also compute the distance on a chart adapted to a sphere by normalizing each S and then computing \arccos S_1\cdot S_2, with appropriate \sigma_i in various places of course.

With such a measure of distance, the first thing one might do is compute the distribution of distances between all the normal samples. It’s a common view that healthy people are all alike, every unhealthy person is unhealthy in his/her own way (with apologies to Tolstoy). So I would imagine that one would get more insight in seeing how the distribution of disease sample-to-normal samples distances looks different from the normal-to-normal distribution.

Up next: PCA on projective space??

Stochastic grammars and grammatical evolution

I’ve been wondering how to use grammatical evolution to generate signaling networks. So first we have to think up some sort of grammar for signaling networks. What would be appropriate start symbols? Productions? Terminals?

Start: Gene

Transcription: Gene > Gene + RNA (constitutive expression) | Gene*TF | Gene*Inhibitor

Transcription: Gene*TF > Gene + RNA | Gene*TF[*Cofactor]^n | Gene*TF*Inhibitor

Transcription: Gene*TF*Cofactor > Gene + RNA

Signaling: Receptor > Receptor*SIgnal | Receptor*Blocker

Degradation: Any > Nothing

and so on

People have done this sort of thing before, obviously, but I’m wondering about how applying genetic mutation operators to a string of such productions will lead to the same sort of changes to gene networks that are actually observed. Not obvious to me …

What happens if you use a stochastic grammar? What’s the difference between a stochastic grammar applied many times to a fixed genome vs a deterministic grammar applied to a population of genomes? In biology, the binding of TFs may actually be stochastic, so perhaps we should encode the probability of a symbol in the genome going to a particular production in the genome itself.

Robustness vs adaptability

So much of science depends on how well you can communicate. I’m now on the least favorite part of any research project from my perspective: The communication part.

Here’s the question that motivated this particular bit: Natural selection acts on the phenotype, i.e. on the adaptability and competence of a particular individual. The search for improvement works by mutating the genotype, and unfortunately(?), even if there is a beneficial mutation in the germline, the individual whose progeny will carry that mutation does not benefit at all in its lifetime. Now: If mutations alter the phenotype to be inherited, then the genotype that leads to the selected parent phenotype will not really play much of a role since mutations will provide the progeny with a phenotype that may not be as good. So there has to be some sort of interplay between mutation resistance and adaptability, but in principle a dynamical system that is adaptable may not be stable to parameter perturbations (due to mutations) and vice versa.

So I suggested a little thought experiment to my postdoc (Zeina Shreif): Imagine two parts of a network, one which receives some environmental input and another that gets some signal from this part and produces some output. If the system is adaptable, we expect that it should be able to maintain the same output for a change in input. However, how exactly can the output part of the network distinguish between a change in the input or a change in some parameter in the input part of the network? So I conjectured that in fact these two properties must be correlated, that statistically homeostatic networks will be robust with respect to parameter fluctuations.

It turns out that this is true. Heroic work by Zeina has demonstrated this all the way up to 50 node random networks! So this is an exciting (to me) result, and so it is worth it to try to write this in a way that people will understand it. The problem is that there are so many exciting ramifications that I am very reluctant to do the careful writing. Must be done 😦

Genotype to phenotype – grammatical evolution?

We’re working on a new modeling framework where we can take evolution into account in developing the models.

  • We want to make models that are `robust’ in several senses (parameter insensitivity, data uncertainties and homeostatic adaptability are some of the reasons).
  • We want to be able to take data from different organisms and use all the data to constrain models, but the data come from distinct models with only evolution connecting them.
  • We want to restrict the model search space by considering only models that could have come from a genotype to phenotype mapping.

There’s loads of work that people have done on such maps, and today I’ve been learning about grammatical evolution, which is a new approach to genetic programming. The idea is that there is a fixed grammar and the genome encodes the production of the start symbol that leads to the actual code, which ends up being compilable if this is done right. Standard genetic programming works directly on the parse trees and, in some variants, doesn’t always lead to working end programs.

My postdoc, Junghyo Jo, and I have been thinking of a genotype – phenotype mapping as well, but wanting to encode a whole dynamical system in the  genotype, parameters and all. That we can set up in a way that is pretty close to `nature’ but I’m still trying to get my head around why grammatical evolution is the correct genotype-phenotype map. Obviously, the GE algorithm generates correct code if the grammar is consistent, but is my genome sequentially encoding the code that is then compiled into the executable that is me? Probably not the best way to phrase my confusion but in all honesty I do not see why GE is biologically inspired. Yes, genes encode for proteins but transcribing a gene into an executable protein as a grammatical production is not quite what happens. The mRNA doesn’t get to the ribosome and start getting translated with amino-acids being added at one point caring about the amino-acids that have previously been added. (There are control mechanisms such as secondary structure of the mRNA etc., but let’s keep it simple.) I think what people have in mind is that the executable is the working folded protein analog rather than a string of residues that needs to be folded etc. In that case it would make some sort of sense as set up – linear structure being mapped to complicated active executable, with the compiler as some sort of ribosome, but I still feel that each succeeding base should not depend on what the preceding base did to the derivation (thus far) of the start symbol.

So what do we expect? I’m thinking this genotype-phenotype mapping is not a one-time thing. There should be many different go-to type entry points in the genotype, and the compiled code should execute something that activates some of these go-to points. Thus, there should be several start symbols, and several go-to points. The compiled code should execute and produce a new set of start symbols that then activate their associated go-to points. That’s a more amusing picture but I’m pretty sure that isn’t enough.

Inference and prediction: Dimon 0 Volcker 1

Jamie Dimon finally gets his. Chase posted a 2 billion $ loss because of a trade that they didn’t do enough risk management on. Too funny.

From DealBook:

>> On Wall Street, few have been more outspoken about the pitfalls of the Volcker Rule than JPMorgan’s chief executive, Jamie Dimon. Mr. Dimon not only attacked the rule, he personally criticized Paul Volcker, the former Federal Reserve chairman and the regulation’s namesake.

“Paul Volcker by his own admission has said he doesn’t understand capital markets,” Mr. Dimon told Fox Business earlier this year. “He has proven that to me.” ….

Even Mr. Dimon had to admit Thursday’s disclosure was a setback for JPMorgan and other banks that want more flexibility when the final version of the Volcker Rule is issued. “It plays into the hands of a bunch of pundits but you have to deal with that and that’s life,” Mr. Dimon said Thursday on a conference call with analysts.  …

“Just because we’re stupid doesn’t mean everybody else was,” he said. “There were huge moves in the marketplace but we made these positions more complex and they were badly monitored.”

“This may not have violated the Volcker Rule, but it violates the Dimon Principle.” <<

Apparently, Mr. Dimon doesn’t understand capital markets either.

On a constructive note, I think some of these problems arise due to an inadequate understanding of how to model probability distributions from a finite amount of data and how to automatically learn changing distributions. These issues are exactly the same as the conflict in evolution: You want stability to propagate your genome, but you need variation to allow for the emergence of new traits that can handle changes in the environment. It isn’t easy to come up with dynamical systems that meet both of these desiderata.

Regeneration vs. reproduction

An intern in my group a couple of summers ago became interested in aging when he got to med school. I usually have a fun time with my interns so sometimes they keep in touch. He called me and wanted to know what was known about aging. Knowing nothing, but always willing to be distracted, I looked into it and discovered all sorts of mathematical models of aging and death. These are usually at a pretty phenomenological level, but it is good to know about Gompertz and the Penna model and so on. In fact, these models don’t take evolution into account. Rephrasing: Why aren’t we all immortal with lots of regenerative capabilities? That’s something that Kurzweil might want to think about, no? So, given that evolution selects on the phenotype and evolves the genotype, I don’t see offhand how the appearance of programmed death in all multicellular species I know about is something that emerges without taking environmental variation and the whole group of individuals into account. In other words, why do we have children instead of continual self-renewal? Why doesn’t our homeostasis include regenerative homeostasis. Avoid all the diapers …
This is something I wonder about a lot, especially after my father’s death a couple of weeks ago, is why is it so hard to accept death for most people? My father died a good death, no prolonged suffering, at a ripe old age, surrounded by his family. So why does my mother keep re-iterating that she did not expect it to be like this? I couldn’t help asking her: How did she think it would happen? I’m more or less convinced that a large part of religion is simply to stop people from going nuts at the prospect of ceasing to exist by offering them an afterlife, with the bonus that if they conform in this life, the afterlife will feature many comforts (conform to being a jihadi and you get umpteen virgins and so on … which brings up: What happens if you’re a woman jihadi? Inquiring minds want to know …) Getting back to this lamentation of death, why is this horror of death a cultural/social survivor? Recall the easy way to induce opposition to healthcare reform: Death panels!! But actually, this isn’t at all unrelated to the concept of regeneration vs. reproduction. We have a certain tradeoff to make: We keep the present genes intact and cared for, or we produce new sets that will carry on while entropy wins over the present set. People should really be asked: This is a finite planet. Do you want aggressive treatment if you fall sick or do you want your descendants to have the equivalent resources? Of course, as a society we’ve decided that we care more about our well being rather than the economic well being of our descendants, so I suppose we’re evolving into the Kurzweil model of keeping our present genes functioning, thank you very much.