Permutation symmetry and finding a `noise model’

If we have a time series of data points \{ x_i\}, how can we tell if these come from some stationary distribution or if there is some secular process coupled to an uncertainty inducing step? Many statistical theorems start out by assuming stationarity. Suppose we ask for the probability of a model P(f,\eta|x_i) where x_i = f(t_i) + \eta_i,  t_i is the `time coordinate’ of the i^{\rm th} data point, and \eta comes from a stationary distribution. How do we disentangle f and \eta? By definition, the f likelihood should be invariant under replacing \eta_i \rightarrow \eta_{\sigma(i)}, \sigma \in P_N, the permutation group on N data points. In other words, P(f,\eta|D) = P(f,\eta_\sigma|D), but then this is going to require P(D|f,\eta) = P(D|f,\eta_\sigma) which suggests the trivial solution of just defining P(D|f,\langle\eta\rangle) as the product over \sigma, which seems a bit heavy-handed. It is easily achieved, of course, but this will be a computational nightmare. If this could be done, I think it would be a pretty strong constraint on what \eta and f could be.

One way to try to do this would be to do Monte-Carlo sampling over P_N in calculating the cost function at each step of the MCMC for finding f, \eta. We really want to evaluate something like

- \log P(x|f,\langle\eta\rangle) \equiv \sum_{\sigma\in P_N} \sum_i (x_i - f(t_i) - \eta_{\sigma(i)})^2,

and I think that for N large enough the sum over \sigma could be approximated by a probabilistic sampling over some permutations. The worry here is that \eta will come out all constant, so the noise model preferred will be the trivial distribution, so all the variation will try to be fit by f. The other extreme is that f will be a constant and all the data will be fit by \eta.

The quick way might be to use something like singular spectrum analysis on each permuted summand and demand that the SSA interpolations all agree: Pick \eta, compute SSA for some sample of permutations, vary \eta in a standard MCMC to sample over possible \eta. The end result should be a de-noised prediction for the actual functional dependence f.

Advertisements

Ray Solomonoff’s universal distribution

Everyone knows about Kolmogorov complexity. Kolmogorov addressed randomness and came up with the Kolmogorov complexity Cof a string  (defined as the length of the shortest program that can reproduce the string) as a measure of its randomness. Solomonoff, two years before Kolmogorov, looked at the problem from a much deeper perspective in my opinion. He asked, in drastic paraphrase: How should you weight  different programs that could generate a given string? And the two are related because it turns out that the function that Solomonoff picked is C. [So why isn’t it called Solomonoff complexity, though Kolmogorov referenced Solomonoff in his paper? Because Kolmogorov was already famous.]

What I would like to understand is the following: How do you go about generating programs that could reproduce the string? Induction is a hoary field, but most of the work is done assuming that you have to decide between different programs, and very little, in proportion, addresses the question of how to go about generating such programs. That is actually a problem because the search space is, to put it mildly, large. Is there a fundamental process that is the optimal manner to induce programs given data? This must itself be a probabilistic process. In other words, the program output match to the desired string must be allowed to be imperfect, so the search space summation is not just over programs that reproduce a given string S

\sum_{{\rm programs\ reproducing\ } S} \exp(-{\rm length}({\rm program}))

but rather weighted in some way to balance fitting S and reducing program length. Why does everything involve a balance between energy and entropy? Because there is nothing more fundamental in all of science.

So I think there should be a way to write something like

\sum_{S',{\rm programs\ producing\ } S'} \exp(-{\rm length}({\rm program})) \exp(- {\rm mismatch}(S,S'))

and then the question is: What is the universal form of the mismatch function? An interesting point here is that this process might work even with noisy S since if the correct string is 0101010101 but you read S= 0111010101, then you’ll find a very short program that can reproduce the string S' = 0101010101, and since the mismatch is only one bit this process would autocorrect your initial misread string.

We want symbols and rules for manipulation such that the resulting stream contains the known sequence of symbols. In other words, an inverse Godel problem: Given a stream of symbols, find a set of rules so that this stream is a proof.

Normalizing gene expression values for classification

What’s the best way to normalize an array of gene expression values? First question: Best for what?

This depends on what you want to do with the array. The requirements for sample classification seem to me to be a little different than for, say, class discovery. What we want for sample classification is to normalize each array in a self-referential manner without taking into account the rest of the samples, because we want to be able to do the exact same normalization on any new sample that we might want to classify, given what we’ve learned from the training set. So suppose we normalize by total mRNA. Now all samples will be points on some high-dimensional hypersurface, but if some gene had high expression values in disease samples, the other genes in disease samples will now have depressed values because of the overall normalization. That doesn’t mean that those genes are biomarkers for that disease. On the other hand, we could arbitrarily declare that some genes are `household’ genes and normalize one (or several) to a fixed expression level, with rescalings of all the rest. This avoids the problem of the total mRNA normalization but exactly how do you decide that a gene is a household gene without taking into account the entire set of samples? And if you do pick the entire set of samples to decide what is a household gene (i.e. one that shows no correlation with the known classification, for example), then you are not normalizing in a sample-specific manner.

So, in some sense we want to live in the projective space associated with gene expression, and the relevant data is the gene expression values up to a constant. We’re picking charts on this projective space with all these different normalizations but what we should do is sample classification in a coordinate invariant manner. So how do we do this? Take any fixed gene, something with a median value for the normal samples for example (this is just bowing to experimental reality, and has no theoretical justification), and normalize all samples S_i so that this gene is 1. Now d(S_1,S_2) \equiv \sum ((S_{1i}-S_{2i})/\sigma_i)^2 with \sigma_i the uncertainty in g_i expression is a measure of the distance in projective space. We could also compute the distance on a chart adapted to a sphere by normalizing each S and then computing \arccos S_1\cdot S_2, with appropriate \sigma_i in various places of course.

With such a measure of distance, the first thing one might do is compute the distribution of distances between all the normal samples. It’s a common view that healthy people are all alike, every unhealthy person is unhealthy in his/her own way (with apologies to Tolstoy). So I would imagine that one would get more insight in seeing how the distribution of disease sample-to-normal samples distances looks different from the normal-to-normal distribution.

Up next: PCA on projective space??

Stochastic grammars and grammatical evolution

I’ve been wondering how to use grammatical evolution to generate signaling networks. So first we have to think up some sort of grammar for signaling networks. What would be appropriate start symbols? Productions? Terminals?

Start: Gene

Transcription: Gene > Gene + RNA (constitutive expression) | Gene*TF | Gene*Inhibitor

Transcription: Gene*TF > Gene + RNA | Gene*TF[*Cofactor]^n | Gene*TF*Inhibitor

Transcription: Gene*TF*Cofactor > Gene + RNA

Signaling: Receptor > Receptor*SIgnal | Receptor*Blocker

Degradation: Any > Nothing

and so on

People have done this sort of thing before, obviously, but I’m wondering about how applying genetic mutation operators to a string of such productions will lead to the same sort of changes to gene networks that are actually observed. Not obvious to me …

What happens if you use a stochastic grammar? What’s the difference between a stochastic grammar applied many times to a fixed genome vs a deterministic grammar applied to a population of genomes? In biology, the binding of TFs may actually be stochastic, so perhaps we should encode the probability of a symbol in the genome going to a particular production in the genome itself.

Robustness vs adaptability

So much of science depends on how well you can communicate. I’m now on the least favorite part of any research project from my perspective: The communication part.

Here’s the question that motivated this particular bit: Natural selection acts on the phenotype, i.e. on the adaptability and competence of a particular individual. The search for improvement works by mutating the genotype, and unfortunately(?), even if there is a beneficial mutation in the germline, the individual whose progeny will carry that mutation does not benefit at all in its lifetime. Now: If mutations alter the phenotype to be inherited, then the genotype that leads to the selected parent phenotype will not really play much of a role since mutations will provide the progeny with a phenotype that may not be as good. So there has to be some sort of interplay between mutation resistance and adaptability, but in principle a dynamical system that is adaptable may not be stable to parameter perturbations (due to mutations) and vice versa.

So I suggested a little thought experiment to my postdoc (Zeina Shreif): Imagine two parts of a network, one which receives some environmental input and another that gets some signal from this part and produces some output. If the system is adaptable, we expect that it should be able to maintain the same output for a change in input. However, how exactly can the output part of the network distinguish between a change in the input or a change in some parameter in the input part of the network? So I conjectured that in fact these two properties must be correlated, that statistically homeostatic networks will be robust with respect to parameter fluctuations.

It turns out that this is true. Heroic work by Zeina has demonstrated this all the way up to 50 node random networks! So this is an exciting (to me) result, and so it is worth it to try to write this in a way that people will understand it. The problem is that there are so many exciting ramifications that I am very reluctant to do the careful writing. Must be done 😦

Inference and prediction: Dimon 0 Volcker 1

Jamie Dimon finally gets his. Chase posted a 2 billion $ loss because of a trade that they didn’t do enough risk management on. Too funny.

From DealBook:

>> On Wall Street, few have been more outspoken about the pitfalls of the Volcker Rule than JPMorgan’s chief executive, Jamie Dimon. Mr. Dimon not only attacked the rule, he personally criticized Paul Volcker, the former Federal Reserve chairman and the regulation’s namesake.

“Paul Volcker by his own admission has said he doesn’t understand capital markets,” Mr. Dimon told Fox Business earlier this year. “He has proven that to me.” ….

Even Mr. Dimon had to admit Thursday’s disclosure was a setback for JPMorgan and other banks that want more flexibility when the final version of the Volcker Rule is issued. “It plays into the hands of a bunch of pundits but you have to deal with that and that’s life,” Mr. Dimon said Thursday on a conference call with analysts.  …

“Just because we’re stupid doesn’t mean everybody else was,” he said. “There were huge moves in the marketplace but we made these positions more complex and they were badly monitored.”

“This may not have violated the Volcker Rule, but it violates the Dimon Principle.” <<

Apparently, Mr. Dimon doesn’t understand capital markets either.

On a constructive note, I think some of these problems arise due to an inadequate understanding of how to model probability distributions from a finite amount of data and how to automatically learn changing distributions. These issues are exactly the same as the conflict in evolution: You want stability to propagate your genome, but you need variation to allow for the emergence of new traits that can handle changes in the environment. It isn’t easy to come up with dynamical systems that meet both of these desiderata.