Ray Solomonoff’s universal distribution

Everyone knows about Kolmogorov complexity. Kolmogorov addressed randomness and came up with the Kolmogorov complexity Cof a string  (defined as the length of the shortest program that can reproduce the string) as a measure of its randomness. Solomonoff, two years before Kolmogorov, looked at the problem from a much deeper perspective in my opinion. He asked, in drastic paraphrase: How should you weight  different programs that could generate a given string? And the two are related because it turns out that the function that Solomonoff picked is C. [So why isn’t it called Solomonoff complexity, though Kolmogorov referenced Solomonoff in his paper? Because Kolmogorov was already famous.]

What I would like to understand is the following: How do you go about generating programs that could reproduce the string? Induction is a hoary field, but most of the work is done assuming that you have to decide between different programs, and very little, in proportion, addresses the question of how to go about generating such programs. That is actually a problem because the search space is, to put it mildly, large. Is there a fundamental process that is the optimal manner to induce programs given data? This must itself be a probabilistic process. In other words, the program output match to the desired string must be allowed to be imperfect, so the search space summation is not just over programs that reproduce a given string S

\sum_{{\rm programs\ reproducing\ } S} \exp(-{\rm length}({\rm program}))

but rather weighted in some way to balance fitting S and reducing program length. Why does everything involve a balance between energy and entropy? Because there is nothing more fundamental in all of science.

So I think there should be a way to write something like

\sum_{S',{\rm programs\ producing\ } S'} \exp(-{\rm length}({\rm program})) \exp(- {\rm mismatch}(S,S'))

and then the question is: What is the universal form of the mismatch function? An interesting point here is that this process might work even with noisy S since if the correct string is 0101010101 but you read S= 0111010101, then you’ll find a very short program that can reproduce the string S' = 0101010101, and since the mismatch is only one bit this process would autocorrect your initial misread string.

We want symbols and rules for manipulation such that the resulting stream contains the known sequence of symbols. In other words, an inverse Godel problem: Given a stream of symbols, find a set of rules so that this stream is a proof.


Stochastic grammars and grammatical evolution

I’ve been wondering how to use grammatical evolution to generate signaling networks. So first we have to think up some sort of grammar for signaling networks. What would be appropriate start symbols? Productions? Terminals?

Start: Gene

Transcription: Gene > Gene + RNA (constitutive expression) | Gene*TF | Gene*Inhibitor

Transcription: Gene*TF > Gene + RNA | Gene*TF[*Cofactor]^n | Gene*TF*Inhibitor

Transcription: Gene*TF*Cofactor > Gene + RNA

Signaling: Receptor > Receptor*SIgnal | Receptor*Blocker

Degradation: Any > Nothing

and so on

People have done this sort of thing before, obviously, but I’m wondering about how applying genetic mutation operators to a string of such productions will lead to the same sort of changes to gene networks that are actually observed. Not obvious to me …

What happens if you use a stochastic grammar? What’s the difference between a stochastic grammar applied many times to a fixed genome vs a deterministic grammar applied to a population of genomes? In biology, the binding of TFs may actually be stochastic, so perhaps we should encode the probability of a symbol in the genome going to a particular production in the genome itself.

Genotype to phenotype – grammatical evolution?

We’re working on a new modeling framework where we can take evolution into account in developing the models.

  • We want to make models that are `robust’ in several senses (parameter insensitivity, data uncertainties and homeostatic adaptability are some of the reasons).
  • We want to be able to take data from different organisms and use all the data to constrain models, but the data come from distinct models with only evolution connecting them.
  • We want to restrict the model search space by considering only models that could have come from a genotype to phenotype mapping.

There’s loads of work that people have done on such maps, and today I’ve been learning about grammatical evolution, which is a new approach to genetic programming. The idea is that there is a fixed grammar and the genome encodes the production of the start symbol that leads to the actual code, which ends up being compilable if this is done right. Standard genetic programming works directly on the parse trees and, in some variants, doesn’t always lead to working end programs.

My postdoc, Junghyo Jo, and I have been thinking of a genotype – phenotype mapping as well, but wanting to encode a whole dynamical system in the  genotype, parameters and all. That we can set up in a way that is pretty close to `nature’ but I’m still trying to get my head around why grammatical evolution is the correct genotype-phenotype map. Obviously, the GE algorithm generates correct code if the grammar is consistent, but is my genome sequentially encoding the code that is then compiled into the executable that is me? Probably not the best way to phrase my confusion but in all honesty I do not see why GE is biologically inspired. Yes, genes encode for proteins but transcribing a gene into an executable protein as a grammatical production is not quite what happens. The mRNA doesn’t get to the ribosome and start getting translated with amino-acids being added at one point caring about the amino-acids that have previously been added. (There are control mechanisms such as secondary structure of the mRNA etc., but let’s keep it simple.) I think what people have in mind is that the executable is the working folded protein analog rather than a string of residues that needs to be folded etc. In that case it would make some sort of sense as set up – linear structure being mapped to complicated active executable, with the compiler as some sort of ribosome, but I still feel that each succeeding base should not depend on what the preceding base did to the derivation (thus far) of the start symbol.

So what do we expect? I’m thinking this genotype-phenotype mapping is not a one-time thing. There should be many different go-to type entry points in the genotype, and the compiled code should execute something that activates some of these go-to points. Thus, there should be several start symbols, and several go-to points. The compiled code should execute and produce a new set of start symbols that then activate their associated go-to points. That’s a more amusing picture but I’m pretty sure that isn’t enough.