We took over the lease of a Chevy Spark EV because our Prius was dying after 199000 miles of trouble free service. And now I can see what the EV craze is all about. This cheapest of EVs can blow every single gas powered car off the line from a traffic light with no fuss. It makes driving in the city so painless, which cannot be said for our 6-speed snoot-mobile. So now I’m convinced that we need to replace every car except for one for long distance driving with a cheap EV. The gas equivalent cost is about 1/4 that of our Escape Hybrid and the smug cloud over my head is therefore about 4 times as big as it used to be. The only surprise was how hard the steering wheel whips to the right if you stomp on the accelerator from a stop. Ooops.
If we have a time series of data points how can we tell if these come from some stationary distribution or if there is some secular process coupled to an uncertainty inducing step? Many statistical theorems start out by assuming stationarity. Suppose we ask for the probability of a model where is the `time coordinate’ of the data point, and comes from a stationary distribution. How do we disentangle and By definition, the likelihood should be invariant under replacing the permutation group on data points. In other words, but then this is going to require which suggests the trivial solution of just defining as the product over which seems a bit heavy-handed. It is easily achieved, of course, but this will be a computational nightmare. If this could be done, I think it would be a pretty strong constraint on what and could be.
One way to try to do this would be to do Monte-Carlo sampling over in calculating the cost function at each step of the MCMC for finding We really want to evaluate something like
and I think that for large enough the sum over could be approximated by a probabilistic sampling over some permutations. The worry here is that will come out all constant, so the noise model preferred will be the trivial distribution, so all the variation will try to be fit by The other extreme is that will be a constant and all the data will be fit by
The quick way might be to use something like singular spectrum analysis on each permuted summand and demand that the SSA interpolations all agree: Pick compute SSA for some sample of permutations, vary in a standard MCMC to sample over possible The end result should be a de-noised prediction for the actual functional dependence
Everyone knows about Kolmogorov complexity. Kolmogorov addressed randomness and came up with the Kolmogorov complexity of a string (defined as the length of the shortest program that can reproduce the string) as a measure of its randomness. Solomonoff, two years before Kolmogorov, looked at the problem from a much deeper perspective in my opinion. He asked, in drastic paraphrase: How should you weight different programs that could generate a given string? And the two are related because it turns out that the function that Solomonoff picked is [So why isn’t it called Solomonoff complexity, though Kolmogorov referenced Solomonoff in his paper? Because Kolmogorov was already famous.]
What I would like to understand is the following: How do you go about generating programs that could reproduce the string? Induction is a hoary field, but most of the work is done assuming that you have to decide between different programs, and very little, in proportion, addresses the question of how to go about generating such programs. That is actually a problem because the search space is, to put it mildly, large. Is there a fundamental process that is the optimal manner to induce programs given data? This must itself be a probabilistic process. In other words, the program output match to the desired string must be allowed to be imperfect, so the search space summation is not just over programs that reproduce a given string
but rather weighted in some way to balance fitting and reducing program length. Why does everything involve a balance between energy and entropy? Because there is nothing more fundamental in all of science.
So I think there should be a way to write something like
and then the question is: What is the universal form of the mismatch function? An interesting point here is that this process might work even with noisy since if the correct string is but you read then you’ll find a very short program that can reproduce the string and since the mismatch is only one bit this process would autocorrect your initial misread string.
We want symbols and rules for manipulation such that the resulting stream contains the known sequence of symbols. In other words, an inverse Godel problem: Given a stream of symbols, find a set of rules so that this stream is a proof.
Given a finite set of points, a standard problem in Bayesian inference is to figure out likely probability distributions, if any, these points could have been drawn from. In particular, supposing that we actually know that these points have no sequence dependent variation (to which I’ll come back later), is the distribution likely to have finite or infinite support? I don’t know the statistical literature very well, but whatever I can find on density estimation assumes the density has infinite support and then transforms to finite support before inferring the likely density. This cannot be correct: A priori we don’t know if the density has finite or infinite support. If it has finite support and we assume infinite support, then our inferred density will be non-zero in regions where it should be zero. This support determination subproblem is a problem which seems to me to require a balance between smoothness and complexity, but somehow all the action is really taking place at the edges of the distribution, i.e. in the regions where there are very few data points. There is no point to studying this problem as a perturbation of the infinite number of data points limit as with an infinite number of data points, we certainly know the distribution’s support.
What do we expect? The probability density is a field confined to a certain domain. The energy of a field configuration is a measure of how well it conforms to the observed data points, and the entropy is the a priori expectation of smoothness or continuity that translates into a term in the Radon-Nikodym derivative proportional to some measure of how big derivatives are. Then in a region with no data points this entropy term is essentially weighting vacuum fluctuations. In other words, is there something like a Casimir energy trying to push the boundaries apart, with the potential well set up by the data points responsible for keeping the boundaries closer? And for an infinite distribution, there should be a marginal mode allowing the boundary to fluctuate without changing the free energy.
Suppose we think in terms of a Thomas-Fermi picture. The data points are fixed nuclei and the density is a cloud of electrons with kinetic energy. The question is: What is the extent of the cloud? And how does it depend on the distribution of the fixed nuclei? The Thomas-Fermi energy functional takes into account the kinetic energy of the electrons, normalized so
where the last term is the attraction to data-points and the middle term is the mutual repulsion of electrons. is defined so that increases as increases. What is interesting here is that we can naturally consider two limits: number of data points, and the number of `bins’ or the resolution of the inferred pdf because we have the freedom to change the ratio of the electron charge to the data point charge, and the TF theory becomes exact in the limit of large numbers of data points, which is in fact what we want.
Another amusing thing about the TF theory is that it actually predicts that molecules will fly apart. This would seem to be a deal breaker because in the way I’m trying to apply it here, that would make the pdf a sum of disjoint pdfs around each of the data points. However, this is only when you put in a term involving nucleus-nucleus repulsion, which I did not do here. (Little nugget buried in the dim recesses from Barry Simon’s or Elliot Lieb’s lectures, misspent youth and all that.) Here there is no reason to put in such a term and therefore the TF pdf should not fly apart. Of course, the dimension dependence of the Green function is amusing. For the Coulomb potential leads to a constant electric field. So if we’re thinking in terms of electrostatic analogies, we have perfect screening provided enough electrons are around each data point, and the only thing spreading them out is the density term, which doesn’t want accumulations. This is not electrostatic repulsion, just the Pauli exclusion principle.
Varying after scaling by we find at leading order in that The question is: What about the term proportional to This term is a smoothing term trying to lessen accumulations of electrons, acting in concert with the electrostatic repulsion. The electrostatic repulsion can be screened, but this term cannot so it plays a crucial role in actually getting a smooth cloud instead of lumping electrons to minimize electrostatic energy. Nevertheless, we do want the leading result to stay valid so it brings us to the question of finding the appropriate dependence of Fluctuations about the expected distribution will be so we will try and see if the term smooths out fluctuations. If you make this term dominant then we would expect that the uniform distribution would dominate the actual presence of data points. On the other hand, too small and we would expect sharp peaks about each data point, close to the sum of delta function limit. This choice is actually dictated by the desire that our probability of a certain should be independent of how many data points we expect. In other words, it is perfectly reasonable to alter the constant in front of because that is all dependent on our notion of how closely we want to match the data but it is not reasonable to alter the a priori probability of
It turns out that the Green function that seems to work reasonably well is actually the logarithm, which is the inverse of the Laplacian in 2d.
Projective spaces are spaces of lines in some ambient space. For example, is the space of straight lines in What is this space? Well, it is the space you get when you identify points in up to the multiplicative group In this particular case, it is the upper hemisphere of with the boundary circle points identified. All good stuff which Brock Fuller and Jack Conn taught me a long long time ago.
But now we’re dealing with trying to do something like PCA on projective space, so we want to ask: What is PCA in a linear space and why do we care? We’re trying to find the directions along which some cloud of points is most spread out. In particular, the rotation symmetry in this space is what underlies wanting to diagonalize So on a projective space, we still have lots of symmetry, and we should be using this symmetry to find something like a great circle along which the cloud of points in projective space is most spread out. The distance measure inherited from the underlying ambient space has the same rotation symmetry (except for a minor caveat about the identification of antipodal points, which affects the group by modding out a discrete ).
It’s not obvious to me how to do this as some sort of simple matrix manipulation, but as a maximization problem, we want to find a rotation in the ambient space such that the points in the projective space have the largest possible range of (Remember the identification of and ) Then the other directions have to be determined by rotations that don’t change the inverse image of this half-circle in the ambient space and so on. (Might be better to have some different orientation standard but this is more a matter of notation and calculation.)
Each line is represented by a direction vector. A 2-plane in the original space defines a circle on the projective space. A 2-plane could be defined by two data points in the original space that maximize the projective distance between pairs of data points. Obviously, picking two specific data points is too susceptible to outliers so we want something like two orthonormal vectors such that the mean distance of the data points from the plane defined by these vectors (which is the angle between the data point and its projection on the plane) is minimized and the standard deviation of the angles between the projections of the points in this plane is maximized. Just implement the second cost:
Of course, there are other sets of charts on the projective space, but the metric distances are not so trivially calculated. It is interesting that this projective PCA would reduce the dynamic range. Is this good? There are instances in image processing where I imagine that the important thing is not the absolute intensity but the relative intensity. I wonder if one would get different results from the Pearson correlation. The Pearson correlation normalizes the mean and the standard deviation before computing the cosine of the angle between the two vectors. The projective distance is scale-independent but does take the mean of the two vectors into account. PCA can be generalized in many ways. There are nonlinear manifold PCA methods, neural network PCA methods and so on.
What’s the best way to normalize an array of gene expression values? First question: Best for what?
This depends on what you want to do with the array. The requirements for sample classification seem to me to be a little different than for, say, class discovery. What we want for sample classification is to normalize each array in a self-referential manner without taking into account the rest of the samples, because we want to be able to do the exact same normalization on any new sample that we might want to classify, given what we’ve learned from the training set. So suppose we normalize by total mRNA. Now all samples will be points on some high-dimensional hypersurface, but if some gene had high expression values in disease samples, the other genes in disease samples will now have depressed values because of the overall normalization. That doesn’t mean that those genes are biomarkers for that disease. On the other hand, we could arbitrarily declare that some genes are `household’ genes and normalize one (or several) to a fixed expression level, with rescalings of all the rest. This avoids the problem of the total mRNA normalization but exactly how do you decide that a gene is a household gene without taking into account the entire set of samples? And if you do pick the entire set of samples to decide what is a household gene (i.e. one that shows no correlation with the known classification, for example), then you are not normalizing in a sample-specific manner.
So, in some sense we want to live in the projective space associated with gene expression, and the relevant data is the gene expression values up to a constant. We’re picking charts on this projective space with all these different normalizations but what we should do is sample classification in a coordinate invariant manner. So how do we do this? Take any fixed gene, something with a median value for the normal samples for example (this is just bowing to experimental reality, and has no theoretical justification), and normalize all samples so that this gene is 1. Now with the uncertainty in expression is a measure of the distance in projective space. We could also compute the distance on a chart adapted to a sphere by normalizing each and then computing with appropriate in various places of course.
With such a measure of distance, the first thing one might do is compute the distribution of distances between all the normal samples. It’s a common view that healthy people are all alike, every unhealthy person is unhealthy in his/her own way (with apologies to Tolstoy). So I would imagine that one would get more insight in seeing how the distribution of disease sample-to-normal samples distances looks different from the normal-to-normal distribution.
Up next: PCA on projective space??
I’ve been wondering how to use grammatical evolution to generate signaling networks. So first we have to think up some sort of grammar for signaling networks. What would be appropriate start symbols? Productions? Terminals?
Transcription: Gene > Gene + RNA (constitutive expression) | Gene*TF | Gene*Inhibitor
Transcription: Gene*TF > Gene + RNA | Gene*TF[*Cofactor]^n | Gene*TF*Inhibitor
Transcription: Gene*TF*Cofactor > Gene + RNA
Signaling: Receptor > Receptor*SIgnal | Receptor*Blocker
Degradation: Any > Nothing
and so on
People have done this sort of thing before, obviously, but I’m wondering about how applying genetic mutation operators to a string of such productions will lead to the same sort of changes to gene networks that are actually observed. Not obvious to me …
What happens if you use a stochastic grammar? What’s the difference between a stochastic grammar applied many times to a fixed genome vs a deterministic grammar applied to a population of genomes? In biology, the binding of TFs may actually be stochastic, so perhaps we should encode the probability of a symbol in the genome going to a particular production in the genome itself.