# Living on the edge

Given a finite set of points, $\{x_i\},$ a standard problem in Bayesian inference is to figure out likely probability distributions, if any, these points could have been drawn from. In particular, supposing that we actually know that these points have no sequence dependent variation (to which I’ll come back later), is the distribution likely to have finite or infinite support? I don’t know the statistical literature very well, but whatever I can find on density estimationÂ  assumes the density has infinite support and then transforms to finite support before inferring the likely density. This cannot be correct: A priori we don’t know if the density has finite or infinite support. If it has finite support and we assume infinite support, then our inferred density will be non-zero in regions where it should be zero. This support determination subproblem is a problem which seems to me to require a balance between smoothness and complexity, but somehow all the action is really taking place at the edges of the distribution, i.e. in the regions where there are very few data points. There is no point to studying this problem as a perturbation of the infinite number of data points limit as with an infinite number of data points, we certainly know the distribution’s support.

What do we expect? The probability density is a field confined to a certain domain. The energy of a field configuration is a measure of how well it conforms to the observed data points, and the entropy is the a priori expectation of smoothness or continuity that translates into a term in the Radon-Nikodym derivative proportional to some measure of how big derivatives are. Then in a region with no data points this entropy term is essentially weighting vacuum fluctuations. In other words, is there something like a Casimir energy trying to push the boundaries apart, with the potential well set up by the data points responsible for keeping the boundaries closer? And for an infinite distribution, there should be a marginal mode allowing the boundary to fluctuate without changing the free energy.

Suppose we think in terms of a Thomas-Fermi picture. The data points are fixed nuclei and the density is a cloud of electrons with kinetic energy. The question is: What is the extent of the cloud? And how does it depend on the distribution of the fixed nuclei? The Thomas-Fermi energy functional takes into account the kinetic energy of the electrons, normalized so $\int \rho = N_e:$

$E_{TF} = \int d^3r c \rho(r)^{5/3} - {1\over 2}(N/N_e)^2 \int\int \rho(x) G(x,y) \rho(y) + (N/N_e) \int\sum_i^N \rho(x) G(x,x_i)$

where the last term is the attraction to data-points and the middle term is the mutual repulsion of electrons. $G$ is defined so that $G(x,y)$ increases as $|x-y|$ increases. What is interesting here is that we can naturally consider two limits: $N =$ number of data points, and $N_e =$ the number of `bins’ or the resolution of the inferred pdf because we have the freedom to change the ratio of the electron charge to the data point charge, and the TF theory becomes exact in the limit of large numbers of data points, which is in fact what we want.

Another amusing thing about the TF theory is that it actually predicts that molecules will fly apart. This would seem to be a deal breaker because in the way I’m trying to apply it here, that would make the pdf a sum of disjoint pdfs around each of the data points. However, this is only when you put in a term involving nucleus-nucleus repulsion, which I did not do here. (Little nugget buried in the dim recesses from Barry Simon’s or Elliot Lieb’s lectures, misspent youth and all that.) Here there is no reason to put in such a term and therefore the TF pdf should not fly apart. Of course, the dimension dependence of the Green function $G$ is amusing. For $D=1,$ the Coulomb potential leads to a constant electric field. So if we’re thinking in terms of electrostatic analogies, we have perfect screening provided enough electrons are around each data point, and the only thing spreading them out is the density $\rho^{1+2/D}$ term, which doesn’t want accumulations. This is not electrostatic repulsion, just the Pauli exclusion principle.

Varying $\rho$ after scaling by $N_e,$ we find at leading order in $N,$ that $\rho = {1\over N} \sum \delta(x-x_i).$ The question is: What about the term proportional to $N_e^{1+2/D}?$ This term is a smoothing term trying to lessen accumulations of electrons, acting in concert with the electrostatic repulsion. The electrostatic repulsion can be screened, but this term cannot so it plays a crucial role in actually getting a smooth cloud instead of lumping electrons to minimize electrostatic energy. Nevertheless, we do want the leading result to stay valid so it brings us to the question of finding the appropriate $N$ dependence of $N_e.$ Fluctuations about the expected distribution will be $O(N),$ so we will try $N_e^{1+2/D} = 1,$ and see if the term smooths out fluctuations. If you make this term dominant then we would expect that the uniform distribution would dominate the actual presence of data points. On the other hand, too small and we would expect sharp peaks about each data point, close to the sum of delta function limit. This choice is actually dictated by the desire that our probability of a certain $\rho$ should be independent of how many data points we expect. In other words, it is perfectly reasonable to alter the constant in front of $E_{TF}$ because that is all dependent on our notion of how closely we want to match the data but it is not reasonable to alter the a priori probability of $\rho.$

It turns out that the Green function that seems to work reasonably well is actually the logarithm, which is the inverse of the Laplacian in 2d.