# Tag: gbrain

We start from a large data-set $V=\{ k,l,m,n,\dots \}$ (texts, events, DNA-samples, …) with a suitable distance-function ($d(m,n) \geq 0~d(k,l)+d(l,m) \geq d(k.m)$) which measures the (dis)similarity between individual samples.

We’re after a set of unknown events $\{ p,q,r,s,\dots \}$ to explain the distances between the observed data. An example: let’s assume we’ve sequenced the DNA of a set of species, and computed a Hamming-like distance to measures the differences between these sequences.

(From Geometry of the space of phylogenetic trees by Billera, Holmes and Vogtmann)

Biology explains these differences from the fact that certain species may have had more recent common ancestors than others. Ideally, the measured distances between DNA-samples are a tree metric. That is, if we can determine the full ancestor-tree of these species, there should be numbers between ancestor-nodes (measuring their difference in DNA) such that the distance between two existing species is the sum of distances over the edges of the unique path in this phylogenetic tree connecting the two species.

Last time we’ve see that a necessary and sufficient condition for a tree-metric is that for every quadruple $k,l,m,n \in V$ we have that the maximum of the sum-distances

$$\{ d(k,l)+d(m,n),~d(k,m)+d(l,n),~d(k,n)+d(l,m) \}$$

is attained at least twice.

In practice, it rarely happens that the measured distances between DNA-samples are a perfect fit to this condition, but still we would like to compute the most probable phylogenetic tree. In the above example, there will be two such likely trees:

(From Geometry of the space of phylogenetic trees by Billera, Holmes and Vogtmann)

How can we find them? And, if the distances in our data-set do not have such a direct biological explanation, is it still possible to find such trees of events (or perhaps, a forest of event-trees) explaining our distance function?

Well, tracking back these ancestor nodes looks a lot like trying to construct colimits.

By now, every child knows that if their toy category $T$ does not allow them to construct all colimits, they can always beg for an upgrade to the presheaf topos $\widehat{T}$ of all contravariant functors from $T$ to $Sets$.

But then, the child can cobble together too many crazy constructions, and the parents have to call in the Grothendieck police who will impose one of their topologies to keep things under control.

Can we fall back on this standard topos philosophy in order to find these forests of the unconscious?

(Image credit)

We have a data-set $V$ with a distance function $d$, and it is fashionable to call this setting a $[0,\infty]$-‘enriched’ category. This is a misnomer, there’s not much ‘category’ in a $[0,\infty]$-enriched category. The only way to define an underlying category from it is to turn $V$ into a poset via $n \geq m$ iff $d(n,m)=0$.

Still, we can define the set $\widehat{V}$ of $[0,\infty]$-enriched presheaves, consisting of all maps
$$p : V \rightarrow [0,\infty] \quad \text{satisfying} \quad \forall m,n \in V : d(m,n)+p(n) \geq p(m)$$
which is again a $[0,\infty]$-enriched category with distance function
$$\hat{d}(p,q) = \underset{m \in V}{max} (q(m) \overset{.}{-} p(m)) \quad \text{with} \quad a \overset{.}{-} b = max(a-b,0)$$
so $\widehat{V}$ is a poset via $p \geq q$ iff $\forall m \in V : p(m) \geq q(m)$.

The good news is that $\widehat{V}$ contains all limits and colimits (because $[0,\infty]$ has sup’s and inf’s) and that $V$ embeds isometrically in $\widehat{V}$ via the Yoneda-map
$$m \mapsto y_m \quad \text{with} \quad y_m(n)=d(n,m)$$
The mental picture of a $[0,\infty]$-enriched presheaf $p$ is that of an additional ‘point’ with $p(m)$ the distance from $y_m$ to $p$.

But there’s hardly a subobject classifier to speak of, and so no Grothendieck topologies nor internal logic. So, how can we select from the abundance of enriched presheaves, the nodes of our event-forest?

We can look for special properties of the ancestor-nodes in a phylogenetic tree.

For any ancestor node $p$ and any $m \in V$ there is a unique branch from $p$ having $m$ as a leaf (picture above,left). Take another branch in $p$ and a leaf vertex $n$ of it, then the combination of these two paths gives the unique path from $m$ to $n$ in the phylogenetic tree, and therefore
$$\hat{d}(y_m,y_n) = d(m,n) = p(m)+p(n) = \hat{d}(p,y_m) + \hat{d}(p,y_n)$$
In other words, for every $m \in V$ there is another $n \in V$ such that $p$ lies on the geodesic from $m$ to $n$ (identifying elements of $V$ with their Yoneda images in $\widehat{V}$).

Compare this to Stephen Wolfram’s belief that if we looked properly at “what ChatGPT is doing inside, we’d immediately see that ChatGPT is doing something “mathematical-physics-simple” like following geodesics”.

Even if the distance on $V$ is symmetric, the extended distance function on $\widehat{V}$ is usually far from symmetric. But here, as we’re dealing with a tree-distance, we have for all ancestor-nodes $p$ and $q$ that $\hat{d}(p,q)=\hat{d}(q,p)$ as this is just the sum of the weights of the edges on the unique path from $p$ and $q$ (picture above, on the right).

Right, now let’s look at a non-tree distance function on $V$, and let’s look at those elements in $\widehat{V}$ having similar properties as the ancestor-nodes:

$$T_V = \{ p \in \widehat{V}~:~\forall n \in V~:~p(n) = \underset{m \in V}{max} (d(m,n) \overset{.}{-} p(m)) \}$$

Then again, for every $p \in T_V$ and every $n \in V$ there is an $m \in V$ such that $p$ lies on a geodesic from $n$ to $m$.

The simplest non-tree example is $V = \{ a,b,c,d \}$ with say

$$d(a,c)+d(b,d) > max(d(a,b)+d(c,d),d(a,d)+d(b,c))$$

In this case, $T_V$ was calculated by Andreas Dress in Trees, Tight Extensions of Metric Spaces, and the Cohomological Dimension of Certain Groups: A Note on Combinatorial Properties of Metric Spaces. Note that Dress writes $mn$ for $d(m,n)$.

If this were a tree-metric, $T_V$ would be the tree, but now we have a $2$-dimensional cell $T_0$ consisting of those presheaves lying on a geodesic between $a$ and $c$, and on the one between $b$ and $d$. Let’s denote this by $T_0 = \{ a—c,b—d \}$.

$T_V$ has eight $1$-dimensional cells, and with the same notation we have

Let’s say that $V= \{ a,b,c,d \}$ are four DNA-samples of species but failed to satisfy the tree-metric condition by an error in the measurements, how can we determine likely phylogenetic trees for them? Well, given the shape of the cell-complex $T_V$ there are four spanning trees (with root in $f_a,f_b,f_c$ or $f_d$) having the elements of $V$ as their only leaf-nodes. Which of these is most likely the ancestor-tree will depend on the precise distances.

For an arbitrary data-set $V$, the structure of $T_V$ has been studied extensively, under a variety of names such as ‘Isbell’s injective hull’, ‘tight span’ or ‘tropical convex hull’, in slightly different settings. So, in order to use results one sometimes have to intersect with some (un)bounded polyhedron.

It is known that $T_V$ is always a cell-complex with dimension of the largest cell bounded by half the number of elements of $V$. In this generality it will no longer be the case that there is a rooted spanning tree of teh complex having the elements of $V$ as its only leaves, but we can opt for the best forest of rooted trees in the $1$-skeleton having all of $V$ as their leaf-nodes. Theses are the ‘forests of the unconscious’ explaining the distance function on the data-set $V$.

Apart from the Dress-paper mentioned above, I’ve found these papers informative:

So far, we started from a data-set $V$ with a symmetric distance function, but for applications in LLMs one might want to drop that condition. In that case, Willerton proved that there is a suitable replacement of $T_V$, which is now called the ‘directed tight span’ and which coincides with the Isbell completion.

Recently, Simon Willerton gave a talk at the African Mathematical Seminar called ‘Looking at metric spaces as enriched categories’:

Willerton also posts a series(?) on this at the n-category cafe, starting with Metric spaces as enriched categories I.

(tbc?)

Previously in this series:

If machine learning, AI, and large language models are here to stay, there’s this inevitable conclusion:

At the start of this series, the hope was to find the topos of the unconscious. Pretty soon, attention turned to the shape of languages and LLMs.

In large language models all syntactic and semantic information is encoded is huge arrays of numbers and weights. It seems unlikely that $\mathbf{Set}$-valued presheaves will be useful in machine learning, but surely Huawei will prove me wrong.

$[0,\infty]$-enriched categories (aka generalised metric spaces) and associated $[0,\infty]$-enriched presheaves may be better suited to understand existing models.

But, as with ordinary presheaves, there are just too many $[0,\infty]$-enriched ones, So, how can we weed out the irrelevant ones?

For inspiration, let’s turn to evolutionary biology and their theory of phylogenetic trees. They want to trace back common (extinguished) ancestors of existing species by studying overlaps in the DNA.

(A tree of life, based on completely sequenced genomes, from Wikipedia)

The connection between phylogenetic trees and tropical geometry is nicely explained in the paper Tropical mathematics by David Speyer and Bernd Sturmfels.

The tropical semi-ring is the set $(-\infty,\infty]$, equipped with a new addition $\oplus$ and multiplication $\odot$

$$a \oplus b = min(a,b), \quad \text{and} \quad a \odot b = a+b$$

Because tropical multiplication is ordinary addition, a tropical monomial in $n$ variables

$$\underbrace{x_1 \odot \dots \odot x_1}_{j_1} \odot \underbrace{x_2 \odot \dots \odot x_2}_{j_2} \odot \dots$$

corresponds to the linear polynomial $j_1 x_1 + j_2 x_2 + \dots \in \mathbb{Z}[x_1,\dots,x_n]$. But then, a tropical polynomial in $n$ variables

$$p(x_1,\dots,x_n)=a \odot x_1^{i_1}\dots x_n^{i_n} \oplus b \odot x_1^{j_1} \dots x_n^{j_n} \oplus \dots$$

gives the piece-wise linear function on $p : \mathbb{R}^n \rightarrow \mathbb{R}$

$$p(x_1,\dots,x_n)=min(a+i_1 x_1 + \dots + i_n x_n,b+j_1 x_1 + \dots + j_n x_n, \dots)$$

The tropical hypersurface $\mathcal{H}(p)$ then consists of all points of $v \in \mathbb{R}^n$ where $p$ is not linear, that is, the value of $p(v)$ is attained in at least two linear terms in the description of $p$.

Now, for the relation to phylogenetic trees: let’s sequence the genomes of human, mouse, rat and chicken and compute the values of a suitable (necessarily symmetric) distance function between them:

From these distances we want to trace back common ancestors and their difference in DNA-profile in a consistent manner, that is, such that the distance between two nodes in the tree is the sum of the distances of the edges connecting them.

In this example, such a tree is easily found (only the weights of the two edges leaving the root can be different, with sum $0.8$):

In general, let’s sequence the genomes of $n$ species and determine their distance matrix $D=(d_{ij})_{i,j}$. Biology asserts that this distance must be a tree-distance, and those can be characterised by the condition that for all $1 \leq i,j,k,l \leq n$, among the three numbers

$$d_{ij}+d_{kl},~d_{ik}+d_{jl},~d_{il}+d_{jk}$$

the maximum is attained at least twice.

What has this to do with tropical geometry? Well, $D$ is a tree distance if and only if $-D$ is a point in the tropical Grassmannian $Gr(2,n)$.

Here’s why: let $e_{ij}=-d_{ij}$ then the above condition is that the minimum of

$$e_{ij}+e_{kl},~e_{ik}+e_{jl},~e_{il}+e_{jk}$$

is attained at least twice, or that $(e_{ij})_{i,j}$ is a point of the tropical hypersurface

$$\mathcal{H}(x_{ij} \odot x_{kl} \oplus x_{ik} \odot x_{jl} \oplus x_{il} \odot x_{jk})$$

and we recognise this as one of the defining quadratic Plucker relations of the Grassmannian $Gr(2,n)$.

More on this can be found in another paper by Speyer and Sturmfels The tropical Grassmannian, and the paper Geometry of the space of phylogenetic trees by Louis Billera, Susan Holmes and Karen Vogtmann.

What’s the connection with $[0,\infty]$-enriched presheaves?

The set of all species $V=\{ m,n,\dots \}$ , together with the distance function $d(m,n)$ between their DNA-sequences is a $[0,\infty]$-category. Recall that a $[0,\infty]$-enriched presheaf on $V$ is a function $p : V \rightarrow [0,\infty]$ satisfying for all $m,n \in V$

$$d(m,n)+p(n) \geq p(m)$$

For an ancestor node $p$ we can take for every $m \in V$ as $p(m)$ the tree distance from $p$ to $m$, so every ancestor is a $[0,\infty]$-enriched presheaf.

We also defined the distance between such $[0,\infty]$-enriched presheaves $p$ and $q$ to be

$$\hat{d}(p,q) = sup_{m \in V}~max(q(m)-p(m),0)$$

and this distance coincides with the tree distance between the nodes.

So, all ancestors nodes in a phylogenetic tree are very special $[0,\infty]$-enriched presheaves, optimal for the connection with the underlying $[0,\infty]$-enriched category (the species and their differences in genome).

We would like to garden out such exceptional $[0,\infty]$-enriched presheaves in general, but clearly the underlying distance of a generalised metric space, even when it is symmetric, is not a tree metric.

Still, there might be regions in the space where we can do the above. So, in general we might expect not one tree, but a forest of trees formed by the $[0,\infty]$-enriched presheaves, optimal for the metric we’re exploring.

If we think of the underlying $[0,\infty]$-category as the conscious manifestations, then this forest of presheaves are the underlying brain-states (or, if you want, the unconscious) leading up to these.

That’s why I like to call this mental picture the tropical brain-forest.

(Image credit)

Where’s the tropical coming from?

Well, I think that in order to pinpoint these ‘optimal’ $[0,\infty]$-enriched presheaves a tropical-like structure on these, already mentioned by Simon Willerton in Tight spans, Isbell completions and semi-tropical modules, will be relevant.

For any two $[0,\infty]$-enriched presheaves we can take $p \oplus q = p \wedge q$, and for every $s \in [0,\infty]$ we can define

$$s \odot p : V \rightarrow [0,\infty] \qquad m \mapsto max(p(m)-s,0)$$

and check that this is again a $[0,\infty]$-presheaf. The mental idea of $s \odot p$ is that of a fat point centered at $p$ with size $s$.

(tbc)

Previously in this series:

A month ago, Stephen Wolfram put out a little booklet (140 pages) What Is ChatGPT Doing … and Why Does It Work?.

It gives a gentle introduction to large language models and the architecture and training of neural networks.

The entire book is freely available:

The advantage of these online texts is that you can click on any of the images, copy their content into a Mathematica notebook, and play with the code.

This really gives a good idea of how an extremely simplified version of ChatGPT (based on GPT-2) works.

Downloading the model (within Mathematica) uses about 500Mb, but afterwards you can complete any prompt quickly, and see how the results change if you turn up the ‘temperature’.

You should’t expect too much from this model. Here’s what it came up with from the prompt “The major results obtained by non-commutative geometry include …” after 20 steps, at temperature 0.8:

 NestList[StringJoin[#, model[#, {"RandomSample", "Temperature" -> 0.8}]] &, "The major results obtained by non-commutative geometry include ", 20]

 

The major results obtained by non-commutative geometry include vernacular accuracy of math and arithmetic, a stable balance between simplicity and complexity and a relatively low level of violence. 

Lol.

In the more philosophical sections of the book, Wolfram speculates about the secret rules of language that ChatGPT must have found if we want to explain its apparent succes. One of these rules, he argues, must be the ‘logic’ of languages:

But is there a general way to tell if a sentence is meaningful? There’s no traditional overall theory for that. But it’s something that one can think of ChatGPT as having implicitly “developed a theory for” after being trained with billions of (presumably meaningful) sentences from the web, etc.

What might this theory be like? Well, there’s one tiny corner that’s basically been known for two millennia, and that’s logic. And certainly in the syllogistic form in which Aristotle discovered it, logic is basically a way of saying that sentences that follow certain patterns are reasonable, while others are not.

Something else ChatGPT may have discovered are language’s ‘semantic laws of motion’, being able to complete sentences by following ‘geodesics’:

And, yes, this seems like a mess—and doesn’t do anything to particularly encourage the idea that one can expect to identify “mathematical-physics-like” “semantic laws of motion” by empirically studying “what ChatGPT is doing inside”. But perhaps we’re just looking at the “wrong variables” (or wrong coordinate system) and if only we looked at the right one, we’d immediately see that ChatGPT is doing something “mathematical-physics-simple” like following geodesics. But as of now, we’re not ready to “empirically decode” from its “internal behavior” what ChatGPT has “discovered” about how human language is “put together”.

So, the ‘hidden secret’ of successful large language models may very well be a combination of logic and geometry. Does this sound familiar?

If you prefer watching YouTube over reading a book, or if you want to see the examples in action, here’s a video by Stephen Wolfram. The stream starts about 10 minutes into the clip, and the whole lecture is pretty long, well over 3 hours (about as long as it takes to read What Is ChatGPT Doing … and Why Does It Work?).