Fun with Linear Transformations and Symbolic Maps

John Hewitt. June 8, 2025

One way I think about linear transformations is as a representation of a bunch of independent symbolic maps. Consider the following question, which I’ve used as a discussion and interview question when chatting with students interested in AI research.

Let $\mathbf{u}$ and $\mathbf{v}$ be orthonormal vectors in $\mathbb{R}^{d}$. Give a matrix $W$ such that $W\mathbf{u}=\mathbf{v}$, and $W\mathbf{v}=\mathbf{u}$, and everything orthogonal to the span of $\{\mathbf{u},\mathbf{v}\}$ maps to $0$.

I’ve loved discussing this question with students because it (and the extensions we talk about) really get at some simple, imo beautiful intuitions. Now that I’m writing this, I probably won’t use these questions again, so this is a forcing function for me to come up with new questions!

Breaking down the problem

$\mathbf{u}$ and $\mathbf{v}$ being orthonormal means that they are both unit norm:

\[\|\mathbf{u}\|_2 = \|\mathbf{v}\|_2 = 1,\]

and their dot product is zero:

\[\mathbf{u}^\top\mathbf{v} = 0\]

Students sometimes would get stuck in thinking perhaps too concretely; instantiating a $\mathbf{u}$ and a $\mathbf{v}$ such that the properties hold, like imagining that $\mathbf{u}=[1,0]$, $\mathbf{v}=[0,1]$ and $W = \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix}$. But we don’t get to assume what vectors we’re working with; all we have are the properties given.

The cute little mathematical bit here — and one I often hoped students would find when I told them that I’d be asking them this question in the interview — is the outer product, whose notation is delightfully transparent.

The outer product of $\mathbf{u}$ and $\mathbf{v}$ is:

\[\mathbf{u}\mathbf{v}^\top \in \mathbb{R}^{d\times d}\]

where any scalar entry $(\mathbf{u}\mathbf{v}^\top)_{ij}=\mathbf{u}_i \mathbf{v}_j$ for $i,j\in\{1,\dots,d\}$. What I love about the outer product is that its semantics in matrix-vector products is transparent in the notation, whereas the definition (each scalar’s value) seems non-obvious. Here’s what I mean:

\[(\mathbf{u}\mathbf{v}^\top)\mathbf{v} = \mathbf{u}(\mathbf{v}^\top \mathbf{v})\]

This transparency of notation is in how we can use the transpose and re-organize the parentheses here. So, my matrix $\mathbf{u}\mathbf{v}^\top$ when multiplied on the right with $\mathbf{v}$ produces... what? Well, go back to orthogonality and we get:

\[(\mathbf{u}\mathbf{v}^\top)\mathbf{v} = \mathbf{u}(\mathbf{v}^\top \mathbf{v}) = \mathbf{u}\|\mathbf{v}\|_2^2=\mathbf{u}*1= \mathbf{u}\]

Pretty cool! $\mathbf{u}\mathbf{v}^\top$ is a linear transformation that acts like a symbolic map; it takes the amount of $\mathbf{v}$ in an input and converts it to an equal amount of $\mathbf{u}$.

We’re already most of the way to a solution. The corresponding map from $\mathbf{u}$ to $\mathbf{v}$ is just $\mathbf{v}\mathbf{u}^\top$. But how do we do both? This is where we rely on the orthogonality of $\mathbf{u}$ and $\mathbf{v}$. It feels a little bit like cheating. We can just sum the two matrices:

\[W = \mathbf{u}\mathbf{v}^\top + \mathbf{v}\mathbf{u}^\top\]

Now let’s apply some vectors and see what happens:

\[\begin{aligned} &W\mathbf{u} = (\mathbf{u}\mathbf{v}^\top + \mathbf{v}\mathbf{u}^\top)\mathbf{u} = \mathbf{u}\mathbf{v}^\top\mathbf{u} + \mathbf{v}\mathbf{u}^\top \mathbf{u} = \mathbf{u}*0 + \mathbf{v}*\|\mathbf{u}\|_2^2 = \mathbf{v} \\ &W\mathbf{v} = (\mathbf{u}\mathbf{v}^\top + \mathbf{v}\mathbf{u}^\top)\mathbf{v} = \mathbf{u}\mathbf{v}^\top\mathbf{v} + \mathbf{v}\mathbf{u}^\top \mathbf{v} = \mathbf{u}\|\mathbf{v}\|_2^2 + \mathbf{v}*0 = \mathbf{u} \end{aligned}\]

Finally, let $\mathbf{z}$ be orthogonal to the span of $\mathbf{u}$ and $\mathbf{v}$. Then

\[W\mathbf{z} = (\mathbf{u}\mathbf{v}^\top + \mathbf{v}\mathbf{u}^\top)\mathbf{z} = \mathbf{u}\mathbf{v}^\top\mathbf{z} + \mathbf{v}\mathbf{u}^\top \mathbf{z} = \mathbf{u}*0 + \mathbf{v}*0 = 0\]

So, $W=\mathbf{u}\mathbf{v}^\top + \mathbf{v}\mathbf{u}^\top$ stores two independent symbolic maps, all for the price of one matrix. Neither is affected by the presence of the other.

Rank, periodicity, complexity

Around this point I’d often ask students if they were familiar with the concept of matrix rank, and if so, what the rank of $W$ is. The rank of $W$ is 2, and indeed, my favorite way of thinking about matrix rank is the minimal number of outer products necessary to represent a matrix $W$ in sum.

The rank of a matrix is sometimes thought of as relating to the complexity of the matrix. Certainly from the standpoint of a symbolic map, this makes sense — the more symbol-symbol relationships you want to store, the more complex the map is. Each relationship adds 1 to the rank.

Consider the following follow-up question. What if I want to use $W$ to construct a matrix $M$ such that $M\mathbf{u}=\mathbf{u}$, and $M\mathbf{v}=\mathbf{v}$? (Sure, the identity matrix does this, but for our thought process we want $W$ to show up in the solution.) Thinking about this from the perspective linear algebra, students would sometimes get stuck. I’d encourage them to go to the higher level abstraction; just think of $W$ as a function that maps $\mathbf{u}$s to $\mathbf{v}$ and $\mathbf{v}$s to $\mathbf{u}$. If we call this function $f$, then $f(f(\mathbf{u})) = f(\mathbf{v}) = \mathbf{u}$. Then what should our matrix be?

\[M=W^2\]

We can check that $M\mathbf{u} = W W\mathbf{u} = W\mathbf{v} = \mathbf{u}$, etc. This then brings up an opportunity to talk about eigenvectors and eigenvalues. What are the eigenvectors of $M$? Eigenvectors are vectors $x$ such that $M x = \lambda x$ for scalar $\lambda$. So, we’ve got our eigenvectors already — $\mathbf{u}$ and $\mathbf{v}$!

So, eigenvectors with eigenvalue $1$, in our view of a matrix as a symbolic map, correspond to symbol-symbol mappings where the input and output symbols are the same. Cool!

Now I’d ask students to think about matrices $W^3$, $W^4$, etc. They’d note that all even powers of $W$ map $\mathbf{u}$ back to itself. We then would sometimes discuss linear “RNNs”, which have the following update rule:

\[\mathbf{h}_{t} = W\mathbf{h}_{t-1} + Ux_{t}\]

So some term $\mathbf{u}=U x_{0}$, after $k$ timesteps, would be represented in the network representation vector $\mathbf{h}_{k}$ as $W^k \mathbf{u}$. This is independent of anything else that has happened since then! (Ok, not quite, since due to finite precision floats, in practice, the other vectors you add in can matter. But this isn’t a blog post on finite precision considerations, which are fascinating as well.)

How, I ask, would we construct a matrix such that instead of $W^2=\mathbf{u}$, we have something like $W^3\mathbf{u}=\mathbf{u}$? I’d also guarantee that the dimensionality $d$ of $\mathbf{u}$ is at least 3. The solution is epsilon from what we started with:

\[W = \mathbf{v}\mathbf{u}^\top + \mathbf{z}\mathbf{v}^\top + \mathbf{u}\mathbf{z}^\top\]

How does this work? Let’s see:

\[\begin{aligned} &W\mathbf{u} = (\mathbf{v}\mathbf{u}^\top + \mathbf{z}\mathbf{v}^\top + \mathbf{u}\mathbf{z}^\top)\mathbf{u} = \mathbf{v} + 0 + 0\\ &W\mathbf{v} = 0 + \mathbf{z} + 0\\ &W\mathbf{z} = 0 + 0 + \mathbf{u} \end{aligned}\]

So, we’ve got $W^3\mathbf{u} = \mathbf{u}$, and both $W\mathbf{u}$ and $W^2\mathbf{u}$ are orthogonal to $\mathbf{u}$. You could see this as $W$ “hiding” the presence of $\mathbf{u}$ for two steps, you could also see it as a periodic behavior with period 3.

The maximum period that you could construct in dimensionality $d$ is $d$ because that’s how many symbols we can store without interference.

Non-linearity and storage

As a follow-up to these thoughts, I sometimes discuss with students, what should I do if I only have $d$ dimensions and I want to store more than $d$ elements? What goes wrong if I try?

What goes wrong is that I get interference between “keys” — if I have a map $W$ with full rank and I try to add a new mapping to it as $W’ = W + \mathbf{u}\mathbf{v}^\top$, I’m guaranteed that when I query for the key $\mathbf{v}$, I don’t just get $\mathbf{u}$; I also get some other vector that corresponds to something else.

Inspiration and further reading

My thinking on linear transformations and symbolic maps was originally inspired by this great piece by Paul Smolensky:

Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems
Paul Smolensky. 1990.

Reading this really changed my intuitions about reasoning about outer products. A lot of my thought processes about linear transformations and reasoning about neural mechanisms were poured into a piece I wrote on how RNNs can (optimally) efficiently represent some formal languages that require stack manipulations:

RNNs can generate bounded hierarchical languages with optimal memory
John Hewitt, Michael Hahn, Surya Ganguli, Percy Liang, Christopher D. Manning. 2020.

This note was originally written in a mixture of plain text and Tex markup, which was then converted to HTML mainly by LLMs. I'm still figuring out this pipeline so give me notes if things seem off.

Thanks to Houjun Liu for nothing the erratum in an earlier draft: "outside the span of" isn't necessarily "orthogonal to the span of".

Posted on 08 Jun 2025.

John Hewitt Language & Machine Learning

Fun with Linear Transformations and Symbolic Maps

Breaking down the problem

Rank, periodicity, complexity

Non-linearity and storage

Inspiration and further reading

Join My Newsletter