The goal of this note is to give an overview of basic neural network components and considerations when modeling text.
Consider the text modeling problem we've considered so far. We have sequences $x_1, \dots, x_n$ over a finite vocabulary $\mathcal{V}$. We want to define probability distributions:
$$ p(\cdot \mid x_{\lt i}) $$This $p(\cdot \mid x_{\lt i})$ notation denotes a $|\mathcal{V}|$-dimensional probability distribution, that is, a distribution over the vocabulary representing the probability of each word in the context of the prefix $x_{\lt i}$. We model this by (1) representing the prefix $x_{\lt i}$, and (2) projecting that representation to the space of the vocabulary, and (3) normalizing to a probability distribution using the softmax function. That is,
$$ \begin{align} p(\cdot \mid x_{\lt i}) &= \text{softmax}(U h_{\lt i}) \\ h_{\lt i} &= f(x_1, \dots, x_{i-1}) \end{align} $$where $h_{\lt i} \in \mathbb{R}^d$ for some fixed dimensionality $d$, and $U$, often called the "unembedding" or "softmax" matrix, is of shape $\mathbb{R}^{|\mathcal{V}| \times d}$. Holding this form constant, the question becomes, how do we represent the prefix $h_{\lt i}$?
One simple way to represent a prefix of text is to embed each token in $\mathbb{R}^d$, and then average the embeddings. To represent this embedding, we often use a matrix $E \in \mathbb{R}^{d \times |\mathcal{V}|}$, and consider each word $x_i$ to be a vector in $\mathbb{R}^{|\mathcal{V}|}$, so that
$$ Ex_i $$is a vector in $\mathbb{R}^d$. Representing our prefix $x_{\lt i}$ as an average of all word embeddings is thus
$$ h_{\lt i} = \frac{1}{i-1} \sum_{j=1}^{i-1} Ex_j $$So, what's good and what's bad about this way of representing text? In the good column, it's (1) pretty cheap, and (2) parallelizable. We'll talk more about paralellization later, but for now, consider how I can compute $h_{\lt 4}$ at the same time as I compute $h_{\lt 3}$ if I want; representations of later prefixes don't depend on representations of earlier prefixes.
There's one glaring issue here, though. This representation doesn't depend on the order of the words. That is, if I took the prefix "Uncle Iroh ran to Zuko" and the prefix "Zuko ran to Uncle Iroh", these would receive the same representation, despite certainly having different meanings.
Let's think through how we can incorporate the information of the positions of words into our representation of the prefix $x_{\lt i}$. Consider the following simple proposal. If we have a maximum sequence length $m$, then really, each position is an element from a finite vocabulary of positions $1, \dots m$. Just like we embedded elements from our finite vocabulary $\mathcal{V}$, we can embed elements from our positions! Let $p_1, \dots, p_m$ be vector embeddings of our positions $1, \dots, m$, each a vector in $\mathbb{R}^d$. (As always, imagine that they're randomly initialized.)
Now that we have position embeddings, let's try just adding in each position embedding to the corresponding word embedding in our average:
$$ h_{\lt i} = \frac{1}{i-1} \sum_{j=1}^{i-1} (Ex_j + p_j) $$Is this better than our previous position-independent average? Alas, no. In fact, this representation is also invariant to the ordering of the words in the prefix. Oops! Let's see why:
$$ \begin{align} h_{\lt i} &= \frac{1}{i-1} \sum_{j=1}^{i-1} (Ex_j + p_j) \\ &= \left( \frac{1}{i-1} \sum_{j=1}^{i-1} Ex_j \right) + \left( \frac{1}{i-1} \sum_{j=1}^{i-1} p_j \right) \quad \text{(due to additivity)} \end{align} $$What's happened here is that due to additivity, there's nothing tying each position (embedding) to each word (embedding). It all gets added together, so the fact that word $x_j$ appeared "with" position embedding $p_j$ is lost in the commutativity of addition. We need to combine the information of the word and its position in another way before we add things together and that pairing information is lost.
How about we combine with a linear transformation? Let $A \in \mathbb{R}^{d \times d}$, a linear transformation. (Again, imagine the entries in this matrix are just randomly sampled from small noise, like $\mathcal{N}(0, \epsilon)$.) Now consider:
$$ \begin{align} h_{\lt i} &= \frac{1}{i-1} \sum_{j=1}^{i-1} A(Ex_j + p_j) \\ &= \left( \frac{1}{i-1} \sum_{j=1}^{i-1} AEx_j \right) + \left( \frac{1}{i-1} \sum_{j=1}^{i-1} Ap_j \right) \quad \text{(due to additivity)} \end{align} $$Ok, that didn't work either. The linear transformation distributes, and then we're left with exactly the same additivity problem we had before. We have to non-linearly combine our word and position information. Non-linearity is powerful in that it binds variables together; it can compute new representations of objects that don't decompose into independent contributions from each element.
What does a nonlinearity mean? Often we use this term to refer to operations that compute a (non-linear) elementwise transformation of a vector. By "elementwise" we mean that each dimension of the vector has a function performed on it independently of the values of all the other dimensions. Some famous examples are $\max(0,x)$ and the sigmoid, $\frac{e^x}{1+e^x}$. When we don't care exactly which nonlinearity is being used, we often just call the function $\sigma$. so, $\sigma(v)$ for some vector $v$ computes the scalar function $\sigma$ on each element on $v$.
Let's use some nonlinearity $\sigma$ to bind the information in each word to its position before we average them together:
$$ h_{\lt i} = \frac{1}{i-1} \sum_{j=1}^{i-1} \sigma(Ex_j + p_j) $$Note how I can't decompose $\sigma(Ex_j + p_j)$ into $\sigma(Ex_j) + \sigma(p_j)$. (Why?) So, if I have a word $x \in \mathcal{V}$, it now matters whether it shows up at position $j$, resulting in representation $\sigma(Ex + p_j)$, compared to position $k$, resulting in representation $\sigma(E_x + p_k)$.
In deep learning or neural networks world, you'll often hear the term "expressivity". What is the set of functions that a model I'm specifying could represent? That is, can I express complicated (or, interesting) functions of the input? Our average of word embeddings lacked expressivity because it couldn't express any function that depended on the order of the words in the sequence.
Now that we've non-linearly combined each word with its position, are there any interesting kinds of functions that we still definitely can't express? Yes; this class of functions we've defined is still quite limited.
Intuitively, the issue is that words only interact linearly with each other.
Let's say we have two word sequences, $x_1,\dots,x_m$ and $x'_1,\dots,x'_m$.
Further, let them be the same except at position $k$, that is, for all $i\not=k$, $x_i=x'_i$, and $x_k\not=x'_k$.
These sequences could have very different meanings, like I like pizza, so I went to buy
and I hate pizza, so I went to buy
.
But the representations differ only by the value of like
vs. hate
, not by how they differ in context.
That is, if the first sequence were I, like, don't love this movie
, swapping out like
for hate
doesn't have the same effect as in the first example.
This lack of expressivity can be stated as follows.
If $h_{m}$ is the representation of $x_1,\dots,x_m$ and $h'_{m}$ is the representation of $x'_1,\dots,x'_m$, then the representations relate to each other as
Intuitively, we want words to interact non-linearly with each other, so that the change of a single word can have impact on the representation of the sequence that can depend on the identities of the other words.
To combine each word with its position, we used a non-linearity. We can use the same trick here to have these word-position tuples interact with each other. Consider a very simple change to our existing representation model:
$$ h_{\lt i} = \sigma\left(\frac{1}{i-1} \sum_{j=1}^{i-1} \sigma(Ex_j + p_j)\right) $$Now we can't separate out the words like we did in Eqn (11).
It may be useful to think of a non-linearity as synthesizing a new result from the components that were added (up to linear transformation) to form the input vector.
Is this a good model? Probably not. Writing out the intuition of this model, it says something like (1) each word's meaning should be modified based on its position. Then (2) all words are combined at once, with equal weight. Then (3) a simple non-linear function (i.e., a sigmoid,) non-linearly combines those components. (4) the resulting mess is used to form a distribution over the vocabulary. This doesn't seem like a very good model, and much of this course will be spent developing better ones.
In future notes, we'll discuss many (better) intuitions for non-linear representation functions, and the properties that make them useful.