next up previous contents
Next: Bounding Matrices: The Gate Up: CEM for Gaussian Mixture Previous: Bounding Scalars: The Gate

Bounding Vectors: The Gate Means

Upon close observation, the above technique of taking derivatives of Q and setting them to zero seems to be intractable for the case of the means. The resulting derivative contains transcendental functions which can not be easily isolated to obtain an update equation. What we can do at this point is bound the Q function specifically for the means to get a nicer expression which can be easily maximized. We introduce the previously discussed parabolic bounds for the Qfunction as in Equation 7.21. Or, equivalently, each pair of gate and data point is individually bounded by a parabola as in Equation 7.22 (with height parameter kim, width parameter wim and location parameter $\gamma_{im}$).


 
$\displaystyle \begin{array}{lll}
Q(\Theta^t,\Theta^{(t-1)}) & = & \sum_{i=1}^N ...
... \sum_{m=1}^M k_{im} -
w_{im} \Vert \mu_x^m-\gamma_{im} \Vert^2 \\
\end{array}$     (7.21)


 
$\displaystyle \begin{array}{ccc}
Q(\Theta^t,\Theta^{(t-1)})_{im} & \geq &
k_{im...
...: \:\:\:\:\:
\Longrightarrow {\rm equality \:\: at \:\:} \mu_x^{m*}
\end{array}$     (7.22)

Once again, the contact point for this parabola (i.e. the $\mu_x^{m*}$) is the value for the previous model $\Theta^{(t-1)}$ or the mean of the gate before any iteration. Let us denote the previous parameters of the gate as $\mu_x^{m*}$ and $\Sigma_{xx}^{m*}$. To facilitate the derivation, we can perform what is commonly called a whitening operation on the coordinate system for each gate. For each gate, we can consider that its previous parameters where zero-mean and identity covariance. This is done by applying an affine transformation to the ${\bf x}$-space (i.e. shifting the data points ${\bf x}_i$ appropriately). The data is translated by the mean and scaled by the inverse of the square root of the covariance matrix. In theory, nothing changes except that each gate observes affinely displaced data instead and it then seems to have a zero mean and identity covariance as its previous operating point. This whitening operation does not need to be done explicitly (i.e. numerically) but helps simplify the following derivation for the bound.

Thus, without loss of generality, we consider that each data point has been appropriately whitened for each gate. The result is that $\mu_{x}^{m*} := 0$ and $\Sigma_{xx}^m = \Sigma_{xx}^{m*} = I$. Using this assumption, we can compactly solve for the parameters of the bounding parabola ( $k_{im},w_{im},\gamma_{im}$). This is done by meeting the three requirements (contact, tangentiality and tightness). The results is shown in Equation 7.23.


 
$\displaystyle \begin{array}{lll}
k_{im} & = & {\hat h}_{im} ( \log {\cal N} ({\...
...}_i}^T{\mu_x^m}
}
{{\mu_x^m}^T {\mu_x^m}} + \frac{{\hat h}_{im}}{2}
\end{array}$     (7.23)

As shown earlier, if a globally valid bound is desired, the value of wim must always be greater than the expression in Equation 7.23 no matter what the value of $\mu_x^m$. Thus we can find the minimum wim (i.e. wminim) for which this is true and form a parabolic bound for the Q function that is as 'tight' as possible. We shall now determine an efficient way to compute wminim without searching the high dimensional space of $\mu_x^m$ each iteration.

Note in the expression for wim above that $\mu_x^m$ is used only in an inner product with itself ( ${\mu_x^m}^T {\mu_x^m}$) or in an inner product with the whitened data point ${{\bf
x}_i}^T{\mu_x^m}$. Thus, we can represent these two quantities regardless of the dimensionality of the vector space using at most two degrees of freedom: the magnitude of $\mu_x^m$ and the cosine of the angle between the vectors $\mu_x^m$ and ${\bf x}_i$. Let us now denote the new variables as follows: $\lambda$ is the magnitude of $\mu_x^m$, $\rho$ is the magnitude of ${\bf x}_i$and $\zeta = \cos(\phi)$ is the cosine of the angle between the two vectors. We then compactly rewrite the bound as in Equation 7.24.


 
$\displaystyle \begin{array}{lll}
w_{im} & \geq & r_i \alpha_m
\frac{
e^{-\frac{...
...zeta\lambda\rho -1 \right )
}
{\lambda^2} + \frac{{\hat h}_{im}}{2}
\end{array}$     (7.24)

The wim is now lower bounded by the right hand side which is a function of two free scalars $\lambda$ and $\zeta$. We wish to find the maximum of the lower bound at some value of $\lambda$ and $\zeta$. By taking derivatives of the right hand side with respect to $\zeta$and setting to 0, we find an extremum of the function at $\zeta =
\frac{\lambda}{2
\rho}$. However, we also note that $\zeta = \cos(\phi)$ and so $\zeta
\epsilon [-1,1]$. If we set $\zeta = 1$ we note that the this solution is always greater than the $\zeta =
\frac{\lambda}{2
\rho}$. So, the extremum we solved for was a minimum. Thus, the maximum occurs at the ends of the interval so $\zeta = \pm 1$. Therefore, we have the relationship $\mu_x^m \propto {\bf x}_i$. In other words, let us define $\mu_x^m = c{\bf x}_i$. The expression for wim then simplifies into a function of 1 free variable (c) as in Equation 7.25.


 
$\displaystyle \begin{array}{lll}
w_{im} & \geq & r_i \alpha_m
e^{-\frac{1}{2} \...
...ac{1}{2} \rho^2}
\max \{ g(c,\rho) \} + \frac{{\hat h}_{im}}{2} \\
\end{array}$     (7.25)

For each value of $\rho$ we find the value of c which maximizes the value of $g(c,\rho)$. This is done offline (taking a few seconds) using Brent's method and the result is stored in a 1D lookup table (LUT). The values are shown in Figure 7.3(a) where we plot the max of the g function with respect to $\rho$. Using this table, we can immediately compute the minimum wminim by looking up the corresponding value of $\rho$ for the current data point and subsequently compute the rest of the parabolic bound's parameters.


  
Figure 7.3: Look-Up Table and Example Mean Parabolic Bound
\begin{figure}\center
\begin{tabular}[b]{cc}
\epsfxsize=2.4in
\epsfbox{LUTbot...
... Parabolic Width LUT &
(b) 1D Mean Bounding Example
\end{tabular}
\end{figure}

An example of a parabolic bound is shown in Figure 7.3(b) under a sub-component of the Qfunction. This sub-component corresponds to a single data point as we vary one gate's uni-variate mean. Many such parabolic bounds are computed (one for each data point and gate combination) and are summed to bound the Q function in its entirety.


  
Figure 7.4: Look-Up Table and Example Mean Piece-Wise Parabolic Bound
\begin{figure}\center
\begin{tabular}[b]{cc}
\epsfxsize=2.4in
\epsfbox{LUTbot...
... Parabolic Width LUT &
(b) 1D Mean Bounding Example
\end{tabular}
\end{figure}

To obtain a final answer for the update of the gate means $\mu_x^m$ we simply maximize the sum of parabolas which bounds the Q in its entirety. For M gates and N data points, we obtain $M \times N$parabolas each with its own ( $wmin_{im},k_{im},\gamma_{im}$). The update is in Equation 7.26.


 
$\displaystyle \begin{array}{lll}
\mu_x^m & := & \frac{ \sum_{i=1}^N wmin_{im} \gamma_{im}}{\sum_{i=1}^N wmin_{im}}
\end{array}$     (7.26)


next up previous contents
Next: Bounding Matrices: The Gate Up: CEM for Gaussian Mixture Previous: Bounding Scalars: The Gate
Tony Jebara
1999-09-15