The research team at SigOpt works to provide the best Bayesian optimization platform for our customers. In our spare time, we also engage in research projects in a variety of other fields. This blog post highlights one of those recent projects which will be presented Tuesday February 6 at AAAI 2018 (2:00pm ET, Poster 1127, Jefferson room). For those who cannot attend that session, we discuss here the topic of embeddings of vectors and the computational gains available when using circulant binary embeddings.

## General Embeddings

Embedding a vector consists of mapping it to a different space, with the goal of studying it in a lower dimension, finding a more fruitful representation of the vector, or transforming it to a structured space with desirable properties. Generally, an embedding can be expressed as a continuous function \(f\)

\(f: \mX \to \mY\)

where \(\mX \in \RR^{d \times N}\) and \(\mY \in \RR^{k \times N}\). Note that the input and output spaces are matrices with \(N\) columns, to account for the fact that generally we are interested in embedding several vectors simultaneously. An example embedding might be a linear embedding, which would be written as

\(\mY = \mW^\top \mX\)

where \(\mW \in \RR^{d \times k}\). The figures below show some data in 2 dimensions and sample embeddings in 1 dimension.

In applications such as hashing, it is commonly beneficial to consider random embeddings of vectors. The distribution of outcomes for randomly chosen embeddings on a set of input data can say something about the structure of the data.^{1} Similarly, certain properties might be preserved in the randomly embedded versions of vectors with high probability, such as a sense of distance. The figure below demonstrates that random embeddings can preserve a sense of relative distance with high probability.

## Binary Embeddings

While the idea of embedding vectors in lower dimensional space is a value proposition, akin to dimensionality reduction, the idea most commonly finds its home in embedding vectors in a binary space. Essentially, the goal is to encode data in \(\RR^d\) as \(k\)-length binary vector with values in \(\{-1, 1\}^k\).^{2} The standard assumption is that all vectors have 2-norm equal to 1, and are on the \(d-1\)-sphere \(\cS^{d-1}\).

This mapping is nonlinear, and thus the function defining it must involve more than a matrix-vector product. A standard mapping consists of applying the sign function to the result of the linear mapping

\(h(\xx) = \sgn(\mW^\top \xx).\)

This concept of random projections to create binary embeddings can provide a valuable mechanism for storing and operating on data efficiently (here is a great article discussing the topic). Additionally, an important theory exists regarding the quality of such embeddings by considering the relationship between distances in the physical space and the embedded space (related to the idea expressed in Figure 2).

The proximity of two vectors \(\xx_i,\xx_j\in\cS^{d-1}\) can be judged most naturally by considering the angle between them, denoted \(\theta_{\xx_i,\xx_j}\).^{3} Probably the most natural way to talk about the proximity of two vectors in \(\{-1,1\}^k\), Hamming distance, \(d_H(h(\xx_i), h(\xx_j))\), which measures the number of non-matching bits. (Jacques et al. 2013) showed that there is an important relationship between these two senses of distance for random binary embeddings.

**Theorem 1**. Given \(0\lt\epsilon\lt1\) and \(n\) vectors \(\xx_i \in \mathcal{S}^{d – 1}\) for \(1\leq i\leq n\), assume \(k = \cO(\epsilon^{-2} \log n)\). Then, with probability at least \(1 – \me^{-c \epsilon^2 k}\),

\(\left| d_{\textrm{H}}(h(\xx_i), h(\xx_j)) – \frac{1}{\pi}\theta_{\xx_i, \xx_j} \right| \leq \epsilon,\)

for some constant \({c > 0}\).

This theorem says that, with high probability, vectors that are close in physical space will be close in embedded space. This is a beneficial result because computations in the \(k\)-dimensional bit vector space are often much more efficient than in the original space. Some analyses on data, such as clustering, which rely on distance computations benefit from this efficiency. The condition \(k = \cO(\epsilon^{-2} \log n)\) is often referred to as the bit complexity: the number of bits required to achieve some desired accuracy in the binary space.

The computation of these embeddings is, however, a potentially costly process, primarily because a \(\mW\in\RR^{d \times k}\) matrix full of (generally) normal random values needs to be created and \(\mW^\top\mX\) must be computed. If computing the embedding costs more than the savings of working with the embedded version of the data, then there is a restricted benefit.^{4} Initial attempts to attack this problem involved bilinear projections, along with other structured matrix strategies, to reduce the computational and/or storage cost.

## Circulant Binary Embeddings

As indicated in the title, our research focused on enforcing a circulant structure on \(\mW\) for efficient computation in a strategy named, unsurprisingly, circulant binary embeddings. The original article (Yu et al. 2014) presented the idea of restricting the complexity of the embedding matrix to \(\mW^\top=\mG\mD\)^{5} where

- \(\mD\in\RR^{d\times d}\) is a diagonal matrix with \(\pm1\)randomly chosen on the diagonal (a Rademacher sequence), and
- \(\mG\in\RR^{d\times d}\) is a circulant matrix defined from the vector \(\gg\), which has standard normal distributed vectors.
^{6}

A circulant matrix \(\mG\in\RR^{d\times d}\) is a special type of Toeplitz matrix defined using \(\gg\in\RR^d\) such that

\(\gg=\begin{pmatrix}g_1\\g_2\\\vdots\\g_d\end{pmatrix},\qquad

\mG=\begin{pmatrix}g_1&g_d&\cdots&g_2 \\ g_2&g_1&\cdots&g_3 \\ \vdots&\vdots&\ddots&\vdots \\ g_d&g_{d-1}&\cdots&g_1\end{pmatrix}.\)

Circulant matrices have the benefit of being both cheaper to store and only requiring \(\cO(d\log d)\) to compute matrix-vector products rather than \(\cO(d^2)\). This can be done using the fast Fourier transform: \(\mG\xx=\cF^{-1}\left(\cF(\gg)\odot\cF(\xx)\right)\), where \(\odot\) denotes the elementwise product of two vectors. The full binary embedding function, then, is

\(h_{\text{circ}}(\xx) = \sgn\left(\cF^{-1}\left(\cF(\gg)\odot\cF(\mD\xx)\right)\right).\)

## Our Article: Optimality of Circulant Binary Embeddings

The benefit of working with circulant matrices comes at a cost, however, because the \(W\) matrix no longer has \(kd\) independent random variables; its values are dependent because of the relationship between the columns of a circulant matrix. As a result, the Theorem 1 described above no longer directly applies.

One standard condition that is required for analysis of circulant binary embeddings is a small norm of all the vectors \(\|\xx_i\|_\infty\). The articles below use that, and other assumptions, to study the impact of the reduction in random complexity from the unstructured case. They show that a greater bit complexity is needed to achieve the same probabilistic bound.

- (Yu et al. 2015) was able the prove that \(k=\cO(\epsilon^{-2}\log^2 n)\) would be sufficient.
- (Oymak 2016) improved on that result to require only \(k=\cO(\epsilon^{-3}\log n)\) many bits for the same accuracy.

In our presentation at AAAI (which can be viewed here) we prove that the **optimal** bit complexity \(k=\cO(\epsilon^{-2}\log n)\) (associated with an unstructured \(W\) in Theorem 1) can be achieved for circulant binary embeddings, subject to the same smallness of the \(\xx_i\) vectors needed for the other bounds. In the figures below we show the error described in Theorem 1 decreasing as a function of \(k\) and one application where the binary embedding can be used efficiently.

## Conclusion

As computational tools for machine learning and data science evolve, it will become increasingly necessary to develop intelligent and efficient ways to manage and access data. Circulant binary embeddings are one of many strategies which can be used to compress data while retaining some measure of its original properties. Here at SigOpt, we are interested in understanding and contributing to research on this and other topics, as well as getting out to conferences and learning about the latest and greatest ideas primed to change the world. I hope you will stop by and see my (Jungtaek Kim) presentation at AAAI on Tuesday!