Implementing the Gauss-Newton Algorithm for Sphere Fitting (1 of 3)

In some calibration strategies we encounter a sphere fitting problem: we need to find the map that does the best job sending the observed data to a unit sphere. This simple problem statement leaves two critical questions unanswered.

1. What family of maps are we picking from?
2. How do measure “best”?

I’ll outline a wealth of options we have in answering these questions in later posts. This article will specify reasonable answers to these questions, use those answers to pose a clear problem, then describe an algorithm to solve the problem.

The naïve algorithm presented here is easy to implement but requires too much memory for many micro-controller applications. Some simple refinements give us a second algorithm that still requires space linear in the number of samples, but with a constant small enough to make it usable on an Arduino. Finally we observe that we can collect 25 summary statistics of our data and have all the information we need to carry out the Gauss-Newton algorithm, giving us a constant space, constant time streaming algorithm. These algorithms will be described in detail in later articles.

Although the focus here is on sphere fitting, the methods described, including the constant-space streaming algorithm, have straightforward (if tedious) extensions to very general polynomial mapping problems.

1. Notation

We denote real 3-vectors with boldface lower-case Latin letters and label the axis with the integers ${0, 1, 2}$. So ${\mathbf{x} = (x_0, x_1, x_2) \in {\mathbb R}^3}$. We assume that we are given a set of ${N}$ 3-vectors ${\mathbf{x}_i = (x_{i0},x_{i1}, x_{i2})}$.

We work with a real vector space of parameters — 6-dimensional in this application — and will denote these parameter vectors with Greek letters that have an overbar. For example ${\bar{\beta}, \bar{\Delta}}$ are parameter vectors.

We also work with ${N}$-vectors where ${N}$ is the number of samples. These are denoted by boldface lower-case Latin letters with an arrow above. The residual vector ${\vec{\mathbf{r}} \in {\mathbb R}^N}$ is one example we will encounter soon.

Matrices of any dimension will be denoted by boldface upper-case Latin letters and transpose is denoted by ${\top}$, so ${\mathbf{M}}$ is a matrix and ${\mathbf{M}^{\top}}$ is its transpose.

The Latin letter ${i}$ will be used to index the ${N}$ samples and other sets of size ${N}$. The letter ${j}$ and ${k}$ will index the dimensions of space: ${j,k \in \{0,1,2\}}$.

2. The Problem

To answer Question 1, we restrict our attention to a 6-parameter family of maps that allows each axis to be shifted and scaled independently.

Definition 1 (Axial Affine Transformation) Given a 3-vector ${\mathbf{x} = (x_0, x_1, x_2)}$ and parameters ${\bar{\beta} = (\beta_0, \beta_1, \beta_2, \beta_3, \beta_4, \beta_5)}$, define the axial affine transformation of ${\mathbf{x}}$ for parameters ${\bar{\beta}}$ to be $\displaystyle A_{\bar{\beta}}(\mathbf{x}) = \frac{x_0 - \beta_0}{\beta_3} + \frac{x_1 - \beta_1}{\beta_4} + \frac{x_2 - \beta_2}{\beta_5}$

Three comments are in order here.

• This is a special case of a general affine transformationon ${{\mathbb R}^3}$, we have just ruled out shear. The techniques described below extend in a straightforward way to handle general affine transformations, but are more involved and more difficult to follow. Also when calibrating MEMS sensors in practice the axial parameters are far more important than the shear parameters. It is not clear that modeling shear is worthwhile, but this needs more investigation.
• At this point it would look nicer to name the parameters like we did the first time we looked at the Gauss-Newton method: ${m_x, m_y, m_z}$ for the offsets and ${\delta_x, \delta_y, \delta_z}$ for the sensitivities. In the calculations that follow, we study functions on the six-dimensional space of parameters and want to treat them uniformly. Special names for the parameters actually make those calculations more confusing (for me anyway, YMMV). We store them as one array in the code too, so it is best to get used to that early.
• You may wonder why we divide by ${\beta_3, \beta_4, \beta_5}$ instead of multiply. In theory, both choices should work equally well. In practice, samples coming from a magnetometer will generally give us offset values on the order of 10-1000 and sensitivities on the order of 100-1000. If we multiplied our parameter instead of dividing it, we’d have to work with the inverse of the sensitivity, which will be on the order of 0.01 – 0.001. So we may have parameters differing by 6 orders of magnitude. In the intermediate matrices we’ll see in the Gauss-Newton calculations, this can lead to matrix entries differing by 12 orders of magnitude creating “nearly singular” matrices that can’t be handled with 32-bit entries. I know this from experience, I did this my first time through and saw data come along and break it. Choosing parameters that are about the same order of magnitude leads to stable, reliable performance.

Now we address Question 2: how can we decide when one map is better than another? We are given a set of ${N}$ observations ${\{\mathbf{x}_i\}_{i=0}^{N-1}}$ trying to find an axial affine transformation that takes all of these onto the unit sphere. If the map ${A_{\bar{\beta}}}$ does this perfectly, then for all ${i = 0 ,\dots, N-1}$ we will have $\displaystyle |A_{\bar{\beta}}(\mathbf{x}_i)|^2 = 1$

If ${N > 6}$, we cannot hope to make this always happen but if our model is reasonable we can hope to get close. If the map is good, then ${1-|A_{\bar{\beta}}(\mathbf{x}_i)|^2 }$ should be small. So we define

Definition 2 (Residual) For sample ${\mathbf{x}_i}$ and parameters ${\bar{\beta}}$, the residual of ${\mathbf{x}_i}$ for ${\bar{\beta}}$ is $\displaystyle r_i(\bar{\beta}) =1 -|A_{\bar{\beta}}(\mathbf{x}_i)|^2 = 1 - \sum_{j=0,1,2} \frac{(x_{ij} - \beta{j})^2}{\beta_{3+j}^2}$

The residual vector is the ${N}$-vector ${\vec{\mathbf{r}}(\bar{\beta}) = \left(r_0(\bar{\beta}), r_1(\bar{\beta}), \dots, r_{N-1}(\bar{\beta})\right)}$

This measures the error for a single sample, but we need to combine all of the residuals to get one number representing the “total error”. We do this by taking the sum of square residuals. $\displaystyle S(\bar{\beta} | \{\mathbf{x}_i\}) = \sum_{i=0}^{N-1} r_i^2(\bar{\beta})$

We can now state our problem formally.

Problem 3 Given observations ${\mathbf{x}_i}$, find parameters that minimize ${S(\bar{\beta})}$. That is, find $\displaystyle \bar{\beta}^{*} = \text{argmin}_{\beta \in {\mathbb R}^6} S(\bar{\beta} | \{\mathbf{x}_i\})$

Alarms should be going off when you see this definition. Two apparently arbitrary decisions have been made that could have an important impact on our final results.

• Why did we use ${1 - |A_{\bar{\beta}}(\mathbf{x}_i)|^2}$ instead of ${1 - |A_{\bar{\beta}}(\mathbf{x}_i)|}$ or ${\log(||A_{\bar{\beta}}(\mathbf{x}_i)|)}$ or something else for the residual?
• Why choose the sum of squares of residuals instead of absolute values or ${p}$-th powers or some other function to combine residuals?

These decisions were made for computational convenience. Using the square of the magnitude in the residual makes the residual a polynomial of ${\mathbf{x}_i}$, a critical fact when we develop a streaming Gauss-Newton algorithm for sphere fitting. Using the sum of squares of residuals is what allows us to use the Gauss-Newton algorithmin the first place. Making these decisions differently may lead to better (or worse) calibration results in practice — I don’t know and I would like to study it — but they will require different algorithms.

3. A Quick Recap of the Gauss-Newton Algorithm

Problem 1 can be solved using the Gauss-Newton algorithm. What we need to know about this method is

• The residual vector is a function ${\vec{\mathbf{r}}: {\mathbb R}^6 \rightarrow {\mathbb R}^N}$.

• The Jacobian matrix of this map lets us locally approximate ${\vec{\mathbf{r}}}$ with a linear function: $\displaystyle \vec{\mathbf{r}}(\bar{\beta} + \Delta\bar{\beta}) \approx \vec{\mathbf{r}}(\bar{\beta}) + \mathbf{J}_{\bar{\beta}}\Delta\bar{\beta}$

Where $\displaystyle \left[\mathbf{J}_{\bar{\beta}}\right]_{ij} = \frac{\partial r_i}{\partial \beta_j} \Big|_{\bar{\beta}} \ \ \ \ \ (1)$

This is just basic calculus – we’re approximating a smooth function with its first derivative.

• The first derivatives of ${S}$ are then ${\frac{\partial S}{\partial \beta_j} = \left[2\mathbf{J}_{\bar{\beta}}^{\top}\vec{\mathbf{r}}\right]_{j}}$
• The second derivative Hessian matrix of ${S}$ can be approximated by ${\frac{\partial^2 S}{\partial \beta_j\partial \beta_k} \approx \left[2\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}\right]_{ij}}$. These two facts follow from the form of ${S}$ as a sum of square residuals.
• We can use ${\mathbf{J}_{\bar{\beta}}^{\top}\vec{\mathbf{r}}}$ and ${\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}}$ to approximate ${S}$ with a its second-order Taylor expansion. This is a quadratic function and we can find its minimum by solving the six-dimensional matrix equation $\displaystyle \mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}\bar{\Delta} = \mathbf{J}_{\bar{\beta}}^{T}\mathbf{r} \ \ \ \ \ (2)$

Then $\displaystyle \sum_{i=0}^{N-1} (r_i(\bar{\beta} - \bar{\Delta}))^2 \leq \sum_{i=0}^{N-1} (r_i(\bar{\beta}))^2$

So we can improve our parameters by replacing ${\bar{\beta}}$ with ${\bar{\beta} - \bar{\Delta}}$.

• This gives us the Gauss-Newton algorithm:
•
1. Guess ${\bar{\beta}}$. (You can just use the datasheet, but there is an easier way that I’ll discuss in another post.)
2. Solve Equation 2to find the correction ${\bar{\Delta}}$.
3. Apply the correction ${\bar{\beta} \leftarrow \bar{\beta} - \bar{\Delta}}$
4. If ${\bar{\Delta}}$ was very small, stop. Otherwise go to (2).

4. A Naïve Implementation

The iterative step of the Gauss-Newton consists of

1. computing the matrix ${\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}}$ and the vector ${\mathbf{J}_{\bar{\beta}}^{\top}\vec{\mathbf{r}}}$ for observations ${\mathbf{x}}$ and parameters ${\beta}$,
2. solving Equation 2, and
3. updating ${\bar{\beta}}$ then deciding whether to continue.

Item 2 is a standard linear algebra problem — I do it using textbook Gaussian elimination in muCSense. Item 3 is even more straightforward. Everything interesting happens in item 1.

The obvious thing to do is compute ${\mathbf{J}_{\bar{\beta}}}$ and ${\vec{\mathbf{r}}}$. Looking at the definition of ${{\mathbb R}}$ and differentiating, we find that $\displaystyle r_i = 1-\sum_{j=0,1,2} \frac{(x_{ij} - \beta_{j})^2}{\beta_{3+j}^2} \ \ \ \ \ (3)$

and $\displaystyle \left[\mathbf{J}_{\bar{\beta}}\right]_{ij} = \left\{ \begin{array}{ll} 2\frac{x_{ij} - \beta_{j}}{\beta_{3+j}^2} & 0 \leq j < 3;\\ 2\frac{(x_{ij} - \beta_{j})^2}{\beta_{3+j}^3} & 3 \leq j < 6.\end{array} \right. \ \ \ \ \ (4)$

In code, to do this at some point we would need to have arrays in memory that look something like this


float r[N];   //residual vector
float J[N]; //Jacobian matrix
int16_t x[N] //the observations

float JtJ; //The left-hand side of equation 2. It is symmetric so
//we only need to store 21 of the 36 matrix entires.
float JtR; //The right-hand side of equation 2.

I won’t write the code out for the implementation here because these declarations are enough to tell us that we are in trouble if we are implementing the algorithm on a micro-controller. These arrays require ${34N + 108}$ bytes of storage. If we take 10 samples we will be fine, but if we take 100 samples we need 3508 bytes. That will lead to some very undefined behavior on my Arduino Uno.

This will work for small samples, and the implementation is straightforward: loop over the observations and use Equations 3 and 4 to fill in the matrix entries. Then perform the needed matrix multiplications, solve the linear equation, update ${\bar{\beta}}$, and decide whether to continue. I will leave this as a warm-up exercise for the interested reader.

In the next post we will start writing some real code.