## Implementing the Gauss-Newton Algorithm for Sphere Fitting (2 of 3)

I sketched a naïve implementation of the Gauss-Newton algorithm for sphere fitting, and provided an implementation in Octave. While it works fine on a laptop, it uses far too much memory to handle more than a handful of calibration samples on a microcontroller.

What took up all of that memory? The matrix ${\mathbf{J}_{\bar{\beta}}}$, the vector ${\vec{\mathbf{r}}}$, and the set of observations ${\mathbf{x}_i}$.

It turns out that we don’t really need to keep any of these around. In this post I’ll describe how to implement the algorithm without materializing ${\mathbf{J}_{\bar{\beta}}}$ and ${\vec{\mathbf{r}}}$. This will reduce the memory footprint enough to make a practical algorithm that can handle ~100 observations. This is the algorithm used in the accelerometer calibration sketch I posted last year.

In the next post, I’ll show how to accumulate 25 statistics about the observations that have all of the information we need to run Gauss-Newton in constant space and time.

Let’s start by getting rid of the big arrays of statistics.

1. Only materialize what you need

When carrying out the Gauss-Newton algorithm as it was sketched in the last post, we never need direct access to the matrix entries of ${\mathbf{J}_{\bar{\beta}}}$ or ${\vec{\mathbf{r}}}$. We just need to work with ${\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}}$ and ${\mathbf{J}_{\bar{\beta}}^{\top}\vec{\mathbf{r}}}$ — two arrays with sizes that do not depend on the number of samples collected.

We will materialize these matrices instead of the big matrices.

This is possible because each matrix entry in ${\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}}$ and ${\vec{\mathbf{r}}}$ can be written as a sum of functions of individual observations. This means we can look at an observation, compute the function of the observation, add it to the matrix entry, then move on to the next observation.

Let’s make this more concrete and look at the 00 entry of ${\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}}$. Recall Equation 4 of part 1:

$\displaystyle \left[\mathbf{J}_{\bar{\beta}}\right]_{ij} = \left\{ \begin{array}{ll} 2\frac{x_{ij} - \beta_{j}}{\beta_{3+j}^2} & 0 \leq j < 3;\\ 2\frac{(x_{ij} - \beta_{j})^2}{\beta_{3+j}^3} & 3 \leq j < 6.\end{array} \right.$

from which the definition of matrix multiplication gives

$\displaystyle [\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}]_{00} = \sum_{i=0}^{N-1} 4\frac{(x_{i0} - \beta_{0})^2}{\beta_{3}^4}$

So to compute ${[\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}]_{00}}$ we can just accumulate this sum observation by observation. There is no need to compute the big matrices ${\mathbf{J}_{\bar{\beta}}}$ and ${\vec{\mathbf{r}}}$.

The general matrix entries are as follows. For ${j,k < 3}$

$\displaystyle [\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}]_{jk} = [\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}]_{kj} = \sum_{i=0}^{N-1} 4\frac{(x_{ij} - \beta_{j})(x_{ik} - \beta_{k})}{\beta_{3+j}^2\beta_{3+k}^2} \ \ \ \ \ (1)$

for ${j < 3 \leq k}$

$\displaystyle [\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}]_{jk} = [\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}]_{kj} = \sum_{i=0}^{N-1} 4\frac{(x_{ij} - \beta_{j})(x_{i(k-3)} - \beta_{(k-3)})^2}{\beta_{3+j}^2\beta_{k}^3} \ \ \ \ \ (2)$

and for ${ 3 \leq j,k}$

$\displaystyle [\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}]_{jk} = [\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}]_{kj} = \sum_{i=0}^{N-1} 4\frac{(x_{i(j-3)} - \beta_{(j-3)})^2(x_{i(k-3)} - \beta_{(k-3)})^2}{\beta_{j}^3\beta_{k}^3} \ \ \ \ \ (3)$

${\mathbf{J}_{\bar{\beta}}^{\top}\vec{\mathbf{r}}}$ has a similar expression. Recall that

$\displaystyle r_i = 1-\sum_{j=0,1,2} \frac{(x_{ij} - \beta_{j})^2}{\beta_{3+j}^2}$

So for ${j < 3}$

$\displaystyle \left[\mathbf{J}_{\bar{\beta}}^{\top}\vec{\mathbf{r}}\right]_{j} = \sum_{i=0}^{N-1}2\frac{x_{ij} - \beta_{j}}{\beta_{3+j}^2} \left( 1 - \sum_{k=0,1,2} \frac{(x_{ik} - \beta_{k})^2}{\beta_{3+k}^2} \right) \ \ \ \ \ (4)$

and for ${j \geq 3}$

$\displaystyle \left[\mathbf{J}_{\bar{\beta}}^{\top}\vec{\mathbf{r}}\right]_{j} = \sum_{i=0}^{N-1}2\frac{(x_{i(j-3)} - \beta_{(j-3)})^2}{\beta_{j}^3} \left( 1 - \sum_{k=0,1,2} \frac{(x_{ik} - \beta_{k})^2}{\beta_{3+k}^2} \right) \ \ \ \ \ (5)$

So all of the matrix entries we need to compute can be accumulated one observation at a time. We can store the observations, then at each iteration step in the Gauss-Newton algorithm run through these observations and use the formulas above to accumulate the matrix entries. Once we have done that, it is simple linear algebra to solve the matrix equation, find ${\bar{\Delta}}$, and adjust ${\bar{\beta}}$. (Review the algorithm sketch here if needed.) This is essentially what is done in the sketch I posted last October, and I’ll use a slightly modified version of that code in the rest of this post.

2. Putting it into code

Now let’s translate this into C++. We won’t try to engineer the code here, just outline a simple Arduino sketch that does the job.

To start, we will need to declare some arrays to hold ${\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}, \mathbf{J}_{\bar{\beta}}^{\top}\vec{\mathbf{r}}, \bar{\beta}}$, and the observations ${\mathbf{x}_i}$. I’ll add an array for ${\bar{\Delta}}$ too even though it isn’t strictly needed.

//matrices for Gauss-Newton computations
float JtJ[6][6];
float JtR[6];
float delta[6];
float beta[6];

//Array of observations
int16_t* x;  //malloc this to length 3*N in setup().


This amounts to ${6N + (36+6+6+6)*4 = 6N+216}$ bytes, where ${N}$ is the number of samples. If we only stored the upper triangular part of ${\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}}$, we could eliminate 15 floats, reducing the cost to ${6N + 156}$ bytes at the expense of making the code below a bit tougher to read. I will leave the modification as an exercise for the interested reader (but I’ll perform a similar one in the next post). The cost of the naïve algorithm was ${34N + 156 }$ (had we declared arrays for ${\bar{\beta}}$ and ${\bar{\Delta}}$), so this method uses significantly less memory for any number of samples. Importantly, the dependency on ${N}$ is dramatically reduced and this algorithm can now be used on and Arduino on data sets with ~100 samples.

With those declared, let’s look at the calibrate() routine. It starts by setting up some general parameters — how small a ${\bar{\Delta}}$ is small enough to stop? (eps), after how many iterations should we give up? (num_iterations), how big was the last ${\bar{\Delta}}$? (change) — then enters the main loop. There is nothing subtle about this loop, just

• clear the matrices
• compute the calibration matrices ${\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}}$ and ${\mathbf{J}_{\bar{\beta}}^{\top}\vec{\mathbf{r}}}$
• find ${\bar{\Delta}}$ by solving the matrix equation
• measure the size of ${\bar{\Delta}}$
• adjust ${\bar{\beta}}$

Here is what that looks like in code.


void calibrate() {
int i;
float eps = 0.000000001;
int num_iterations = 20;
float change = 100.0;
while (--num_iterations >=0 && change > eps) {

//set all of the matrix entries to zero.
reset_calibration_matrices();

//use the samples and the current values of beta to compute
// the matrices JtJ and JtR
compute_calibration_matrices();

//solve the matrix equation JtJ*delta = JtR
find_delta();

//find the size of delta (look at log change in the coefficient parameters)
change = delta[0]*delta[0] + delta[1]*delta[1] + delta[2]*delta[2] + delta[3]*delta[3]/(beta[3]*beta[3]) + delta[4]*delta[4]/(beta[4]*beta[4]) + delta[5]*delta[5]/(beta[5]*beta[5]);

for(i=0;i<6;++i) {
beta[i] -= delta[i];
}

}

}


But up to this is really just a template for an algorithm. As we’ve just been discussing, the real work of the algorithm all falls in to the routine compute_matrices(). Everything else is simple. In this implementation, we compute the matrices by scanning the samples and updating the matrix entries as required by Equations 15. Here is an implementation.



void compute_calibration_matrices() {
int i;

reset_calibration_matrices();
for(i=0;i<N;i++) {
update_calibration_matrices(x+3*i);
}

}

void update_calibration_matrices(const unsigned int* data) {
int j, k;
float dx, b;
float residual = 1.0;
float jacobian[6];

for(j=0;j<3;++j) {
b = beta[3+j];
dx = ((float)data[j])- beta[j];
residual -= pow(dx/b,2);
jacobian[j] = 2.0*dx/pow(b,2);
jacobian[3+j] = 2.0*dx*dx/pow(b,3);
}

for(j=0;j<6;++j) {
JtR[j] += jacobian[j]*residual;
for(k=0;k<6;++k) {
JtJ[j][k] += jacobian[j]*jacobian[k];
}
}

}



3. Wrap-up and what’s next

By being careful about computing only the matrices we need, we were able to reduce the memory cost of the Gauss-Newton algorithm significantly and produce an algorithm that is viable on an inexpensive microcontroller. But we still need to keep the samples around and make a full pass through the sample set with each iterative pass — as ${\bar{\beta}}$ changes so do the matrices. This costs space and time and limits the viability of the algorithm.

In the next post we will see that even here we are storing too much. Thanks to the nature of our residual function, we will see that there are 25 statistics that we can compute in one pass through the observation set that contain all of the information we need to compute ${\mathbf{J}_{\bar{\beta}}^{\top}\mathbf{J}_{\bar{\beta}}}$ and ${\mathbf{J}_{\bar{\beta}}^{\top}\vec{\mathbf{r}}}$ for any given ${\bar{\beta}}$. Once we do this, we will have an algorithm with space costs that are independent of the number of observations. (Well, really logarithmic because with too many samples a float will not have enough precision to capture the information in the marginal sample, but I’m not worrying about that for now.)