Tuesday, May 3, 2016

Numerical pitfalls in computing variance

One of the most common tasks in statistical computing is computation of sample variance. This would seem to be straightforward; there are a number of algebraically equivalent ways of representing the sum of squares \(S\), such as \[ S = \sum_{k=1}^n ( x_k - \bar{x})^2 \] or \[ S = \sum_{k=1}^n x_k^2 + \frac{1}{n}\bar{x}^2 \] and the sample variance is simply \(S/(n-1)\).

What is straightforward algebraically, however, is sometimes not so straightforward in the floating-point arithmetic used by computers. Computers cannot represent numbers to infinite precision, and arithmetic operations can affect the precision of floating-point numbers in unexpected ways.




Consider the numbers .1 and .3/3. These two numbers are equal. However,
.1 - .3/3
## [1] 0.00000000000000001388

is not exactly 0, as one would expect it to be (for more, see "Falling into the Floating
Point Trap", Chapter 4 in the "R Inferno" by Patrick Burns. Multiple ways of computing the variance that are algebraically equivalent do not necessarily yield equal answers in software such as R, and some ways are better than others.

In a series of posts, John D. Cook shows that the seemingly reasonable, commonly used second method above, which he calls the "sum of squares" method, can be extremely unstable in certain circumstances, even giving impossible negative values. He also discusses how to compute the sample variance in a numerically stable way using Welford's method. Both of these posts are well worth reading.

When I read them, I thought two things. First, I was reminded that I use [used; this was written some time ago] the "sum of squares" method in the BayesFactor package. Secondly, I thought I would not be affected by the problem, because I represent numbers as logarithms internally for numerical stability and ease of division. Logarithms make many things easier: very large and very small numbers become easier to work with; exponentiation becomes multiplication; and multiplication and division become addition and subtraction. The tricky part of working with logarithms is addition and subtraction. If we have two numbers, \(\exp(a)\) and \(\exp(b)\) represented by their logarithms \(a\) and \(b\), and we want to know the logarithm of their sum, we can make use of the identities
\[
\begin{eqnarray*}
\log(\exp(a) + \exp(b)) = a + \log(1 + \exp(b - a))\\
\log(\exp(a) - \exp(b)) = a + \log(1 - \exp(b - a))
\end{eqnarray*}
\]
Now arithmetic with \(a\) and \(b\) is addition and subtraction, and we can use accurate floating point approximations of \(\log(1+\exp(x))\) and \(\log(1-\exp(x))\).

When logarithms don't help



But I wasn't really sure whether I would be affected by the instability of the "sum of squares" method, so I decided to check. It turns out, representing numbers logarithmically doesn't necessarily help. In order to demonstrate this easily, I created an R S4 class that eases arithmetic on logarithmically-represented values. First, we load necessary libraries/files:
# Install the BayesFactor and devtools packages, if you don't already have them

# Load my S4 class for representing real numbers with logarithms
# and performing arithmetic on them
# See the code at https://gist.github.com/richarddmorey/3c77d0065983e31241bff3807482443e
devtools::source_gist('3c77d0065983e31241bff3807482443e')

# set random seed so results are reproducible
set.seed(2)

[Click here to view the R file you'll be sourcing above]

To see the S4 class in action, we need to generate some numbers that are logarithmically-represented. The variables x and y below are equal to \(\exp(1)=2.718\) and \(\exp(2)=7.389\), respectively. The modulo argument gives the log-represented magnitude of the number, and the sign argument gives the sign (with 1L meaning the integer representation of 1):
x = logRepresentedReal(modulo = 1, sign = 1L)
y = logRepresentedReal(modulo = 2, sign = 1L)
We can add the two numbers together, for instance:
x + y
##  10.10734
Although the result does not look logarithmically-represented, we can verify that it is using the str function:
str( x + y )
## Formal class 'logRepresentedReal' [package ".GlobalEnv"] with 2 slots
##   ..@ modulo: num 2.31
##   ..@ sign  : int 1
The result is of class logRepresentedReal, and the modulo slot tells us \(\log(x+y)\). With the arithmetic on the logarithmically-represented numbers defined using the logRepresentedReal class, we can test whether our logarithms help stabilize the estimate of the variance. Following Cook, we will sample values from a uniform distribution, making use of the fact that if \(z\) has a uniform distribution, then \(−\log(z)\) has an standard exponential distribution:
runif2 = function(n){
  # Sample log values from exponential distribution
  x = -rexp(n)
  # represent all values logarithmically in a list
  lapply(x, logRepresentedReal, sign = 1L)
}

n = 100
z = runif2(n)
We sampled \(n=100\) values from a uniform distribution. We can now compute the variance in several ways. The first way is to use the “sum of squares” method on the exponentiated values:
# Sum of squares method
var.sumsq.exp = function(z)
{
  n = length(z)
  z = sapply(z, as.numeric)
  (sum(z^2) - n*mean(z)^2)/(n-1)
}

var.sumsq.exp(z)
## [1] 0.07419988
This presents no problem, since our uniformly-distributed values are rather moderate. We now use Welford’s method on the logarithmically-represented values to compute the variance:
var.welford <- function(z){
  n = length(z)
  M = list()
  S = list()
  M[[1]] = z[[1]]
  S[[1]] = 0

  for(k in 2:n){
    M[[k]] = M[[k-1]] + ( z[[k]] - M[[k-1]] ) / k
    S[[k]] = S[[k-1]] + ( z[[k]] - M[[k-1]] ) * ( z[[k]] - M[[k]] )
  }
  return(S[[n]] / (n - 1))
}

var.welford(z)
##  0.07419988
And finally, we can use the “sum of squares” method on the logarithmically-represented values:
var.sumsq = function(z){
  n = length(z)
  zsqr = sapply(z, function(x) x^2)

  sumz = 0
  sumz2 = 0
  for(k in 1:n){
    sumz = sumz + z[[k]]
    sumz2 = sumz2 + zsqr[[k]] 
  }
  mnz = sumz/n
  ssz = sumz2 - n * mnz^2 
  return(ssz / (n-1))  
}

var.sumsq(z)
##  0.07419988
Again, this presents no problem, since our uniformly-distributed values are moderate. So far, we see no signs of numerical instability, but none are expected. As Cook did in his example, we can add a very large number — in this case, one billion — to all sampled values. This makes the variance quite small compared to the mean, and would be expected to make the sum of squares estimates unstable.
const = 1e9
z = sapply(z, function(x) x + const)

var.sumsq.exp(z)
## [1] -165.4949
var.welford(z)
##  0.07419973
var.sumsq(z)
##  35886
Notice that both the sum of square estimates fail; the logarithmic representation used by var.sumsq does not help the numeric stability. The Welford method, on the other hand, yields an accurate value. 


Conclusion 


There are many ways of combating numerical instability. Representing numbers as logarithms is one important method. Although representing numbers as logarithms is effective at combating numerical instability from some sources, it does not necessarily help in all cases. The superiority of the Welford method of computing variance, even when numbers are represented logarithmically, shows this clearly.

15 comments:

  1. You left out the obvious tool to compare against: stats::sd, which is just sqrt(stats::var). I don't have the source handy, so perhaps you could check out .Call(C_cov, x, y, na.method, FALSE) which is the guts of the 'var' function?

    ReplyDelete
    Replies
    1. Hi cellocgw, since am interested mainly in working with logarithms directly, and stats:sd function doesn't do that (internally, I don't want to exponentiate ever because I would end up getting infinities). But using stats::var on the exponentiated values is very accurate. Not sure what method is used, though.

      Delete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. Nice blog posts - thanks for giving the pitfalls in computing variance of the World.

    Discount Airline Tickets

    ReplyDelete
  5. This is a good website for my guy. I have to show it to him :D

    ReplyDelete
  6. I like this site, this is very informative article!! Please Visit:

    cloud pos
    point of sales softwares
    pos app

    ReplyDelete
  7. Looks fairly complicated, but thanks for such a clear explanation. By the way, I'm planning to start the implementation of Microsoft Dynamics erp software and I just found this offer: https://ax-dynamics.com/microsoft-dynamics-ax. Looks pretty good!

    ReplyDelete
  8. COEPD LLC- Center of Excellence for Professional Development is the most trusted online training platform to global participants. We are primarily a community of Business Analysts who have taken the initiative to facilitate professionals of IT or Non IT background with the finest quality training. Our trainings are delivered through interactive mode with illustrative scenarios, activities and case studies to help learners start a successful career. We impart knowledge keeping in view of the challenging situations individuals will face in the real time, so that they can handle their job deliverables with at most confidence.

    http://coepd.us/

    ReplyDelete
  9. Irrespective of receiving daily oral or future injectable depot therapies, these require health care visits for medication and monitoring of safety and response. If patients are treated early enough, before a lot of immune system damage has occurred, life expectancy is close to normal, as long as they remain on successful treatment. However, when patients stop therapy, virus rebounds to high levels in most patients, sometimes associated with severe illness because i have gone through this and even an increased risk of death. The aim of “cure”is ongoing but i still do believe my government made millions of ARV drugs instead of finding a cure. for ongoing therapy and monitoring. ARV alone cannot cure HIV as among the cells that are infected are very long-living CD4 memory cells and possibly other cells that act as long-term reservoirs. HIV can hide in these cells without being detected by the body’s immune system. Therefore even when ART completely blocks subsequent rounds of infection of cells, reservoirs that have been infected before therapy initiation persist and from these reservoirs HIV rebounds if therapy is stopped. “Cure” could either mean an eradication cure, which means to completely rid the body of reservoir virus or a functional HIV cure, where HIV may remain in reservoir cells but rebound to high levels is prevented after therapy interruption.Dr Itua Herbal Medicine makes me believes there is a hope for people suffering from,Parkinson's disease,Schizophrenia,Cancer,Scoliosis,Fibromyalgia,Fluoroquinolone Toxicity
    Syndrome Fibrodysplasia Ossificans Progressiva.Fatal Familial Insomnia Factor V Leiden Mutation ,Epilepsy Dupuytren's disease,Desmoplastic small-round-cell tumor Diabetes ,Coeliac disease,Creutzfeldt–Jakob disease,Cerebral Amyloid Angiopathy, Ataxia,Arthritis,Amyotrophic Lateral Sclerosis,Alzheimer's disease,Adrenocortical carcinoma.Asthma,Allergic diseases.Hiv_ Aids,Herpes,Inflammatory bowel disease ,Copd,Diabetes,Hepatitis,I read about him online how he cure Tasha and Tara so i contacted him on drituaherbalcenter@gmail.com even talked on whatsapps +2348149277967 believe me it was easy i drank his herbal medicine for two weeks and i was cured just like that isn't Dr Itua a wonder man? Yes he is! I thank him so much so i will advise if you are suffering from one of those diseases Pls do contact him he's a nice man.

    ReplyDelete
  10. My life is beautiful thanks to you, Mein Helfer. Lord Jesus in my life as a candle light in the darkness. You showed me the meaning of faith with your words. I know that even when I cried all day thinking about how to recover, you were not sleeping, you were dear to me. I contacted the herbal center Dr Itua, who lived in West Africa. A friend of mine here in Hamburg is also from Africa. She told me about African herbs but I was nervous. I am very afraid when it comes to Africa because I heard many terrible things about them because of my Christianity. god for direction, take a bold step and get in touch with him in the email and then move to WhatsApp, he asked me if I can come for treatment or I want a delivery, I told him I wanted to know him I buy ticket in 2 ways to Africa To meet Dr. Itua, I went there and I was speechless from the people I saw there. Patent, sick people. Itua is a god sent to the world, I told my pastor about what I am doing, Pastor Bill Scheer. We have a real battle beautifully with Spirit and Flesh. Adoration that same night. He prayed for me and asked me to lead. I spent 2 weeks and 2 days in Africa at Dr Itua Herbal Home. After the treatment, he asked me to meet his nurse for the HIV test when I did it. It was negative, I asked my friend to take me to another nearby hospital when I arrived, it was negative. I was overwhite with the result, but happy inside of me. We went with Dr. Itua, I thank him but I explain that I do not have enough to show him my appreciation, that he understands my situation, but I promise that he will testify about his good work. Thank God for my dear friend, Emma, I know I could be reading this now, I want to thank you. And many thanks to Dr. Itua Herbal Center. He gave me his calendar that I put on my wall in my house. Dr. Itua can also cure the following diseases ... Cancer, HIV, Herpes, Hepatitis B, Inflammatory Liver, Diabetis, Fribroid,Parkinson's disease,Inflammatory bowel disease ,Fibromyalgia, recover your ex. You can contact him by email or whatsapp, @ .. drituaherbalcenter@gmail.com, phone number .. + 2348149277967 .. He is a good doctor, talk to him kindly. I'm sure he will also listen to you.

    ReplyDelete
  11. i think this post of yours lack some great info man.
    Know about Microsoft Dynamic 365.

    ReplyDelete