Telekom Malaysia Bhd Billing Bug Explained

A lesson for Computer Science Students

WARNING: this page is the creation of a wandering mind thinking of ways to make bugs their own software more real to Computer Science Students with more than a touch of silliness.

It is amazing how fast modern computers are. Just think, in one second, a modern computer can multiply over a million times, a single human mistake. And if it gets on the Internet, no matter how embarrassing it is, it will never be forgotten.
Gilbert Healton

Let me tell you about a very embarrassing billing mistake of a major telephone company.

  1. A mistake so stupid only a computer could make it.
  2. A few basic words on why things like this can happen. Anyone with a decent high-school eduction should be able to follow most of it.
  3. A more technical description that a lot of people should still be able to follow as I try to keep the computer geek talk down.
  4. A more technical description. Normal people can skip this without problem or embarrassment.

An Embarrassing Software Bug

The Associated Press ran a story in 2006-April about a Malaysian man whose father died the previous December. The son, Yahaya Wahab, nicely settled with the Telekom Malaysia Bhd. in January to close his father's account. But the company computers decided a little more money was due. 806,400,000,000,000.01 ringgit to be exact. This was 8,064 x1020 ringgit, or $218 trillion. The final touch was that he had 10 days to pay up or face collection action.

MSNBC
Fox News

Looking at the CIA Factbook, Malaysia had a 2005 Gross Domestic Product of $287 billion. For reference, the GDP of the United States was $12.4 trillion and the world as a whole $60.6 trillion. Ten days? Right.

The Basics

Computer save numbers they are working with in a type of cell called a variable. The number inside is always changing. While variables come in different sizes, each variable is limited to some maximum, and minimum value. The limits depend on the size of the variable.

Strange things happen if you exceed the size of a variable. If you had a four digit variable you could hold numbers from 0000 to 9999. Not these can not hold negative values. To hold negative values in four characters you use -999 to +999. Computers pull similar tricks inside themselves, but using the "binary" number they are comfortable with.

If you add one to 9998 you get 9999. If you add one to 9999 you get, surprise, 0000 and not 10000 as you are stuck with four digits.

If you subtract one from 0002 you get 0001. Subtract one again and you get 0000. Subtract one again and you get, 9999. Oops. Yes, you borrowed from nowhere. Start with 0012 and subtract 0023 and you get, with this four digit number, 9989.

As the midnight in question had a leap second, I think an all to easy to make bug in the program did the same trick. But it did it in binary and with variables so large the numbers in them are only fit for computers. One part of the program used the leap second and the other did not leaving a gap of one second... but 0000 less 0001 is, as you now know, 9999. Why gets a bit more esoteric and is covered next.

Thinking About The Problem

As a professional software developer I've been pushing computers around, or they pushing me around, for over thirty years. During this time I've seen a bug or two. Given my interest in time and date problems (see The Best Of Dates, The Worst Of Dates for hints), the only way my mind would let me rest was to work out a likely reason or two for the bug.

High on my list is the fact that 2005-December-31 had a leap second. Some operating systems handle these correctly, others don't. Some applications handle these correctly, others don't. Another word for trouble is to run applications that correctly handle leap seconds on OS systems that do not, or vice versa. Regardless of cause, this is a wonderful example of how easy it is to write software bugs and how they can blow up to make entire companies look totally silly.

Letting my mind wander on this problem came produced the following tidbits beyond the leap second issue:

Recreating The Problem

Putting all of the above together I came up with a C program to show my way of forcing the error.

Computer Science students are urged to understand what every part of this program does as I've seen similar problems in countless programs. The "carry" condition is a repeat offender, in and out of dates. If you don't understand the bug herein you are very likely to repeat it, somewhere.

/* reproducing the Telekom Malaysia bug */

  /* DISCLAIMER: 
     I have no idea if something like this was the 
     problem, and have not checked. Just having fun 
     showing software people how easy it is to make 
     date bugs. */

  /* NOTE: this program was not written until it was noticed
     that the billed amount (RINGGIT) was close to 2 to the 64
     (2**64 is shorthand used in documentation, though it is
     NOT VALID C!). As this is very close to time_t values 
     on some newer computers I began to wonder about it.  */

#include <stdio.h>
#include <stdlib.h>

#define RINGGIT 8.064e20             /* amount billed */

typedef unsigned long long time_64t; /* simulate 64-bit time_t */
   /* as a 64-bit time_t is needed to express this bug, but
      the compiler/OS this sample is being run on may still be 
      using the historic 32-bit version, we make our own typedef 
      that forces 64-bits. */

int main ( void )
{

			/* starting and ending times of phone call */
			/* assuming a call started close to midnight 
                   of 2004-Dec-31 and then went just one 
                   second into 2005-Jan-01. A one-second interval
                   immediately before or after midnight might 
                   have been the cause of the bug. Thus this 
                   program only looks at that one second,
                   ignoring call time on the other
                   side of midnight. */
    time_64t     startt   = 1;
    time_64t     endt     = 0;
               /* (assume leap second did something silly) */

		  /* calculate length of call we are interested in */
    time_64t      minutes = endt - startt;
          /* unsigned numbers are often used in integer
             calculations that are known to be non-negative 
             as they allow values twice as large their signed 
             counterparts to be used without any additional cost.

             Such unsigned values work great unless some special 
             exception, such as a leap second, does something 
             strange to result in a calculation that should yield
             a negative result (e.g., 0 - 1). But for unsigned 
             values the resulting carry produces a really 
             huge negative value. */

			/* calculate a cost per-minute from givens */
    float     costPerMin = RINGGIT / minutes;

    float     bill;		/* calculated billed amount */

    bill = minutes  * costPerMin;	/* calculate bill */

    printf( "costPerMin %.3f\n", costPerMin );
    printf( "Minutes %llu (0x%llx)\n", 
                       minutes, 
                              (long long)minutes );
    printf( "Please pay %.2f\n", bill );

    exit( EXIT_SUCCESS );
}

The exact output you get on the Please Pay line depends on the run-time library of the compiler being used. Some will be smart enough to know that a float just doesn't have that many significant digits in it and start throwing zeros out once the available precision is reached. Others are not that smart and keep doing binary to decimal conversions on the floating point number even though only garbage is being produced. It looks like Telekom Malaysia has a smarter library.

I'll bet that even if Yahaya Wahab paid up the computers at both the phone company and banks would of expressed their own bugs over the amount.

How To Avoid Such Errors

First, be sure you fully understand date and time processing. Just because you use dates and times every day of your life does not mean you understand them, especially to the detailed levels used by computer software.

Second, be sure your application software handles leap years and leap seconds in the same way the OS it runs on does.

Test test test. Especially the boundary conditions. A five second range is best: two seconds before, one second before, right on, one second after, and two seconds after. Start and stop times must use all permutations and combinations of these.

On important issues it is important to have the software trap values exceeding some sanity threshold for manual review.

Think.

[]


 


 

   ============================================================

   ============================================================
[home] / [y2k]y2k
[AnyBrowser]
NetMechanic HTML Code Excellence Award
.http://www.exit109.com/~ghealton/y2k/TelekomBug.html  
 Hits since 2006-12-25: [unavailable]  $Id: TelekomBug.hmac,v 1.5 2009/04/30 01:50:28 ghealton Exp ghealton $
Last formatted 2009-09-13
(Disclaimer)