WARNING: this page is the creation of a wandering mind thinking of ways to make bugs in their own software more real to Computer Science Students.
It is amazing how fast modern computers are. Just think, in one second, a modern computer can multiply over a million times, a single human mistake. And if it gets on the Internet, no matter how embarrassing it is, it will never be forgotten.Gilbert Healton
The Associated Press ran a story in 2006-April about a Malaysian man whose father died the previous December. The son, Yahaya Wahab, nicely settled with the Telekom Malaysia Bhd. in January to close his father's account. But the company computers decided more money was due. 806,400,000,000,000.01 ringgit (8,064 x1020 ringgit, or $218 trillion) to be exact. The final touch was that he had 10 days to pay up or face collection action. Looking at the CIA Factbook, Malaysia had a 2005 Gross Domestic Product of $287 billion. For reference, the GDP of the United States was $12.4 trillion and the world as a whole $60.6 trillion.
As a professional 30-year software developer with an interest in time and date problems (see The Best Of Dates, The Worst Of Dates for hints), I've been wondering with might cause this particular problem. High on my list was the fact that 2005-December-31 had a leap second. Some operating systems handle these correctly, others don't. Some applications handle these correctly, others don't. If you run a problem application under an OS that knows that it is doing you are asking for trouble. Regardless of cause, it is a good example of how easy it is to write software bugs.
Letting my mind wander on this problem came up with the following tidbits beyond the leap second issue:
- The fact that so many numbers near the end are zeros implies the computer only keeps so much precision, if only in the printing of numbers.
- The ".01" at the end was likely internally a zero that got rounded strangely: a common rounding error in some "floating point" numbers used by computers. Or it might be a genuine 0.01 added in for regulatory or other similar issues. In short, ignore it.
- First lets check a very common "carry" problem. 264 is 18,446,744,073,709,551,616 or 1.844 x 1018. Thus the 8.064 x 1020 bill is not a power of two nor is, it a "floating point" power of 2.
- Suspect a date calculation in an unsigned integer went "negative" due to leap second (unsigned 0 - 1 in 64 bits is that 264 number, less 1).
- 264 is very close, percentage wise, to the billed amount. This makes me suspicious the bug has something to do with 264, especially as
- The use of 64-bit time_t values is right and proper these days as it allows software to avoid the year 2038 bug.
Putting all of the above together I came up with a C program to show my way of forcing the error.
Computer Science students are urged to understand what every part of this program does. I've seen similar problems in countless programs. The "carry" condition is a repeat offender, in and out of dates. If you don't understand the bug herein you are very likely to repeat it, somewhere.
/* reproducing the Telekom Malaysia bug */ /* DISCLAIMER: I have no idea if something like this was the problem, and have not checked. Just having fun showing software people how easy it is to make date bugs. */ /* NOTE: this program was not written until it was noticed that the billed amount (RINGGIT) was close to 2 to the 64 (2^64 is shorthand used in documentation, though it is NOT VALID C!). As this is very close to time_t values on some newer computers I began to wonder about it. */ #include <stdio.h> #include <stdlib.h> #define RINGGIT 8.064e20 /* amount billed */ typedef unsigned long long time_64t; /* simulate 64-bit time_t */ /* as a 64-bit time_t is needed to express this bug, but the compiler/OS this sample is being run on may still be using the historic 32-bit version, we make our own typedef that forces 64-bits. */ int main ( void ) { /* starting and ending times of phone call */ /* assuming a call started close to midnight of 2004-Dec-31 and then went just one second into 2005-Jan-01. A one-second interval immediately before or after midnight might have been the cause of the bug. Thus this program only looks at that one second, ignoring call time on the other side of midnight. */ time_64t startt = 1; time_64t endt = 0; /* (assume leap second did something silly) */ /* calculate length of call we are interested in */ time_64t minutes = endt - startt; /* unsigned numbers are often used in integer calculations that are known to be non-negative as they allow values twice as large their signed counterparts to be used without any additional cost. Such unsigned values work great unless some special exception, such as a leap second, does something strange to result in a calculation that should yield a negative result (e.g., 0 - 1). But for unsigned values the resulting carry produces a really huge negative value. */ /* calculate a cost per-minute from givens */ float costPerMin = RINGGIT / minutes; float bill; /* calculated billed amount */ bill = minutes * costPerMin; /* calculate bill */ printf( "costPerMin %.3f\n", costPerMin ); printf( "Minutes %llu (0x%llx)\n", minutes, (long long)minutes ); printf( "Please pay %.2f\n", bill ); exit( EXIT_SUCCESS ); }
The exact output you get on the Please Pay line depends on the run-time library of the compiler being used. Some will be smart enough to know that a float just doesn't have that many significant digits in it and start throwing zeros out once the available precision is reached. Others are not that smart and keep doing binary to decimal conversions on the floating point number even though only garbage is being produced. It looks like Telekom Malaysia has a smarter library.
I'll bet that even if Yahaya Wahab paid up the computers at both the phone company and banks would of expressed their own bugs over the amount.
First, be sure you fully understand date and time processing. Just because you use dates and times every day of your life does not mean you understand them, especially to the detailed levels used by computer software.
Second, be sure your application software handles leap years and leap seconds in the same way the OS it runs on does.
Test test test. Especially the boundary conditions. A five second range is best: two seconds before, one second before, right on, one second after, and two seconds after. Start and stop times must use all permutations and combinations of these.
On important issues it is important to have the software trap values exceeding some sanity threshold for manual review.
Think.


/
y2k | |||
|
http://www.exit109.com/~ghealton/y2k/TelekomBug.html |
||
| Hits since 2006-12-25: |
$Id: TelekomBug.hmac,v 1.4 2007/08/26 22:16:41 ghealton Exp $ Last formatted 2007-08-26 (Disclaimer) |
||