Niall’s virtual diary archives – Monday 8th April 2013

by . Last updated . This page has been accessed 14 times since the 26th March 2019.

Monday 8th April 2013: 8.06pm. Link shared: http://en.wikipedia.org/wiki/Double-precision_floating-point_format

Just spent several hours trying to figure out how to write a floating point double truncation routine which can't use a C library, not even anything from <math.h>. I can't use compiler intrinsics either (don't ask!). I also can't cast to a 64-bit integer, as the double is highly likely to overflow a 64-bit integer and would therefore cause a floating point exception. Why do I need a truncation routine? Because printing a floating point number requires one, and remember, I have no C library.

First off I tried parsing the IEEE 754 format and using fixed point 64-bit integer to reconstitute the number as a whole integer. That proved a bit tricky.

Then I thought while staring at http://en.wikipedia.org/wiki/Double-precision_floating-point_format, the mantissa is just another integer really, and like any integer you can overflow it. So I came up with this:

// Niall's C99 trunc() implementation which uses no C library
static double mytrunc(double _v)
{
   // I really, really don't want the optimiser to elide this code
   volatile double v=_v;
   // There are 52 bits of mantissa stored as 2^N, so make the FP unit truncate the bottom bits for us
   v+=(double)(1ULL<<52);
   v-=(double)(1ULL<<52);
   // The FP unit will round to nearest, so we need to inhibit that.
   if(v>=0 && v>_v) v-=1;
   else if(v<0 && v<_v) v+=1;
   return v;
}

I think that ought to work no matter the input. It certainly seems to work. Anyway, I'm feeling quite proud of myself for such an elegant solution. It's probably old hat to many people, but for me that took a fair bit of thinking.

Go back to the archive index Go back to the latest entries

Contact the webmaster: Niall Douglas @ webmaster2<at symbol>nedprod.com (Last updated: 2013-04-08 20:06:24 +0000 UTC)