Is it possible in floating point to return 0.0 subtracting two different values?












41















Due to the floating point "approx" nature, its possible that two different sets of values return the same value.



Example:



#include <iostream>

int main() {
std::cout.precision(100);

double a = 0.5;
double b = 0.5;
double c = 0.49999999999999994;

std::cout << a + b << std::endl; // output "exact" 1.0
std::cout << a + c << std::endl; // output "exact" 1.0
}


But is it also possible with subtraction? I mean: is there two sets of different values (keeping one value of them) that return 0.0?



i.e. a - b = 0.0 and a - c = 0.0, given some sets of a,b and a,c with b != c??










share|improve this question




















  • 2





    i asked a related question long time ago, it led to lots of discussion and till now I couldnt decide what to accept as answer: stackoverflow.com/questions/39108471/…

    – user463035818
    Feb 5 at 9:50











  • Well, 0.0 and -0.0 have different representations, but they compare equal. wandbox.org/permlink/YQJyZfLojKva9iHs

    – Bob__
    Feb 5 at 10:23











  • do you mean different values or values coming from different computing ?

    – bruno
    Feb 5 at 10:37











  • @user463035818 I think my answer to a related question applies also to your question. The present question, though, is different.

    – Daniel Daranas
    Feb 5 at 11:24













  • @DanielDaranas it would be a perfect duplicate, if it wasnt for a different language. Some things are the same across lanuagues, but tbh sometimes I dont care about other languages ;)

    – user463035818
    Feb 5 at 11:27
















41















Due to the floating point "approx" nature, its possible that two different sets of values return the same value.



Example:



#include <iostream>

int main() {
std::cout.precision(100);

double a = 0.5;
double b = 0.5;
double c = 0.49999999999999994;

std::cout << a + b << std::endl; // output "exact" 1.0
std::cout << a + c << std::endl; // output "exact" 1.0
}


But is it also possible with subtraction? I mean: is there two sets of different values (keeping one value of them) that return 0.0?



i.e. a - b = 0.0 and a - c = 0.0, given some sets of a,b and a,c with b != c??










share|improve this question




















  • 2





    i asked a related question long time ago, it led to lots of discussion and till now I couldnt decide what to accept as answer: stackoverflow.com/questions/39108471/…

    – user463035818
    Feb 5 at 9:50











  • Well, 0.0 and -0.0 have different representations, but they compare equal. wandbox.org/permlink/YQJyZfLojKva9iHs

    – Bob__
    Feb 5 at 10:23











  • do you mean different values or values coming from different computing ?

    – bruno
    Feb 5 at 10:37











  • @user463035818 I think my answer to a related question applies also to your question. The present question, though, is different.

    – Daniel Daranas
    Feb 5 at 11:24













  • @DanielDaranas it would be a perfect duplicate, if it wasnt for a different language. Some things are the same across lanuagues, but tbh sometimes I dont care about other languages ;)

    – user463035818
    Feb 5 at 11:27














41












41








41


3






Due to the floating point "approx" nature, its possible that two different sets of values return the same value.



Example:



#include <iostream>

int main() {
std::cout.precision(100);

double a = 0.5;
double b = 0.5;
double c = 0.49999999999999994;

std::cout << a + b << std::endl; // output "exact" 1.0
std::cout << a + c << std::endl; // output "exact" 1.0
}


But is it also possible with subtraction? I mean: is there two sets of different values (keeping one value of them) that return 0.0?



i.e. a - b = 0.0 and a - c = 0.0, given some sets of a,b and a,c with b != c??










share|improve this question
















Due to the floating point "approx" nature, its possible that two different sets of values return the same value.



Example:



#include <iostream>

int main() {
std::cout.precision(100);

double a = 0.5;
double b = 0.5;
double c = 0.49999999999999994;

std::cout << a + b << std::endl; // output "exact" 1.0
std::cout << a + c << std::endl; // output "exact" 1.0
}


But is it also possible with subtraction? I mean: is there two sets of different values (keeping one value of them) that return 0.0?



i.e. a - b = 0.0 and a - c = 0.0, given some sets of a,b and a,c with b != c??







c++ floating-point






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 5 at 9:46







markzzz

















asked Feb 5 at 9:41









markzzzmarkzzz

18.9k90233404




18.9k90233404








  • 2





    i asked a related question long time ago, it led to lots of discussion and till now I couldnt decide what to accept as answer: stackoverflow.com/questions/39108471/…

    – user463035818
    Feb 5 at 9:50











  • Well, 0.0 and -0.0 have different representations, but they compare equal. wandbox.org/permlink/YQJyZfLojKva9iHs

    – Bob__
    Feb 5 at 10:23











  • do you mean different values or values coming from different computing ?

    – bruno
    Feb 5 at 10:37











  • @user463035818 I think my answer to a related question applies also to your question. The present question, though, is different.

    – Daniel Daranas
    Feb 5 at 11:24













  • @DanielDaranas it would be a perfect duplicate, if it wasnt for a different language. Some things are the same across lanuagues, but tbh sometimes I dont care about other languages ;)

    – user463035818
    Feb 5 at 11:27














  • 2





    i asked a related question long time ago, it led to lots of discussion and till now I couldnt decide what to accept as answer: stackoverflow.com/questions/39108471/…

    – user463035818
    Feb 5 at 9:50











  • Well, 0.0 and -0.0 have different representations, but they compare equal. wandbox.org/permlink/YQJyZfLojKva9iHs

    – Bob__
    Feb 5 at 10:23











  • do you mean different values or values coming from different computing ?

    – bruno
    Feb 5 at 10:37











  • @user463035818 I think my answer to a related question applies also to your question. The present question, though, is different.

    – Daniel Daranas
    Feb 5 at 11:24













  • @DanielDaranas it would be a perfect duplicate, if it wasnt for a different language. Some things are the same across lanuagues, but tbh sometimes I dont care about other languages ;)

    – user463035818
    Feb 5 at 11:27








2




2





i asked a related question long time ago, it led to lots of discussion and till now I couldnt decide what to accept as answer: stackoverflow.com/questions/39108471/…

– user463035818
Feb 5 at 9:50





i asked a related question long time ago, it led to lots of discussion and till now I couldnt decide what to accept as answer: stackoverflow.com/questions/39108471/…

– user463035818
Feb 5 at 9:50













Well, 0.0 and -0.0 have different representations, but they compare equal. wandbox.org/permlink/YQJyZfLojKva9iHs

– Bob__
Feb 5 at 10:23





Well, 0.0 and -0.0 have different representations, but they compare equal. wandbox.org/permlink/YQJyZfLojKva9iHs

– Bob__
Feb 5 at 10:23













do you mean different values or values coming from different computing ?

– bruno
Feb 5 at 10:37





do you mean different values or values coming from different computing ?

– bruno
Feb 5 at 10:37













@user463035818 I think my answer to a related question applies also to your question. The present question, though, is different.

– Daniel Daranas
Feb 5 at 11:24







@user463035818 I think my answer to a related question applies also to your question. The present question, though, is different.

– Daniel Daranas
Feb 5 at 11:24















@DanielDaranas it would be a perfect duplicate, if it wasnt for a different language. Some things are the same across lanuagues, but tbh sometimes I dont care about other languages ;)

– user463035818
Feb 5 at 11:27





@DanielDaranas it would be a perfect duplicate, if it wasnt for a different language. Some things are the same across lanuagues, but tbh sometimes I dont care about other languages ;)

– user463035818
Feb 5 at 11:27












4 Answers
4






active

oldest

votes


















60














The IEEE-754 standard was deliberately designed so that subtracting two values produces zero if and only if the two values are equal, except that subtracting an infinity from itself produces NaN and/or an exception.



Unfortunately, C++ does not require conformance to IEEE-754, and many C++ implementations use some features of IEEE-754 but do not fully conform.



A not uncommon behavior is to “flush” subnormal results to zero. This is part of a hardware design to avoid the burden of handling subnormal results correctly. If this behavior is in effect, the subtraction of two very small but different numbers can yield zero. (The numbers would have to be near the bottom of the normal range, having some significand bits in the subnormal range.)



Sometimes systems with this behavior may offer a way of disabling it.



Another behavior to beware of is that C++ does not require floating-point operations to be carried out precisely as written. It allows “excess precision” to be used in intermediate operations and “contractions” of some expressions. For example, a*b - c*d may be computed by using one operation that multiplies a and b and then another that multiplies c and d and subtracts the result from the previously computed a*b. This latter operation acts as if c*d were computed with infinite precision rather than rounded to the nominal floating-point format. In this case, a*b - c*d may produce a non-zero result even though a*b == c*d evaluates to true.



Some C++ implementations offer ways to disable or limit such behavior.






share|improve this answer

































    18














    Gradual underflow feature of IEEE floating point standard prevents this. Gradual underflow is achieved by subnormal (denormal) numbers, which are spaced evenly (as opposed to logarithmically, like normal floating point) and located between the smallest negative and positive normal numbers with zeroes in the middle. As they are evenly spaced, the addition of two subnormal numbers of differing signedness (i.e. subtraction towards zero) is exact and therefore won't reproduce what you ask. The smallest subnormal is (much) less than the smallest distance between normal numbers, and therefore any subtraction between unequal normal numbers is going to be closer to a subnormal than zero.



    If you disable IEEE conformance using a special denormals-are-zero (DAZ) or flush-to-zero (FTZ) mode of the CPU, then indeed you could subtract two small, close numbers which would otherwise result in a subnormal number, which would be treated as zero due to the mode of the CPU. A working example (Linux):



    _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);    // system specific
    double d = std::numeric_limits<double>::min(); // smallest normal
    double n = std::nextafter(d, 10.0); // second smallest normal
    double z = d - n; // a negative subnormal (flushed to zero)
    std::cout << (z == 0) << 'n' << (d == n);


    This should print



    1
    0


    First 1 indicates that result of subtraction is exactly zero, while the second 0 indicates that the operands are not equal.






    share|improve this answer


























    • "The smallest subnormal is (much) less than the smallest distance between normal numbers" no, the smallest denomal is exactly the same as the distance between normal numbers with the lowest allowed exponent.

      – plugwash
      Feb 6 at 3:41



















    6














    Unfortunately the answer is dependent on your implementation and the way it is configured. C and C++ don't demand any specific floating point representation or behavior. Most implementations use the IEEE 754 representations, but they don't always precisely implement IEEE 754 arithmetic behaviour.



    To understand the answer to this question we must first understand how floating point numbers work.



    A naive floating point representation would have an exponent, a sign and a mantissa. It's value would be



    (-1)s2(e – e0)(m/2M)



    Where:




    • s is the sign bit, with a value of 0 or 1.

    • e is the exponent field

    • e0 is the exponent bias. It essentially sets the overall range of the floating point number.

    • M is the number of mantissa bits.

    • m is the mantissa with a value between 0 and 2M-1


    This is similar in concept to the scientific notation you were taught in school.



    However this format has many different representations of the same number, nearly a whole bit's worth of encoding space is wasted. To fix this we can add an "implicit 1" to the mantissa.



    (-1)s2(e – e0)(1+(m/2M))



    This format has exactly one representation of each number. However there is a problem with it, it can't represent zero or numbers close to zero.



    To fix this IEEE floating point reserves a couple of exponent values for special cases. An exponent value of zero is reserved for representing small numbers known as subnormals. The highest possible exponent value is reserved for NaNs and infinities (which I will ignore in this post since they aren't relevant here). So the definition now becomes.



    (-1)s2(1 – e0)(m/2M) when e = 0

    (-1)s2(e – e0)(1+(m/2M)) when e >0 and e < 2E-1



    With this representation smaller numbers always have a step size that is less than or equal to that for larger ones. So provided the result of the subtraction is smaller in magnitude than both operands it can be represented exactly. In particular results close to but not exactly zero can be represented exactly.



    This does not apply if the result is larger in magnitude than one or both of the operands, for example subtracting a small value from a large value or subtracting two values of opposite signs. In those cases the result may be imprecise but it clearly can't be zero.



    Unfortunately FPU designers cut corners. Rather than including the logic to handle subnormal numbers quickly and correctly they either did not support (non-zero) subnormals at all or provided slow support for subnormals and then gave the user the option to turn it on and off. If support for proper subnormal calculations is not present or is disabled and the number is too small to represent in normalized form then it will be "flushed to zero".



    So in the real world under some systems and configurations subtracting two different very-small floating point numbers can result in a zero answer.






    share|improve this answer


























    • So basically this problem interess only subnormal numbers? i.e. if I don't work with subnormals I'll never have two different floating point numbers that can result in a zero?

      – markzzz
      Feb 6 at 9:15






    • 1





      The problem interests numbers where the result of the subtraction is subnormal. Subtracting two small normal numbers can produce a subnormal result which some implementations will flush to zero.

      – plugwash
      Feb 6 at 15:01











    • So disabling DAZ and FTZ will show two different results?

      – markzzz
      Feb 6 at 17:48






    • 1





      The difference between "FTZ" and "DAZ" modes on Intel processors seems to be where the flushing happens. The former operates on the outputs of an operation while the latter operates on the inputs. So if i'm reading the documentation right with FTZ the subtraction could produce a false-zero while with DAZ the subtraction would produce a correct denormal result, but the comparision would then treat that denormal result as zero.

      – plugwash
      Feb 6 at 18:13



















    2














    Excluding funny numbers like NAN, I don't think it's possible.



    Let's say a and b are normal finite IEEE 754 floats, and |a - b| is less than or equal to both |a| and |b| (otherwise it's clearly not zero).



    That means the exponent is <= both a's and b's, and so the absolute precision is at least as high, which makes the subtraction exactly representable. That means that if a - b == 0, then it is exactly zero, so a == b.






    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54531425%2fis-it-possible-in-floating-point-to-return-0-0-subtracting-two-different-values%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      4 Answers
      4






      active

      oldest

      votes








      4 Answers
      4






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      60














      The IEEE-754 standard was deliberately designed so that subtracting two values produces zero if and only if the two values are equal, except that subtracting an infinity from itself produces NaN and/or an exception.



      Unfortunately, C++ does not require conformance to IEEE-754, and many C++ implementations use some features of IEEE-754 but do not fully conform.



      A not uncommon behavior is to “flush” subnormal results to zero. This is part of a hardware design to avoid the burden of handling subnormal results correctly. If this behavior is in effect, the subtraction of two very small but different numbers can yield zero. (The numbers would have to be near the bottom of the normal range, having some significand bits in the subnormal range.)



      Sometimes systems with this behavior may offer a way of disabling it.



      Another behavior to beware of is that C++ does not require floating-point operations to be carried out precisely as written. It allows “excess precision” to be used in intermediate operations and “contractions” of some expressions. For example, a*b - c*d may be computed by using one operation that multiplies a and b and then another that multiplies c and d and subtracts the result from the previously computed a*b. This latter operation acts as if c*d were computed with infinite precision rather than rounded to the nominal floating-point format. In this case, a*b - c*d may produce a non-zero result even though a*b == c*d evaluates to true.



      Some C++ implementations offer ways to disable or limit such behavior.






      share|improve this answer






























        60














        The IEEE-754 standard was deliberately designed so that subtracting two values produces zero if and only if the two values are equal, except that subtracting an infinity from itself produces NaN and/or an exception.



        Unfortunately, C++ does not require conformance to IEEE-754, and many C++ implementations use some features of IEEE-754 but do not fully conform.



        A not uncommon behavior is to “flush” subnormal results to zero. This is part of a hardware design to avoid the burden of handling subnormal results correctly. If this behavior is in effect, the subtraction of two very small but different numbers can yield zero. (The numbers would have to be near the bottom of the normal range, having some significand bits in the subnormal range.)



        Sometimes systems with this behavior may offer a way of disabling it.



        Another behavior to beware of is that C++ does not require floating-point operations to be carried out precisely as written. It allows “excess precision” to be used in intermediate operations and “contractions” of some expressions. For example, a*b - c*d may be computed by using one operation that multiplies a and b and then another that multiplies c and d and subtracts the result from the previously computed a*b. This latter operation acts as if c*d were computed with infinite precision rather than rounded to the nominal floating-point format. In this case, a*b - c*d may produce a non-zero result even though a*b == c*d evaluates to true.



        Some C++ implementations offer ways to disable or limit such behavior.






        share|improve this answer




























          60












          60








          60







          The IEEE-754 standard was deliberately designed so that subtracting two values produces zero if and only if the two values are equal, except that subtracting an infinity from itself produces NaN and/or an exception.



          Unfortunately, C++ does not require conformance to IEEE-754, and many C++ implementations use some features of IEEE-754 but do not fully conform.



          A not uncommon behavior is to “flush” subnormal results to zero. This is part of a hardware design to avoid the burden of handling subnormal results correctly. If this behavior is in effect, the subtraction of two very small but different numbers can yield zero. (The numbers would have to be near the bottom of the normal range, having some significand bits in the subnormal range.)



          Sometimes systems with this behavior may offer a way of disabling it.



          Another behavior to beware of is that C++ does not require floating-point operations to be carried out precisely as written. It allows “excess precision” to be used in intermediate operations and “contractions” of some expressions. For example, a*b - c*d may be computed by using one operation that multiplies a and b and then another that multiplies c and d and subtracts the result from the previously computed a*b. This latter operation acts as if c*d were computed with infinite precision rather than rounded to the nominal floating-point format. In this case, a*b - c*d may produce a non-zero result even though a*b == c*d evaluates to true.



          Some C++ implementations offer ways to disable or limit such behavior.






          share|improve this answer















          The IEEE-754 standard was deliberately designed so that subtracting two values produces zero if and only if the two values are equal, except that subtracting an infinity from itself produces NaN and/or an exception.



          Unfortunately, C++ does not require conformance to IEEE-754, and many C++ implementations use some features of IEEE-754 but do not fully conform.



          A not uncommon behavior is to “flush” subnormal results to zero. This is part of a hardware design to avoid the burden of handling subnormal results correctly. If this behavior is in effect, the subtraction of two very small but different numbers can yield zero. (The numbers would have to be near the bottom of the normal range, having some significand bits in the subnormal range.)



          Sometimes systems with this behavior may offer a way of disabling it.



          Another behavior to beware of is that C++ does not require floating-point operations to be carried out precisely as written. It allows “excess precision” to be used in intermediate operations and “contractions” of some expressions. For example, a*b - c*d may be computed by using one operation that multiplies a and b and then another that multiplies c and d and subtracts the result from the previously computed a*b. This latter operation acts as if c*d were computed with infinite precision rather than rounded to the nominal floating-point format. In this case, a*b - c*d may produce a non-zero result even though a*b == c*d evaluates to true.



          Some C++ implementations offer ways to disable or limit such behavior.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Feb 5 at 20:12

























          answered Feb 5 at 10:40









          Eric PostpischilEric Postpischil

          75.4k880162




          75.4k880162

























              18














              Gradual underflow feature of IEEE floating point standard prevents this. Gradual underflow is achieved by subnormal (denormal) numbers, which are spaced evenly (as opposed to logarithmically, like normal floating point) and located between the smallest negative and positive normal numbers with zeroes in the middle. As they are evenly spaced, the addition of two subnormal numbers of differing signedness (i.e. subtraction towards zero) is exact and therefore won't reproduce what you ask. The smallest subnormal is (much) less than the smallest distance between normal numbers, and therefore any subtraction between unequal normal numbers is going to be closer to a subnormal than zero.



              If you disable IEEE conformance using a special denormals-are-zero (DAZ) or flush-to-zero (FTZ) mode of the CPU, then indeed you could subtract two small, close numbers which would otherwise result in a subnormal number, which would be treated as zero due to the mode of the CPU. A working example (Linux):



              _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);    // system specific
              double d = std::numeric_limits<double>::min(); // smallest normal
              double n = std::nextafter(d, 10.0); // second smallest normal
              double z = d - n; // a negative subnormal (flushed to zero)
              std::cout << (z == 0) << 'n' << (d == n);


              This should print



              1
              0


              First 1 indicates that result of subtraction is exactly zero, while the second 0 indicates that the operands are not equal.






              share|improve this answer


























              • "The smallest subnormal is (much) less than the smallest distance between normal numbers" no, the smallest denomal is exactly the same as the distance between normal numbers with the lowest allowed exponent.

                – plugwash
                Feb 6 at 3:41
















              18














              Gradual underflow feature of IEEE floating point standard prevents this. Gradual underflow is achieved by subnormal (denormal) numbers, which are spaced evenly (as opposed to logarithmically, like normal floating point) and located between the smallest negative and positive normal numbers with zeroes in the middle. As they are evenly spaced, the addition of two subnormal numbers of differing signedness (i.e. subtraction towards zero) is exact and therefore won't reproduce what you ask. The smallest subnormal is (much) less than the smallest distance between normal numbers, and therefore any subtraction between unequal normal numbers is going to be closer to a subnormal than zero.



              If you disable IEEE conformance using a special denormals-are-zero (DAZ) or flush-to-zero (FTZ) mode of the CPU, then indeed you could subtract two small, close numbers which would otherwise result in a subnormal number, which would be treated as zero due to the mode of the CPU. A working example (Linux):



              _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);    // system specific
              double d = std::numeric_limits<double>::min(); // smallest normal
              double n = std::nextafter(d, 10.0); // second smallest normal
              double z = d - n; // a negative subnormal (flushed to zero)
              std::cout << (z == 0) << 'n' << (d == n);


              This should print



              1
              0


              First 1 indicates that result of subtraction is exactly zero, while the second 0 indicates that the operands are not equal.






              share|improve this answer


























              • "The smallest subnormal is (much) less than the smallest distance between normal numbers" no, the smallest denomal is exactly the same as the distance between normal numbers with the lowest allowed exponent.

                – plugwash
                Feb 6 at 3:41














              18












              18








              18







              Gradual underflow feature of IEEE floating point standard prevents this. Gradual underflow is achieved by subnormal (denormal) numbers, which are spaced evenly (as opposed to logarithmically, like normal floating point) and located between the smallest negative and positive normal numbers with zeroes in the middle. As they are evenly spaced, the addition of two subnormal numbers of differing signedness (i.e. subtraction towards zero) is exact and therefore won't reproduce what you ask. The smallest subnormal is (much) less than the smallest distance between normal numbers, and therefore any subtraction between unequal normal numbers is going to be closer to a subnormal than zero.



              If you disable IEEE conformance using a special denormals-are-zero (DAZ) or flush-to-zero (FTZ) mode of the CPU, then indeed you could subtract two small, close numbers which would otherwise result in a subnormal number, which would be treated as zero due to the mode of the CPU. A working example (Linux):



              _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);    // system specific
              double d = std::numeric_limits<double>::min(); // smallest normal
              double n = std::nextafter(d, 10.0); // second smallest normal
              double z = d - n; // a negative subnormal (flushed to zero)
              std::cout << (z == 0) << 'n' << (d == n);


              This should print



              1
              0


              First 1 indicates that result of subtraction is exactly zero, while the second 0 indicates that the operands are not equal.






              share|improve this answer















              Gradual underflow feature of IEEE floating point standard prevents this. Gradual underflow is achieved by subnormal (denormal) numbers, which are spaced evenly (as opposed to logarithmically, like normal floating point) and located between the smallest negative and positive normal numbers with zeroes in the middle. As they are evenly spaced, the addition of two subnormal numbers of differing signedness (i.e. subtraction towards zero) is exact and therefore won't reproduce what you ask. The smallest subnormal is (much) less than the smallest distance between normal numbers, and therefore any subtraction between unequal normal numbers is going to be closer to a subnormal than zero.



              If you disable IEEE conformance using a special denormals-are-zero (DAZ) or flush-to-zero (FTZ) mode of the CPU, then indeed you could subtract two small, close numbers which would otherwise result in a subnormal number, which would be treated as zero due to the mode of the CPU. A working example (Linux):



              _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);    // system specific
              double d = std::numeric_limits<double>::min(); // smallest normal
              double n = std::nextafter(d, 10.0); // second smallest normal
              double z = d - n; // a negative subnormal (flushed to zero)
              std::cout << (z == 0) << 'n' << (d == n);


              This should print



              1
              0


              First 1 indicates that result of subtraction is exactly zero, while the second 0 indicates that the operands are not equal.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Feb 5 at 13:45

























              answered Feb 5 at 10:44









              eerorikaeerorika

              81.6k559124




              81.6k559124













              • "The smallest subnormal is (much) less than the smallest distance between normal numbers" no, the smallest denomal is exactly the same as the distance between normal numbers with the lowest allowed exponent.

                – plugwash
                Feb 6 at 3:41



















              • "The smallest subnormal is (much) less than the smallest distance between normal numbers" no, the smallest denomal is exactly the same as the distance between normal numbers with the lowest allowed exponent.

                – plugwash
                Feb 6 at 3:41

















              "The smallest subnormal is (much) less than the smallest distance between normal numbers" no, the smallest denomal is exactly the same as the distance between normal numbers with the lowest allowed exponent.

              – plugwash
              Feb 6 at 3:41





              "The smallest subnormal is (much) less than the smallest distance between normal numbers" no, the smallest denomal is exactly the same as the distance between normal numbers with the lowest allowed exponent.

              – plugwash
              Feb 6 at 3:41











              6














              Unfortunately the answer is dependent on your implementation and the way it is configured. C and C++ don't demand any specific floating point representation or behavior. Most implementations use the IEEE 754 representations, but they don't always precisely implement IEEE 754 arithmetic behaviour.



              To understand the answer to this question we must first understand how floating point numbers work.



              A naive floating point representation would have an exponent, a sign and a mantissa. It's value would be



              (-1)s2(e – e0)(m/2M)



              Where:




              • s is the sign bit, with a value of 0 or 1.

              • e is the exponent field

              • e0 is the exponent bias. It essentially sets the overall range of the floating point number.

              • M is the number of mantissa bits.

              • m is the mantissa with a value between 0 and 2M-1


              This is similar in concept to the scientific notation you were taught in school.



              However this format has many different representations of the same number, nearly a whole bit's worth of encoding space is wasted. To fix this we can add an "implicit 1" to the mantissa.



              (-1)s2(e – e0)(1+(m/2M))



              This format has exactly one representation of each number. However there is a problem with it, it can't represent zero or numbers close to zero.



              To fix this IEEE floating point reserves a couple of exponent values for special cases. An exponent value of zero is reserved for representing small numbers known as subnormals. The highest possible exponent value is reserved for NaNs and infinities (which I will ignore in this post since they aren't relevant here). So the definition now becomes.



              (-1)s2(1 – e0)(m/2M) when e = 0

              (-1)s2(e – e0)(1+(m/2M)) when e >0 and e < 2E-1



              With this representation smaller numbers always have a step size that is less than or equal to that for larger ones. So provided the result of the subtraction is smaller in magnitude than both operands it can be represented exactly. In particular results close to but not exactly zero can be represented exactly.



              This does not apply if the result is larger in magnitude than one or both of the operands, for example subtracting a small value from a large value or subtracting two values of opposite signs. In those cases the result may be imprecise but it clearly can't be zero.



              Unfortunately FPU designers cut corners. Rather than including the logic to handle subnormal numbers quickly and correctly they either did not support (non-zero) subnormals at all or provided slow support for subnormals and then gave the user the option to turn it on and off. If support for proper subnormal calculations is not present or is disabled and the number is too small to represent in normalized form then it will be "flushed to zero".



              So in the real world under some systems and configurations subtracting two different very-small floating point numbers can result in a zero answer.






              share|improve this answer


























              • So basically this problem interess only subnormal numbers? i.e. if I don't work with subnormals I'll never have two different floating point numbers that can result in a zero?

                – markzzz
                Feb 6 at 9:15






              • 1





                The problem interests numbers where the result of the subtraction is subnormal. Subtracting two small normal numbers can produce a subnormal result which some implementations will flush to zero.

                – plugwash
                Feb 6 at 15:01











              • So disabling DAZ and FTZ will show two different results?

                – markzzz
                Feb 6 at 17:48






              • 1





                The difference between "FTZ" and "DAZ" modes on Intel processors seems to be where the flushing happens. The former operates on the outputs of an operation while the latter operates on the inputs. So if i'm reading the documentation right with FTZ the subtraction could produce a false-zero while with DAZ the subtraction would produce a correct denormal result, but the comparision would then treat that denormal result as zero.

                – plugwash
                Feb 6 at 18:13
















              6














              Unfortunately the answer is dependent on your implementation and the way it is configured. C and C++ don't demand any specific floating point representation or behavior. Most implementations use the IEEE 754 representations, but they don't always precisely implement IEEE 754 arithmetic behaviour.



              To understand the answer to this question we must first understand how floating point numbers work.



              A naive floating point representation would have an exponent, a sign and a mantissa. It's value would be



              (-1)s2(e – e0)(m/2M)



              Where:




              • s is the sign bit, with a value of 0 or 1.

              • e is the exponent field

              • e0 is the exponent bias. It essentially sets the overall range of the floating point number.

              • M is the number of mantissa bits.

              • m is the mantissa with a value between 0 and 2M-1


              This is similar in concept to the scientific notation you were taught in school.



              However this format has many different representations of the same number, nearly a whole bit's worth of encoding space is wasted. To fix this we can add an "implicit 1" to the mantissa.



              (-1)s2(e – e0)(1+(m/2M))



              This format has exactly one representation of each number. However there is a problem with it, it can't represent zero or numbers close to zero.



              To fix this IEEE floating point reserves a couple of exponent values for special cases. An exponent value of zero is reserved for representing small numbers known as subnormals. The highest possible exponent value is reserved for NaNs and infinities (which I will ignore in this post since they aren't relevant here). So the definition now becomes.



              (-1)s2(1 – e0)(m/2M) when e = 0

              (-1)s2(e – e0)(1+(m/2M)) when e >0 and e < 2E-1



              With this representation smaller numbers always have a step size that is less than or equal to that for larger ones. So provided the result of the subtraction is smaller in magnitude than both operands it can be represented exactly. In particular results close to but not exactly zero can be represented exactly.



              This does not apply if the result is larger in magnitude than one or both of the operands, for example subtracting a small value from a large value or subtracting two values of opposite signs. In those cases the result may be imprecise but it clearly can't be zero.



              Unfortunately FPU designers cut corners. Rather than including the logic to handle subnormal numbers quickly and correctly they either did not support (non-zero) subnormals at all or provided slow support for subnormals and then gave the user the option to turn it on and off. If support for proper subnormal calculations is not present or is disabled and the number is too small to represent in normalized form then it will be "flushed to zero".



              So in the real world under some systems and configurations subtracting two different very-small floating point numbers can result in a zero answer.






              share|improve this answer


























              • So basically this problem interess only subnormal numbers? i.e. if I don't work with subnormals I'll never have two different floating point numbers that can result in a zero?

                – markzzz
                Feb 6 at 9:15






              • 1





                The problem interests numbers where the result of the subtraction is subnormal. Subtracting two small normal numbers can produce a subnormal result which some implementations will flush to zero.

                – plugwash
                Feb 6 at 15:01











              • So disabling DAZ and FTZ will show two different results?

                – markzzz
                Feb 6 at 17:48






              • 1





                The difference between "FTZ" and "DAZ" modes on Intel processors seems to be where the flushing happens. The former operates on the outputs of an operation while the latter operates on the inputs. So if i'm reading the documentation right with FTZ the subtraction could produce a false-zero while with DAZ the subtraction would produce a correct denormal result, but the comparision would then treat that denormal result as zero.

                – plugwash
                Feb 6 at 18:13














              6












              6








              6







              Unfortunately the answer is dependent on your implementation and the way it is configured. C and C++ don't demand any specific floating point representation or behavior. Most implementations use the IEEE 754 representations, but they don't always precisely implement IEEE 754 arithmetic behaviour.



              To understand the answer to this question we must first understand how floating point numbers work.



              A naive floating point representation would have an exponent, a sign and a mantissa. It's value would be



              (-1)s2(e – e0)(m/2M)



              Where:




              • s is the sign bit, with a value of 0 or 1.

              • e is the exponent field

              • e0 is the exponent bias. It essentially sets the overall range of the floating point number.

              • M is the number of mantissa bits.

              • m is the mantissa with a value between 0 and 2M-1


              This is similar in concept to the scientific notation you were taught in school.



              However this format has many different representations of the same number, nearly a whole bit's worth of encoding space is wasted. To fix this we can add an "implicit 1" to the mantissa.



              (-1)s2(e – e0)(1+(m/2M))



              This format has exactly one representation of each number. However there is a problem with it, it can't represent zero or numbers close to zero.



              To fix this IEEE floating point reserves a couple of exponent values for special cases. An exponent value of zero is reserved for representing small numbers known as subnormals. The highest possible exponent value is reserved for NaNs and infinities (which I will ignore in this post since they aren't relevant here). So the definition now becomes.



              (-1)s2(1 – e0)(m/2M) when e = 0

              (-1)s2(e – e0)(1+(m/2M)) when e >0 and e < 2E-1



              With this representation smaller numbers always have a step size that is less than or equal to that for larger ones. So provided the result of the subtraction is smaller in magnitude than both operands it can be represented exactly. In particular results close to but not exactly zero can be represented exactly.



              This does not apply if the result is larger in magnitude than one or both of the operands, for example subtracting a small value from a large value or subtracting two values of opposite signs. In those cases the result may be imprecise but it clearly can't be zero.



              Unfortunately FPU designers cut corners. Rather than including the logic to handle subnormal numbers quickly and correctly they either did not support (non-zero) subnormals at all or provided slow support for subnormals and then gave the user the option to turn it on and off. If support for proper subnormal calculations is not present or is disabled and the number is too small to represent in normalized form then it will be "flushed to zero".



              So in the real world under some systems and configurations subtracting two different very-small floating point numbers can result in a zero answer.






              share|improve this answer















              Unfortunately the answer is dependent on your implementation and the way it is configured. C and C++ don't demand any specific floating point representation or behavior. Most implementations use the IEEE 754 representations, but they don't always precisely implement IEEE 754 arithmetic behaviour.



              To understand the answer to this question we must first understand how floating point numbers work.



              A naive floating point representation would have an exponent, a sign and a mantissa. It's value would be



              (-1)s2(e – e0)(m/2M)



              Where:




              • s is the sign bit, with a value of 0 or 1.

              • e is the exponent field

              • e0 is the exponent bias. It essentially sets the overall range of the floating point number.

              • M is the number of mantissa bits.

              • m is the mantissa with a value between 0 and 2M-1


              This is similar in concept to the scientific notation you were taught in school.



              However this format has many different representations of the same number, nearly a whole bit's worth of encoding space is wasted. To fix this we can add an "implicit 1" to the mantissa.



              (-1)s2(e – e0)(1+(m/2M))



              This format has exactly one representation of each number. However there is a problem with it, it can't represent zero or numbers close to zero.



              To fix this IEEE floating point reserves a couple of exponent values for special cases. An exponent value of zero is reserved for representing small numbers known as subnormals. The highest possible exponent value is reserved for NaNs and infinities (which I will ignore in this post since they aren't relevant here). So the definition now becomes.



              (-1)s2(1 – e0)(m/2M) when e = 0

              (-1)s2(e – e0)(1+(m/2M)) when e >0 and e < 2E-1



              With this representation smaller numbers always have a step size that is less than or equal to that for larger ones. So provided the result of the subtraction is smaller in magnitude than both operands it can be represented exactly. In particular results close to but not exactly zero can be represented exactly.



              This does not apply if the result is larger in magnitude than one or both of the operands, for example subtracting a small value from a large value or subtracting two values of opposite signs. In those cases the result may be imprecise but it clearly can't be zero.



              Unfortunately FPU designers cut corners. Rather than including the logic to handle subnormal numbers quickly and correctly they either did not support (non-zero) subnormals at all or provided slow support for subnormals and then gave the user the option to turn it on and off. If support for proper subnormal calculations is not present or is disabled and the number is too small to represent in normalized form then it will be "flushed to zero".



              So in the real world under some systems and configurations subtracting two different very-small floating point numbers can result in a zero answer.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Feb 6 at 18:14

























              answered Feb 6 at 4:28









              plugwashplugwash

              4,4301125




              4,4301125













              • So basically this problem interess only subnormal numbers? i.e. if I don't work with subnormals I'll never have two different floating point numbers that can result in a zero?

                – markzzz
                Feb 6 at 9:15






              • 1





                The problem interests numbers where the result of the subtraction is subnormal. Subtracting two small normal numbers can produce a subnormal result which some implementations will flush to zero.

                – plugwash
                Feb 6 at 15:01











              • So disabling DAZ and FTZ will show two different results?

                – markzzz
                Feb 6 at 17:48






              • 1





                The difference between "FTZ" and "DAZ" modes on Intel processors seems to be where the flushing happens. The former operates on the outputs of an operation while the latter operates on the inputs. So if i'm reading the documentation right with FTZ the subtraction could produce a false-zero while with DAZ the subtraction would produce a correct denormal result, but the comparision would then treat that denormal result as zero.

                – plugwash
                Feb 6 at 18:13



















              • So basically this problem interess only subnormal numbers? i.e. if I don't work with subnormals I'll never have two different floating point numbers that can result in a zero?

                – markzzz
                Feb 6 at 9:15






              • 1





                The problem interests numbers where the result of the subtraction is subnormal. Subtracting two small normal numbers can produce a subnormal result which some implementations will flush to zero.

                – plugwash
                Feb 6 at 15:01











              • So disabling DAZ and FTZ will show two different results?

                – markzzz
                Feb 6 at 17:48






              • 1





                The difference between "FTZ" and "DAZ" modes on Intel processors seems to be where the flushing happens. The former operates on the outputs of an operation while the latter operates on the inputs. So if i'm reading the documentation right with FTZ the subtraction could produce a false-zero while with DAZ the subtraction would produce a correct denormal result, but the comparision would then treat that denormal result as zero.

                – plugwash
                Feb 6 at 18:13

















              So basically this problem interess only subnormal numbers? i.e. if I don't work with subnormals I'll never have two different floating point numbers that can result in a zero?

              – markzzz
              Feb 6 at 9:15





              So basically this problem interess only subnormal numbers? i.e. if I don't work with subnormals I'll never have two different floating point numbers that can result in a zero?

              – markzzz
              Feb 6 at 9:15




              1




              1





              The problem interests numbers where the result of the subtraction is subnormal. Subtracting two small normal numbers can produce a subnormal result which some implementations will flush to zero.

              – plugwash
              Feb 6 at 15:01





              The problem interests numbers where the result of the subtraction is subnormal. Subtracting two small normal numbers can produce a subnormal result which some implementations will flush to zero.

              – plugwash
              Feb 6 at 15:01













              So disabling DAZ and FTZ will show two different results?

              – markzzz
              Feb 6 at 17:48





              So disabling DAZ and FTZ will show two different results?

              – markzzz
              Feb 6 at 17:48




              1




              1





              The difference between "FTZ" and "DAZ" modes on Intel processors seems to be where the flushing happens. The former operates on the outputs of an operation while the latter operates on the inputs. So if i'm reading the documentation right with FTZ the subtraction could produce a false-zero while with DAZ the subtraction would produce a correct denormal result, but the comparision would then treat that denormal result as zero.

              – plugwash
              Feb 6 at 18:13





              The difference between "FTZ" and "DAZ" modes on Intel processors seems to be where the flushing happens. The former operates on the outputs of an operation while the latter operates on the inputs. So if i'm reading the documentation right with FTZ the subtraction could produce a false-zero while with DAZ the subtraction would produce a correct denormal result, but the comparision would then treat that denormal result as zero.

              – plugwash
              Feb 6 at 18:13











              2














              Excluding funny numbers like NAN, I don't think it's possible.



              Let's say a and b are normal finite IEEE 754 floats, and |a - b| is less than or equal to both |a| and |b| (otherwise it's clearly not zero).



              That means the exponent is <= both a's and b's, and so the absolute precision is at least as high, which makes the subtraction exactly representable. That means that if a - b == 0, then it is exactly zero, so a == b.






              share|improve this answer




























                2














                Excluding funny numbers like NAN, I don't think it's possible.



                Let's say a and b are normal finite IEEE 754 floats, and |a - b| is less than or equal to both |a| and |b| (otherwise it's clearly not zero).



                That means the exponent is <= both a's and b's, and so the absolute precision is at least as high, which makes the subtraction exactly representable. That means that if a - b == 0, then it is exactly zero, so a == b.






                share|improve this answer


























                  2












                  2








                  2







                  Excluding funny numbers like NAN, I don't think it's possible.



                  Let's say a and b are normal finite IEEE 754 floats, and |a - b| is less than or equal to both |a| and |b| (otherwise it's clearly not zero).



                  That means the exponent is <= both a's and b's, and so the absolute precision is at least as high, which makes the subtraction exactly representable. That means that if a - b == 0, then it is exactly zero, so a == b.






                  share|improve this answer













                  Excluding funny numbers like NAN, I don't think it's possible.



                  Let's say a and b are normal finite IEEE 754 floats, and |a - b| is less than or equal to both |a| and |b| (otherwise it's clearly not zero).



                  That means the exponent is <= both a's and b's, and so the absolute precision is at least as high, which makes the subtraction exactly representable. That means that if a - b == 0, then it is exactly zero, so a == b.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Feb 5 at 10:40









                  Joseph IrelandJoseph Ireland

                  1,866617




                  1,866617






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54531425%2fis-it-possible-in-floating-point-to-return-0-0-subtracting-two-different-values%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

                      ComboBox Display Member on multiple fields

                      Is it possible to collect Nectar points via Trainline?