Re: binary search debugging of compilers

Kaz Kylheku <864-117-4973@kylheku.com>
Mon, 15 May 2023 21:52:23 -0000 (UTC)

          From comp.compilers

Related articles
[2 earlier articles]
Re: binary search debugging of compilers pronesto@gmail.com (Fernando) (2023-05-13)
Re: binary search debugging of compilers 864-117-4973@kylheku.com (Kaz Kylheku) (2023-05-14)
Re: binary search debugging of compilers tkoenig@netcologne.de (Thomas Koenig) (2023-05-14)
Re: binary search debugging of compilers gah4@u.washington.edu (gah4) (2023-05-14)
Re: binary search debugging of compilers cameron.mcinally@nyu.edu (Cameron McInally) (2023-05-14)
Re: binary search debugging of compilers 864-117-4973@kylheku.com (Kaz Kylheku) (2023-05-15)
Re: binary search debugging of compilers 864-117-4973@kylheku.com (Kaz Kylheku) (2023-05-15)
Re: binary search debugging of compilers gah4@u.washington.edu (gah4) (2023-05-16)
Re: binary search debugging of compilers rsc@swtch.com (Russ Cox) (2023-05-17)
Re: binary search debugging of compilers cameron.mcinally@nyu.edu (Cameron McInally) (2023-05-17)
Re: binary search debugging of compilers 864-117-4973@kylheku.com (Kaz Kylheku) (2023-05-17)
Re: binary search debugging of compilers 864-117-4973@kylheku.com (Kaz Kylheku) (2023-05-17)
Re: binary search debugging of compilers gah4@u.washington.edu (gah4) (2023-05-17)
[10 later articles]
| List of all articles for this month |

From: Kaz Kylheku <864-117-4973@kylheku.com>
Newsgroups: comp.compilers
Date: Mon, 15 May 2023 21:52:23 -0000 (UTC)
Organization: A noiseless patient Spider
References: 23-05-003 23-05-005 23-05-006 23-05-008
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="21354"; mail-complaints-to="abuse@iecc.com"
Keywords: debug, tools
Posted-Date: 15 May 2023 21:54:04 EDT

On 2023-05-14, gah4 <gah4@u.washington.edu> wrote:
> On Sunday, May 14, 2023 at 9:20:54 AM UTC-7, Kaz Kylheku wrote:
>
> (snip)
>
>> Setup:
>>
>> We can (somehow) enumerate the N entities with integers, which we
>> express using a pure binary enumeration. Then, we can discover the
>> number of an erroneously processed element, one bit at a time.
>>
>> Procedure:
>>
>> First we treat all elements numbered ..XXXXXX0 using the new technique
>> technique, and all others using the the old technique. If the
>> system fails, we know that it's the ..XXXXX0 set which has one or more
>> badly processed elements. Otherwise it's the other set, whose
>> enumerations end in a 1. Either way, we have discovered the identity of
>> the last bit.
>
> There is the assumption that only one of the N is in error.


You might think, but it's not actually a problem with the algorithm.


When we bisect a change history, there is the assumption that the
bug showed up, so there are good commits before that, and bad commits
after. If the history is not like that: the bug disappears and
reappears, then the bisect will not identify *the* bad commit reliably,
because there isn't one.


I have a story like this; but it digresses from my present point, so
I will return to it below.


In this situation, we are bisecting a space of N entities. All we need
is to find one bad one. If there are three bad ones, it doesn't matter
which one we find.


Say we have a basket of apples with a rotten smell emanating from it.
We can subdivide it, and smell one half and the other. If both halves
smell, we know we have two or more rotten apples and they ended up
in different halves. This doesn't matter. We just pick one half and
keep subdividing. As long as we stay on the trail of the rotten scent, we will
get down to one rotten apple, and we can use that apple to analyze further:
what kind of mould or bacterium has infected it and so on. Probably
the other rotten apples have the same root cause. If they have different root
causes, we can do another search after fixing the one root cause we have found.


Now about the story. The conclusion of the story was that there was
a GC-related bug in an ash (arithmetic shift) function, dealing with
heap-allocated bignum integers.


A bignum integer was being prematurely reclaimed by the garbage
collector.


The fix looked like this:


commit 9cf992835c8a0f2e6c4ace07b67fea2acb762cc5
Author: Kaz Kylheku <kaz@kylheku.com>
Date: Wed Jun 19 23:21:39 2019 -0700


        ash: gc problem.


        On 32 bit x86 Solaris 10, with gcc 4.9.0, this issue caused a
        miscompilation of the pset macro, due to ash abruptly
        returning zero, causing the op-code field of an instruction to
        be NOOP rather than GCALL.


        I'm committing this fix just for reference; I will immediately
        replace it with a refactoring of the function.


        * arith.c (ash): Add a gc_hint to prevent the a bignum from
        being reclaimed if the b = make_bignum() triggers gc.
        This happens in the case when the previous case computes
        a = make_bignum() and falls through.


diff --git a/arith.c b/arith.c
index fdb294b7..f374ddf4 100644
--- a/arith.c
+++ b/arith.c
@@ -3235,6 +3235,7 @@ val ash(val a, val bits)
              b = make_bignum();
              if ((mpe = mp_shift(mp(a), mp(b), bn)) != MP_OKAY)
                  break;
+ gc_hint(a);
              return normalize(b);
          default:
              goto bad3;


The problem only showed upon Solaris (first red flag). When I git bisected it,
it was randomly appearing and disappearing, so git bisect was of no use.


Not all problems introduce themselves in a way that there is a reproducible
test case.


Here, the garbage collector had to go off exactly at the wrong time during the
execution of the ash function. Small perturbations to completely unrelated code
can make it go away.


In the end, I had to debug it on Solaris (no working gdb or anything), using
the repro that I had.


Once I discovered the bad VM instruction, which was affecting the behavior of a
certain Lisp macro in the standard library, I traced it through the compiler
right down to the assembler, where the opcode field for a good assembly
instruction was being miscalculated. I could not reproduce the bad ash
calculation though in isolation.


In summary:


1. non-gc-aware, sloppy coding in ash function caused ...
2. rare problem in assembler's masking and shifting calculations, causing ...
3. Bad instruction when compiling the "pset" (parallel assignment) macro ...
4. Which failed some code relying pset.


All of this reproduced only one one platform, one particular commit.
Small perturbations of Lisp code just about anywhere made it go away.
Bisect was useless.


--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.