Bug Hunting 101 - Finding "The" Alpha Bug

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Bug Hunting 101 - Finding "The" Alpha Bug

J.C. Roberts-2
Bug Hunting 101 - Finding "The" Alpha Bug

I've been told that "The" alpha bug has been around for quite some time
and no one has been able to find or fix it. I've also been told looking
for this bug has driven a few developers to drink, well, probably "drink
more" is a better description. Anyhow, since I could use a drink, I'm
going to give it a shot.

Since I don't have the skill to fix it myself, my goal is simply to
figure out when "The" alpha bug entered the tree. If I can just figure
out the `when' hopefully someone a lot smarter than me can figure out
the `what' of the problem. Basically I'm going to turn loose a half
dozen alpha systems compiling various versions of OpenBSD until I find
where the bug stops occurring.

As far as I can tell, the bug smells like a race condition of some sort
and if my wild guess is correct, it will be difficult to reproduce
consistently. With some (but not all) race conditions, you can increase
the chance of triggering them by increasing loads. Since I want the race
condition to occur, what is the best way stress to the systems while
also doing make build?

http://www.holm.cc/stress/
http://www.openbsd.org/cgi-bin/cvsweb/ports/sysutils/stress/

I simply don't know and I'm only guessing but the prime suspects for
where the race might live seem to be physical memory management,
PAL/interrupt handling or even the scheduler.

Are there better ways to stress the system?
Are there better ways to increase the odds of a race occurring?

Since I needed to find a starting point, I went searching and reading
through the archives of misc@, tech@, alpha@ and bugs@ even the netbsd
archives in hopes of finding a "patient zero" where the bug was first
reported. I found something interesting, namely a (more than once)
reported bug that looks very similar to "The" alpha bug. The primary
difference is you get "cpu_switch_queuescan" rather than "cpu_switch" in
the trace output.

2003-10-01 21:40:00
http://marc.theaimsgroup.com/?l=openbsd-alpha&m=106504464724168&w=2

2003-08-03 12:00:14
http://marc.theaimsgroup.com/?l=openbsd-alpha&m=105999853009839&w=2

There is also another report that is vague but since it is missing the
needed trace information, there's no way to tell if it's related.
2003-05-13 22:13:50
http://marc.theaimsgroup.com/?l=openbsd-bugs&m=105286536018393&w=2

From other bug reports in the archive I know 3.8, 3.7 and 3.6 are all
affected by "The" alpha bug if my hunch is correct and the bugs linked
above are related to "The" alpha bug, then I should start the
compile-a-thon at OpenBSD v3.3 and work backwards.

If you've got a better idea, please let me know.

Kind Regards,
jcr

Reply | Threaded
Open this post in threaded view
|

Re: Bug Hunting 101 - Finding "The" Alpha Bug

Siegbert Marschall
Hi,

> As far as I can tell, the bug smells like a race condition of some sort
> and if my wild guess is correct, it will be difficult to reproduce
> consistently. With some (but not all) race conditions, you can increase
> the chance of triggering them by increasing loads. Since I want the race
> condition to occur, what is the best way stress to the systems while
> also doing make build?
well, I have three alphas in the basement where I am trying to figure
this one out, nothing provable yet but everything is pointing into
some hardware problem with the low-end alpha cpus and second-level cache.
llsc errors, stuck cachelines and stuff but I didn't dive deep enough
into the code and processor documentations to figure out what's going
on there and will not be in the next weeks/months since I have a few
more pressing issues to take care of first before having the spare
time for this ;)

only thing I can tell is that with netbsd the machines stay up for
weeks/months and with obsd they crash latest after a few days.
no flame, doesn't show that netbsd is better, probably just missing the
tripwire or doesn't care wether it blows.

good luck, siggi.

Reply | Threaded
Open this post in threaded view
|

Re: Bug Hunting 101 - Finding "The" Alpha Bug

J.C. Roberts-2
On Wed, 21 Dec 2005 22:46:00 +0100 (CET), "Siegbert Marschall"
<[hidden email]> wrote:

>Hi,
>
>> As far as I can tell, the bug smells like a race condition of some sort
>> and if my wild guess is correct, it will be difficult to reproduce
>> consistently. With some (but not all) race conditions, you can increase
>> the chance of triggering them by increasing loads. Since I want the race
>> condition to occur, what is the best way stress to the systems while
>> also doing make build?
>>
>well, I have three alphas in the basement where I am trying to figure
>this one out, nothing provable yet but everything is pointing into
>some hardware problem with the low-end alpha cpus and second-level cache.

Due to the old bug reports which may or may not be related, I've been
looking into the changes in src/sys/arch/alpha/alpha/locore.s

>llsc errors, stuck cachelines and stuff but I didn't dive deep enough
>into the code and processor documentations to figure out what's going
>on there and will not be in the next weeks/months since I have a few
>more pressing issues to take care of first before having the spare
>time for this ;)
>

If I can figure out when the bug entered the tree, it will hopefully
make it easy for someone else to figure out the "what" of the problem.
Since I lack the skill and experience to deal with figuring out the
what, I'm just going to use brute force to figure out the when. ;-)

>only thing I can tell is that with netbsd the machines stay up for
>weeks/months and with obsd they crash latest after a few days.
>no flame, doesn't show that netbsd is better, probably just missing the
>tripwire or doesn't care wether it blows.
>
>good luck, siggi.

I've searched the netbsd list archives thoroughly and found no similar
bug reports. As far as I know netbsd is not affected.

jcr

Reply | Threaded
Open this post in threaded view
|

Re: Bug Hunting 101 - Finding "The" Alpha Bug

ober-4
I know this is going to be OT, but since this bug seems to deal with only
OpenBSD on alpha, possibly in locore.s and does not seem to affect netbsd,
that I might point out a coincidental, but most likely unrelated "bug".

http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/arch/alpha/alpha/locore.s.diff?r1=1.19&r2=1.20&f=h
Search on "OpenBSD". :D



-Ober

On Wed, 21 Dec 2005, J.C. Roberts wrote:

> On Wed, 21 Dec 2005 22:46:00 +0100 (CET), "Siegbert Marschall"
> <[hidden email]> wrote:
>
>> Hi,
>>
>>> As far as I can tell, the bug smells like a race condition of some sort
>>> and if my wild guess is correct, it will be difficult to reproduce
>>> consistently. With some (but not all) race conditions, you can increase
>>> the chance of triggering them by increasing loads. Since I want the race
>>> condition to occur, what is the best way stress to the systems while
>>> also doing make build?
>>>
>> well, I have three alphas in the basement where I am trying to figure
>> this one out, nothing provable yet but everything is pointing into
>> some hardware problem with the low-end alpha cpus and second-level cache.
>
> Due to the old bug reports which may or may not be related, I've been
> looking into the changes in src/sys/arch/alpha/alpha/locore.s
>
>> llsc errors, stuck cachelines and stuff but I didn't dive deep enough
>> into the code and processor documentations to figure out what's going
>> on there and will not be in the next weeks/months since I have a few
>> more pressing issues to take care of first before having the spare
>> time for this ;)
>>
>
> If I can figure out when the bug entered the tree, it will hopefully
> make it easy for someone else to figure out the "what" of the problem.
> Since I lack the skill and experience to deal with figuring out the
> what, I'm just going to use brute force to figure out the when. ;-)
>
>> only thing I can tell is that with netbsd the machines stay up for
>> weeks/months and with obsd they crash latest after a few days.
>> no flame, doesn't show that netbsd is better, probably just missing the
>> tripwire or doesn't care wether it blows.
>>
>> good luck, siggi.
>
> I've searched the netbsd list archives thoroughly and found no similar
> bug reports. As far as I know netbsd is not affected.
>
> jcr

Reply | Threaded
Open this post in threaded view
|

Re: Bug Hunting 101 - Finding "The" Alpha Bug

J.C. Roberts-2
In reply to this post by J.C. Roberts-2
On Wed, 21 Dec 2005 12:13:54 -0800, "J.C. Roberts" <[hidden email]>
wrote:

>I found something interesting, namely a (more than once)
>reported bug that looks very similar to "The" alpha bug. The primary
>difference is you get "cpu_switch_queuescan" rather than "cpu_switch" in
>the trace output.
>
>2003-10-01 21:40:00
>http://marc.theaimsgroup.com/?l=openbsd-alpha&m=106504464724168&w=2
>
>2003-08-03 12:00:14
>http://marc.theaimsgroup.com/?l=openbsd-alpha&m=105999853009839&w=2
>
>There is also another report that is vague but since it is missing the
>needed trace information, there's no way to tell if it's related.
>2003-05-13 22:13:50
>http://marc.theaimsgroup.com/?l=openbsd-bugs&m=105286536018393&w=2


Yes, the two bugs, one which shows "cpu_switch" in the trace output and
the other that shows "cpu_switch_queuescan" in the trace output, are
definitely related.

I managed to reproduce the "cpu_switch_queuescan" output originally
reported from OpenBSD 3.3 while compiling 3.8-STABLE tonight.

The only change in the source files is that I enabled the

  #makeoptions DEBUG="-g"

line in /src/sys/conf/GENERIC file. I'm going to try flipping this back
and forth a few times to see if it really is the deciding factor for
which output the bug displays.

JCR

Reply | Threaded
Open this post in threaded view
|

Re: Bug Hunting 101 - Finding "The" Alpha Bug

Artur Grabowski
In reply to this post by J.C. Roberts-2
"J.C. Roberts" <[hidden email]> writes:

> Since I don't have the skill to fix it myself, my goal is simply to
> figure out when "The" alpha bug entered the tree. If I can just figure
> out the `when' hopefully someone a lot smarter than me can figure out
> the `what' of the problem. Basically I'm going to turn loose a half
> dozen alpha systems compiling various versions of OpenBSD until I find
> where the bug stops occurring.

Good luck. I spent two months doing this (out of 8-9 months of chasing
the bug). What will you change? gcc?  binutils? libc used to build
gcc/binutils? /usr/bin/config?

The bug isn't necessarily in the kernel code.

> As far as I can tell, the bug smells like a race condition of some sort
> and if my wild guess is correct, it will be difficult to reproduce
> consistently. With some (but not all) race conditions, you can increase
> the chance of triggering them by increasing loads. Since I want the race
> condition to occur, what is the best way stress to the systems while
> also doing make build?

Good luck. I never found a reliable way to reproduce it. Sometimes it
showed up seconds after boot, sometimes after a few weeks uptime.

> http://www.holm.cc/stress/
> http://www.openbsd.org/cgi-bin/cvsweb/ports/sysutils/stress/

Stress tests never increased the probability of the bug popping up.
It often popped up when the machine was completly idle.

> I simply don't know and I'm only guessing but the prime suspects for
> where the race might live seem to be physical memory management,
> PAL/interrupt handling or even the scheduler.

Yawn.

> Are there better ways to stress the system?
> Are there better ways to increase the odds of a race occurring?

No.

> Since I needed to find a starting point, I went searching and reading
> through the archives of misc@, tech@, alpha@ and bugs@ even the netbsd
> archives in hopes of finding a "patient zero" where the bug was first
> reported. I found something interesting, namely a (more than once)
> reported bug that looks very similar to "The" alpha bug. The primary
> difference is you get "cpu_switch_queuescan" rather than "cpu_switch" in
> the trace output.

cpu_switch is just where it shows most often nowadays. When I debugged it
it was all over the place. Any debug printf I added to detect the condition
that caused the crash just moved the bug to another place.

> 2003-10-01 21:40:00
> http://marc.theaimsgroup.com/?l=openbsd-alpha&m=106504464724168&w=2
>
> 2003-08-03 12:00:14
> http://marc.theaimsgroup.com/?l=openbsd-alpha&m=105999853009839&w=2

It was definitely happening before that. At least since summer 2002, or
even earlier.

> >From other bug reports in the archive I know 3.8, 3.7 and 3.6 are all
> affected by "The" alpha bug if my hunch is correct and the bugs linked
> above are related to "The" alpha bug, then I should start the
> compile-a-thon at OpenBSD v3.3 and work backwards.

Good luck. Since there is no way to reproduce the problem, there is also no
way to know that you have successfully found the bug unless you run your
every complie for at least a few weeks with normal load.

//art

Reply | Threaded
Open this post in threaded view
|

Re: Bug Hunting 101 - Finding "The" Alpha Bug

Nikns Siankin
In reply to this post by J.C. Roberts-2
Upgraded alphastation to 3.8 and first time in my life hit
alpha bug. ;)
Kernel panicked while ungziping src.tar.gz.
When I hit continue in ddb I was dropped into
other panic.
There is photos of panic, maybe it helps someone to
find alphabug :))

http://secure.lv/~nikns/alphabug/

Reply | Threaded
Open this post in threaded view
|

Re: Bug Hunting 101 - Finding "The" Alpha Bug

J.C. Roberts-2
On Tue, 27 Dec 2005 09:01:00 +0200, nikns <[hidden email]> wrote:

>Upgraded alphastation to 3.8 and first time in my life hit
>alpha bug. ;)
>Kernel panicked while ungziping src.tar.gz.
>When I hit continue in ddb I was dropped into
>other panic.
>There is photos of panic, maybe it helps someone to
>find alphabug :))
>
>http://secure.lv/~nikns/alphabug/

Any chance you can post a dmesg for the box?

thanks,
jcr

Reply | Threaded
Open this post in threaded view
|

Re: Bug Hunting 101 - Finding "The" Alpha Bug

Nikns Siankin
On Thu, Dec 29, 2005 at 01:51:34PM -0800, J.C. Roberts wrote:

>On Tue, 27 Dec 2005 09:01:00 +0200, nikns <[hidden email]> wrote:
>
>>Upgraded alphastation to 3.8 and first time in my life hit
>>alpha bug. ;)
>>Kernel panicked while ungziping src.tar.gz.
>>When I hit continue in ddb I was dropped into
>>other panic.
>>There is photos of panic, maybe it helps someone to
>>find alphabug :))
>>
>>http://secure.lv/~nikns/alphabug/
>
>Any chance you can post a dmesg for the box?

http://marc.theaimsgroup.com/?l=openbsd-alpha&m=113051046212041&w=2

Welcome!