gdb in uninterruptible wait

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

gdb in uninterruptible wait

Julian Smith
I've been finding egdb and gdb rather easily get stuck in an
uninterruptible wait, e.g. when running the 'next' command after
hitting a breakpoint.

So it's not possible to kill the debuggee or gdb and the only way to
kill the debuggee process and free up its listening sockets seems to be
to reboot the entire system.

Perhaps unsurprisingly one cannot attach a second invocation of gdb to
the uninterruptible gdb, so i don't know for sure what syscall is being
run that is getting stuck.

The debuggee is a local build of the flightgear flight simulator.

Here's the output of ps for the debugger and debuggee:

12419 p0  D        0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set print thread-events off -ex set print pretty on -ex run --args build-walk/fgfs,clang,debug,opt,co
63921 p0  TX+      0:42.45 /home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe --airport=egtk (fgfs,clang,debug)

I've tried using ktrace on egdb, and the kdump output ends like this:

 53950 egdb     CALL  wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0)
 53950 egdb     RET   wait4 97562/0x17d1a
 53950 egdb     CALL  ptrace(PT_GET_PROCESS_STATE,97562,0x7f7ffffe8ef0,12)
 53950 egdb     RET   ptrace 0
 53950 egdb     CALL  ptrace(PT_GETREGS,161560,0x7f7ffffe8b40,0)
 53950 egdb     RET   ptrace 0
 53950 egdb     CALL  futex(0x6444e37c490,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
 53950 egdb     RET   futex 0
 53950 egdb     CALL  futex(0x644bef12740,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
 53950 egdb     RET   futex 0
 53950 egdb     CALL  ptrace(PT_IO,97562,0x7f7ffffe8a30,0)
 53950 egdb     RET   ptrace 0
 53950 egdb     CALL  ptrace(PT_IO,97562,0x7f7ffffe8a30,0)
 53950 egdb     RET   ptrace 0
 53950 egdb     CALL  ptrace(PT_STEP,97562,0x1,0)
 53950 egdb     RET   ptrace 0
 53950 egdb     CALL  read(6,0x7f7ffffe9187,0x1)
 53950 egdb     RET   read -1 errno 35 Resource temporarily unavailable
 53950 egdb     CALL  poll(0x6441581e720,3,0)
 53950 egdb     STRU  struct pollfd [3] { fd=4, events=0x1<POLLIN>, revents=0<> } { fd=6, events=0x1<POLLIN>, revents=0<> } { fd=10, events=0x1<POLLIN>, revents=0<> }
 53950 egdb     RET   poll 0
 53950 egdb     CALL  wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0)

Assuming that this is the actual end of the ktrace output and there
isn't some missing ktrace output in a buffer somewhere, this looks
like egdb is simply blocked in wait4(), which should be harmless and
certainly not uninterruptable?

Does anyone have any suggestions about how to investigate this further?

I'm running OpenBSD 6.7 GENERIC.MP#182 amd64.

Thanks,

- Jules

--
http://op59.net

Reply | Threaded
Open this post in threaded view
|

Re: gdb in uninterruptible wait

Visa Hankala-2
On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:

> I've been finding egdb and gdb rather easily get stuck in an
> uninterruptible wait, e.g. when running the 'next' command after
> hitting a breakpoint.
>
> So it's not possible to kill the debuggee or gdb and the only way to
> kill the debuggee process and free up its listening sockets seems to be
> to reboot the entire system.
>
> Perhaps unsurprisingly one cannot attach a second invocation of gdb to
> the uninterruptible gdb, so i don't know for sure what syscall is being
> run that is getting stuck.
>
> The debuggee is a local build of the flightgear flight simulator.
>
> Here's the output of ps for the debugger and debuggee:
>
> 12419 p0  D        0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set print thread-events off -ex set print pretty on -ex run --args build-walk/fgfs,clang,debug,opt,co
> 63921 p0  TX+      0:42.45 /home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe --airport=egtk (fgfs,clang,debug)
>
> I've tried using ktrace on egdb, and the kdump output ends like this:
>
>  53950 egdb     CALL  wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0)
>  53950 egdb     RET   wait4 97562/0x17d1a
>  53950 egdb     CALL  ptrace(PT_GET_PROCESS_STATE,97562,0x7f7ffffe8ef0,12)
>  53950 egdb     RET   ptrace 0
>  53950 egdb     CALL  ptrace(PT_GETREGS,161560,0x7f7ffffe8b40,0)
>  53950 egdb     RET   ptrace 0
>  53950 egdb     CALL  futex(0x6444e37c490,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
>  53950 egdb     RET   futex 0
>  53950 egdb     CALL  futex(0x644bef12740,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
>  53950 egdb     RET   futex 0
>  53950 egdb     CALL  ptrace(PT_IO,97562,0x7f7ffffe8a30,0)
>  53950 egdb     RET   ptrace 0
>  53950 egdb     CALL  ptrace(PT_IO,97562,0x7f7ffffe8a30,0)
>  53950 egdb     RET   ptrace 0
>  53950 egdb     CALL  ptrace(PT_STEP,97562,0x1,0)
>  53950 egdb     RET   ptrace 0
>  53950 egdb     CALL  read(6,0x7f7ffffe9187,0x1)
>  53950 egdb     RET   read -1 errno 35 Resource temporarily unavailable
>  53950 egdb     CALL  poll(0x6441581e720,3,0)
>  53950 egdb     STRU  struct pollfd [3] { fd=4, events=0x1<POLLIN>, revents=0<> } { fd=6, events=0x1<POLLIN>, revents=0<> } { fd=10, events=0x1<POLLIN>, revents=0<> }
>  53950 egdb     RET   poll 0
>  53950 egdb     CALL  wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0)
>
> Assuming that this is the actual end of the ktrace output and there
> isn't some missing ktrace output in a buffer somewhere, this looks
> like egdb is simply blocked in wait4(), which should be harmless and
> certainly not uninterruptable?

The single-thread check done by wait4() is non-interruptible.
When the debugger gets stuck, is it blocked in "suspend" state?

However, I think there is a bug in the single-thread switch code.
It looks that ps_singlecount can be decremented too much. This probably
is a regression of making ps_singlecount unsigned and letting
single_thread_check() run without the kernel lock.

The bug might go away if single_thread_check() made sure that
P_SUSPSINGLE is set before the thread suspends.

Does the following patch help? Even if it does, it probably needs
some refining.

Index: kern/kern_sig.c
===================================================================
RCS file: src/sys/kern/kern_sig.c,v
retrieving revision 1.258
diff -u -p -r1.258 kern_sig.c
--- kern/kern_sig.c 15 Jun 2020 13:18:33 -0000 1.258
+++ kern/kern_sig.c 20 Jul 2020 04:27:30 -0000
@@ -1915,16 +1915,23 @@ single_thread_check(struct proc *p, int
  return (EINTR);
  }
 
- if (atomic_dec_int_nv(&pr->ps_singlecount) == 0)
- wakeup(&pr->ps_singlecount);
+ SCHED_LOCK(s);
+ if (p->p_flag & P_SUSPSINGLE) {
+ if (atomic_dec_int_nv(&pr->ps_singlecount) == 0)
+ wakeup(&pr->ps_singlecount);
+ } else if ((p->p_flag & P_WEXIT) == 0) {
+ SCHED_UNLOCK(s);
+ CPU_BUSY_CYCLE();
+ continue;
+ }
  if (pr->ps_flags & PS_SINGLEEXIT) {
+ SCHED_UNLOCK(s);
  KERNEL_LOCK();
  exit1(p, 0, 0, EXIT_THREAD_NOCHECK);
- KERNEL_UNLOCK();
+ /* NOTREACHED */
  }
 
  /* not exiting and don't need to unwind, so suspend */
- SCHED_LOCK(s);
  p->p_stat = SSTOP;
  mi_switch();
  SCHED_UNLOCK(s);

Reply | Threaded
Open this post in threaded view
|

Re: gdb in uninterruptible wait

Visa Hankala-2
On Mon, Jul 20, 2020 at 04:35:12AM +0000, Visa Hankala wrote:

> On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:
> > I've been finding egdb and gdb rather easily get stuck in an
> > uninterruptible wait, e.g. when running the 'next' command after
> > hitting a breakpoint.
> >
> > So it's not possible to kill the debuggee or gdb and the only way to
> > kill the debuggee process and free up its listening sockets seems to be
> > to reboot the entire system.
> >
> > Perhaps unsurprisingly one cannot attach a second invocation of gdb to
> > the uninterruptible gdb, so i don't know for sure what syscall is being
> > run that is getting stuck.
> >
> > The debuggee is a local build of the flightgear flight simulator.
> >
> > Here's the output of ps for the debugger and debuggee:
> >
> > 12419 p0  D        0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set print thread-events off -ex set print pretty on -ex run --args build-walk/fgfs,clang,debug,opt,co
> > 63921 p0  TX+      0:42.45 /home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe --airport=egtk (fgfs,clang,debug)
> >
> > I've tried using ktrace on egdb, and the kdump output ends like this:
> >
> >  53950 egdb     CALL  wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0)
> >  53950 egdb     RET   wait4 97562/0x17d1a
> >  53950 egdb     CALL  ptrace(PT_GET_PROCESS_STATE,97562,0x7f7ffffe8ef0,12)
> >  53950 egdb     RET   ptrace 0
> >  53950 egdb     CALL  ptrace(PT_GETREGS,161560,0x7f7ffffe8b40,0)
> >  53950 egdb     RET   ptrace 0
> >  53950 egdb     CALL  futex(0x6444e37c490,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
> >  53950 egdb     RET   futex 0
> >  53950 egdb     CALL  futex(0x644bef12740,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
> >  53950 egdb     RET   futex 0
> >  53950 egdb     CALL  ptrace(PT_IO,97562,0x7f7ffffe8a30,0)
> >  53950 egdb     RET   ptrace 0
> >  53950 egdb     CALL  ptrace(PT_IO,97562,0x7f7ffffe8a30,0)
> >  53950 egdb     RET   ptrace 0
> >  53950 egdb     CALL  ptrace(PT_STEP,97562,0x1,0)
> >  53950 egdb     RET   ptrace 0
> >  53950 egdb     CALL  read(6,0x7f7ffffe9187,0x1)
> >  53950 egdb     RET   read -1 errno 35 Resource temporarily unavailable
> >  53950 egdb     CALL  poll(0x6441581e720,3,0)
> >  53950 egdb     STRU  struct pollfd [3] { fd=4, events=0x1<POLLIN>, revents=0<> } { fd=6, events=0x1<POLLIN>, revents=0<> } { fd=10, events=0x1<POLLIN>, revents=0<> }
> >  53950 egdb     RET   poll 0
> >  53950 egdb     CALL  wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0)
> >
> > Assuming that this is the actual end of the ktrace output and there
> > isn't some missing ktrace output in a buffer somewhere, this looks
> > like egdb is simply blocked in wait4(), which should be harmless and
> > certainly not uninterruptable?
>
> The single-thread check done by wait4() is non-interruptible.
> When the debugger gets stuck, is it blocked in "suspend" state?
>
> However, I think there is a bug in the single-thread switch code.
> It looks that ps_singlecount can be decremented too much. This probably
> is a regression of making ps_singlecount unsigned and letting
> single_thread_check() run without the kernel lock.
>
> The bug might go away if single_thread_check() made sure that
> P_SUSPSINGLE is set before the thread suspends.

Below is an updated patch for testing. It extends the scope of
SCHED_LOCK() so that there are fewer chances of interleaving of
single_thread_set() and single_thread_check().

One problem is that once single_thread_set() sets ps_single, the other
threads can enter the suspension code in single_thread_check(). The
extended lock scope prevents the threads from taking action before
single_thread_set() has finished with the state updates.

Index: kern/kern_sig.c
===================================================================
RCS file: src/sys/kern/kern_sig.c,v
retrieving revision 1.258
diff -u -p -r1.258 kern_sig.c
--- kern/kern_sig.c 15 Jun 2020 13:18:33 -0000 1.258
+++ kern/kern_sig.c 20 Jul 2020 13:29:50 -0000
@@ -1915,16 +1915,17 @@ single_thread_check(struct proc *p, int
  return (EINTR);
  }
 
+ SCHED_LOCK(s);
  if (atomic_dec_int_nv(&pr->ps_singlecount) == 0)
  wakeup(&pr->ps_singlecount);
  if (pr->ps_flags & PS_SINGLEEXIT) {
+ SCHED_UNLOCK(s);
  KERNEL_LOCK();
  exit1(p, 0, 0, EXIT_THREAD_NOCHECK);
- KERNEL_UNLOCK();
+ /* NOTREACHED */
  }
 
  /* not exiting and don't need to unwind, so suspend */
- SCHED_LOCK(s);
  p->p_stat = SSTOP;
  mi_switch();
  SCHED_UNLOCK(s);
@@ -1950,7 +1951,7 @@ single_thread_set(struct proc *p, enum s
 {
  struct process *pr = p->p_p;
  struct proc *q;
- int error;
+ int error, s;
 
  KERNEL_ASSERT_LOCKED();
  KASSERT(curproc == p);
@@ -1974,26 +1975,22 @@ single_thread_set(struct proc *p, enum s
  panic("single_thread_mode = %d", mode);
 #endif
  }
+ SCHED_LOCK(s);
  pr->ps_singlecount = 0;
  membar_producer();
  pr->ps_single = p;
  TAILQ_FOREACH(q, &pr->ps_threads, p_thr_link) {
- int s;
-
  if (q == p)
  continue;
  if (q->p_flag & P_WEXIT) {
  if (mode == SINGLE_EXIT) {
- SCHED_LOCK(s);
  if (q->p_stat == SSTOP) {
  setrunnable(q);
  atomic_inc_int(&pr->ps_singlecount);
  }
- SCHED_UNLOCK(s);
  }
  continue;
  }
- SCHED_LOCK(s);
  atomic_setbits_int(&q->p_flag, P_SUSPSINGLE);
  switch (q->p_stat) {
  case SIDL:
@@ -2027,8 +2024,8 @@ single_thread_set(struct proc *p, enum s
  signotify(q);
  break;
  }
- SCHED_UNLOCK(s);
  }
+ SCHED_UNLOCK(s);
 
  if (mode != SINGLE_PTRACE)
  single_thread_wait(pr, 1);

Reply | Threaded
Open this post in threaded view
|

Re: gdb in uninterruptible wait

Julian Smith
On Mon, 20 Jul 2020 15:26:11 +0000
Visa Hankala <[hidden email]> wrote:

> On Mon, Jul 20, 2020 at 04:35:12AM +0000, Visa Hankala wrote:
> > On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:  
> > > I've been finding egdb and gdb rather easily get stuck in an
> > > uninterruptible wait, e.g. when running the 'next' command after
> > > hitting a breakpoint.

[...]

> > The single-thread check done by wait4() is non-interruptible.
> > When the debugger gets stuck, is it blocked in "suspend" state?

ps reports it to be in state 'D'.

> >
> > However, I think there is a bug in the single-thread switch code.
> > It looks that ps_singlecount can be decremented too much. This
> > probably is a regression of making ps_singlecount unsigned and
> > letting single_thread_check() run without the kernel lock.
> >
> > The bug might go away if single_thread_check() made sure that
> > P_SUSPSINGLE is set before the thread suspends.  
>
> Below is an updated patch for testing. It extends the scope of
> SCHED_LOCK() so that there are fewer chances of interleaving of
> single_thread_set() and single_thread_check().

Many thanks for these patches. I'll try to test in the next couple of
days. Though the last time i built an OpenBSD kernel was well over a
decade ago, so it might take me a little longer.

Thanks,

- Jules


>
> One problem is that once single_thread_set() sets ps_single, the other
> threads can enter the suspension code in single_thread_check(). The
> extended lock scope prevents the threads from taking action before
> single_thread_set() has finished with the state updates.
>
> Index: kern/kern_sig.c
> ===================================================================
> RCS file: src/sys/kern/kern_sig.c,v
> retrieving revision 1.258
> diff -u -p -r1.258 kern_sig.c
> --- kern/kern_sig.c 15 Jun 2020 13:18:33 -0000 1.258
> +++ kern/kern_sig.c 20 Jul 2020 13:29:50 -0000
> @@ -1915,16 +1915,17 @@ single_thread_check(struct proc *p, int
>   return (EINTR);
>   }
>  
> + SCHED_LOCK(s);
>   if (atomic_dec_int_nv(&pr->ps_singlecount)
> == 0) wakeup(&pr->ps_singlecount);
>   if (pr->ps_flags & PS_SINGLEEXIT) {
> + SCHED_UNLOCK(s);
>   KERNEL_LOCK();
>   exit1(p, 0, 0, EXIT_THREAD_NOCHECK);
> - KERNEL_UNLOCK();
> + /* NOTREACHED */
>   }
>  
>   /* not exiting and don't need to unwind, so
> suspend */
> - SCHED_LOCK(s);
>   p->p_stat = SSTOP;
>   mi_switch();
>   SCHED_UNLOCK(s);
> @@ -1950,7 +1951,7 @@ single_thread_set(struct proc *p, enum s
>  {
>   struct process *pr = p->p_p;
>   struct proc *q;
> - int error;
> + int error, s;
>  
>   KERNEL_ASSERT_LOCKED();
>   KASSERT(curproc == p);
> @@ -1974,26 +1975,22 @@ single_thread_set(struct proc *p, enum s
>   panic("single_thread_mode = %d", mode);
>  #endif
>   }
> + SCHED_LOCK(s);
>   pr->ps_singlecount = 0;
>   membar_producer();
>   pr->ps_single = p;
>   TAILQ_FOREACH(q, &pr->ps_threads, p_thr_link) {
> - int s;
> -
>   if (q == p)
>   continue;
>   if (q->p_flag & P_WEXIT) {
>   if (mode == SINGLE_EXIT) {
> - SCHED_LOCK(s);
>   if (q->p_stat == SSTOP) {
>   setrunnable(q);
>   atomic_inc_int(&pr->ps_singlecount);
>   }
> - SCHED_UNLOCK(s);
>   }
>   continue;
>   }
> - SCHED_LOCK(s);
>   atomic_setbits_int(&q->p_flag, P_SUSPSINGLE);
>   switch (q->p_stat) {
>   case SIDL:
> @@ -2027,8 +2024,8 @@ single_thread_set(struct proc *p, enum s
>   signotify(q);
>   break;
>   }
> - SCHED_UNLOCK(s);
>   }
> + SCHED_UNLOCK(s);
>  
>   if (mode != SINGLE_PTRACE)
>   single_thread_wait(pr, 1);
>
>



--
http://op59.net

Reply | Threaded
Open this post in threaded view
|

Re: gdb in uninterruptible wait

Julian Smith
On Mon, 20 Jul 2020 17:18:19 +0100
Julian Smith <[hidden email]> wrote:

> On Mon, 20 Jul 2020 15:26:11 +0000
> Visa Hankala <[hidden email]> wrote:
>
> > On Mon, Jul 20, 2020 at 04:35:12AM +0000, Visa Hankala wrote:  
> > > On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:    
> > > > I've been finding egdb and gdb rather easily get stuck in an
> > > > uninterruptible wait, e.g. when running the 'next' command after
> > > > hitting a breakpoint.  
>
> [...]
>
> > > The single-thread check done by wait4() is non-interruptible.
> > > When the debugger gets stuck, is it blocked in "suspend" state?  
>
> ps reports it to be in state 'D'.
>
> > >
> > > However, I think there is a bug in the single-thread switch code.
> > > It looks that ps_singlecount can be decremented too much. This
> > > probably is a regression of making ps_singlecount unsigned and
> > > letting single_thread_check() run without the kernel lock.
> > >
> > > The bug might go away if single_thread_check() made sure that
> > > P_SUSPSINGLE is set before the thread suspends.    
> >
> > Below is an updated patch for testing. It extends the scope of
> > SCHED_LOCK() so that there are fewer chances of interleaving of
> > single_thread_set() and single_thread_check().  
>
> Many thanks for these patches. I'll try to test in the next couple of
> days. Though the last time i built an OpenBSD kernel was well over a
> decade ago, so it might take me a little longer.

I managed to build a patched kernel, and it seems to fix the problem -
i haven't been able to get egdb into an uninterruptable wait state.

Also, i've been running the patched kernel all day now and it doesn't
seem to be causing any problems elsewhere.

Thanks,

- Jules

--
http://op59.net


Reply | Threaded
Open this post in threaded view
|

Re: gdb in uninterruptible wait

Julian Smith
On Tue, 21 Jul 2020 19:23:44 +0100
Julian Smith <[hidden email]> wrote:

> On Mon, 20 Jul 2020 17:18:19 +0100
> Julian Smith <[hidden email]> wrote:
>
> > On Mon, 20 Jul 2020 15:26:11 +0000
> > Visa Hankala <[hidden email]> wrote:
> >  
> > > On Mon, Jul 20, 2020 at 04:35:12AM +0000, Visa Hankala wrote:    
> > > > On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:
> > > >    
> > > > > I've been finding egdb and gdb rather easily get stuck in an
> > > > > uninterruptible wait, e.g. when running the 'next' command
> > > > > after hitting a breakpoint.    
> >
> > [...]
> >  
> > > > The single-thread check done by wait4() is non-interruptible.
> > > > When the debugger gets stuck, is it blocked in "suspend" state?
> > > >    
> >
> > ps reports it to be in state 'D'.
> >  
> > > >
> > > > However, I think there is a bug in the single-thread switch
> > > > code. It looks that ps_singlecount can be decremented too much.
> > > > This probably is a regression of making ps_singlecount unsigned
> > > > and letting single_thread_check() run without the kernel lock.
> > > >
> > > > The bug might go away if single_thread_check() made sure that
> > > > P_SUSPSINGLE is set before the thread suspends.      
> > >
> > > Below is an updated patch for testing. It extends the scope of
> > > SCHED_LOCK() so that there are fewer chances of interleaving of
> > > single_thread_set() and single_thread_check().    
> >
> > Many thanks for these patches. I'll try to test in the next couple
> > of days. Though the last time i built an OpenBSD kernel was well
> > over a decade ago, so it might take me a little longer.  
>
> I managed to build a patched kernel, and it seems to fix the problem -
> i haven't been able to get egdb into an uninterruptable wait state.
>
> Also, i've been running the patched kernel all day now and it doesn't
> seem to be causing any problems elsewhere.

Unfortunately the same problem has just occurred again. I've run egdb
quite a few times since i updated the kernel, so the patch has
definitely reduced the problem, but it doesn't seem to have eliminated
it.

Let me know if there anything i could do to find out more information.

Thanks,

- Jules

--
http://op59.net