pool_do_get: scxspl free list modified on amd64 regress machine

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

pool_do_get: scxspl free list modified on amd64 regress machine

Alexander Bluhm
Hi,

I got a strange panic on my daily amd64 regress machine

OpenBSD 6.5-current (GENERIC.MP) #129: Mon Jul 15 18:54:34 MDT 2019
    [hidden email]:/usr/src/sys/arch/amd64/compile/GENERIC.MP

reordering libraries:panic: pool_do_get: scxspl free list modified: page 0xfffffd823f71f000; item addr 0xfffffd823f71f550; offset 0x18=0x3f759407
Stopped at      db_enter+0x10:  popq    %rbp
    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
 523205  98028      0         0x3  0x4000000    1  ld
*128313  42691      0     0x14000      0x200    2K update
 367004  19543      0     0x14000 0x40000200    0  softclock
db_enter() at db_enter+0x10
panic() at panic+0x128
pool_do_get(ffffffff81dc1c50,a,ffff800021ec5b94) at pool_do_get+0x36c
pool_get(ffffffff81dc1c50,a) at pool_get+0x8f
scsi_xs_io(ffff8000000bf000,ffff80000009a2c0,1) at scsi_xs_io+0x3b
scsi_xsh_ioh(ffff8000000949c0,ffff80000009a2c0) at scsi_xsh_ioh+0x29
scsi_iopool_run(ffff8000000961c8) at scsi_iopool_run+0xad
scsi_xsh_runqueue(ffff8000000bf000) at scsi_xsh_runqueue+0x161
scsi_xsh_add(ffff8000000949c0) at scsi_xsh_add+0x91
sdstrategy(fffffd82798eea00) at sdstrategy+0x112
spec_strategy(ffff800021ec5e30) at spec_strategy+0x55
ufs_strategy(ffff800021ec5e30) at ufs_strategy+0xf2
VOP_STRATEGY(fffffd82798eea00) at VOP_STRATEGY+0x4a
bwrite(fffffd82798eea00) at bwrite+0x126
end trace frame: 0xffff800021ec5ed0, count: 0

Full console log at
http://bluhm.genua.de/regress/results/2019-07-17T09%3A17%3A02Z/setup-ot6.log

bluhm

Reply | Threaded
Open this post in threaded view
|

Re: pool_do_get: scxspl free list modified on amd64 regress machine

Alexander Bluhm
On Wed, Jul 17, 2019 at 01:26:29PM +0200, Alexander Bluhm wrote:
> I got a strange panic on my daily amd64 regress machine
> reordering libraries:panic: pool_do_get: scxspl free list modified: page

I see more strange effects on my regress machines for a while now.
The SSH connection that controls my tests fails with broken pipe.

packet_write_wait: Connection to 10.0.1.26 port 22: Broken pipe

This is my excerpt of the previous two weeks.  I never see it on
i386, that architecture is stable for my use case.

arch GENERIC.MP aborted
amd64 Fri Jun 28 10:59:55 MDT 2019 ok
arm64 Fri Jun 28 12:09:23 MDT 2019 ok
amd64 Thu Jul 4 21:30:05 MDT 2019 ok
arm64 Thu Jul 4 23:42:09 MDT 2019 ok
amd64 Sat Jul 6 00:15:43 MDT 2019 broken pipe
arm64 Sat Jul 6 02:28:11 MDT 2019 broken pipe
amd64 Sat Jul 6 19:30:27 MDT 2019 ok
arm64 Sat Jul 6 21:42:52 MDT 2019 broken pipe
amd64 Sun Jul 7 22:21:36 MDT 2019 broken pipe
arm64 Mon Jul 8 00:34:07 MDT 2019 broken pipe
amd64 Mon Jul 8 11:54:20 MDT 2019 ok
arm64 Tue Jul 9 03:12:08 MDT 2019 broken pipe
amd64 Wed Jul 10 00:20:15 MDT 2019 ok
arm64 Wed Jul 10 02:33:40 MDT 2019 ok
amd64 Wed Jul 10 23:35:33 MDT 2019 broken pipe
arm64 Thu Jul 11 04:32:25 MDT 2019 ok
amd64 Thu Jul 11 15:44:36 MDT 2019 broken pipe
arm64 Thu Jul 11 17:56:31 MDT 2019 ok
amd64 Fri Jul 12 09:59:59 MDT 2019 broken pipe
arm64 Fri Jul 12 12:12:19 MDT 2019 ok
amd64 Fri Jul 12 09:59:59 MDT 2019 ok
arm64 Fri Jul 12 12:12:19 MDT 2019 ok
amd64 Fri Jul 12 09:59:59 MDT 2019 broken pipe
arm64 Fri Jul 12 12:12:19 MDT 2019 broken pipe
amd64 Mon Jul 15 18:54:34 MDT 2019 broken pipe
arm64 Mon Jul 15 21:07:06 MDT 2019 ok
amd64 Mon Jul 15 18:54:34 MDT 2019 scxspl free list modified

Sometimes the test before the broken pipe has an unusual result.

Sun Jul 7 22:21:36 MDT 2019
malloc_duel(65986) in free(): chunk canary corrupted 0x873508a1770 0x4@0x4

Tue Jul 9 03:12:08 MDT 2019
ln: fstest_fc3618a74baf021f5226c3cb938e91f0/32767: Too many links

Fri Jul 12 09:59:59 MDT 2019
malloc_duel(22164) in free(): chunk canary corrupted 0x4fd491216a0 0x4@0x4 (double free?)

Fri Jul 12 12:12:19 MDT 2019
ln: fstest_c7e1a3c857443619516751530ca0a85e/32767: Too many links

It is not reproduceable, the broken pipe happens at different regress
tests.  But there are some regress tests where it is more likely,
during the first one it happens quite often, the last two hit it
only once.

regress/sbin/iked/parser
regress/lib/libpthread/malloc_duel
regress/sys/ffs/ffs
regress/sys/ffs/nfs
regress/lib/libc/stdio_threading/fread

bluhm

Reply | Threaded
Open this post in threaded view
|

uvm_fault pmap_page_remove+0x210: xchgq %rax,0(%rcx,%rdx,1)

Alexander Bluhm
On Wed, Jul 17, 2019 at 05:19:55PM +0200, Alexander Bluhm wrote:
> On Wed, Jul 17, 2019 at 01:26:29PM +0200, Alexander Bluhm wrote:
> > I got a strange panic on my daily amd64 regress machine
> > reordering libraries:panic: pool_do_get: scxspl free list modified: page
>
> I see more strange effects on my regress machines for a while now.
> The SSH connection that controls my tests fails with broken pipe.

Unfortunately SSH broken pipe is the common error if I loose contact
with the test machine.  Most of the time it was a local fuckup.  A
pf on a bridge sending TCP RST.

> Fri Jul 12 09:59:59 MDT 2019
> malloc_duel(22164) in free(): chunk canary corrupted 0x4fd491216a0 0x4@0x4 (double free?)

This problem is real.  Running /usr/src/regress/lib/libpthread/malloc_duel
in a loop ends in a crash.

The reaper on CPU 0 does a NULL dereference when removing the page.
On CPU 1 zerothread is waiting for kernel lock.  CPU 2 and 3 are
idle.

uvm_fault(0xfffffd8240760cc8, 0x7f827ea48908, 0, 2) -> e
kernel: page fault trap, code=0
Stopped at      pmap_page_remove+0x210: xchgq   %rax,0(%rcx,%rdx,1)

version:        OpenBSD 6.5-current (GENERIC.MP) #129: Mon Jul 15 18:54:34 MDT 2
019\012    [hidden email]:/usr/src/sys/arch/amd64/compile/GENERIC.MP

ddb{0}> trace
pmap_page_remove(fffffd8107d97b00) at pmap_page_remove+0x210
uvm_anfree(fffffd8214777360) at uvm_anfree+0x36
amap_wipeout(fffffd825d3dc130) at amap_wipeout+0xe5
uvm_unmap_detach(ffff800021eba2d8,1) at uvm_unmap_detach+0xef
uvm_map_teardown(fffffd8279b37340) at uvm_map_teardown+0x1c1
uvmspace_free(fffffd8279b37340) at uvmspace_free+0x57
uvm_exit(ffff8000ffff6d90) at uvm_exit+0x24
reaper(ffff8000ffff8770) at reaper+0x13b
end trace frame: 0x0, count: -8

*52315  402033      0      0  7     0x14200                reaper

ddb{0}> show register
rdi                              0xa
rsi               0xfffffd8107d97b68
rbp               0xffff800021eba1f0
rbx                                0
rdx                   0x7f8000000000
rcx                      0x20a8ea850
rax                                0
r8                0xfffffd810a6e5d80
r9                0xffffffff81d27ff0    cpu_info_full_primary+0x1ff0
r10               0x45e35718a213a84d
r11               0xd63ebed56fdc728a
r12               0xfffffd8107d97b00
r13               0xfffffd823df88940
r14                      0x27f7c2000
r15               0xfffffd8107d97b68
rip               0xffffffff817df8f0    pmap_page_remove+0x210
cs                               0x8
rflags                       0x10246    __ALIGN_SIZE+0xf246
rsp               0xffff800021eba190
ss                              0x10
pmap_page_remove+0x210: xchgq   %rax,0(%rcx,%rdx,1)

ddb{1}> trace
x86_ipi_db(ffff800021c80ff0) at x86_ipi_db+0x12
x86_ipi_handler() at x86_ipi_handler+0x80
Xresume_lapic_ipi(a,ffff800021c80ff0,ffff8000ffff9640,0,0,ffff8000ffff9718) at X
resume_lapic_ipi+0x23
_kernel_lock() at _kernel_lock+0xae
timeout_del_barrier(ffff8000ffff9718) at timeout_del_barrier+0xa2
msleep(ffffffff81dacc6c,ffffffff81dacbb8,7f,ffffffff81a91b36,0) at msleep+0xf5
uvm_pagezero_thread(ffff8000ffff9640) at uvm_pagezero_thread+0xa2
end trace frame: 0x0, count: -7

*63535   48641      0      0  7     0x14200                zerothread

ddb{1}> show register
rdi               0xffff800021c80ff0
rsi                                0
rbp               0xffff800021ed2440
rbx               0xffffffff81d06168    ipifunc+0x38
rdx                                0
rcx                              0x7
rax                       0xffffff7f
r8                                 0
r9                                 0
r10                                0
r11               0x29c1412aeecf9c9a
r12                              0x7
r13                                0
r14               0xffff800021c80ff0
r15                                0
rip               0xffffffff813572e2    x86_ipi_db+0x12
cs                               0x8
rflags                         0x206
rsp               0xffff800021ed2430
ss                              0x10
x86_ipi_db+0x12:        leave

I will update kernel and look if panic is reproducable.

bluhm

Reply | Threaded
Open this post in threaded view
|

uvm_fault pmap_enter+0x1d6: movq __ALIGN_SIZE+0x3000(%rcx,%rsi,8),%rsi

Alexander Bluhm
On Wed, Jul 24, 2019 at 08:59:44PM +0200, Alexander Bluhm wrote:
> The reaper on CPU 0 does a NULL dereference when removing the page.
> On CPU 1 zerothread is waiting for kernel lock.  CPU 2 and 3 are
> idle.
>
> uvm_fault(0xfffffd8240760cc8, 0x7f827ea48908, 0, 2) -> e
> kernel: page fault trap, code=0
> Stopped at      pmap_page_remove+0x210: xchgq   %rax,0(%rcx,%rdx,1)

Forgot to mention, that was C source line pmap.c:1878

                opte = pmap_pte_set(&PTE_BASE[pl1_i(pve->pv_va)], 0);

> I will update kernel and look if panic is reproducable.

It is reproduceable

ddb{3}> x/s version
version:        OpenBSD 6.5-current (GENERIC.MP) #139: Wed Jul 24 05:11:28 MDT 2
019\012    [hidden email]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
\012

ddb{3}> show panic
kernel page fault
uvm_fault(0xfffffd823efc7998, 0x7f8444c11f08, 0, 1) -> e
pmap_enter(fffffd823e1ce3f8,889823e1000,5f3c2000,3,22) at pmap_enter+0x1d6
end trace frame: 0xffff80002210ed30, count: 0

Now it happens in pmap.c:2624

        opte = PTE_BASE[pl1_i(va)];             /* old PTE */

Something in PTE_BASE array is not mapped.

ddb{3}> trace
pmap_enter(fffffd823e1ce3f8,889823e1000,5f3c2000,3,22) at pmap_enter+0x1d6
uvm_fault(fffffd823efc7998,889823e1000,0,2) at uvm_fault+0xa2a
pageflttrap() at pageflttrap+0x145
usertrap(ffff80002210ee20) at usertrap+0x1e3
recall_trap(6,dfdfdfdfdfdfdfdf,0,6,1000,8890b6fc7c0) at recall_trap+0x8
end of kernel
end trace frame: 0x888fdfc9330, count: -5

Note that at June 11th I reported a similiar trace in pmap to bugs@
when ld caused a crash.

ddb{3}> ps
   PID     TID   PPID    UID  S       FLAGS  WAIT          COMMAND
 76368  342680   5059      0  2         0x2                malloc_duel
 76368  101339   5059      0  7   0x4000002                malloc_duel
 76368  514296   5059      0  3   0x4000082  fsleep        malloc_duel
*76368  384915   5059      0  7   0x4000002                malloc_duel
 76368  221830   5059      0  7   0x4000002                malloc_duel
 76368  361827   5059      0  7   0x4000002                malloc_duel
 76368  480274   5059      0  3   0x4000082  fsleep        malloc_duel
 76368  468117   5059      0  3   0x4000082  fsleep        malloc_duel
 76368  461971   5059      0  3   0x4000082  fsleep        malloc_duel
 76368  266728   5059      0  2   0x4000002                malloc_duel
 76368   82327   5059      0  2   0x4000002                malloc_duel
  5059  194815   4702      0  3    0x10008a  pause         make
  4702  434789  57398      0  3    0x10008a  pause         sh
 57398  272052  80135      0  3    0x10008a  pause         make
 80135   83438  74843      0  3    0x10008a  pause         sh
 74843  269959  24644      0  3    0x10008a  pause         make
 71213   91038  31378      0  3    0x100082  piperd        gzip
 31378  297755  24644      0  3    0x100082  piperd        pax
 24644  139228  73204      0  3        0x82  piperd        perl
 73204  241400   3907      0  3    0x10008a  pause         ksh
  3907  427314  77842      0  3        0x92  select        sshd
 49732  259852      1      0  3    0x100083  ttyin         getty
 58444  180559      1      0  3    0x100083  ttyin         getty
 30659  289121      1      0  3    0x100083  ttyin         getty
  9656  108850      1      0  3    0x100083  ttyin         getty
 24203   10241      1      0  3    0x100083  ttyin         getty
 65063  251469      1      0  3    0x100083  ttyin         getty
 16142  523320      1      0  3    0x100098  poll          cron
 90805    3316      0      0  3     0x14280  nfsidl        nfsio
 11202  322177      0      0  3     0x14280  nfsidl        nfsio
 73491  331359      0      0  3     0x14280  nfsidl        nfsio
 37841  249018      0      0  3     0x14280  nfsidl        nfsio
  4136  428500      1     99  3    0x100090  poll          sndiod
 12112  519438      1    110  3    0x100090  poll          sndiod
 49306   97767    137     95  3    0x100092  kqread        smtpd
 70869  189393    137    103  3    0x100092  kqread        smtpd
 79867  131344    137     95  3    0x100092  kqread        smtpd
 66859  375509    137     95  3    0x100092  kqread        smtpd
 22396   48018    137     95  3    0x100092  kqread        smtpd
 16604   93317    137     95  3    0x100092  kqread        smtpd
   137  452544      1      0  3    0x100080  kqread        smtpd
 77842  219221      1      0  3        0x80  select        sshd
 88298  318549      0      0  3     0x14200  acct          acct
  7436  211089      1      0  3    0x100080  poll          ntpd
 15596  214430  72873     83  3    0x100092  poll          ntpd
 72873  423080      1     83  3    0x100092  poll          ntpd
   639  455748   5843     74  3    0x100092  bpf           pflogd
  5843  152563      1      0  3        0x80  netio         pflogd
 49089   65344  96782     73  3    0x100090  kqread        syslogd
 96782  134250      1      0  3    0x100082  netio         syslogd
 15309   57931      1     77  3    0x100090  poll          dhclient
 92131  300080      1      0  3        0x80  poll          dhclient
   440  434925  45137    115  3    0x100092  kqread        slaacd
 23230  157398  45137    115  3    0x100092  kqread        slaacd
 45137  283018      1      0  3    0x100080  kqread        slaacd
 11751  424885      0      0  3     0x14200  pgzero        zerothread
 94669  233757      0      0  3     0x14200  aiodoned      aiodoned
 39044  189625      0      0  3     0x14200  syncer        update
 11265  246421      0      0  3     0x14200  cleaner       cleaner
 86967  386950      0      0  3     0x14200  reaper        reaper
 48511  221734      0      0  3     0x14200  pgdaemon      pagedaemon
 27362  255648      0      0  3     0x14200  bored         crynlk
 58949  107875      0      0  3     0x14200  bored         crypto
 88305  317139      0      0  3     0x14200  bored         sensors
 62804  248570      0      0  3     0x14200  usbtsk        usbtask
   717  253829      0      0  3     0x14200  usbatsk       usbatsk
 48070  263826      0      0  3  0x40014200  acpi0         acpi0
 65386  442770      0      0  3  0x40014200                idle3
 33089  148765      0      0  3  0x40014200                idle2
 65055  498669      0      0  3  0x40014200                idle1
 10578  506553      0      0  3     0x14200  bored         softnet
 70559   53653      0      0  3     0x14200  bored         systqmp
  6788     104      0      0  3     0x14200  bored         systq
 23919  173929      0      0  3  0x40014200  bored         softclock
 87424  241507      0      0  3  0x40014200                idle0
 44349  256295      0      0  3     0x14200  bored         smr
     1  488173      0      0  3        0x82  wait          init
     0       0     -1      0  3     0x10200  scheduler     swapper

bluhm

Reply | Threaded
Open this post in threaded view
|

Re: uvm_fault pmap_enter+0x1d6: movq __ALIGN_SIZE+0x3000(%rcx,%rsi,8),%rsi

Mike Larkin-2
On Wed, Jul 24, 2019 at 10:48:25PM +0200, Alexander Bluhm wrote:

> On Wed, Jul 24, 2019 at 08:59:44PM +0200, Alexander Bluhm wrote:
> > The reaper on CPU 0 does a NULL dereference when removing the page.
> > On CPU 1 zerothread is waiting for kernel lock.  CPU 2 and 3 are
> > idle.
> >
> > uvm_fault(0xfffffd8240760cc8, 0x7f827ea48908, 0, 2) -> e
> > kernel: page fault trap, code=0
> > Stopped at      pmap_page_remove+0x210: xchgq   %rax,0(%rcx,%rdx,1)
>
> Forgot to mention, that was C source line pmap.c:1878
>
>                 opte = pmap_pte_set(&PTE_BASE[pl1_i(pve->pv_va)], 0);
>
> > I will update kernel and look if panic is reproducable.
>
> It is reproduceable
>
> ddb{3}> x/s version
> version:        OpenBSD 6.5-current (GENERIC.MP) #139: Wed Jul 24 05:11:28 MDT 2
> 019\012    [hidden email]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> \012
>
> ddb{3}> show panic
> kernel page fault
> uvm_fault(0xfffffd823efc7998, 0x7f8444c11f08, 0, 1) -> e
> pmap_enter(fffffd823e1ce3f8,889823e1000,5f3c2000,3,22) at pmap_enter+0x1d6
> end trace frame: 0xffff80002210ed30, count: 0
>
> Now it happens in pmap.c:2624
>
>         opte = PTE_BASE[pl1_i(va)];             /* old PTE */
>
> Something in PTE_BASE array is not mapped.
>

I wrote a quick program to calculate what address this would be (thinking
maybe we had some overflow or something) but it does indeed match the
faulting address above (0x7f8444c11f08) for the VA 0x889823e1000.
This address (0x7f8444c11f08) is in the PTE range, so it looks like it
was never allocated or possibly double-freed. Double free matches the
previous email's comment as well.

If this happens again, it might be interesting to see what pages around
that are mapped. For example, for this particular instance, to see if
0x7f8444c10000 is mapped, or 0x7f8444c12000. ddb>'s 'x' command can do that
(see if you get another fault or if you get some data). Maybe the data in
those pages around it might provide a hint (although that's a longshot).

-ml

> ddb{3}> trace
> pmap_enter(fffffd823e1ce3f8,889823e1000,5f3c2000,3,22) at pmap_enter+0x1d6
> uvm_fault(fffffd823efc7998,889823e1000,0,2) at uvm_fault+0xa2a
> pageflttrap() at pageflttrap+0x145
> usertrap(ffff80002210ee20) at usertrap+0x1e3
> recall_trap(6,dfdfdfdfdfdfdfdf,0,6,1000,8890b6fc7c0) at recall_trap+0x8
> end of kernel
> end trace frame: 0x888fdfc9330, count: -5
>
> Note that at June 11th I reported a similiar trace in pmap to bugs@
> when ld caused a crash.
>
> ddb{3}> ps
>    PID     TID   PPID    UID  S       FLAGS  WAIT          COMMAND
>  76368  342680   5059      0  2         0x2                malloc_duel
>  76368  101339   5059      0  7   0x4000002                malloc_duel
>  76368  514296   5059      0  3   0x4000082  fsleep        malloc_duel
> *76368  384915   5059      0  7   0x4000002                malloc_duel
>  76368  221830   5059      0  7   0x4000002                malloc_duel
>  76368  361827   5059      0  7   0x4000002                malloc_duel
>  76368  480274   5059      0  3   0x4000082  fsleep        malloc_duel
>  76368  468117   5059      0  3   0x4000082  fsleep        malloc_duel
>  76368  461971   5059      0  3   0x4000082  fsleep        malloc_duel
>  76368  266728   5059      0  2   0x4000002                malloc_duel
>  76368   82327   5059      0  2   0x4000002                malloc_duel
>   5059  194815   4702      0  3    0x10008a  pause         make
>   4702  434789  57398      0  3    0x10008a  pause         sh
>  57398  272052  80135      0  3    0x10008a  pause         make
>  80135   83438  74843      0  3    0x10008a  pause         sh
>  74843  269959  24644      0  3    0x10008a  pause         make
>  71213   91038  31378      0  3    0x100082  piperd        gzip
>  31378  297755  24644      0  3    0x100082  piperd        pax
>  24644  139228  73204      0  3        0x82  piperd        perl
>  73204  241400   3907      0  3    0x10008a  pause         ksh
>   3907  427314  77842      0  3        0x92  select        sshd
>  49732  259852      1      0  3    0x100083  ttyin         getty
>  58444  180559      1      0  3    0x100083  ttyin         getty
>  30659  289121      1      0  3    0x100083  ttyin         getty
>   9656  108850      1      0  3    0x100083  ttyin         getty
>  24203   10241      1      0  3    0x100083  ttyin         getty
>  65063  251469      1      0  3    0x100083  ttyin         getty
>  16142  523320      1      0  3    0x100098  poll          cron
>  90805    3316      0      0  3     0x14280  nfsidl        nfsio
>  11202  322177      0      0  3     0x14280  nfsidl        nfsio
>  73491  331359      0      0  3     0x14280  nfsidl        nfsio
>  37841  249018      0      0  3     0x14280  nfsidl        nfsio
>   4136  428500      1     99  3    0x100090  poll          sndiod
>  12112  519438      1    110  3    0x100090  poll          sndiod
>  49306   97767    137     95  3    0x100092  kqread        smtpd
>  70869  189393    137    103  3    0x100092  kqread        smtpd
>  79867  131344    137     95  3    0x100092  kqread        smtpd
>  66859  375509    137     95  3    0x100092  kqread        smtpd
>  22396   48018    137     95  3    0x100092  kqread        smtpd
>  16604   93317    137     95  3    0x100092  kqread        smtpd
>    137  452544      1      0  3    0x100080  kqread        smtpd
>  77842  219221      1      0  3        0x80  select        sshd
>  88298  318549      0      0  3     0x14200  acct          acct
>   7436  211089      1      0  3    0x100080  poll          ntpd
>  15596  214430  72873     83  3    0x100092  poll          ntpd
>  72873  423080      1     83  3    0x100092  poll          ntpd
>    639  455748   5843     74  3    0x100092  bpf           pflogd
>   5843  152563      1      0  3        0x80  netio         pflogd
>  49089   65344  96782     73  3    0x100090  kqread        syslogd
>  96782  134250      1      0  3    0x100082  netio         syslogd
>  15309   57931      1     77  3    0x100090  poll          dhclient
>  92131  300080      1      0  3        0x80  poll          dhclient
>    440  434925  45137    115  3    0x100092  kqread        slaacd
>  23230  157398  45137    115  3    0x100092  kqread        slaacd
>  45137  283018      1      0  3    0x100080  kqread        slaacd
>  11751  424885      0      0  3     0x14200  pgzero        zerothread
>  94669  233757      0      0  3     0x14200  aiodoned      aiodoned
>  39044  189625      0      0  3     0x14200  syncer        update
>  11265  246421      0      0  3     0x14200  cleaner       cleaner
>  86967  386950      0      0  3     0x14200  reaper        reaper
>  48511  221734      0      0  3     0x14200  pgdaemon      pagedaemon
>  27362  255648      0      0  3     0x14200  bored         crynlk
>  58949  107875      0      0  3     0x14200  bored         crypto
>  88305  317139      0      0  3     0x14200  bored         sensors
>  62804  248570      0      0  3     0x14200  usbtsk        usbtask
>    717  253829      0      0  3     0x14200  usbatsk       usbatsk
>  48070  263826      0      0  3  0x40014200  acpi0         acpi0
>  65386  442770      0      0  3  0x40014200                idle3
>  33089  148765      0      0  3  0x40014200                idle2
>  65055  498669      0      0  3  0x40014200                idle1
>  10578  506553      0      0  3     0x14200  bored         softnet
>  70559   53653      0      0  3     0x14200  bored         systqmp
>   6788     104      0      0  3     0x14200  bored         systq
>  23919  173929      0      0  3  0x40014200  bored         softclock
>  87424  241507      0      0  3  0x40014200                idle0
>  44349  256295      0      0  3     0x14200  bored         smr
>      1  488173      0      0  3        0x82  wait          init
>      0       0     -1      0  3     0x10200  scheduler     swapper
>
> bluhm
>

Reply | Threaded
Open this post in threaded view
|

Re: uvm_fault pmap_enter+0x1d6: movq __ALIGN_SIZE+0x3000(%rcx,%rsi,8),%rsi

Alexander Bluhm
On Wed, Jul 24, 2019 at 03:22:31PM -0700, Mike Larkin wrote:
> If this happens again, it might be interesting to see what pages around
> that are mapped.

They are not mapped.

uvm_fault(0xffffffff81da8270, 0x7f826ae62830, 0, 2) -> e
kernel: page fault trap, code=0
Stopped at      pmap_page_remove+0x210: xchgq   %rax,0(%rcx,%rdx,1)
ddb{0}> trace
pmap_page_remove(fffffd810b60de00) at pmap_page_remove+0x210
uvm_anfree(fffffd816a8396d0) at uvm_anfree+0x36
amap_wipeout(fffffd827f101138) at amap_wipeout+0xe5
uvm_unmap_detach(ffff800021eba398,1) at uvm_unmap_detach+0xef
uvm_map_teardown(fffffd823df509a0) at uvm_map_teardown+0x1c1
uvmspace_free(fffffd823df509a0) at uvmspace_free+0x57
uvm_exit(ffff800021eee038) at uvm_exit+0x24
reaper(ffff8000ffff89e8) at reaper+0x13b
end trace frame: 0x0, count: -8

ddb{0}> x/x 0x7f826ae62000
0x7f826ae62000:uvm_fault(0xffffffff81da8270, 0x7f826ae62000, 0, 1) -> e
 kernel: page fault trap, code=0
Faulted in DDB; continuing...
ddb{0}> x/x 0x7f826ae61000
0x7f826ae61000:uvm_fault(0xffffffff81da8270, 0x7f826ae61000, 0, 1) -> e
 kernel: page fault trap, code=0
Faulted in DDB; continuing...
ddb{0}> x/x 0x7f826ae63000
0x7f826ae63000:uvm_fault(0xffffffff81da8270, 0x7f826ae63000, 0, 1) -> e
 kernel: page fault trap, code=0
Faulted in DDB; continuing...

Note that /usr/src/regress/lib/libpthread/malloc_duel/malloc_duel.c
does allocations in small malloc(sizeof(int)) chunks.  So I think
we are dealing with random page size mappings.

bluhm

Reply | Threaded
Open this post in threaded view
|

Re: uvm_fault pmap_enter+0x1d6: movq __ALIGN_SIZE+0x3000(%rcx,%rsi,8),%rsi

Mike Larkin-2
On Fri, Jul 26, 2019 at 12:57:23AM +0200, Alexander Bluhm wrote:

> On Wed, Jul 24, 2019 at 03:22:31PM -0700, Mike Larkin wrote:
> > If this happens again, it might be interesting to see what pages around
> > that are mapped.
>
> They are not mapped.
>
> uvm_fault(0xffffffff81da8270, 0x7f826ae62830, 0, 2) -> e
> kernel: page fault trap, code=0
> Stopped at      pmap_page_remove+0x210: xchgq   %rax,0(%rcx,%rdx,1)
> ddb{0}> trace
> pmap_page_remove(fffffd810b60de00) at pmap_page_remove+0x210
> uvm_anfree(fffffd816a8396d0) at uvm_anfree+0x36
> amap_wipeout(fffffd827f101138) at amap_wipeout+0xe5
> uvm_unmap_detach(ffff800021eba398,1) at uvm_unmap_detach+0xef
> uvm_map_teardown(fffffd823df509a0) at uvm_map_teardown+0x1c1
> uvmspace_free(fffffd823df509a0) at uvmspace_free+0x57
> uvm_exit(ffff800021eee038) at uvm_exit+0x24
> reaper(ffff8000ffff89e8) at reaper+0x13b
> end trace frame: 0x0, count: -8
>
> ddb{0}> x/x 0x7f826ae62000
> 0x7f826ae62000:uvm_fault(0xffffffff81da8270, 0x7f826ae62000, 0, 1) -> e
>  kernel: page fault trap, code=0
> Faulted in DDB; continuing...
> ddb{0}> x/x 0x7f826ae61000
> 0x7f826ae61000:uvm_fault(0xffffffff81da8270, 0x7f826ae61000, 0, 1) -> e
>  kernel: page fault trap, code=0
> Faulted in DDB; continuing...
> ddb{0}> x/x 0x7f826ae63000
> 0x7f826ae63000:uvm_fault(0xffffffff81da8270, 0x7f826ae63000, 0, 1) -> e
>  kernel: page fault trap, code=0
> Faulted in DDB; continuing...
>
> Note that /usr/src/regress/lib/libpthread/malloc_duel/malloc_duel.c
> does allocations in small malloc(sizeof(int)) chunks.  So I think
> we are dealing with random page size mappings.
>
> bluhm
>

Probably time to start bisecting the tree to find out when this error
was introduced. The information above wasn't too enlightening (although I
really didn't think it would be...)

-ml