ld.so speedup (part 2)

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

ld.so speedup (part 2)

Nathanael Rensen-3
The diff below speeds up ld.so library intialisation where the dependency
tree is broad and deep, such as samba's smbd which links over 100 libraries.

See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2

See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1
that speeds up library loading.

The timings below are for /usr/local/sbin/smbd --version:

Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s system
Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s system
Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s system
Timing with both diffs      : 0m00.03s real  0m00.03s user  0m00.00s system

Note that these timings are for a build of a recent samba master tree
(linked with kerberos) which is probably slower than the OpenBSD port.

Nathanael


Index: libexec/ld.so/loader.c
===================================================================
RCS file: /cvs/src/libexec/ld.so/loader.c,v
retrieving revision 1.177
diff -u -p -p -u -r1.177 loader.c
--- libexec/ld.so/loader.c 3 Dec 2018 05:29:56 -0000 1.177
+++ libexec/ld.so/loader.c 27 Apr 2019 13:24:02 -0000
@@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje
 {
  struct dep_node *n;
 
- object->status |= STAT_VISITED;
+ int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2;
+
+ object->status |= visited_flag;
 
  TAILQ_FOREACH(n, &object->child_list, next_sib) {
- if (n->data->status & STAT_VISITED)
+ if (n->data->status & visited_flag)
  continue;
  _dl_call_init_recurse(n->data, initfirst);
  }
-
- object->status &= ~STAT_VISITED;
 
  if (object->status & STAT_INIT_DONE)
  return;
Index: libexec/ld.so/resolve.h
===================================================================
RCS file: /cvs/src/libexec/ld.so/resolve.h,v
retrieving revision 1.90
diff -u -p -p -u -r1.90 resolve.h
--- libexec/ld.so/resolve.h 21 Apr 2019 04:11:42 -0000 1.90
+++ libexec/ld.so/resolve.h 27 Apr 2019 13:24:02 -0000
@@ -125,8 +125,9 @@ struct elf_object {
 #define STAT_FINI_READY 0x10
 #define STAT_UNLOADED 0x20
 #define STAT_NODELETE 0x40
-#define STAT_VISITED 0x80
+#define STAT_VISITED_1 0x80
 #define STAT_GNU_HASH 0x100
+#define STAT_VISITED_2 0x200
 
  Elf_Phdr *phdrp;
  int phdrc;

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Antoine Jacoutot-7
On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:

> The diff below speeds up ld.so library intialisation where the dependency
> tree is broad and deep, such as samba's smbd which links over 100 libraries.
>
> See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2
>
> See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1
> that speeds up library loading.
>
> The timings below are for /usr/local/sbin/smbd --version:
>
> Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s system
> Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s system
> Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s system
> Timing with both diffs      : 0m00.03s real  0m00.03s user  0m00.00s system
>
> Note that these timings are for a build of a recent samba master tree
> (linked with kerberos) which is probably slower than the OpenBSD port.
>
> Nathanael

Wow. Tried your part1 and part2 diffs and the difference is indeed insane!
mail/evolution always took 10+ seconds to start for me and now it's almost
instant...
Crazy... But this sounds too good to be true ;-)
What are the potential regressions?


> Index: libexec/ld.so/loader.c
> ===================================================================
> RCS file: /cvs/src/libexec/ld.so/loader.c,v
> retrieving revision 1.177
> diff -u -p -p -u -r1.177 loader.c
> --- libexec/ld.so/loader.c 3 Dec 2018 05:29:56 -0000 1.177
> +++ libexec/ld.so/loader.c 27 Apr 2019 13:24:02 -0000
> @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje
>  {
>   struct dep_node *n;
>  
> - object->status |= STAT_VISITED;
> + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2;
> +
> + object->status |= visited_flag;
>  
>   TAILQ_FOREACH(n, &object->child_list, next_sib) {
> - if (n->data->status & STAT_VISITED)
> + if (n->data->status & visited_flag)
>   continue;
>   _dl_call_init_recurse(n->data, initfirst);
>   }
> -
> - object->status &= ~STAT_VISITED;
>  
>   if (object->status & STAT_INIT_DONE)
>   return;
> Index: libexec/ld.so/resolve.h
> ===================================================================
> RCS file: /cvs/src/libexec/ld.so/resolve.h,v
> retrieving revision 1.90
> diff -u -p -p -u -r1.90 resolve.h
> --- libexec/ld.so/resolve.h 21 Apr 2019 04:11:42 -0000 1.90
> +++ libexec/ld.so/resolve.h 27 Apr 2019 13:24:02 -0000
> @@ -125,8 +125,9 @@ struct elf_object {
>  #define STAT_FINI_READY 0x10
>  #define STAT_UNLOADED 0x20
>  #define STAT_NODELETE 0x40
> -#define STAT_VISITED 0x80
> +#define STAT_VISITED_1 0x80
>  #define STAT_GNU_HASH 0x100
> +#define STAT_VISITED_2 0x200
>  
>   Elf_Phdr *phdrp;
>   int phdrc;
>

--
Antoine

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Otto Moerbeek
On Sat, Apr 27, 2019 at 04:37:23PM +0200, Antoine Jacoutot wrote:

> On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> > The diff below speeds up ld.so library intialisation where the dependency
> > tree is broad and deep, such as samba's smbd which links over 100 libraries.
> >
> > See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2
> >
> > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1
> > that speeds up library loading.
> >
> > The timings below are for /usr/local/sbin/smbd --version:
> >
> > Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s system
> > Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s system
> > Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s system
> > Timing with both diffs      : 0m00.03s real  0m00.03s user  0m00.00s system
> >
> > Note that these timings are for a build of a recent samba master tree
> > (linked with kerberos) which is probably slower than the OpenBSD port.
> >
> > Nathanael
>
> Wow. Tried your part1 and part2 diffs and the difference is indeed insane!
> mail/evolution always took 10+ seconds to start for me and now it's almost
> instant...
> Crazy... But this sounds too good to be true ;-)
> What are the potential regressions?

Speaking off regression tests, we have quite en extensive collection.
The tests in libexec/ld.so should all pass.

        -Otto


>
>
> > Index: libexec/ld.so/loader.c
> > ===================================================================
> > RCS file: /cvs/src/libexec/ld.so/loader.c,v
> > retrieving revision 1.177
> > diff -u -p -p -u -r1.177 loader.c
> > --- libexec/ld.so/loader.c 3 Dec 2018 05:29:56 -0000 1.177
> > +++ libexec/ld.so/loader.c 27 Apr 2019 13:24:02 -0000
> > @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje
> >  {
> >   struct dep_node *n;
> >  
> > - object->status |= STAT_VISITED;
> > + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2;
> > +
> > + object->status |= visited_flag;
> >  
> >   TAILQ_FOREACH(n, &object->child_list, next_sib) {
> > - if (n->data->status & STAT_VISITED)
> > + if (n->data->status & visited_flag)
> >   continue;
> >   _dl_call_init_recurse(n->data, initfirst);
> >   }
> > -
> > - object->status &= ~STAT_VISITED;
> >  
> >   if (object->status & STAT_INIT_DONE)
> >   return;
> > Index: libexec/ld.so/resolve.h
> > ===================================================================
> > RCS file: /cvs/src/libexec/ld.so/resolve.h,v
> > retrieving revision 1.90
> > diff -u -p -p -u -r1.90 resolve.h
> > --- libexec/ld.so/resolve.h 21 Apr 2019 04:11:42 -0000 1.90
> > +++ libexec/ld.so/resolve.h 27 Apr 2019 13:24:02 -0000
> > @@ -125,8 +125,9 @@ struct elf_object {
> >  #define STAT_FINI_READY 0x10
> >  #define STAT_UNLOADED 0x20
> >  #define STAT_NODELETE 0x40
> > -#define STAT_VISITED 0x80
> > +#define STAT_VISITED_1 0x80
> >  #define STAT_GNU_HASH 0x100
> > +#define STAT_VISITED_2 0x200
> >  
> >   Elf_Phdr *phdrp;
> >   int phdrc;
> >
>
> --
> Antoine
>

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Otto Moerbeek
On Sat, Apr 27, 2019 at 04:43:14PM +0200, Otto Moerbeek wrote:

> On Sat, Apr 27, 2019 at 04:37:23PM +0200, Antoine Jacoutot wrote:
>
> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> > > The diff below speeds up ld.so library intialisation where the dependency
> > > tree is broad and deep, such as samba's smbd which links over 100 libraries.
> > >
> > > See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2
> > >
> > > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1
> > > that speeds up library loading.
> > >
> > > The timings below are for /usr/local/sbin/smbd --version:
> > >
> > > Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s system
> > > Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s system
> > > Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s system
> > > Timing with both diffs      : 0m00.03s real  0m00.03s user  0m00.00s system
> > >
> > > Note that these timings are for a build of a recent samba master tree
> > > (linked with kerberos) which is probably slower than the OpenBSD port.
> > >
> > > Nathanael
> >
> > Wow. Tried your part1 and part2 diffs and the difference is indeed insane!
> > mail/evolution always took 10+ seconds to start for me and now it's almost
> > instant...
> > Crazy... But this sounds too good to be true ;-)
> > What are the potential regressions?
>
> Speaking off regression tests, we have quite en extensive collection.
> The tests in libexec/ld.so should all pass.

And the do on amd64

>
> -Otto
>
>
> >
> >
> > > Index: libexec/ld.so/loader.c
> > > ===================================================================
> > > RCS file: /cvs/src/libexec/ld.so/loader.c,v
> > > retrieving revision 1.177
> > > diff -u -p -p -u -r1.177 loader.c
> > > --- libexec/ld.so/loader.c 3 Dec 2018 05:29:56 -0000 1.177
> > > +++ libexec/ld.so/loader.c 27 Apr 2019 13:24:02 -0000
> > > @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje
> > >  {
> > >   struct dep_node *n;
> > >  
> > > - object->status |= STAT_VISITED;
> > > + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2;
> > > +
> > > + object->status |= visited_flag;
> > >  
> > >   TAILQ_FOREACH(n, &object->child_list, next_sib) {
> > > - if (n->data->status & STAT_VISITED)
> > > + if (n->data->status & visited_flag)
> > >   continue;
> > >   _dl_call_init_recurse(n->data, initfirst);
> > >   }
> > > -
> > > - object->status &= ~STAT_VISITED;
> > >  
> > >   if (object->status & STAT_INIT_DONE)
> > >   return;
> > > Index: libexec/ld.so/resolve.h
> > > ===================================================================
> > > RCS file: /cvs/src/libexec/ld.so/resolve.h,v
> > > retrieving revision 1.90
> > > diff -u -p -p -u -r1.90 resolve.h
> > > --- libexec/ld.so/resolve.h 21 Apr 2019 04:11:42 -0000 1.90
> > > +++ libexec/ld.so/resolve.h 27 Apr 2019 13:24:02 -0000
> > > @@ -125,8 +125,9 @@ struct elf_object {
> > >  #define STAT_FINI_READY 0x10
> > >  #define STAT_UNLOADED 0x20
> > >  #define STAT_NODELETE 0x40
> > > -#define STAT_VISITED 0x80
> > > +#define STAT_VISITED_1 0x80
> > >  #define STAT_GNU_HASH 0x100
> > > +#define STAT_VISITED_2 0x200
> > >  
> > >   Elf_Phdr *phdrp;
> > >   int phdrc;
> > >
> >
> > --
> > Antoine
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Ian Mcwilliam-6


On 28/4/19, 12:56 am, "[hidden email] on behalf of Otto Moerbeek"
<[hidden email] on behalf of [hidden email]> wrote:

>On Sat, Apr 27, 2019 at 04:43:14PM +0200, Otto Moerbeek wrote:
>
>> On Sat, Apr 27, 2019 at 04:37:23PM +0200, Antoine Jacoutot wrote:
>>
>> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
>> > > The diff below speeds up ld.so library intialisation where the
>>dependency
>> > > tree is broad and deep, such as samba's smbd which links over 100
>>libraries.
>> > >
>> > > See for example
>>https://marc.info/?l=openbsd-misc&m=155007285712913&w=2
>> > >
>> > > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for
>>part 1
>> > > that speeds up library loading.
>> > >
>> > > The timings below are for /usr/local/sbin/smbd --version:
>> > >
>> > > Timing without either diff  : 6m45.67s real  6m45.65s user
>>0m00.02s system
>> > > Timing with part 1 diff only: 4m42.88s real  4m42.85s user
>>0m00.02s system
>> > > Timing with part 2 diff only: 2m02.61s real  2m02.60s user
>>0m00.01s system
>> > > Timing with both diffs      : 0m00.03s real  0m00.03s user
>>0m00.00s system
>> > >
>> > > Note that these timings are for a build of a recent samba master
>>tree
>> > > (linked with kerberos) which is probably slower than the OpenBSD
>>port.
>> > >
>> > > Nathanael
>> >
>> > Wow. Tried your part1 and part2 diffs and the difference is indeed
>>insane!
>> > mail/evolution always took 10+ seconds to start for me and now it's
>>almost
>> > instant...
>> > Crazy... But this sounds too good to be true ;-)
>> > What are the potential regressions?
>>
>> Speaking off regression tests, we have quite en extensive collection.
>> The tests in libexec/ld.so should all pass.
>
>And the do on amd64
>
>>
>> -Otto
>>
>>

The results look good but it still doesn¹t resolve the root cause of the
issue.
Using both patches on old hardware helps speed up the process but I still
see the rc script timeout before smbd is loaded causing the rest of the
samba processes to fail to load. This did not happen under 6.4 (amd64) so
the change of linker / compiler update is still potentially where the
problem may lie.

Starting smbd with both patches
 0m46.55s real     0m46.47s user     0m00.07s system


Would still be good to see this work committed though.

Ian McWilliam

OpenBSD 6.5 (GENERIC.MP) #0: Mon Apr 15 16:28:00 AEST 2019
   
[hidden email]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 6424494080 (6126MB)
avail mem = 6220148736 (5931MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xf0100 (55 entries)
bios0: vendor Award Software International, Inc. version "F10d" date
07/22/2010
bios0: Gigabyte Technology Co., Ltd. GA-MA790X-DS4
acpi0 at bios0: rev 0
acpi0: sleep states S0 S1 S4 S5
acpi0: tables DSDT FACP SSDT HPET MCFG APIC
acpi0: wakeup devices USB0(S3) USB1(S3) USB2(S3) USB3(S3) USB4(S3)
USB5(S3) SBAZ(S4) P2P_(S5) PCE2(S4) PCE3(S4) PCE4(S4) PCE5(S4) PCE6(S4)
PCE7(S4) PCE8(S4) PCE9(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 32 bits
acpihpet0 at acpi0: 14318180 Hz
acpimcfg0 at acpi0
acpimcfg0: addr 0xe0000000, bus 0-255
acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: AMD Phenom(tm) 9750 Quad-Core Processor, 2411.28 MHz, 10-02-03
cpu0:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL
USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT
SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,
OSVW,IBS,ITSC
cpu0: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB
64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache
cpu0: ITLB 32 4KB entries fully associative, 16 4MB entries fully
associative
cpu0: DTLB 48 4KB entries fully associative, 48 4MB entries fully
associative
cpu0: AMD erratum 721 detected and fixed
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
cpu0: apic clock running at 200MHz
cpu0: mwait min=64, max=64, IBE
cpu1 at mainbus0: apid 1 (application processor)
cpu1: AMD Phenom(tm) 9750 Quad-Core Processor, 2410.99 MHz, 10-02-03
cpu1:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL
USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT
SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,
OSVW,IBS,ITSC
cpu1: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB
64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache
cpu1: ITLB 32 4KB entries fully associative, 16 4MB entries fully
associative
cpu1: DTLB 48 4KB entries fully associative, 48 4MB entries fully
associative
cpu1: AMD erratum 721 detected and fixed
cpu1: smt 0, core 1, package 0
cpu2 at mainbus0: apid 2 (application processor)
cpu2: AMD Phenom(tm) 9750 Quad-Core Processor, 2410.99 MHz, 10-02-03
cpu2:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL
USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT
SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,
OSVW,IBS,ITSC
cpu2: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB
64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache
cpu2: ITLB 32 4KB entries fully associative, 16 4MB entries fully
associative
cpu2: DTLB 48 4KB entries fully associative, 48 4MB entries fully
associative
cpu2: AMD erratum 721 detected and fixed
cpu2: smt 0, core 3, package 0
cpu3 at mainbus0: apid 3 (application processor)
cpu3: AMD Phenom(tm) 9750 Quad-Core Processor, 2410.99 MHz, 10-02-03
cpu3:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL
USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT
SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,
OSVW,IBS,ITSC
cpu3: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB
64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache
cpu3: ITLB 32 4KB entries fully associative, 16 4MB entries fully
associative
cpu3: DTLB 48 4KB entries fully associative, 48 4MB entries fully
associative
cpu3: AMD erratum 721 detected and fixed
cpu3: smt 0, core 2, package 0
ioapic0 at mainbus0: apid 2 pa 0xfec00000, version 21, 24 pins, remapped
acpiprt0 at acpi0: bus 0 (PCI0)
acpiprt1 at acpi0: bus 3 (P2P_)
acpiprt2 at acpi0: bus 1 (PCE2)
acpiprt3 at acpi0: bus -1 (PCE3)
acpiprt4 at acpi0: bus -1 (PCE4)
acpiprt5 at acpi0: bus -1 (PCE5)
acpiprt6 at acpi0: bus -1 (PCE6)
acpiprt7 at acpi0: bus -1 (PCE7)
acpiprt8 at acpi0: bus -1 (PCE8)
acpiprt9 at acpi0: bus -1 (PCE9)
acpiprt10 at acpi0: bus 2 (PCEA)
acpiprt11 at acpi0: bus -1 (PCEB)
acpiprt12 at acpi0: bus -1 (PCEC)
acpicpu0 at acpi0: C1(@1 halt!), PSS
acpicpu1 at acpi0: C1(@1 halt!), PSS
acpicpu2 at acpi0: C1(@1 halt!), PSS
acpicpu3 at acpi0: C1(@1 halt!), PSS
acpibtn0 at acpi0: PWRB
acpipci0 at acpi0 PCI0: _OSC failed
acpicmos0 at acpi0
cpu0: 2411 MHz: speeds: 2400 1200 MHz
pci0 at mainbus0 bus 0
0:0:0: mem address conflict 0xe0000000/0x20000000
pchb0 at pci0 dev 0 function 0 "ATI RD780 HT-PCIE" rev 0x00
ppb0 at pci0 dev 2 function 0 "ATI RD790 PCIE" rev 0x00
pci1 at ppb0 bus 1
vga1 at pci1 dev 0 function 0 "NVIDIA GeForce 8400 GS" rev 0xa1
wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
ppb1 at pci0 dev 10 function 0 "ATI RD790 PCIE" rev 0x00
pci2 at ppb1 bus 2
re0 at pci2 dev 0 function 0 "Realtek 8168" rev 0x01: RTL8168 2 (0x3800),
msi, address 00:1f:d0:a4:c0:03
rgephy0 at re0 phy 7: RTL8169S/8110S/8211 PHY, rev. 2
ahci0 at pci0 dev 18 function 0 "ATI SB600 SATA" rev 0x00: apic 2 int 22,
AHCI 1.1
ahci0: port 0: 3.0Gb/s
ahci0: port 1: 3.0Gb/s
ahci0: port 2: 1.5Gb/s
scsibus1 at ahci0: 32 targets
sd0 at scsibus1 targ 0 lun 0: <ATA, Hitachi HDS72105, JP2O> SCSI3 0/direct
fixed naa.5000cca374f45c20
sd0: 476935MB, 512 bytes/sector, 976764911 sectors
sd1 at scsibus1 targ 1 lun 0: <ATA, SAMSUNG HD501LJ, CR10> SCSI3 0/direct
fixed naa.50000f00db208553
sd1: 476938MB, 512 bytes/sector, 976771055 sectors
cd0 at scsibus1 targ 2 lun 0: <HL-DT-ST, DVDRAM GH22NS50, TN03> ATAPI
5/cdrom removable
ohci0 at pci0 dev 19 function 0 "ATI SB600 USB" rev 0x00: apic 2 int 16,
version 1.0, legacy support
ohci1 at pci0 dev 19 function 1 "ATI SB600 USB" rev 0x00: apic 2 int 17,
version 1.0, legacy support
ohci2 at pci0 dev 19 function 2 "ATI SB600 USB" rev 0x00: apic 2 int 18,
version 1.0, legacy support
ohci3 at pci0 dev 19 function 3 "ATI SB600 USB" rev 0x00: apic 2 int 17,
version 1.0, legacy support
ohci4 at pci0 dev 19 function 4 "ATI SB600 USB" rev 0x00: apic 2 int 18,
version 1.0, legacy support
ehci0 at pci0 dev 19 function 5 "ATI SB600 USB2" rev 0x00: apic 2 int 19
usb0 at ehci0: USB revision 2.0
uhub0 at usb0 configuration 1 interface 0 "ATI EHCI root hub" rev
2.00/1.00 addr 1
piixpm0 at pci0 dev 20 function 0 "ATI SBx00 SMBus" rev 0x14: SMI
iic0 at piixpm0
spdmem0 at iic0 addr 0x50: 2GB DDR2 SDRAM non-parity PC2-6400CL5
spdmem1 at iic0 addr 0x51: 2GB DDR2 SDRAM non-parity PC2-6400CL5
spdmem2 at iic0 addr 0x52: 1GB DDR2 SDRAM non-parity PC2-6400CL6
spdmem3 at iic0 addr 0x53: 1GB DDR2 SDRAM non-parity PC2-6400CL6
pciide0 at pci0 dev 20 function 1 "ATI SB600 IDE" rev 0x00: DMA, channel 0
configured to compatibility, channel 1 configured to compatibility
azalia0 at pci0 dev 20 function 2 "ATI SBx00 HD Audio" rev 0x00: apic 2
int 16
azalia0: codecs: Realtek ALC885
audio0 at azalia0
pcib0 at pci0 dev 20 function 3 "ATI SB600 ISA" rev 0x00
ppb2 at pci0 dev 20 function 4 "ATI SB600 PCI" rev 0x00
pci3 at ppb2 bus 3
"TI TSB43AB23 FireWire" rev 0x00 at pci3 dev 14 function 0 not configured
pchb1 at pci0 dev 24 function 0 "AMD AMD64 10h HyperTransport" rev 0x00
pchb2 at pci0 dev 24 function 1 "AMD AMD64 10h Address Map" rev 0x00
pchb3 at pci0 dev 24 function 2 "AMD AMD64 10h DRAM Cfg" rev 0x00
km0 at pci0 dev 24 function 3 "AMD AMD64 10h Misc Cfg" rev 0x00
pchb4 at pci0 dev 24 function 4 "AMD AMD64 10h Link Cfg" rev 0x00
usb1 at ohci0: USB revision 1.0
uhub1 at usb1 configuration 1 interface 0 "ATI OHCI root hub" rev
1.00/1.00 addr 1
usb2 at ohci1: USB revision 1.0
uhub2 at usb2 configuration 1 interface 0 "ATI OHCI root hub" rev
1.00/1.00 addr 1
usb3 at ohci2: USB revision 1.0
uhub3 at usb3 configuration 1 interface 0 "ATI OHCI root hub" rev
1.00/1.00 addr 1
usb4 at ohci3: USB revision 1.0
uhub4 at usb4 configuration 1 interface 0 "ATI OHCI root hub" rev
1.00/1.00 addr 1
usb5 at ohci4: USB revision 1.0
uhub5 at usb5 configuration 1 interface 0 "ATI OHCI root hub" rev
1.00/1.00 addr 1
isa0 at pcib0
isadma0 at isa0
fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
pckbc0 at isa0 port 0x60/5 irq 1 irq 12
pckbd0 at pckbc0 (kbd slot)
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pcppi0 at isa0 port 0x61
spkr0 at pcppi0
lpt0 at isa0 port 0x378/4 irq 7
it0 at isa0 port 0x2e/2: IT8718F rev 5, EC port 0x228
vmm0 at mainbus0: SVM/RVI
vscsi0 at root
scsibus2 at vscsi0: 256 targets
softraid0 at root
scsibus3 at softraid0: 256 targets
root on sd1a (f0dd5c5b4051779c.a) swap on sd1b dump on sd1b





>

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Otto Moerbeek
On Sun, Apr 28, 2019 at 01:57:46AM +0000, Ian McWilliam wrote:

>
>
> On 28/4/19, 12:56 am, "[hidden email] on behalf of Otto Moerbeek"
> <[hidden email] on behalf of [hidden email]> wrote:
>
> >On Sat, Apr 27, 2019 at 04:43:14PM +0200, Otto Moerbeek wrote:
> >
> >> On Sat, Apr 27, 2019 at 04:37:23PM +0200, Antoine Jacoutot wrote:
> >>
> >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> >> > > The diff below speeds up ld.so library intialisation where the
> >>dependency
> >> > > tree is broad and deep, such as samba's smbd which links over 100
> >>libraries.
> >> > >
> >> > > See for example
> >>https://marc.info/?l=openbsd-misc&m=155007285712913&w=2
> >> > >
> >> > > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for
> >>part 1
> >> > > that speeds up library loading.
> >> > >
> >> > > The timings below are for /usr/local/sbin/smbd --version:
> >> > >
> >> > > Timing without either diff  : 6m45.67s real  6m45.65s user
> >>0m00.02s system
> >> > > Timing with part 1 diff only: 4m42.88s real  4m42.85s user
> >>0m00.02s system
> >> > > Timing with part 2 diff only: 2m02.61s real  2m02.60s user
> >>0m00.01s system
> >> > > Timing with both diffs      : 0m00.03s real  0m00.03s user
> >>0m00.00s system
> >> > >
> >> > > Note that these timings are for a build of a recent samba master
> >>tree
> >> > > (linked with kerberos) which is probably slower than the OpenBSD
> >>port.
> >> > >
> >> > > Nathanael
> >> >
> >> > Wow. Tried your part1 and part2 diffs and the difference is indeed
> >>insane!
> >> > mail/evolution always took 10+ seconds to start for me and now it's
> >>almost
> >> > instant...
> >> > Crazy... But this sounds too good to be true ;-)
> >> > What are the potential regressions?
> >>
> >> Speaking off regression tests, we have quite en extensive collection.
> >> The tests in libexec/ld.so should all pass.
> >
> >And the do on amd64
> >
> >>
> >> -Otto
> >>
> >>
>
> The results look good but it still doesn¹t resolve the root cause of the
> issue.


Speedup of lds.o is nice in any circostance and samba issues should be
viewed seperately.  In other word, please don't hijack the thread.

        -Otto

> Using both patches on old hardware helps speed up the process but I still
> see the rc script timeout before smbd is loaded causing the rest of the
> samba processes to fail to load. This did not happen under 6.4 (amd64) so
> the change of linker / compiler update is still potentially where the
> problem may lie.
>
> Starting smbd with both patches
>  0m46.55s real     0m46.47s user     0m00.07s system
>
>
> Would still be good to see this work committed though.
>
> Ian McWilliam
>
> OpenBSD 6.5 (GENERIC.MP) #0: Mon Apr 15 16:28:00 AEST 2019
>    
> [hidden email]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> real mem = 6424494080 (6126MB)
> avail mem = 6220148736 (5931MB)
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xf0100 (55 entries)
> bios0: vendor Award Software International, Inc. version "F10d" date
> 07/22/2010
> bios0: Gigabyte Technology Co., Ltd. GA-MA790X-DS4
> acpi0 at bios0: rev 0
> acpi0: sleep states S0 S1 S4 S5
> acpi0: tables DSDT FACP SSDT HPET MCFG APIC
> acpi0: wakeup devices USB0(S3) USB1(S3) USB2(S3) USB3(S3) USB4(S3)
> USB5(S3) SBAZ(S4) P2P_(S5) PCE2(S4) PCE3(S4) PCE4(S4) PCE5(S4) PCE6(S4)
> PCE7(S4) PCE8(S4) PCE9(S4) [...]
> acpitimer0 at acpi0: 3579545 Hz, 32 bits
> acpihpet0 at acpi0: 14318180 Hz
> acpimcfg0 at acpi0
> acpimcfg0: addr 0xe0000000, bus 0-255
> acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: AMD Phenom(tm) 9750 Quad-Core Processor, 2411.28 MHz, 10-02-03
> cpu0:
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL
> USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT
> SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,
> OSVW,IBS,ITSC
> cpu0: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB
> 64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache
> cpu0: ITLB 32 4KB entries fully associative, 16 4MB entries fully
> associative
> cpu0: DTLB 48 4KB entries fully associative, 48 4MB entries fully
> associative
> cpu0: AMD erratum 721 detected and fixed
> cpu0: smt 0, core 0, package 0
> mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
> cpu0: apic clock running at 200MHz
> cpu0: mwait min=64, max=64, IBE
> cpu1 at mainbus0: apid 1 (application processor)
> cpu1: AMD Phenom(tm) 9750 Quad-Core Processor, 2410.99 MHz, 10-02-03
> cpu1:
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL
> USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT
> SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,
> OSVW,IBS,ITSC
> cpu1: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB
> 64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache
> cpu1: ITLB 32 4KB entries fully associative, 16 4MB entries fully
> associative
> cpu1: DTLB 48 4KB entries fully associative, 48 4MB entries fully
> associative
> cpu1: AMD erratum 721 detected and fixed
> cpu1: smt 0, core 1, package 0
> cpu2 at mainbus0: apid 2 (application processor)
> cpu2: AMD Phenom(tm) 9750 Quad-Core Processor, 2410.99 MHz, 10-02-03
> cpu2:
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL
> USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT
> SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,
> OSVW,IBS,ITSC
> cpu2: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB
> 64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache
> cpu2: ITLB 32 4KB entries fully associative, 16 4MB entries fully
> associative
> cpu2: DTLB 48 4KB entries fully associative, 48 4MB entries fully
> associative
> cpu2: AMD erratum 721 detected and fixed
> cpu2: smt 0, core 3, package 0
> cpu3 at mainbus0: apid 3 (application processor)
> cpu3: AMD Phenom(tm) 9750 Quad-Core Processor, 2410.99 MHz, 10-02-03
> cpu3:
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL
> USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT
> SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,
> OSVW,IBS,ITSC
> cpu3: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB
> 64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache
> cpu3: ITLB 32 4KB entries fully associative, 16 4MB entries fully
> associative
> cpu3: DTLB 48 4KB entries fully associative, 48 4MB entries fully
> associative
> cpu3: AMD erratum 721 detected and fixed
> cpu3: smt 0, core 2, package 0
> ioapic0 at mainbus0: apid 2 pa 0xfec00000, version 21, 24 pins, remapped
> acpiprt0 at acpi0: bus 0 (PCI0)
> acpiprt1 at acpi0: bus 3 (P2P_)
> acpiprt2 at acpi0: bus 1 (PCE2)
> acpiprt3 at acpi0: bus -1 (PCE3)
> acpiprt4 at acpi0: bus -1 (PCE4)
> acpiprt5 at acpi0: bus -1 (PCE5)
> acpiprt6 at acpi0: bus -1 (PCE6)
> acpiprt7 at acpi0: bus -1 (PCE7)
> acpiprt8 at acpi0: bus -1 (PCE8)
> acpiprt9 at acpi0: bus -1 (PCE9)
> acpiprt10 at acpi0: bus 2 (PCEA)
> acpiprt11 at acpi0: bus -1 (PCEB)
> acpiprt12 at acpi0: bus -1 (PCEC)
> acpicpu0 at acpi0: C1(@1 halt!), PSS
> acpicpu1 at acpi0: C1(@1 halt!), PSS
> acpicpu2 at acpi0: C1(@1 halt!), PSS
> acpicpu3 at acpi0: C1(@1 halt!), PSS
> acpibtn0 at acpi0: PWRB
> acpipci0 at acpi0 PCI0: _OSC failed
> acpicmos0 at acpi0
> cpu0: 2411 MHz: speeds: 2400 1200 MHz
> pci0 at mainbus0 bus 0
> 0:0:0: mem address conflict 0xe0000000/0x20000000
> pchb0 at pci0 dev 0 function 0 "ATI RD780 HT-PCIE" rev 0x00
> ppb0 at pci0 dev 2 function 0 "ATI RD790 PCIE" rev 0x00
> pci1 at ppb0 bus 1
> vga1 at pci1 dev 0 function 0 "NVIDIA GeForce 8400 GS" rev 0xa1
> wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
> wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
> ppb1 at pci0 dev 10 function 0 "ATI RD790 PCIE" rev 0x00
> pci2 at ppb1 bus 2
> re0 at pci2 dev 0 function 0 "Realtek 8168" rev 0x01: RTL8168 2 (0x3800),
> msi, address 00:1f:d0:a4:c0:03
> rgephy0 at re0 phy 7: RTL8169S/8110S/8211 PHY, rev. 2
> ahci0 at pci0 dev 18 function 0 "ATI SB600 SATA" rev 0x00: apic 2 int 22,
> AHCI 1.1
> ahci0: port 0: 3.0Gb/s
> ahci0: port 1: 3.0Gb/s
> ahci0: port 2: 1.5Gb/s
> scsibus1 at ahci0: 32 targets
> sd0 at scsibus1 targ 0 lun 0: <ATA, Hitachi HDS72105, JP2O> SCSI3 0/direct
> fixed naa.5000cca374f45c20
> sd0: 476935MB, 512 bytes/sector, 976764911 sectors
> sd1 at scsibus1 targ 1 lun 0: <ATA, SAMSUNG HD501LJ, CR10> SCSI3 0/direct
> fixed naa.50000f00db208553
> sd1: 476938MB, 512 bytes/sector, 976771055 sectors
> cd0 at scsibus1 targ 2 lun 0: <HL-DT-ST, DVDRAM GH22NS50, TN03> ATAPI
> 5/cdrom removable
> ohci0 at pci0 dev 19 function 0 "ATI SB600 USB" rev 0x00: apic 2 int 16,
> version 1.0, legacy support
> ohci1 at pci0 dev 19 function 1 "ATI SB600 USB" rev 0x00: apic 2 int 17,
> version 1.0, legacy support
> ohci2 at pci0 dev 19 function 2 "ATI SB600 USB" rev 0x00: apic 2 int 18,
> version 1.0, legacy support
> ohci3 at pci0 dev 19 function 3 "ATI SB600 USB" rev 0x00: apic 2 int 17,
> version 1.0, legacy support
> ohci4 at pci0 dev 19 function 4 "ATI SB600 USB" rev 0x00: apic 2 int 18,
> version 1.0, legacy support
> ehci0 at pci0 dev 19 function 5 "ATI SB600 USB2" rev 0x00: apic 2 int 19
> usb0 at ehci0: USB revision 2.0
> uhub0 at usb0 configuration 1 interface 0 "ATI EHCI root hub" rev
> 2.00/1.00 addr 1
> piixpm0 at pci0 dev 20 function 0 "ATI SBx00 SMBus" rev 0x14: SMI
> iic0 at piixpm0
> spdmem0 at iic0 addr 0x50: 2GB DDR2 SDRAM non-parity PC2-6400CL5
> spdmem1 at iic0 addr 0x51: 2GB DDR2 SDRAM non-parity PC2-6400CL5
> spdmem2 at iic0 addr 0x52: 1GB DDR2 SDRAM non-parity PC2-6400CL6
> spdmem3 at iic0 addr 0x53: 1GB DDR2 SDRAM non-parity PC2-6400CL6
> pciide0 at pci0 dev 20 function 1 "ATI SB600 IDE" rev 0x00: DMA, channel 0
> configured to compatibility, channel 1 configured to compatibility
> azalia0 at pci0 dev 20 function 2 "ATI SBx00 HD Audio" rev 0x00: apic 2
> int 16
> azalia0: codecs: Realtek ALC885
> audio0 at azalia0
> pcib0 at pci0 dev 20 function 3 "ATI SB600 ISA" rev 0x00
> ppb2 at pci0 dev 20 function 4 "ATI SB600 PCI" rev 0x00
> pci3 at ppb2 bus 3
> "TI TSB43AB23 FireWire" rev 0x00 at pci3 dev 14 function 0 not configured
> pchb1 at pci0 dev 24 function 0 "AMD AMD64 10h HyperTransport" rev 0x00
> pchb2 at pci0 dev 24 function 1 "AMD AMD64 10h Address Map" rev 0x00
> pchb3 at pci0 dev 24 function 2 "AMD AMD64 10h DRAM Cfg" rev 0x00
> km0 at pci0 dev 24 function 3 "AMD AMD64 10h Misc Cfg" rev 0x00
> pchb4 at pci0 dev 24 function 4 "AMD AMD64 10h Link Cfg" rev 0x00
> usb1 at ohci0: USB revision 1.0
> uhub1 at usb1 configuration 1 interface 0 "ATI OHCI root hub" rev
> 1.00/1.00 addr 1
> usb2 at ohci1: USB revision 1.0
> uhub2 at usb2 configuration 1 interface 0 "ATI OHCI root hub" rev
> 1.00/1.00 addr 1
> usb3 at ohci2: USB revision 1.0
> uhub3 at usb3 configuration 1 interface 0 "ATI OHCI root hub" rev
> 1.00/1.00 addr 1
> usb4 at ohci3: USB revision 1.0
> uhub4 at usb4 configuration 1 interface 0 "ATI OHCI root hub" rev
> 1.00/1.00 addr 1
> usb5 at ohci4: USB revision 1.0
> uhub5 at usb5 configuration 1 interface 0 "ATI OHCI root hub" rev
> 1.00/1.00 addr 1
> isa0 at pcib0
> isadma0 at isa0
> fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
> com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
> pckbc0 at isa0 port 0x60/5 irq 1 irq 12
> pckbd0 at pckbc0 (kbd slot)
> wskbd0 at pckbd0: console keyboard, using wsdisplay0
> pcppi0 at isa0 port 0x61
> spkr0 at pcppi0
> lpt0 at isa0 port 0x378/4 irq 7
> it0 at isa0 port 0x2e/2: IT8718F rev 5, EC port 0x228
> vmm0 at mainbus0: SVM/RVI
> vscsi0 at root
> scsibus2 at vscsi0: 256 targets
> softraid0 at root
> scsibus3 at softraid0: 256 targets
> root on sd1a (f0dd5c5b4051779c.a) swap on sd1b dump on sd1b
>
>
>
>
>
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Stuart Henderson
In reply to this post by Ian Mcwilliam-6
> >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> >> > > The diff below speeds up ld.so library intialisation where the
> >>dependency
> >> > > tree is broad and deep, such as samba's smbd which links over 100
> >>libraries.

Past experience with ld.so changes suggests it would be good to have
test reports from multiple arches, *especially* hppa.

On 2019/04/28 01:57, Ian McWilliam wrote:
> Using both patches on old hardware helps speed up the process but I still
> see the rc script timeout before smbd is loaded causing the rest of the
> samba processes to fail to load. This did not happen under 6.4 (amd64) so
> the change of linker / compiler update is still potentially where the
> problem may lie.
>
> Starting smbd with both patches
>  0m46.55s real     0m46.47s user     0m00.07s system

This doesn't match my experience:

$ time sudo rcctl start samba
smbd(ok)
nmbd(ok)
    0m00.81s real     0m00.31s user     0m00.31s system

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Matthieu Herrb-3
On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote:
> > >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> > >> > > The diff below speeds up ld.so library intialisation where the
> > >>dependency
> > >> > > tree is broad and deep, such as samba's smbd which links over 100
> > >>libraries.
>
> Past experience with ld.so changes suggests it would be good to have
> test reports from multiple arches, *especially* hppa.

The regress test seem to pass here on hppa.

--
Matthieu Herrb

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Robert Nagy
On 28/04/19 12:01 +0200, Matthieu Herrb wrote:

> On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote:
> > > >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> > > >> > > The diff below speeds up ld.so library intialisation where the
> > > >>dependency
> > > >> > > tree is broad and deep, such as samba's smbd which links over 100
> > > >>libraries.
> >
> > Past experience with ld.so changes suggests it would be good to have
> > test reports from multiple arches, *especially* hppa.
>
> The regress test seem to pass here on hppa.
>
> --
> Matthieu Herrb
>

This also fixes the component FLAVOR of chromium which uses a gazillion
shared objects. Awesome work!

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Charlene Wendling
On Sun, 28 Apr 2019 13:04:22 +0200
Robert Nagy <[hidden email]> wrote:

> On 28/04/19 12:01 +0200, Matthieu Herrb wrote:
> > On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote:
> > > > >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen
> > > > >> > wrote:
> > > > >> > > The diff below speeds up ld.so library intialisation
> > > > >> > > where the
> > > > >>dependency
> > > > >> > > tree is broad and deep, such as samba's smbd which links
> > > > >> > > over 100
> > > > >>libraries.
> > >
> > > Past experience with ld.so changes suggests it would be good to
> > > have test reports from multiple arches, *especially* hppa.
> >
> > The regress test seem to pass here on hppa.

It seems good on macppc as well, here is the log [0]. Startup time for
clang has been reduced from 3.2s to 0.11s with the two diff applied!

> > --
> > Matthieu Herrb
> >
>
> This also fixes the component FLAVOR of chromium which uses a
> gazillion shared objects. Awesome work!
>

Charlène.

[0] http://0x0.st/zbUa.txt

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Brian Callahan-5
In reply to this post by Matthieu Herrb-3


On 4/28/19 6:01 AM, Matthieu Herrb wrote:

> On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote:
>>>>>> On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
>>>>>>> The diff below speeds up ld.so library intialisation where the
>>>>> dependency
>>>>>>> tree is broad and deep, such as samba's smbd which links over 100
>>>>> libraries.
>> Past experience with ld.so changes suggests it would be good to have
>> test reports from multiple arches, *especially* hppa.
> The regress test seem to pass here on hppa.
>

Pass here too on hppa and macppc and armv7.

~Brian

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Stuart Henderson
On 2019/04/28 09:45, Brian Callahan wrote:

>
>
> On 4/28/19 6:01 AM, Matthieu Herrb wrote:
> > On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote:
> > > > > > > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> > > > > > > > The diff below speeds up ld.so library intialisation where the
> > > > > > dependency
> > > > > > > > tree is broad and deep, such as samba's smbd which links over 100
> > > > > > libraries.
> > > Past experience with ld.so changes suggests it would be good to have
> > > test reports from multiple arches, *especially* hppa.
> > The regress test seem to pass here on hppa.
> >
>
> Pass here too on hppa and macppc and armv7.
>
> ~Brian
>

Regress is clean for me on i386 and I am using it on my current ports bulk
build there (halfway done, no issues seen yet).

Regress is also clean on arm64.

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Chris Cappuccio
In reply to this post by Stuart Henderson
Stuart Henderson [[hidden email]] wrote:
>
> This doesn't match my experience:
>
> $ time sudo rcctl start samba
> smbd(ok)
> nmbd(ok)
>     0m00.81s real     0m00.31s user     0m00.31s system

He was linking Samba with Kerberos libs too.

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Stuart Henderson
On 2019/04/29 09:47, Chris Cappuccio wrote:

> Stuart Henderson [[hidden email]] wrote:
> >
> > This doesn't match my experience:
> >
> > $ time sudo rcctl start samba
> > smbd(ok)
> > nmbd(ok)
> >     0m00.81s real     0m00.31s user     0m00.31s system
>
> He was linking Samba with Kerberos libs too.
>

OP was but I don't think Ian was.

That is with the ld.so diffs of course. Startup takes getting on for a
minute for me without them.

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Jeremie Courreges-Anglas-2
In reply to this post by Stuart Henderson
On Mon, Apr 29 2019, Stuart Henderson <[hidden email]> wrote:

> On 2019/04/28 09:45, Brian Callahan wrote:
>>
>>
>> On 4/28/19 6:01 AM, Matthieu Herrb wrote:
>> > On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote:
>> > > > > > > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
>> > > > > > > > The diff below speeds up ld.so library intialisation where the
>> > > > > > dependency
>> > > > > > > > tree is broad and deep, such as samba's smbd which links over 100
>> > > > > > libraries.
>> > > Past experience with ld.so changes suggests it would be good to have
>> > > test reports from multiple arches, *especially* hppa.
>> > The regress test seem to pass here on hppa.
>> >
>>
>> Pass here too on hppa and macppc and armv7.
>>
>> ~Brian
>>
>
> Regress is clean for me on i386 and I am using it on my current ports bulk
> build there (halfway done, no issues seen yet).

Using this in current ports bulk on sparc64, no fallout.

> Regress is also clean on arm64.

and on sparc64.

--
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Martin Pieuchot
In reply to this post by Nathanael Rensen-3
On 27/04/19(Sat) 21:55, Nathanael Rensen wrote:

> The diff below speeds up ld.so library intialisation where the dependency
> tree is broad and deep, such as samba's smbd which links over 100 libraries.
>
> See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2
>
> See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1
> that speeds up library loading.
>
> The timings below are for /usr/local/sbin/smbd --version:
>
> Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s system
> Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s system
> Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s system
> Timing with both diffs      : 0m00.03s real  0m00.03s user  0m00.00s system
>
> Note that these timings are for a build of a recent samba master tree
> (linked with kerberos) which is probably slower than the OpenBSD port.

Nice numbers.  Could you explain in words what your diff is doing?  Why
does splitting the flag help?  Is it because some ctors/initarray are
being initialized multiple times currently?  Or is it just to prevent
some traversal?  In that case does that mean the `STAT_VISISTED' flag
is removed too early?

> Index: libexec/ld.so/loader.c
> ===================================================================
> RCS file: /cvs/src/libexec/ld.so/loader.c,v
> retrieving revision 1.177
> diff -u -p -p -u -r1.177 loader.c
> --- libexec/ld.so/loader.c 3 Dec 2018 05:29:56 -0000 1.177
> +++ libexec/ld.so/loader.c 27 Apr 2019 13:24:02 -0000
> @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje
>  {
>   struct dep_node *n;
>  
> - object->status |= STAT_VISITED;
> + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2;
> +
> + object->status |= visited_flag;
>  
>   TAILQ_FOREACH(n, &object->child_list, next_sib) {
> - if (n->data->status & STAT_VISITED)
> + if (n->data->status & visited_flag)
>   continue;
>   _dl_call_init_recurse(n->data, initfirst);
>   }
> -
> - object->status &= ~STAT_VISITED;
>  
>   if (object->status & STAT_INIT_DONE)
>   return;
> Index: libexec/ld.so/resolve.h
> ===================================================================
> RCS file: /cvs/src/libexec/ld.so/resolve.h,v
> retrieving revision 1.90
> diff -u -p -p -u -r1.90 resolve.h
> --- libexec/ld.so/resolve.h 21 Apr 2019 04:11:42 -0000 1.90
> +++ libexec/ld.so/resolve.h 27 Apr 2019 13:24:02 -0000
> @@ -125,8 +125,9 @@ struct elf_object {
>  #define STAT_FINI_READY 0x10
>  #define STAT_UNLOADED 0x20
>  #define STAT_NODELETE 0x40
> -#define STAT_VISITED 0x80
> +#define STAT_VISITED_1 0x80
>  #define STAT_GNU_HASH 0x100
> +#define STAT_VISITED_2 0x200
>  
>   Elf_Phdr *phdrp;
>   int phdrc;
>

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Nathanael Rensen-3
In reply to this post by Nathanael Rensen-3
On Sun, 5 May 2019 at 06:26, Martin Pieuchot <[hidden email]> wrote:

>
> On 27/04/19(Sat) 21:55, Nathanael Rensen wrote:
> > The diff below speeds up ld.so library intialisation where the dependency
> > tree is broad and deep, such as samba's smbd which links over 100 libraries.
> >
> > See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2
> >
> > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1
> > that speeds up library loading.
> >
> > The timings below are for /usr/local/sbin/smbd --version:
> >
> > Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s system
> > Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s system
> > Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s system
> > Timing with both diffs      : 0m00.03s real  0m00.03s user  0m00.00s system
> >
> > Note that these timings are for a build of a recent samba master tree
> > (linked with kerberos) which is probably slower than the OpenBSD port.
>
> Nice numbers.  Could you explain in words what your diff is doing?  Why
> does splitting the flag help?  Is it because some ctors/initarray are
> being initialized multiple times currently?

No, the STAT_INIT_DONE flag prevents that.

> Or is it just to prevent some traversal?

Yes.

> In that case does that mean the `STAT_VISISTED' flag is removed too
> early?

Yes, STAT_VISITED is removed too early. The visited flag is set on a node
while traversing the child nodes of that node and then removed. It serves
to protect against circular dependencies, but does not prevent repeatedly
traversing through a node that appears on separate branches.

The entire tree must be traversed twice - first to initialise the
DF_1_INITFIRST libraries, and secondly to initialise the others. This
is presumably why this diff contributes roughly twice as much speedup
as the part 1 diff. To be effective in avoiding repeated traversals
the visited flag must persist throughout an entire tree traversal but
it must either be cleared between first and second traversals or a
different flag used for the second traversal.

My approach was to add a second visited flag and make them both
persistent. My rationale for why I believe the flags may be
persisted is as follows. dlopen() calls _dl_call_init() with the
newly loaded object and neither the newly loaded object nor any
newly loaded children of that object will have either visited flag
set. Already loaded children will have those flags set, but they
won't have gained any new children as a result of the dlopen().
If this reasoning is wrong then the diff is wrong and could lead to
uninitialised libraries (and an ld.so regress test should probably
be created to catch that situation).

It occurs to me as I'm writing this that perhaps it's possible to
avoid a tree traversal entirely by walking the linearised grpsym_list
in reverse and relying only on the STAT_INIT_DONE flag.
        /*
         * grpsym_list is an ordered list of all child libs of the
         * _dl_loading_object with no dups. The order is equivalent
         * to a breadth-first traversal of the child list without dups.
         */
I don't think it is a true breadth-first traversal, not in the way I
understand breadth-first, but it does ensure that parent nodes appear
before child nodes. So in reverse, child nodes will appear before
parent nodes. While this is not the same as a depth-first traversal
it may be OK. There may be some specific requirements of DF_1_INITFIRST
that need to be taken into account.

Nathanael

>
> > Index: libexec/ld.so/loader.c
> > ===================================================================
> > RCS file: /cvs/src/libexec/ld.so/loader.c,v
> > retrieving revision 1.177
> > diff -u -p -p -u -r1.177 loader.c
> > --- libexec/ld.so/loader.c    3 Dec 2018 05:29:56 -0000       1.177
> > +++ libexec/ld.so/loader.c    27 Apr 2019 13:24:02 -0000
> > @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje
> >  {
> >       struct dep_node *n;
> >
> > -     object->status |= STAT_VISITED;
> > +     int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2;
> > +
> > +     object->status |= visited_flag;
> >
> >       TAILQ_FOREACH(n, &object->child_list, next_sib) {
> > -             if (n->data->status & STAT_VISITED)
> > +             if (n->data->status & visited_flag)
> >                       continue;
> >               _dl_call_init_recurse(n->data, initfirst);
> >       }
> > -
> > -     object->status &= ~STAT_VISITED;
> >
> >       if (object->status & STAT_INIT_DONE)
> >               return;
> > Index: libexec/ld.so/resolve.h
> > ===================================================================
> > RCS file: /cvs/src/libexec/ld.so/resolve.h,v
> > retrieving revision 1.90
> > diff -u -p -p -u -r1.90 resolve.h
> > --- libexec/ld.so/resolve.h   21 Apr 2019 04:11:42 -0000      1.90
> > +++ libexec/ld.so/resolve.h   27 Apr 2019 13:24:02 -0000
> > @@ -125,8 +125,9 @@ struct elf_object {
> >  #define      STAT_FINI_READY 0x10
> >  #define      STAT_UNLOADED   0x20
> >  #define      STAT_NODELETE   0x40
> > -#define      STAT_VISITED    0x80
> > +#define      STAT_VISITED_1  0x80
> >  #define      STAT_GNU_HASH   0x100
> > +#define      STAT_VISITED_2  0x200
> >
> >       Elf_Phdr        *phdrp;
> >       int             phdrc;
> >

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Jeremie Courreges-Anglas-2
In reply to this post by Nathanael Rensen-3
On Sat, Apr 27 2019, Nathanael Rensen <[hidden email]> wrote:

> The diff below speeds up ld.so library intialisation where the dependency
> tree is broad and deep, such as samba's smbd which links over 100 libraries.
>
> See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2
>
> See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1
> that speeds up library loading.
>
> The timings below are for /usr/local/sbin/smbd --version:
>
> Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s system
> Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s system
> Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s system
> Timing with both diffs      : 0m00.03s real  0m00.03s user  0m00.00s system

First off, thanks a lot for solving this long outstanding issue.  The
use of ld --as-needed hides the problem but it looks like ld.lld isn't
as good as ld.bfd at eliminating extra inter-library references.

As I told mpi@ earlier today, I think your changes are correct as is,
and are good to be committed.  So this counts as an ok jca@.  But I'd
expect other developers to chime in soon, maybe they'll spot something
that I didn't.

--
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE

Reply | Threaded
Open this post in threaded view
|

Re: ld.so speedup (part 2)

Philip Guenther-2
On Tue, 7 May 2019, Jeremie Courreges-Anglas wrote:
> On Sat, Apr 27 2019, Nathanael Rensen <[hidden email]>
> wrote:
> > The diff below speeds up ld.so library intialisation where the
> > dependency tree is broad and deep, such as samba's smbd which links
> > over 100 libraries.
...
> As I told mpi@ earlier today, I think your changes are correct as is,
> and are good to be committed.  So this counts as an ok jca@.  But I'd
> expect other developers to chime in soon, maybe they'll spot something
> that I didn't.

drahn@ and I pulled on our ld.so waders and agreed it's good, so I've
committed it with some tweaking to the #defines to make them
self-explanatory and have contiguous bit-assignments.

Thank you for identifying this badly inefficient algorithm and spotting
how easy it was to fix!


Philip Guenther