powerpc64: 64-bit-ize memmove.S

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

powerpc64: 64-bit-ize memmove.S

Christian Weisgerber
Well, I suggested it, so here's my attempt to switch powerpc64's
libc memmove.S over to 64 bits:
* Treat length parameter as 64 bits (size_t) instead of 32 (u_int).
* Set up the main loop to copy 64-bit doublewords instead of 32-bit
  words.

This definitely needs double and triple checking.

Things I wonder about, but didn't touch:
* Why is the memcpy entry point commented out?
* END... STRONG? WEAK? BUILTIN?

Index: memmove.S
===================================================================
RCS file: /cvs/src/lib/libc/arch/powerpc64/string/memmove.S,v
retrieving revision 1.1
diff -u -p -r1.1 memmove.S
--- memmove.S 25 Jun 2020 02:34:22 -0000 1.1
+++ memmove.S 26 Jun 2020 22:22:51 -0000
@@ -39,7 +39,7 @@
  * ==========================================================================
  */
 
-#include "SYS.h"
+#include "DEFS.h"
 
         .text
 
@@ -64,45 +64,45 @@ ENTRY(memmove)
  /* start of dest */
 
 fwd:
- addi %r4, %r4, -4 /* Back up src and dst pointers */
- addi %r8, %r8, -4 /* due to auto-update of 'load' */
+ addi %r4, %r4, -8 /* Back up src and dst pointers */
+ addi %r8, %r8, -8 /* due to auto-update of 'load' */
 
- srwi. %r9,%r5,2 /* How many words in total cnt */
- beq- last1 /* Handle byte by byte if < 4 */
+ srdi. %r9,%r5,3 /* Doublewords in total count */
+ beq- last1 /* Handle byte by byte if < 8 */
  /* bytes total */
- mtctr %r9 /* Count of words for loop */
- lwzu %r7, 4(%r4) /* Preload first word */
+ mtctr %r9 /* Count of dwords for loop */
+ ldu %r7, 8(%r4) /* Preload first doubleword */
 
  b g1
 
 g0: /* Main loop */
 
- lwzu %r7, 4(%r4) /* Load a new word */
- stwu %r6, 4(%r8) /* Store previous word */
+ ldu %r7, 8(%r4) /* Load a new doubleword */
+ stdu %r6, 8(%r8) /* Store previous doubleword */
 
 g1:
 
  bdz- last /* Dec cnt, and branch if just */
- /* one word to store */
- lwzu %r6, 4(%r4) /* Load another word */
- stwu %r7, 4(%r8) /* Store previous word */
+ /* one doubleword to store */
+ ldu %r6, 8(%r4) /* Load another doubleword */
+ stdu %r7, 8(%r8) /* Store previous doubleword */
  bdnz+ g0 /* Dec cnt, and loop again if */
- /* more words */
- mr %r7, %r6 /* If word count -> 0, then... */
+ /* more doublewords */
+ mr %r7, %r6 /* If dword count -> 0, then... */
 
 last:
 
- stwu %r7, 4(%r8) /* ... store last word */
+ stdu %r7, 8(%r8) /* ... store last doubleword */
 
 last1: /* Byte-by-byte copy */
 
- clrlwi. %r5,%r5,30 /* If count -> 0, then ... */
+ clrldi. %r5,%r5,61 /* If count -> 0, then ... */
  beqlr /* we're done */
 
  mtctr %r5 /* else load count for loop */
 
- lbzu %r6, 4(%r4) /* 1st byte: update addr by 4 */
- stbu %r6, 4(%r8) /* since we pre-adjusted by 4 */
+ lbzu %r6, 8(%r4) /* 1st byte: update addr by 8 */
+ stbu %r6, 8(%r8) /* since we pre-adjusted by 8 */
  bdzlr- /* in anticipation of main loop */
 
 last2:
@@ -120,40 +120,40 @@ reverse:
 
  add %r4, %r4, %r5 /* Work from end to beginning */
  add %r8, %r8, %r5 /* so add count to string ptrs */
- srwi. %r9,%r5,2 /* Words in total count */
- beq- rlast1 /* Handle byte by byte if < 4 */
+ srdi. %r9,%r5,3 /* Doublewords in total count */
+ beq- rlast1 /* Handle byte by byte if < 8 */
  /* bytes total */
 
- mtctr %r9 /* Count of words for loop */
+ mtctr %r9 /* Count of dwords for loop */
 
- lwzu %r7, -4(%r4) /* Preload first word */
+ ldu %r7, -8(%r4) /* Preload first doubleword */
  b rg1
 
 rg0: /* Main loop */
 
- lwzu %r7, -4(%r4) /* Load a new word */
- stwu %r6, -4(%r8) /* Store previous word */
+ ldu %r7, -8(%r4) /* Load a new doubleword */
+ stdu %r6, -8(%r8) /* Store previous doubleword */
 
 rg1:
 
  bdz- rlast /* Dec cnt, and branch if just */
- /* one word to store */
+ /* one doubleword to store */
 
- lwzu %r6, -4(%r4) /* Load another word */
- stwu %r7, -4(%r8) /* Store previous word */
+ ldu %r6, -8(%r4) /* Load another doubleword */
+ stdu %r7, -8(%r8) /* Store previous doubleword */
 
  bdnz+ rg0 /* Dec cnt, and loop again if */
- /* more words */
+ /* more doublewords */
 
- mr %r7, %r6 /* If word count -> 0, then... */
+ mr %r7, %r6 /* If dword count -> 0, then... */
 
 rlast:
 
- stwu %r7, -4(%r8) /* ... store last word */
+ stdu %r7, -8(%r8) /* ... store last doubleword */
 
 rlast1: /* Byte-by-byte copy */
 
- clrlwi. %r5,%r5,30 /* If count -> 0, then... */
+ clrldi. %r5,%r5,61 /* If count -> 0, then... */
  beqlr /* ... we're done */
 
  mtctr %r5 /* else load count for loop */
--
Christian "naddy" Weisgerber                          [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: powerpc64: 64-bit-ize memmove.S

Christian Weisgerber
Christian Weisgerber:

> Well, I suggested it, so here's my attempt to switch powerpc64's
> libc memmove.S over to 64 bits:

Actually, on second thought:

That function simply copies as many (double)words plus a tail of
bytes as the length argument specifies.  Neither source nor destination
are checked for alignment, so this will happily run a loop of
unaligned accesses, which doesn't sound very optimal.

I'm also intrigued by this aside in the PowerPC ISA documentation:
| Moreover, Load with Update instructions may take longer to execute
| in some implementations than the corresponding pair of a non-update
| Load instruction and an Add instruction.
What does clang generate?

I think we should consider dropping this "optimized" memmove.S on
both powerpc and powerpc64.

--
Christian "naddy" Weisgerber                          [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: powerpc64: 64-bit-ize memmove.S

George Koehler-2
On Sat, 27 Jun 2020 01:27:14 +0200
Christian Weisgerber <[hidden email]> wrote:

> I'm also intrigued by this aside in the PowerPC ISA documentation:
> | Moreover, Load with Update instructions may take longer to execute
> | in some implementations than the corresponding pair of a non-update
> | Load instruction and an Add instruction.
> What does clang generate?

clang likes load/store with update instructions.  For example, the
powerpc64 kernel has /sys/lib/libkern/memcpy.c, which copies bytes:

        while (n-- > 0)
                *t++ = *f++;

clang uses lbzu and stbu:

memcpy: cmpldi r5,0x0
memcpy+0x4:     beqlr
memcpy+0x8:     addi r4,r4,-1
memcpy+0xc:     addi r6,r3,-1
memcpy+0x10:    mtspr ctr,r5
memcpy+0x14:    lbzu r5,1(r4)
memcpy+0x18:    stbu r5,1(r6)
memcpy+0x1c:    bdnz 0x26cd0d4 (memcpy+0x14)
memcpy+0x20:    blr

> I think we should consider dropping this "optimized" memmove.S on
> both powerpc and powerpc64.

I might want to benchmark memmove.S against memmove.c to check if
those unaligned accesses are too slow.  First I would have to write
a benchmark.

Reply | Threaded
Open this post in threaded view
|

Re: powerpc64: 64-bit-ize memmove.S

Theo de Raadt-2
George Koehler <[hidden email]> wrote:

> On Sat, 27 Jun 2020 01:27:14 +0200
> Christian Weisgerber <[hidden email]> wrote:
>
> > I'm also intrigued by this aside in the PowerPC ISA documentation:
> > | Moreover, Load with Update instructions may take longer to execute
> > | in some implementations than the corresponding pair of a non-update
> > | Load instruction and an Add instruction.
> > What does clang generate?
>
> clang likes load/store with update instructions.  For example, the
> powerpc64 kernel has /sys/lib/libkern/memcpy.c, which copies bytes:
>
> while (n-- > 0)
> *t++ = *f++;

Keep in mind in userland we use memcpy.c, which does backwards detection.
It's probably slower.

The original idea was to use the C version with backwards detection until
it stopped finding bugs, but... it keeps finding bugs........

Reply | Threaded
Open this post in threaded view
|

Re: powerpc64: 64-bit-ize memmove.S

George Koehler-2
In reply to this post by Christian Weisgerber
On Sat, 27 Jun 2020 01:27:14 +0200
Christian Weisgerber <[hidden email]> wrote:

> That function simply copies as many (double)words plus a tail of
> bytes as the length argument specifies.  Neither source nor destination
> are checked for alignment, so this will happily run a loop of
> unaligned accesses, which doesn't sound very optimal.

I made a benchmark and concluded that unaligned word copies are slower
than aligned word copies, but faster than byte copies.  In most cases,
memmove.S is faster than memmove.c, but if aligned word copies between
unaligned buffers are possible, then memmove.c is faster.

The benchmark was on a 32-bit macppc G3 with
cpu0 at mainbus0: 750 (Revision 0x202): 400 MHz: 512KB backside cache

The benchmark has 4 implementations of memmove,
  stbu  =>  byte copy with lbzu,stbu loop
  stbx  =>  byte copy with lbzx,stbx,addi loop
  C     =>  aligned word copy or byte copy (libc/string/memmove.c)
  asm   =>  unaligned word copy (libc/arch/powerpc/string/memmove.S)

It shows time measured by mftb (move from timebase).

1st bench: move 10000 bytes up by 4 bytes, then down by 4 bytes, in
aligned buffer (offset 0).  asm wins:

$ ./bench 10000 4 0
        stbu    stbx    C       asm
        2639    2814    792     633
        2502    2814    784     628
        2501    2814    783     627
        2501    2814    784     626

2nd bench: unaligned buffer (offset 1), but (src & 3) == (dst & 3), so
C does aligned word copies, while asm does misaligned.  C wins:

$ ./bench 10000 4 1
        stbu    stbx    C       asm
        2638    3006    795     961
        2502    2814    786     938
        2501    2814    786     939
        2501    2813    785     939

3rd bench: move up then down by 5 bytes, src & 3 != dst & 3, can't
align word copies.  C does byte copies.  asm wins:

$ ./bench 10000 5 0
        stbu    stbx    C       asm
        2675    2815    2514    809
        2501    2813    2504    782
        2502    2815    2504    782
        2501    2814    2503    782

I think that memmove.S is probably better than memmove.c on G3.
I haven't run the bench on POWER9.