VIA C7 Dual RNG

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

VIA C7 Dual RNG

Henric Jungheim-2
I put together a little diff to get the 2nd generator
enabled on the VIA C7.  This should roughly double the
number of bits per second it generates without changing the
interface (both generators feed into the same queue).

While mucking with it, I also changed the entropy collection
to just grab what was available at the time rather than
blocking.  (What's the point?  Why hang the kernel if the
RNG barfs--or is disabled?) The entropy is now fed to the
system a block at a time, so there is a chance for some new
bits to trickle in to the CPU's RNG queue while the previous
ones are being enqueued.  The blocks are 4-byte aligned and
the xstore-rng happens to always do 4-byte writes
(regardless of the number of bytes that contain data from
the RNG), so there is no need for padding at the end.

The comments around 'viac3_rnd' were also a bit
suspicious--at least compared to the docs that I have
available.  My reading of the divisor is that it just tells
the thing how many bytes to write to memory and how many to
throw away (doing "value & 0xff" does the same thing as
asking for the full divisor).

Note that the machdep change insures that the control bits
are at their recommended values (e.g., the whitener is
enabled).  Some of those bits are reserved on the C3...

I moved the magic constants to specialreg.h, so things
should be a little easier to read.  However, it is so very
easy to screw up a list of #defines with lots of zeros that
a double check against the docs is definitely advised (at
least I seem to have a gift for miscounting bits, nibbles,
zeros, etc.).

I haven't tested this on a C3.  Oh, and the diff is against
-current as of last night sometime.


Index: arch/i386/i386/machdep.c
===================================================================
RCS file: /usr/cvs/openbsd/src/sys/arch/i386/i386/machdep.c,v
retrieving revision 1.373
diff -u -r1.373 machdep.c
--- arch/i386/i386/machdep.c 23 Dec 2006 22:46:13 -0000 1.373
+++ arch/i386/i386/machdep.c 22 Jan 2007 19:00:34 -0000
@@ -1249,11 +1249,11 @@
  if (val & C3_CPUID_HAS_RNG) {
  extern int viac3_rnd_present;
 
- if (!(val & C3_CPUID_DO_RNG)) {
- msreg = rdmsr(0x110B);
- msreg |= 0x40;
- wrmsr(0x110B, msreg);
- }
+ /* Make sure that both generators are enabled. */
+ msreg = rdmsr(0x110B) & C3_RNG_MASK;
+ msreg |= C3_RNG_ENBL | C3_RNG_SRCE_DUAL;
+ wrmsr(0x110B, msreg);
+
  viac3_rnd_present = 1;
  printf(" RNG");
  }
Index: arch/i386/i386/via.c
===================================================================
RCS file: /usr/cvs/openbsd/src/sys/arch/i386/i386/via.c,v
retrieving revision 1.8
diff -u -r1.8 via.c
--- arch/i386/i386/via.c 17 Nov 2006 07:47:56 -0000 1.8
+++ arch/i386/i386/via.c 22 Jan 2007 20:19:52 -0000
@@ -508,17 +508,15 @@
 /*
  * Note, the VIA C3 Nehemiah provides 4 internal 8-byte buffers, which
  * store random data, and can be accessed a lot quicker than waiting
- * for new data to be generated.  As we are using every 8th bit only
- * due to whitening. Since the RNG generates in excess of 21KB/s at
- * it's worst, collecting 64 bytes worth of entropy should not affect
- * things significantly.
+ * for new data to be generated.  Since the RNG generates in excess of
+ * 12Mbit/s at it's worst (and ranging up to ~80Mbit/s), collecting 64
+ * bytes worth of entropy should not affect things significantly.
  *
- * Note, due to some weirdness in the RNG, we need at least 7 bytes
- * extra on the end of our buffer.  Also, there is an outside chance
- * that the VIA RNG can "wedge", as the generated bit-rate is variable.
- * We could do all sorts of startup testing and things, but
- * frankly, I don't really see the point.  If the RNG wedges, then the
- * chances of you having a defective CPU are very high.  Let it wedge.
+ * There is an outside chance that the VIA RNG can "wedge", as the
+ * generated bit-rate is variable.  We could do all sorts of startup
+ * testing and things, but frankly, I don't really see the point.  If
+ * the RNG wedges, then the chances of you having a defective CPU are
+ * very high.  Let it wedge.
  *
  * Adding to the whole confusion, in order to access the RNG, we need
  * to have FXSR support enabled, and the correct FPU enable bits must
@@ -526,7 +524,6 @@
  * mumbo-jumbo was not needed in order to use the RNG.  Oh well, life
  * does go on...
  */
-#define VIAC3_RNG_BUFSIZ 16 /* 32bit words */
 struct timeout viac3_rnd_tmo;
 int viac3_rnd_present;
 
@@ -534,25 +531,46 @@
 viac3_rnd(void *v)
 {
  struct timeout *tmo = v;
- unsigned int *p, i, rv, creg0, len = VIAC3_RNG_BUFSIZ;
- static int buffer[VIAC3_RNG_BUFSIZ + 2]; /* XXX why + 2? */
+ unsigned int rv, creg0;
+ int i;
+ int buffer[2];
 
  creg0 = rcr0(); /* Permit access to SIMD/FPU path */
  lcr0(creg0 & ~(CR0_EM|CR0_TS));
 
  /*
- * Here we collect the random data from the VIA C3 RNG.  We make
- * sure that we turn on maximum whitening (%edx[0,1] == "11"), so
- * that we get the best random data possible.
+ * Here we collect the random data from the VIA C3/7 RNG.  We
+ * select the entire buffer (one can optionally select a
+ * subset, but the whitepaper suggest that we are still looking
+ * at ~.75-.98 bits of entropy per bit).
+ *
+ * Since there are 4 buffers with 8 bytes of data, we will
+ * probably average 4*8*100Hz = 3200bytes/sec (this is plenty
+ * as long as we are not reading more than ~20kbits/s out
+ * of the entropy pool).
+ *
+ * We avoid the "rep" variant and divisors to keep things sane.
+         * This also gives the hardware a chance to replenish its buffers
+ * while add_true_randomness() does its thing.
  */
- __asm __volatile("rep xstore-rng"
-    : "=a" (rv) : "d" (3), "D" (buffer), "c" (len*sizeof(int))
-    : "memory", "cc");
 
- lcr0(creg0);
+ for (i = 0; i < 16; ++i)
+ {
+ int* p = buffer;
+
+ __asm __volatile("xstore-rng"
+    : "=a" (rv), "+D" (p)
+          : "d" (0)
+    : "memory" );
+
+ if (0 == (rv & 0xf))
+ break;
 
- for (i = 0, p = buffer; i < VIAC3_RNG_BUFSIZ; i++, p++)
- add_true_randomness(*p);
+ add_true_randomness(buffer[0]);
+ add_true_randomness(buffer[1]);
+ }
+
+ lcr0(creg0);
 
  timeout_add(tmo, (hz > 100) ? (hz / 100) : 1);
 }
Index: arch/i386/include/specialreg.h
===================================================================
RCS file: /usr/cvs/openbsd/src/sys/arch/i386/include/specialreg.h,v
retrieving revision 1.28
diff -u -r1.28 specialreg.h
--- arch/i386/include/specialreg.h 12 Jun 2006 13:18:18 -0000 1.28
+++ arch/i386/include/specialreg.h 22 Jan 2007 18:56:49 -0000
@@ -451,6 +451,21 @@
 #define C3_CPUID_HAS_PMM 0x001000
 #define C3_CPUID_DO_PMM 0x002000
 
+/* VIA C3 RNG Control (MSR 0x110b) */
+#define C3_RNG_BCNT 0x0000001f
+#define C3_RNG_ENBL 0x00000040
+#define C3_RNG_SRCE_M 0x00000300
+#define C3_RNG_SRCE_A 0x00000000
+#define C3_RNG_SRCE_B 0x00000100
+#define C3_RNG_SRCE_DUAL 0x00000200
+#define C3_RNG_BIAS_M 0x00001c00
+#define C3_RNG_BIAS 0x00000000
+#define C3_RNG_RBTS 0x00200000
+#define C3_RNG_FLTR 0x00400000
+#define C3_RNG_FAIL 0x00800000
+#define C3_RNG_FCNT_M 0x3f000000
+#define C3_RNG_MASK  (C3_RNG_RBTS | C3_RNG_BIAS_M | C3_RNG_SRCE_M)
+
 /* VIA C3 xcrypt-* instruction context control options */
 #define C3_CRYPT_CWLO_ROUND_M 0x0000000f
 #define C3_CRYPT_CWLO_ALG_M 0x00000070

Reply | Threaded
Open this post in threaded view
|

Re: VIA C7 Dual RNG

Tobias Weingartner-2
On Monday, January 22, Henric Jungheim wrote:
>
> I put together a little diff to get the 2nd generator

Hmm...

> The comments around 'viac3_rnd' were also a bit
> suspicious--at least compared to the docs that I have
> available.

Which comments would those be?

> - * Note, due to some weirdness in the RNG, we need at least 7 bytes
> - * extra on the end of our buffer.  Also, there is an outside chance

I'd have to re-read the stuff I had, but those extra 7 bytes were
necessary, as the CPU could over-run the buffer given by up to 7 bytes.
IE: we want to provide the CPU with extra space to do its thing.  Now,
it may be that not doing the "rep" version gets around this...

> -#define VIAC3_RNG_BUFSIZ 16 /* 32bit words */
>  struct timeout viac3_rnd_tmo;
>  int viac3_rnd_present;
>  
> @@ -534,25 +531,46 @@
>  viac3_rnd(void *v)
>  {
>   struct timeout *tmo = v;
> - unsigned int *p, i, rv, creg0, len = VIAC3_RNG_BUFSIZ;
> - static int buffer[VIAC3_RNG_BUFSIZ + 2]; /* XXX why + 2? */

The "+2" was those extra 7 bytes.  (4 bytes * 2 == 8, which is more
than enough...)  :)

> + * We avoid the "rep" variant and divisors to keep things sane.
> +         * This also gives the hardware a chance to replenish its buffers
> + * while add_true_randomness() does its thing.
>   */
> - __asm __volatile("rep xstore-rng"
> -    : "=a" (rv) : "d" (3), "D" (buffer), "c" (len*sizeof(int))
> -    : "memory", "cc");
>  
> - lcr0(creg0);
> + for (i = 0; i < 16; ++i)
> + {
> + int* p = buffer;
> +
> + __asm __volatile("xstore-rng"
> +    : "=a" (rv), "+D" (p)
> +          : "d" (0)
> +    : "memory" );
> +
> + if (0 == (rv & 0xf))
> + break;
>  
> - for (i = 0, p = buffer; i < VIAC3_RNG_BUFSIZ; i++, p++)
> - add_true_randomness(*p);
> + add_true_randomness(buffer[0]);
> + add_true_randomness(buffer[1]);
> + }
> +
> + lcr0(creg0);

I'm not sure this is better or worse.  Could you point me at the
documentation you have so I can have a read please?

--Toby.

Reply | Threaded
Open this post in threaded view
|

Re: VIA C7 Dual RNG

Henric Jungheim-2
On Mon, Jan 22, 2007 at 04:15:15PM -0700, Tobias Weingartner wrote:

> On Monday, January 22, Henric Jungheim wrote:
> >
> > I put together a little diff to get the 2nd generator
>
> Hmm...
>
> > The comments around 'viac3_rnd' were also a bit
> > suspicious--at least compared to the docs that I have
> > available.
>
> Which comments would those be?

The comments that suggest that there is some connection between
the whitener and the divider.  The whitener feeds the queue.  The
divider sets how many bytes to grab from the queue (the rest
are discarded).  The data rate also seemed way off.

>
> > - * Note, due to some weirdness in the RNG, we need at least 7 bytes
> > - * extra on the end of our buffer.  Also, there is an outside chance
>
> I'd have to re-read the stuff I had, but those extra 7 bytes were
> necessary, as the CPU could over-run the buffer given by up to 7 bytes.
> IE: we want to provide the CPU with extra space to do its thing.  Now,
> it may be that not doing the "rep" version gets around this...

As I read it, programming_guide.pdf defines xstore-rng as:

If RNG queue has data, pull the top entry from the queue
        if EDX == 0:
                *(int*)EDI = RRRR
                EDI += 4
                *(int*)EDI = RRRR
  EDI += 4
                EAX[3:0] = 8 (plus other leading bits)
        else if EDX == 1:
                *(int*)EDI = RRRR
  EDI += 4
                EAX[3:0] = 4 (plus other leading bits)
        else if EDX == 2:
                *(int*)EDI = 00RR
                EDI += 2
                EAX[3:0] = 2 (plus other leading bits)
        else if EDX == 3:
                *(int*)EDI = 000R
                EDI += 1
                EAX[3:0] = 1 (plus other leading bits)
else // no data
        EAX[3:0] = 0 (plus other leading bits)

where the "R"s are all from the same queue entry.  The unused
"R"s--if any--are then thrown away.

It will always write 4 bytes to the address in EDI.  If the
divider is not 0 or 1, then EDI will spend a good part of
its time unaligned.

I found that the the stack-smash gods get very angry if one
gets the buffering wrong even by a little bit, so the
behavior I'm seeing is consistent with the docs.  The other
hardware ops are all different in how they interact with
memory.

BTW, I spent some time mucking with the sample SDK
(padlock_demo) and I would not recommend it as an example
for either the RNG or the AES stuff (there is some worrying
AES-CTR bug work-around code in there that just seems to
want to crash).

Then again, those extra 7 bytes may have something to
do with an undocumented bug that just so happens to not
hurt my C7.  I have an old C3 lying around somewhere...

The magic "7" may also have come from the SDK example.

Anyone have technical contact info for VIA that could give a
definitive answer?  (Or even a link to an errata?  I didn't
go looking through the normal/non-padlock CPU docs.)

>
> > -#define VIAC3_RNG_BUFSIZ 16 /* 32bit words */
> >  struct timeout viac3_rnd_tmo;
> >  int viac3_rnd_present;
> >  
> > @@ -534,25 +531,46 @@
> >  viac3_rnd(void *v)
> >  {
> >   struct timeout *tmo = v;
> > - unsigned int *p, i, rv, creg0, len = VIAC3_RNG_BUFSIZ;
> > - static int buffer[VIAC3_RNG_BUFSIZ + 2]; /* XXX why + 2? */
>
> The "+2" was those extra 7 bytes.  (4 bytes * 2 == 8, which is more
> than enough...)  :)

Yup.

>
> > + * We avoid the "rep" variant and divisors to keep things sane.
> > +         * This also gives the hardware a chance to replenish its buffers
> > + * while add_true_randomness() does its thing.
> >   */
> > - __asm __volatile("rep xstore-rng"
> > -    : "=a" (rv) : "d" (3), "D" (buffer), "c" (len*sizeof(int))
> > -    : "memory", "cc");
> >  
> > - lcr0(creg0);
> > + for (i = 0; i < 16; ++i)
> > + {
> > + int* p = buffer;
> > +
> > + __asm __volatile("xstore-rng"
> > +    : "=a" (rv), "+D" (p)
> > +          : "d" (0)
> > +    : "memory" );
> > +
> > + if (0 == (rv & 0xf))
> > + break;
> >  
> > - for (i = 0, p = buffer; i < VIAC3_RNG_BUFSIZ; i++, p++)
> > - add_true_randomness(*p);
> > + add_true_randomness(buffer[0]);
> > + add_true_randomness(buffer[1]);
> > + }
> > +
> > + lcr0(creg0);
>
> I'm not sure this is better or worse.  Could you point me at the
> documentation you have so I can have a read please?

The PadLock Developer Center page
    http://www.via.com.tw/en/initiatives/padlock/developercenter.jsp
has a link to the "VIA C5J programming guide for Asssembler"
    http://www.via.com.tw/en/downloads/whitepapers/initiatives/padlock/programming_guide.pdf
   
The "Expert's Guide to VIA PadLock"/"Via PadLock Security
Engine" page
    http://www.via.com.tw/en/initiatives/padlock/hardware.jsp
has a link to the "VIA C5J PadLock Security Engine"
    http://www.via.com.tw/en/downloads/whitepapers/initiatives/padlock/security_application_note.pdf
and to a whitepaper:
    http://www.via.com.tw/en/downloads/whitepapers/initiatives/padlock/evaluation_padlock_rng.pdf

They recommend not using the REP mode for important stuff.

There is also the Software Development Kit and the SDK guide
(less useful):
    http://www.viaarena.com/default.aspx?PageID=22&DSCat=161&DCatType=1
I spent some time mucking with it before I got it working--
all but AES-CTR, anyway.  (Lemme know if anyone wants to see
that code.)

One might also argue that the non-streamable SHA1/256
implementation is inherently a bug (i.e., the whole SHA op
must be in a contiguous block).  I would assume that if they
could fix it in a microcode update, they would already have
done so.


This all started with me wondering what it would take to get
the RSA acceleration going ...  <sigh>

>
> --Toby.

Reply | Threaded
Open this post in threaded view
|

Re: VIA C7 Dual RNG

Henric Jungheim-2
Errr...  configure the divider for 8 bytes (EDX = 0) and ask
for one byte of output, then call "rep xstore-rng".  It will
write 8 bytes when only one was asked for, so it could
overwrite the output buffer by 7 bytes.  There's the
mysterious "7".  (Now I just need to do something about this
palm-print; I must remember to not slap my forehead like
that...  A Homer-esque, "Doh!" should be an effective
alternative.)

Reply | Threaded
Open this post in threaded view
|

Re: VIA C7 Dual RNG

Shane J Pearson
In reply to this post by Henric Jungheim-2
Hello Henric,

On 23/01/2007, at 8:50 AM, Henric Jungheim wrote:

> I put together a little diff to get the 2nd generator
> enabled on the VIA C7.  This should roughly double the
> number of bits per second it generates without changing the
> interface (both generators feed into the same queue).

Out of interest, how fast do the new dual RNG C7's output random data?


Shane J Pearson
shanejp netspace net au

Reply | Threaded
Open this post in threaded view
|

Re: VIA C7 Dual RNG

Berk D. Demir
Shane J Pearson wrote:
>> I put together a little diff to get the 2nd generator
>> enabled on the VIA C7.  This should roughly double the
>> number of bits per second it generates without changing the
>> interface (both generators feed into the same queue).
>
> Out of interest, how fast do the new dual RNG C7's output random data?

Did you read the Henric's diff?

An excerpt from it.

+ * for new data to be generated.  Since the RNG generates in excess of
+ * 12Mbit/s at it's worst (and ranging up to ~80Mbit/s), collecting 64
+ * bytes worth of entropy should not affect things significantly.

Reply | Threaded
Open this post in threaded view
|

Re: VIA C7 Dual RNG

Shane J Pearson
On 24/01/2007, at 12:41 AM, Berk D. Demir wrote:

>> Out of interest, how fast do the new dual RNG C7's output random  
>> data?
>
> Did you read the Henric's diff?
>
> An excerpt from it.
>
> + * for new data to be generated.  Since the RNG generates in  
> excess of
> + * 12Mbit/s at it's worst (and ranging up to ~80Mbit/s),  
> collecting 64
> + * bytes worth of entropy should not affect things significantly.

No I didn't. I should have. Sorry.

Wow, it's really quick.


Shane J Pearson
shanejp netspace net au

Reply | Threaded
Open this post in threaded view
|

Re: VIA C7 Dual RNG

Henric Jungheim-2
In reply to this post by Tobias Weingartner-2
Regarding the speed of the RNG...

The raw rate will vary over time even on the same chip.
Take a look at the security app note and the whitepaper I
mentioned in the previous mail.  They mention some numbers
there.

Of more practical interest is the rate that bits are
contributed to the kernel rng pool.  The current interface
is based on fixed rate polling (at ~100Hz).  Assuming 4
buffers of 8 bytes are queued up and that the poll happens
faster than a new buffer can be provided, then the RNG
should be supplying the kernel entropy pool at a rate of 8 *
4 * 100Hz = 3200bytes/s = 25.6kBit/s.  That is orders of
magnitude less than it could be providing but also likely
much more than the rate at which things are read from the
kernel pool.  For this source--and others--it may be useful
to give the rng pool driver a way to actively grab entropy
from the hardware RNG whenever someone tries to extract
entropy from the system (and perhaps using it like it
currently uses nanotime).  ...perhaps have the entropy
extraction code feed in whatever stray bits any hardware rng
happens to have lying around into the the MD5 it does for
the extraction (around line 900 of sys/dev/rnd.c).

The C7 RNG can be used directly from userland, in which case
the raw rate becomes relevant.  However, the data rate also
depends on what one wants to do with those numbers.  The
software should do some processing on the stream before
using it for anything but the most trivial application (and
if the application is trivial, then it is simpler to just
use "random").  The effective data rate is then dependent on
the overhead of that processing (typically SHA1 or
some-such).  This is something that /dev/srandom or
/dev/random could (should?) be doing.

Aaaanyway...  None of this is going to make TCP port numbers
any more difficult to predict.

Reply | Threaded
Open this post in threaded view
|

Re: VIA C7 Dual RNG

Tobias Weingartner-2
In reply to this post by Henric Jungheim-2
On Monday, January 22, Henric Jungheim wrote:
>
> Errr...  configure the divider for 8 bytes (EDX = 0) and ask
> for one byte of output, then call "rep xstore-rng".  It will
> write 8 bytes when only one was asked for, so it could
> overwrite the output buffer by 7 bytes.  There's the
> mysterious "7".  (Now I just need to do something about this
> palm-print; I must remember to not slap my forehead like
> that...  A Homer-esque, "Doh!" should be an effective
> alternative.)

Does the palm-print mean your original diff needs to be modified?  :)

--Toby.

Reply | Threaded
Open this post in threaded view
|

Re: VIA C7 Dual RNG

Henric Jungheim-2
No, the palm-print just means that I realized where that,
"be careful of those extra 7 bytes," warning came from.  I
think the diff handles the buffering correctly as it has the
CPU do two 32 bit writes to an 8 byte buffer.  Furthermore,
in a previous variation of that diff, I managed to
demonstrate that writing past the end of the buffer causes
the system to panic very, very quickly.

The original code needs extra buffer space since the last
iteration of the "rep xstore-rng" does an unaligned 32-bit
write, containing the last random byte, to the address of
the last byte of the nominal read buffer, thereby going 3
bytes past the end--which is okay, since that "+ 2" provides
8 more bytes for the CPU to drool on.  It is not as
complicated as I just tried to make it sound...


On Tue, Jan 23, 2007 at 11:04:26PM -0700, Tobias Weingartner wrote:

> On Monday, January 22, Henric Jungheim wrote:
> >
> > Errr...  configure the divider for 8 bytes (EDX = 0) and ask
> > for one byte of output, then call "rep xstore-rng".  It will
> > write 8 bytes when only one was asked for, so it could
> > overwrite the output buffer by 7 bytes.  There's the
> > mysterious "7".  (Now I just need to do something about this
> > palm-print; I must remember to not slap my forehead like
> > that...  A Homer-esque, "Doh!" should be an effective
> > alternative.)
>
> Does the palm-print mean your original diff needs to be modified?  :)
>
> --Toby.

Reply | Threaded
Open this post in threaded view
|

Re: VIA C7 Dual RNG

Shane J Pearson
In reply to this post by Henric Jungheim-2
Thanks Henric,

On 24/01/2007, at 9:57 AM, Henric Jungheim wrote:

> Regarding the speed of the RNG...

This is interesting. I might buy a Via EPIA EN15000 soon and have a  
play with it.

I'm curious to know if it would be safe to run the RNG's as fast as  
they'll go through a compression function. I'll take a look at those  
docs...

Thanks,


Shane J Pearson
shanejp netspace net au

Reply | Threaded
Open this post in threaded view
|

Re: VIA C7 Dual RNG

Henric Jungheim-2
In reply to this post by Henric Jungheim-2
More regarding the speed of the RNG...

I wrote a simple program to measure the raw bitrate of the
two Epia boards I have sitting around.  Both are running
4.0-current.  The 1.2GHz C7 with both sources enabled
provides ~35Mbit/s and the 1GHz C3 provides ~7Mbit/s.

I did not get a chance to try the C7 with a single entropy
source enabled.

Note that the doesn't process the data in any way.  It just
measure how fast the RNG spits out bits.  Real apps are
likely to hash/encrypt/whatever the data before considering
it "random enough."

After taking a quick look at the raw numbers, I got
interested in just how much CPU time the current code spends
in "rep xstore-rng".  Since the first four requests can be
satisfied by the RNG's hardware queue, I had the simulation
code skip them entirely (otherwise it is the doing the same
thing the kernel does 100 times per second).  From this I
got ~1% of the total available CPU time for the C7 and ~5%
for the C3.  That seems a little excessive.  Compensating
for the kernel's 100 iterations per second does not change
the numbers significantly (e.g., during the thest run on the
C3's 11.24s, the kernel would have run ~11.24s * 100Hz=1124
iterations in parallel with the the 20000 iterations the
test ran itself).

(Did someone say something about beating dead horses...?)

Here's the actual program output:

1.2GHz C7 (with both RNGs enabled):
Raw rate:
3.64 s total time
1.82 us per 8 byte iteration
35.1648 Mbit/s
Kernel polling:
2.13 s total time
106.5 us per 64-byte iteration
1.065% CPU

1GHz C3:
Raw rate:
18.65 s total time
9.325 us per 8 byte iteration
6.86327 Mbit/s
Kernel polling:
11.24 s total time
562 us per 64-byte iteration
5.62% CPU


Compiled with, "gcc -O3 -o via_rng via_rng_speed.c" (but -O0
gave the same results):

#include <stdio.h>
#include <time.h>

#define SIMPLE_ITERATIONS (2 * 1000 * 1000)
#define REP_ITERATIONS (20 * 1000)

void f()
{
        int i;
        unsigned int rv;
        unsigned int buffer[2];

        for (i = SIMPLE_ITERATIONS; i > 0; --i)
        {
                do
                {
                        int* p = buffer;

                        __asm __volatile("xstore-rng"
                            : "=a" (rv), "+D" (p)
                            : "d" (0)
                            : "memory" );
                } while (8 != (rv & 0xf));
        }
}

#define VIAC3_RNG_BUFSIZ 16

void viac3_rnd()          
{
        unsigned int *p, i, rv, len = VIAC3_RNG_BUFSIZ;
        static int buffer[VIAC3_RNG_BUFSIZ + 2];        /* XXX why + 2? */
         
        /*
         * Here we collect the random data from the VIA C3 RNG.  We make
         * sure that we turn on maximum whitening (%edx[0,1] == "11"), so
         * that we get the best random data possible.
         */

        for (i = REP_ITERATIONS; i > 0; --i)
        {
                unsigned int *p = buffer;
                /*
                 * We subtract 4 from count since the CPU's
                 * 4 buffers are almost certainly full by the
                 * time the kernel gets here.
                 */
                int count = len * sizeof(int) - 4;
               
                __asm __volatile("rep xstore-rng"
                    : "=a" (rv), "+D" (p), "+c" (count)
                    : "d" (3)
                    : "memory", "cc");  
        }
}

int
main()
{
        clock_t start, end;
        double elapsed;
        double time_per_second;

        printf("Raw rate:\n");

        start = clock();
        f();
        end = clock();

        elapsed = (end - start) / (double)CLOCKS_PER_SEC;
       
        printf("%g s total time\n", elapsed);

        elapsed /= SIMPLE_ITERATIONS;

        printf("%g us per 8 byte iteration\n", 1000 * 1000 * elapsed);
        printf("%g Mbit/s\n", (8 * 8.0 / 1000 / 1000) / elapsed);
       
        printf("Kernel polling:\n");

        start = clock();
        viac3_rnd();
        end = clock();

        elapsed = (end - start) / (double)CLOCKS_PER_SEC;
       
        printf("%g s total time\n", elapsed);

        elapsed /= REP_ITERATIONS;

        printf("%g us per 64-byte iteration\n", 1000 * 1000 * elapsed);

        /* The kernel calls viac3_rnd 100 per second */
        time_per_second = 100 * elapsed;
        printf("%g%% CPU\n", 100 * time_per_second);
       
        return 0;
}