[PATCH] let the mbufs use more then 4gb of memory

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] let the mbufs use more then 4gb of memory

Simon Mages
On a System where you use the maximum socketbuffer size of 256kbyte you
can run out of memory after less then 9k open sockets.

My patch adds a new uvm_constraint for the mbufs with a bigger memory area.
I choose this area after reading the comments in sys/arch/amd64/include/pmap.h.
This patch further changes the maximum sucketbuffer size from 256k to 1gb as
it is described in the rfc1323 S2.3.

I tested this diff with the ix, em and urndis driver. I know that this
diff only works
for amd64 right now, but i wanted to send this diff as a proposal what could be
done. Maybe somebody has a different solution for this Problem or can me why
this is a bad idea.


Index: arch/amd64/amd64/bus_dma.c
===================================================================
RCS file: /openbsd/src/sys/arch/amd64/amd64/bus_dma.c,v
retrieving revision 1.49
diff -u -p -u -p -r1.49 bus_dma.c
--- arch/amd64/amd64/bus_dma.c 17 Dec 2015 17:16:04 -0000 1.49
+++ arch/amd64/amd64/bus_dma.c 22 Jun 2016 11:33:17 -0000
@@ -584,7 +584,7 @@ _bus_dmamap_load_buffer(bus_dma_tag_t t,
  */
  pmap_extract(pmap, vaddr, (paddr_t *)&curaddr);

- if (curaddr > dma_constraint.ucr_high)
+ if (curaddr > mbuf_constraint.ucr_high)
  panic("Non dma-reachable buffer at curaddr %#lx(raw)",
     curaddr);

Index: arch/amd64/amd64/machdep.c
===================================================================
RCS file: /openbsd/src/sys/arch/amd64/amd64/machdep.c,v
retrieving revision 1.221
diff -u -p -u -p -r1.221 machdep.c
--- arch/amd64/amd64/machdep.c 21 May 2016 00:56:43 -0000 1.221
+++ arch/amd64/amd64/machdep.c 22 Jun 2016 11:33:17 -0000
@@ -202,9 +202,11 @@ struct vm_map *phys_map = NULL;
 /* UVM constraint ranges. */
 struct uvm_constraint_range  isa_constraint = { 0x0, 0x00ffffffUL };
 struct uvm_constraint_range  dma_constraint = { 0x0, 0xffffffffUL };
+struct uvm_constraint_range  mbuf_constraint = { 0x0, 0xfffffffffUL };
 struct uvm_constraint_range *uvm_md_constraints[] = {
     &isa_constraint,
     &dma_constraint,
+    &mbuf_constraint,
     NULL,
 };

Index: kern/uipc_mbuf.c
===================================================================
RCS file: /openbsd/src/sys/kern/uipc_mbuf.c,v
retrieving revision 1.226
diff -u -p -u -p -r1.226 uipc_mbuf.c
--- kern/uipc_mbuf.c 13 Jun 2016 21:24:43 -0000 1.226
+++ kern/uipc_mbuf.c 22 Jun 2016 11:33:18 -0000
@@ -153,7 +153,7 @@ mbinit(void)

  pool_init(&mbpool, MSIZE, 0, 0, 0, "mbufpl", NULL);
  pool_setipl(&mbpool, IPL_NET);
- pool_set_constraints(&mbpool, &kp_dma_contig);
+ pool_set_constraints(&mbpool, &kp_mbuf_contig);
  pool_setlowat(&mbpool, mblowat);

  pool_init(&mtagpool, PACKET_TAG_MAXSIZE + sizeof(struct m_tag),
@@ -166,7 +166,7 @@ mbinit(void)
  pool_init(&mclpools[i], mclsizes[i], 0, 0, 0,
     mclnames[i], NULL);
  pool_setipl(&mclpools[i], IPL_NET);
- pool_set_constraints(&mclpools[i], &kp_dma_contig);
+ pool_set_constraints(&mclpools[i], &kp_mbuf_contig);
  pool_setlowat(&mclpools[i], mcllowat);
  }

Index: sys/socketvar.h
===================================================================
RCS file: /openbsd/src/sys/sys/socketvar.h,v
retrieving revision 1.60
diff -u -p -u -p -r1.60 socketvar.h
--- sys/socketvar.h 25 Feb 2016 07:39:09 -0000 1.60
+++ sys/socketvar.h 22 Jun 2016 11:33:18 -0000
@@ -112,7 +112,7 @@ struct socket {
  short sb_flags; /* flags, see below */
  u_short sb_timeo; /* timeout for read/write */
  } so_rcv, so_snd;
-#define SB_MAX (256*1024) /* default for max chars in sockbuf */
+#define SB_MAX (1024*1024*1024)/* default for max chars in sockbuf */
 #define SB_LOCK 0x01 /* lock on data queue */
 #define SB_WANT 0x02 /* someone is waiting to lock */
 #define SB_WAIT 0x04 /* someone is waiting for data/space */
Index: uvm/uvm_extern.h
===================================================================
RCS file: /openbsd/src/sys/uvm/uvm_extern.h,v
retrieving revision 1.139
diff -u -p -u -p -r1.139 uvm_extern.h
--- uvm/uvm_extern.h 5 Jun 2016 08:35:57 -0000 1.139
+++ uvm/uvm_extern.h 22 Jun 2016 11:33:18 -0000
@@ -234,6 +234,7 @@ extern struct uvmexp uvmexp;
 /* Constraint ranges, set by MD code. */
 extern struct uvm_constraint_range  isa_constraint;
 extern struct uvm_constraint_range  dma_constraint;
+extern struct uvm_constraint_range  mbuf_constraint;
 extern struct uvm_constraint_range  no_constraint;
 extern struct uvm_constraint_range *uvm_md_constraints[];

@@ -398,6 +399,7 @@ extern const struct kmem_pa_mode kp_zero
 extern const struct kmem_pa_mode kp_dma;
 extern const struct kmem_pa_mode kp_dma_contig;
 extern const struct kmem_pa_mode kp_dma_zero;
+extern const struct kmem_pa_mode kp_mbuf_contig;
 extern const struct kmem_pa_mode kp_pageable;
 extern const struct kmem_pa_mode kp_none;

Index: uvm/uvm_km.c
===================================================================
RCS file: /openbsd/src/sys/uvm/uvm_km.c,v
retrieving revision 1.128
diff -u -p -u -p -r1.128 uvm_km.c
--- uvm/uvm_km.c 26 Sep 2015 17:55:00 -0000 1.128
+++ uvm/uvm_km.c 22 Jun 2016 11:33:18 -0000
@@ -1016,6 +1016,11 @@ const struct kmem_pa_mode kp_dma_zero =
  .kp_zero = 1
 };

+const struct kmem_pa_mode kp_mbuf_contig = {
+ .kp_constraint = &mbuf_constraint,
+ .kp_maxseg = 1
+};
+
 const struct kmem_pa_mode kp_zero = {
  .kp_constraint = &no_constraint,
  .kp_zero = 1

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

David Gwynne-5
On Wed, Jun 22, 2016 at 01:58:25PM +0200, Simon Mages wrote:

> On a System where you use the maximum socketbuffer size of 256kbyte you
> can run out of memory after less then 9k open sockets.
>
> My patch adds a new uvm_constraint for the mbufs with a bigger memory area.
> I choose this area after reading the comments in sys/arch/amd64/include/pmap.h.
> This patch further changes the maximum sucketbuffer size from 256k to 1gb as
> it is described in the rfc1323 S2.3.
>
> I tested this diff with the ix, em and urndis driver. I know that this
> diff only works
> for amd64 right now, but i wanted to send this diff as a proposal what could be
> done. Maybe somebody has a different solution for this Problem or can me why
> this is a bad idea.

hey simon,

first, some background.

the 4G watermark is less about limiting the amount of memory used
by the network stack and more about making the memory addressable
by as many devices, including network cards, as possible. we support
older chips that only deal with 32 bit addresses (and one or two
stupid ones with an inability to address over 1G), so we took the
conservative option and made made the memory generally usable without
developers having to think about it much.

you could argue that if you should be able to give big addresses
to modern cards, but that falls down if you are forwarding packets
between a modern and old card, cos the old card will want to dma
the packet the modern card rxed, but it needs it below the 4g line.
even if you dont have an old card, in todays hotplug world you might
plug an old device in. either way, the future of an mbuf is very
hard for the kernel to predict.

secondly, allocating more than 4g at a time to socket buffers is
generally a waste of memory. in practice you should scale the amount
of memory available to sockets according to the size of the tcp
windows you need to saturate the bandwidth available to the box.
this means if you want to sustain a gigabit of traffic with a 300ms
round trip time for packets, you'd "only" need ~37.5 megabytes of
buffers. to sustain 40 gigabit you'd need 1.5 gigabytes, which is
still below 4G. allowing more use of memory for buffers would likely
induce latency.

the above means that if you want to sustain a single 40G tcp
connection to that host you'd need to be able to place 1.5G on the
socket buffer, which is above the 1G you mention above. however,
if you want to sustain 2 connections, you ideally want to fairly
share the 1.5G between both sockets. they should get 750M each.

fairly sharing buffers between the sockets may already be in place
in openbsd. when i reworked the pools subsystem i set it up so
things sleeping on memory were woken up in order.

it occurs to me that perhaps we should limit mbufs by the bytes
they can use rather than the number of them. that would also work
well if we moved to per cpu caches for mbufs and clusters, cos the
number of active mbufs in the system becomes hard to limit accurately
if we want cpus to run independently.

if you want something to work on in this area, could you look at
letting sockets use the "jumbo" clusters instead of assuming
everything has to be in 2k clusters? i started on thsi with the
diff below, but it broke ospfd and i never got back to it.

if you get it working, it would be interested to test creating even
bigger cluster pools, eg, a 1M or 4M mbuf cluster.

cheers,
dlg

Index: uipc_socket.c
===================================================================
RCS file: /cvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.135
diff -u -p -r1.135 uipc_socket.c
--- uipc_socket.c 11 Dec 2014 19:21:57 -0000 1.135
+++ uipc_socket.c 22 Dec 2014 01:11:03 -0000
@@ -493,15 +493,18 @@ restart:
  mlen = MLEN;
  }
  if (resid >= MINCLSIZE && space >= MCLBYTES) {
- MCLGET(m, M_NOWAIT);
+ MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
+    lmin(space, MAXMCLBYTES)));
  if ((m->m_flags & M_EXT) == 0)
  goto nopages;
  if (atomic && top == 0) {
- len = lmin(MCLBYTES - max_hdr,
-    resid);
+ len = lmin(resid,
+    m->m_ext.ext_size -
+    max_hdr);
  m->m_data += max_hdr;
  } else
- len = lmin(MCLBYTES, resid);
+ len = lmin(resid,
+    m->m_ext.ext_size);
  space -= len;
  } else {
 nopages:

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

Claudio Jeker
In reply to this post by Simon Mages
On Wed, Jun 22, 2016 at 01:58:25PM +0200, Simon Mages wrote:
> On a System where you use the maximum socketbuffer size of 256kbyte you
> can run out of memory after less then 9k open sockets.
>
> My patch adds a new uvm_constraint for the mbufs with a bigger memory area.
> I choose this area after reading the comments in sys/arch/amd64/include/pmap.h.
> This patch further changes the maximum sucketbuffer size from 256k to 1gb as
> it is described in the rfc1323 S2.3.

You read that RFC wrong. I see no reason to increase the socketbuffer size
to such a huge value. A change like this is currently not acceptable.
 
> I tested this diff with the ix, em and urndis driver. I know that this
> diff only works
> for amd64 right now, but i wanted to send this diff as a proposal what could be
> done. Maybe somebody has a different solution for this Problem or can me why
> this is a bad idea.
>

Are you sure that all drivers are able to handle memory with physical
addresses that are more than 32bit long? I doubt this. I think a lot more
is needed than this diff to make this work even just for amd64.

>
> Index: arch/amd64/amd64/bus_dma.c
> ===================================================================
> RCS file: /openbsd/src/sys/arch/amd64/amd64/bus_dma.c,v
> retrieving revision 1.49
> diff -u -p -u -p -r1.49 bus_dma.c
> --- arch/amd64/amd64/bus_dma.c 17 Dec 2015 17:16:04 -0000 1.49
> +++ arch/amd64/amd64/bus_dma.c 22 Jun 2016 11:33:17 -0000
> @@ -584,7 +584,7 @@ _bus_dmamap_load_buffer(bus_dma_tag_t t,
>   */
>   pmap_extract(pmap, vaddr, (paddr_t *)&curaddr);
>
> - if (curaddr > dma_constraint.ucr_high)
> + if (curaddr > mbuf_constraint.ucr_high)
>   panic("Non dma-reachable buffer at curaddr %#lx(raw)",
>      curaddr);
>
> Index: arch/amd64/amd64/machdep.c
> ===================================================================
> RCS file: /openbsd/src/sys/arch/amd64/amd64/machdep.c,v
> retrieving revision 1.221
> diff -u -p -u -p -r1.221 machdep.c
> --- arch/amd64/amd64/machdep.c 21 May 2016 00:56:43 -0000 1.221
> +++ arch/amd64/amd64/machdep.c 22 Jun 2016 11:33:17 -0000
> @@ -202,9 +202,11 @@ struct vm_map *phys_map = NULL;
>  /* UVM constraint ranges. */
>  struct uvm_constraint_range  isa_constraint = { 0x0, 0x00ffffffUL };
>  struct uvm_constraint_range  dma_constraint = { 0x0, 0xffffffffUL };
> +struct uvm_constraint_range  mbuf_constraint = { 0x0, 0xfffffffffUL };
>  struct uvm_constraint_range *uvm_md_constraints[] = {
>      &isa_constraint,
>      &dma_constraint,
> +    &mbuf_constraint,
>      NULL,
>  };
>
> Index: kern/uipc_mbuf.c
> ===================================================================
> RCS file: /openbsd/src/sys/kern/uipc_mbuf.c,v
> retrieving revision 1.226
> diff -u -p -u -p -r1.226 uipc_mbuf.c
> --- kern/uipc_mbuf.c 13 Jun 2016 21:24:43 -0000 1.226
> +++ kern/uipc_mbuf.c 22 Jun 2016 11:33:18 -0000
> @@ -153,7 +153,7 @@ mbinit(void)
>
>   pool_init(&mbpool, MSIZE, 0, 0, 0, "mbufpl", NULL);
>   pool_setipl(&mbpool, IPL_NET);
> - pool_set_constraints(&mbpool, &kp_dma_contig);
> + pool_set_constraints(&mbpool, &kp_mbuf_contig);
>   pool_setlowat(&mbpool, mblowat);
>
>   pool_init(&mtagpool, PACKET_TAG_MAXSIZE + sizeof(struct m_tag),
> @@ -166,7 +166,7 @@ mbinit(void)
>   pool_init(&mclpools[i], mclsizes[i], 0, 0, 0,
>      mclnames[i], NULL);
>   pool_setipl(&mclpools[i], IPL_NET);
> - pool_set_constraints(&mclpools[i], &kp_dma_contig);
> + pool_set_constraints(&mclpools[i], &kp_mbuf_contig);
>   pool_setlowat(&mclpools[i], mcllowat);
>   }
>
> Index: sys/socketvar.h
> ===================================================================
> RCS file: /openbsd/src/sys/sys/socketvar.h,v
> retrieving revision 1.60
> diff -u -p -u -p -r1.60 socketvar.h
> --- sys/socketvar.h 25 Feb 2016 07:39:09 -0000 1.60
> +++ sys/socketvar.h 22 Jun 2016 11:33:18 -0000
> @@ -112,7 +112,7 @@ struct socket {
>   short sb_flags; /* flags, see below */
>   u_short sb_timeo; /* timeout for read/write */
>   } so_rcv, so_snd;
> -#define SB_MAX (256*1024) /* default for max chars in sockbuf */
> +#define SB_MAX (1024*1024*1024)/* default for max chars in sockbuf */
>  #define SB_LOCK 0x01 /* lock on data queue */
>  #define SB_WANT 0x02 /* someone is waiting to lock */
>  #define SB_WAIT 0x04 /* someone is waiting for data/space */
> Index: uvm/uvm_extern.h
> ===================================================================
> RCS file: /openbsd/src/sys/uvm/uvm_extern.h,v
> retrieving revision 1.139
> diff -u -p -u -p -r1.139 uvm_extern.h
> --- uvm/uvm_extern.h 5 Jun 2016 08:35:57 -0000 1.139
> +++ uvm/uvm_extern.h 22 Jun 2016 11:33:18 -0000
> @@ -234,6 +234,7 @@ extern struct uvmexp uvmexp;
>  /* Constraint ranges, set by MD code. */
>  extern struct uvm_constraint_range  isa_constraint;
>  extern struct uvm_constraint_range  dma_constraint;
> +extern struct uvm_constraint_range  mbuf_constraint;
>  extern struct uvm_constraint_range  no_constraint;
>  extern struct uvm_constraint_range *uvm_md_constraints[];
>
> @@ -398,6 +399,7 @@ extern const struct kmem_pa_mode kp_zero
>  extern const struct kmem_pa_mode kp_dma;
>  extern const struct kmem_pa_mode kp_dma_contig;
>  extern const struct kmem_pa_mode kp_dma_zero;
> +extern const struct kmem_pa_mode kp_mbuf_contig;
>  extern const struct kmem_pa_mode kp_pageable;
>  extern const struct kmem_pa_mode kp_none;
>
> Index: uvm/uvm_km.c
> ===================================================================
> RCS file: /openbsd/src/sys/uvm/uvm_km.c,v
> retrieving revision 1.128
> diff -u -p -u -p -r1.128 uvm_km.c
> --- uvm/uvm_km.c 26 Sep 2015 17:55:00 -0000 1.128
> +++ uvm/uvm_km.c 22 Jun 2016 11:33:18 -0000
> @@ -1016,6 +1016,11 @@ const struct kmem_pa_mode kp_dma_zero =
>   .kp_zero = 1
>  };
>
> +const struct kmem_pa_mode kp_mbuf_contig = {
> + .kp_constraint = &mbuf_constraint,
> + .kp_maxseg = 1
> +};
> +
>  const struct kmem_pa_mode kp_zero = {
>   .kp_constraint = &no_constraint,
>   .kp_zero = 1
>

--
:wq Claudio

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

Theo de Raadt
In reply to this post by David Gwynne-5
> secondly, allocating more than 4g at a time to socket buffers is
> generally a waste of memory.

and there is one further problem.

Eventually, this subsystem will starve the system.  Other subsystems
which also need large amounts of memory, then have to scramble.  There
have to be backpressure mechanisms in each subsystem to force out
memory.

There is no such mechanism in socket buffers.

The mechanisms in the remaining parts of the kernel have always proven
to be weak, as in, they don't interact as nicely as we want, to create
space.  There has been much work to make them work better.

However in socket buffers, there is no such mechanism.  What are
you going to do.  Throw data away?  You can't do that.  Therefore,
you are holding the remaining system components hostage, and your
diff creates deadlock.

You probably tested your diff under ideal conditions with gobs of
memory...

 

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

Alexander Bluhm
In reply to this post by David Gwynne-5
On Wed, Jun 22, 2016 at 10:54:27PM +1000, David Gwynne wrote:
> secondly, allocating more than 4g at a time to socket buffers is
> generally a waste of memory. in practice you should scale the amount
> of memory available to sockets according to the size of the tcp
> windows you need to saturate the bandwidth available to the box.

Currently OpenBSD limits the socket buffer size to 256k.
#define SB_MAX          (256*1024)      /* default for max chars in sockbuf */

For downloading large files from the internet this is not sufficinet
anymore.  After customer complaints we have increased the limit to
1MB.  This still does not give maximum throughput, but granting
more could easily result in running out of mbufs.  16MB would be
sufficent.

Besides from single connections with high throughput we also have
a lot of long running connections, say some 10000.  Each connection
over a relay needs two sockets and four socket buffers.  With 1MB
limit and 10000 connections the theoretical maximum is 40GB.

It is hard to figure out which connections need socket buffer space
in advance.  tcp_update_{snd,rcv}space() adjusts it dynamically,
there sbchecklowmem() has a first come first serve policy.  Another
challenge is, that the peers on both sides of the relay can decide
wether they fill our buffers.

Besides from finding a smarter algorithm to distribute the socket
buffer space, increasing the number of mbufs could be a solution.
Our server machines mostly relay connection data, there I seems
seductive to use much more mbuf memory to speed up TCP connetions.
Without 64 bit DMA most memory of the machine is unused.

Also modern BIOS maps only 2GB in low region.  All DMA devices must
share these.  Putting mbufs high should reduce pressure.

Of course there are problems with network adaptors that support
less DMA space and with hotplug configurations.  For a general
solution we can implement bounce buffers, disable the feature on
such machines or have a knob.

bluhm

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

Mark Kettenis
> Date: Thu, 23 Jun 2016 13:09:28 +0200
> From: Alexander Bluhm <[hidden email]>
>
> On Wed, Jun 22, 2016 at 10:54:27PM +1000, David Gwynne wrote:
> > secondly, allocating more than 4g at a time to socket buffers is
> > generally a waste of memory. in practice you should scale the amount
> > of memory available to sockets according to the size of the tcp
> > windows you need to saturate the bandwidth available to the box.
>
> Currently OpenBSD limits the socket buffer size to 256k.
> #define SB_MAX          (256*1024)      /* default for max chars in sockbuf */
>
> For downloading large files from the internet this is not sufficinet
> anymore.  After customer complaints we have increased the limit to
> 1MB.  This still does not give maximum throughput, but granting
> more could easily result in running out of mbufs.  16MB would be
> sufficent.
>
> Besides from single connections with high throughput we also have
> a lot of long running connections, say some 10000.  Each connection
> over a relay needs two sockets and four socket buffers.  With 1MB
> limit and 10000 connections the theoretical maximum is 40GB.
>
> It is hard to figure out which connections need socket buffer space
> in advance.  tcp_update_{snd,rcv}space() adjusts it dynamically,
> there sbchecklowmem() has a first come first serve policy.  Another
> challenge is, that the peers on both sides of the relay can decide
> wether they fill our buffers.
>
> Besides from finding a smarter algorithm to distribute the socket
> buffer space, increasing the number of mbufs could be a solution.
> Our server machines mostly relay connection data, there I seems
> seductive to use much more mbuf memory to speed up TCP connetions.
> Without 64 bit DMA most memory of the machine is unused.
>
> Also modern BIOS maps only 2GB in low region.  All DMA devices must
> share these.  Putting mbufs high should reduce pressure.
>
> Of course there are problems with network adaptors that support
> less DMA space and with hotplug configurations.  For a general
> solution we can implement bounce buffers, disable the feature on
> such machines or have a knob.

We really don't want to implement bounce-buffers.  Adding IOMMU
support is probably a better approach as it also brings some security
benefits.  Not all amd64 hardware supports an IOMMU.  And hardware
that does support it doesn't always have it enabled.  But for modern
hardware an iommu is pretty much standard, except for the absolute
low-end.  But those low-end machines tend to have only 2GB of memory
anyway.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

Chris Cappuccio
Mark Kettenis [[hidden email]] wrote:
>
> We really don't want to implement bounce-buffers.  Adding IOMMU
> support is probably a better approach as it also brings some security
> benefits.  Not all amd64 hardware supports an IOMMU.  And hardware
> that does support it doesn't always have it enabled.  But for modern
> hardware an iommu is pretty much standard, except for the absolute
> low-end.  But those low-end machines tend to have only 2GB of memory
> anyway.

Is the sparc64 iommu code port usable for this purpose?

http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/arch/amd64/amd64/Attic/sg_dma.c

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

Stefan Fritsch
In reply to this post by Mark Kettenis
On Thursday 23 June 2016 14:41:53, Mark Kettenis wrote:
> We really don't want to implement bounce-buffers.  Adding IOMMU
> support is probably a better approach as it also brings some
> security benefits.  Not all amd64 hardware supports an IOMMU.  And
> hardware that does support it doesn't always have it enabled.  But
> for modern hardware an iommu is pretty much standard, except for
> the absolute low-end.  But those low-end machines tend to have only
> 2GB of memory anyway.

On amd64, modern would mean skylake or newer. At least until haswell
(not sure about broadwell), Intel considered vt-d to be a high-end
feature and many desktop CPUs don't have it enabled. It is easy to
find systems with >=16 GB RAM without IOMMU.

Stefan

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

Claudio Jeker
In reply to this post by Mark Kettenis
On Thu, Jun 23, 2016 at 02:41:53PM +0200, Mark Kettenis wrote:

> > Date: Thu, 23 Jun 2016 13:09:28 +0200
> > From: Alexander Bluhm <[hidden email]>
> >
> > On Wed, Jun 22, 2016 at 10:54:27PM +1000, David Gwynne wrote:
> > > secondly, allocating more than 4g at a time to socket buffers is
> > > generally a waste of memory. in practice you should scale the amount
> > > of memory available to sockets according to the size of the tcp
> > > windows you need to saturate the bandwidth available to the box.
> >
> > Currently OpenBSD limits the socket buffer size to 256k.
> > #define SB_MAX          (256*1024)      /* default for max chars in sockbuf */
> >
> > For downloading large files from the internet this is not sufficinet
> > anymore.  After customer complaints we have increased the limit to
> > 1MB.  This still does not give maximum throughput, but granting
> > more could easily result in running out of mbufs.  16MB would be
> > sufficent.
> >
> > Besides from single connections with high throughput we also have
> > a lot of long running connections, say some 10000.  Each connection
> > over a relay needs two sockets and four socket buffers.  With 1MB
> > limit and 10000 connections the theoretical maximum is 40GB.
> >
> > It is hard to figure out which connections need socket buffer space
> > in advance.  tcp_update_{snd,rcv}space() adjusts it dynamically,
> > there sbchecklowmem() has a first come first serve policy.  Another
> > challenge is, that the peers on both sides of the relay can decide
> > wether they fill our buffers.
> >
> > Besides from finding a smarter algorithm to distribute the socket
> > buffer space, increasing the number of mbufs could be a solution.
> > Our server machines mostly relay connection data, there I seems
> > seductive to use much more mbuf memory to speed up TCP connetions.
> > Without 64 bit DMA most memory of the machine is unused.
> >
> > Also modern BIOS maps only 2GB in low region.  All DMA devices must
> > share these.  Putting mbufs high should reduce pressure.
> >
> > Of course there are problems with network adaptors that support
> > less DMA space and with hotplug configurations.  For a general
> > solution we can implement bounce buffers, disable the feature on
> > such machines or have a knob.
>
> We really don't want to implement bounce-buffers.  Adding IOMMU
> support is probably a better approach as it also brings some security
> benefits.  Not all amd64 hardware supports an IOMMU.  And hardware
> that does support it doesn't always have it enabled.  But for modern
> hardware an iommu is pretty much standard, except for the absolute
> low-end.  But those low-end machines tend to have only 2GB of memory
> anyway.

Another option is to use m_defrag() to move the mbuf from high mem down in
case it is needed. I think this is much simpler to implement and devices
that need it can be identified fairly easy. This only solves the TX side
on the RX side the bouncing would need to be done in the socketbuffers (it
would make sense to use large mclusters in socketbuffers and copy the data
over.

--
:wq Claudio

Reply | Threaded
Open this post in threaded view
|

Fwd: [PATCH] let the mbufs use more then 4gb of memory

Simon Mages
In reply to this post by David Gwynne-5
I sent this message to dlg@ directly to discuss my modification of his
diff to make the
bigger mbuf clusters work. i got no response so far, thats why i
decided to post it on tech@
directly. Maybe this way i get faster some feedback :)

BR
Simon

### Original Mail:

---------- Forwarded message ----------
From: Simon Mages <[hidden email]>
Date: Fri, 22 Jul 2016 13:24:24 +0200
Subject: Re: [PATCH] let the mbufs use more then 4gb of memory
To: David Gwynne <[hidden email]>

Hi,

I think i found the problem with your diff regarding the bigger mbuf clusters.

You choose a buffer size based on space and resid, but what happens when resid
is larger then space and space is for example 2050? The cluster choosen has
then the size 4096. But this size is to large for the socket buffer. In the
past this was never a problem because you only allocated external clusters
of size MCLBYTES and this was only done when space was larger then MCLBYTES.

diff:
Index: kern/uipc_socket.c
===================================================================
RCS file: /cvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.152
diff -u -p -u -p -r1.152 uipc_socket.c
--- kern/uipc_socket.c 13 Jun 2016 21:24:43 -0000 1.152
+++ kern/uipc_socket.c 22 Jul 2016 10:56:02 -0000
@@ -496,15 +496,18 @@ restart:
  mlen = MLEN;
  }
  if (resid >= MINCLSIZE && space >= MCLBYTES) {
- MCLGET(m, M_NOWAIT);
+ MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
+ lmin(space, MAXMCLBYTES)));
  if ((m->m_flags & M_EXT) == 0)
  goto nopages;
  if (atomic && top == 0) {
- len = ulmin(MCLBYTES - max_hdr,
-    resid);
+ len = lmin(lmin(resid, space),
+    m->m_ext.ext_size -
+    max_hdr);
  m->m_data += max_hdr;
  } else
- len = ulmin(MCLBYTES, resid);
+ len = lmin(lmin(resid, space),
+    m->m_ext.ext_size);
  space -= len;
  } else {
 nopages:

Im using this diff no for a while on my notebook and everything works as
expected. But i had no time to realy test it or test the performance. This will
be my next step.

I reproduced the unix socket problem you mentioned with the following little
programm:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <err.h>
#include <fcntl.h>
#include <poll.h>
#include <unistd.h>

#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/wait.h>

#define FILE "/tmp/afile"

int senddesc(int fd, int so);
int recvdesc(int so);

int
main(void)
{
        struct stat sb;
        int sockpair[2];
        pid_t pid = 0;
        int status;
        int newfile;

        if (unlink(FILE) < 0)
                warn("unlink: %s", FILE);

        int file = open(FILE, O_RDWR|O_CREAT|O_TRUNC);

        if (socketpair(AF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, sockpair) < 0)
                err(1, "socketpair");

        if ((pid =fork())) {
                senddesc(file, sockpair[0]);
                if (waitpid(pid, &status, 0) < 0)
                        err(1, "waitpid");
        } else {
                newfile = recvdesc(sockpair[1]);
                if (fstat(newfile, &sb) < 0)
                        err(1, "fstat");
        }

        return 0;
}

int
senddesc(int fd, int so)
{
        struct msghdr msg;
        struct cmsghdr *cmsg;
        union {
                struct cmsghdr hdr;
                unsigned char buf[CMSG_SPACE(sizeof(int))];
        } cmsgbuf;

        char *cbuf = calloc(6392, sizeof(char));
        memset(cbuf, 'K', 6392);
        struct iovec iov = {
                .iov_base = cbuf,
                .iov_len = 6392,
        };

        memset(&msg, 0, sizeof(struct msghdr));
        msg.msg_iov = &iov;
        msg.msg_iovlen = 1;
        msg.msg_control = &cmsgbuf.buf;
        msg.msg_controllen = sizeof(cmsgbuf.buf);

        cmsg = CMSG_FIRSTHDR(&msg);
        cmsg->cmsg_len = CMSG_LEN(sizeof(int));
        cmsg->cmsg_level = SOL_SOCKET;
        cmsg->cmsg_type = SCM_RIGHTS;
        *(int *)CMSG_DATA(cmsg) = fd;

        struct pollfd pfd[1];
        int nready;
        int wrote = 0;
        int wrote_total = 0;
        pfd[0].fd = so;
        pfd[0].events = POLLOUT;

        while (1) {
                nready = poll(pfd, 1, -1);
                if (nready == -1)
                        err(1, "poll");
                if ((pfd[0].revents & (POLLERR|POLLNVAL)))
                        errx(1, "bad fd %d", pfd[0].fd);
                if ((pfd[0].revents & (POLLOUT|POLLHUP))) {
                        if ((wrote = sendmsg(so, &msg, 0)) < 0)
                                err(1, "sendmsg");
                }
                wrote_total += wrote;
                iov.iov_len -= wrote * sizeof(char);
                if (iov.iov_len <= 0) {
                        printf("send all data: %d byte\n", wrote_total);
                        break;
                }
        }
        return 0;
}

int
recvdesc(int so)
{
        int fd = -1;

        struct msghdr    msg;
        struct cmsghdr  *cmsg;
        union {
                struct cmsghdr hdr;
                unsigned char    buf[CMSG_SPACE(sizeof(int))];
        } cmsgbuf;
        struct iovec iov;
        iov.iov_base = calloc(6392, sizeof(char));
        iov.iov_len = 6392 * sizeof(char);

        memset(&msg, 0, sizeof(struct msghdr));
        msg.msg_control = &cmsgbuf.buf;
        msg.msg_controllen = sizeof(cmsgbuf.buf);
        msg.msg_iov = &iov;
        msg.msg_iovlen = 1;

        struct pollfd pfd[1];
        int nready;
        int read_data = 0;
        int total_read_data = 0;
        pfd[0].fd = so;
        pfd[0].events = POLLIN;

        while (1) {
                nready = poll(pfd, 1, -1);
                if (nready == -1)
                        err(1, "poll");
                if ((pfd[0].revents & (POLLERR|POLLNVAL)))
                        errx(1, "bad fd %d", pfd[0].fd);
                if ((pfd[0].revents & (POLLIN|POLLHUP))) {
                        if ((read_data = recvmsg(so, &msg, 0)) < 0)
                                err(1, "recvmsg");
                }
                total_read_data += read_data;
                iov.iov_len -= read_data * sizeof(char);
                if (iov.iov_len <= 0) {
                        printf("received all data: %d byte\n", total_read_data);
                        break;
                }
        }

        if ((msg.msg_flags & MSG_CTRUNC))
                errx(1, "control message truncated");

        if ((msg.msg_flags & MSG_TRUNC))
                errx(1, "message truncated");

        for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL;
             cmsg = CMSG_NXTHDR(&msg, cmsg)) {
                if (cmsg->cmsg_len == CMSG_LEN(sizeof(int)) &&
                    cmsg->cmsg_level == SOL_SOCKET &&
                    cmsg->cmsg_type == SCM_RIGHTS) {
                        fd = *(int *)CMSG_DATA(cmsg);
                }
        }

        return fd;
}

Without the fix the following happens. MCLGETI will choose a cluster which is
to large for the receive buffer of the second domain socket and will then also
move to much data. Then it goes down the pr_usrreq path, in case of a unix
socket uipc_usrreq. There it will then use sbappendcontrol to move the data
into the socket buffer, which is not possible, so it just returns zero. In that
case the sender thinks he send all the data but the receiver gets nothing, or
another part which was small enough later on. There is also no error signaling
there, so what happens is that the mbufs just leak. Maybe this should be
handled in a different way, but i did not realy think about that so far.

BR

Simon


2016-07-01 2:45 GMT+02:00, David Gwynne <[hidden email]>:

>
>> On 1 Jul 2016, at 04:44, Simon Mages <[hidden email]> wrote:
>>
>> Do you remember what the problem was you encounter with ospfd and your
>> kern_socket diff?
>
> the unix socket pair between two of the processes closed unexpectedly or had
> the wrong amount of data available. it was too long ago for me to recall
> correctly :(
>
>>
>> I'm not a user of ospf :)
>
> thats fine. id hope more of the multiprocess daemons like ntpd and smtpd
> would exhibit the behaviour too.
>
> do you have a way of testing performance of sockets?
>
> dlg
>
>>
>> 2016-06-22 14:54 GMT+02:00, David Gwynne <[hidden email]>:
>>> On Wed, Jun 22, 2016 at 01:58:25PM +0200, Simon Mages wrote:
>>>> On a System where you use the maximum socketbuffer size of 256kbyte you
>>>> can run out of memory after less then 9k open sockets.
>>>>
>>>> My patch adds a new uvm_constraint for the mbufs with a bigger memory
>>>> area.
>>>> I choose this area after reading the comments in
>>>> sys/arch/amd64/include/pmap.h.
>>>> This patch further changes the maximum sucketbuffer size from 256k to
>>>> 1gb
>>>> as
>>>> it is described in the rfc1323 S2.3.
>>>>
>>>> I tested this diff with the ix, em and urndis driver. I know that this
>>>> diff only works
>>>> for amd64 right now, but i wanted to send this diff as a proposal what
>>>> could be
>>>> done. Maybe somebody has a different solution for this Problem or can
>>>> me
>>>> why
>>>> this is a bad idea.
>>>
>>> hey simon,
>>>
>>> first, some background.
>>>
>>> the 4G watermark is less about limiting the amount of memory used
>>> by the network stack and more about making the memory addressable
>>> by as many devices, including network cards, as possible. we support
>>> older chips that only deal with 32 bit addresses (and one or two
>>> stupid ones with an inability to address over 1G), so we took the
>>> conservative option and made made the memory generally usable without
>>> developers having to think about it much.
>>>
>>> you could argue that if you should be able to give big addresses
>>> to modern cards, but that falls down if you are forwarding packets
>>> between a modern and old card, cos the old card will want to dma
>>> the packet the modern card rxed, but it needs it below the 4g line.
>>> even if you dont have an old card, in todays hotplug world you might
>>> plug an old device in. either way, the future of an mbuf is very
>>> hard for the kernel to predict.
>>>
>>> secondly, allocating more than 4g at a time to socket buffers is
>>> generally a waste of memory. in practice you should scale the amount
>>> of memory available to sockets according to the size of the tcp
>>> windows you need to saturate the bandwidth available to the box.
>>> this means if you want to sustain a gigabit of traffic with a 300ms
>>> round trip time for packets, you'd "only" need ~37.5 megabytes of
>>> buffers. to sustain 40 gigabit you'd need 1.5 gigabytes, which is
>>> still below 4G. allowing more use of memory for buffers would likely
>>> induce latency.
>>>
>>> the above means that if you want to sustain a single 40G tcp
>>> connection to that host you'd need to be able to place 1.5G on the
>>> socket buffer, which is above the 1G you mention above. however,
>>> if you want to sustain 2 connections, you ideally want to fairly
>>> share the 1.5G between both sockets. they should get 750M each.
>>>
>>> fairly sharing buffers between the sockets may already be in place
>>> in openbsd. when i reworked the pools subsystem i set it up so
>>> things sleeping on memory were woken up in order.
>>>
>>> it occurs to me that perhaps we should limit mbufs by the bytes
>>> they can use rather than the number of them. that would also work
>>> well if we moved to per cpu caches for mbufs and clusters, cos the
>>> number of active mbufs in the system becomes hard to limit accurately
>>> if we want cpus to run independently.
>>>
>>> if you want something to work on in this area, could you look at
>>> letting sockets use the "jumbo" clusters instead of assuming
>>> everything has to be in 2k clusters? i started on thsi with the
>>> diff below, but it broke ospfd and i never got back to it.
>>>
>>> if you get it working, it would be interested to test creating even
>>> bigger cluster pools, eg, a 1M or 4M mbuf cluster.
>>>
>>> cheers,
>>> dlg
>>>
>>> Index: uipc_socket.c
>>> ===================================================================
>>> RCS file: /cvs/src/sys/kern/uipc_socket.c,v
>>> retrieving revision 1.135
>>> diff -u -p -r1.135 uipc_socket.c
>>> --- uipc_socket.c 11 Dec 2014 19:21:57 -0000 1.135
>>> +++ uipc_socket.c 22 Dec 2014 01:11:03 -0000
>>> @@ -493,15 +493,18 @@ restart:
>>> mlen = MLEN;
>>> }
>>> if (resid >= MINCLSIZE && space >= MCLBYTES) {
>>> - MCLGET(m, M_NOWAIT);
>>> + MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
>>> +    lmin(space, MAXMCLBYTES)));
>>> if ((m->m_flags & M_EXT) == 0)
>>> goto nopages;
>>> if (atomic && top == 0) {
>>> - len = lmin(MCLBYTES - max_hdr,
>>> -    resid);
>>> + len = lmin(resid,
>>> +    m->m_ext.ext_size -
>>> +    max_hdr);
>>> m->m_data += max_hdr;
>>> } else
>>> - len = lmin(MCLBYTES, resid);
>>> + len = lmin(resid,
>>> +    m->m_ext.ext_size);
>>> space -= len;
>>> } else {
>>> nopages:
>>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

David Gwynne-5

> On 1 Aug 2016, at 21:07, Simon Mages <[hidden email]> wrote:
>
> I sent this message to dlg@ directly to discuss my modification of his
> diff to make the
> bigger mbuf clusters work. i got no response so far, thats why i
> decided to post it on tech@
> directly. Maybe this way i get faster some feedback :)

hey simon,

i was travelling when you sent your mail to me and then it fell out of my head. sorry about that.

if this is working correctly then i would like to put it in the tree. from the light testing i have done, it is working correctly. would anyone object?

some performance measurement would also be interesting :)

dlg

>
> BR
> Simon
>
> ### Original Mail:
>
> ---------- Forwarded message ----------
> From: Simon Mages <[hidden email]>
> Date: Fri, 22 Jul 2016 13:24:24 +0200
> Subject: Re: [PATCH] let the mbufs use more then 4gb of memory
> To: David Gwynne <[hidden email]>
>
> Hi,
>
> I think i found the problem with your diff regarding the bigger mbuf clusters.
>
> You choose a buffer size based on space and resid, but what happens when resid
> is larger then space and space is for example 2050? The cluster choosen has
> then the size 4096. But this size is to large for the socket buffer. In the
> past this was never a problem because you only allocated external clusters
> of size MCLBYTES and this was only done when space was larger then MCLBYTES.
>
> diff:
> Index: kern/uipc_socket.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/uipc_socket.c,v
> retrieving revision 1.152
> diff -u -p -u -p -r1.152 uipc_socket.c
> --- kern/uipc_socket.c 13 Jun 2016 21:24:43 -0000 1.152
> +++ kern/uipc_socket.c 22 Jul 2016 10:56:02 -0000
> @@ -496,15 +496,18 @@ restart:
> mlen = MLEN;
> }
> if (resid >= MINCLSIZE && space >= MCLBYTES) {
> - MCLGET(m, M_NOWAIT);
> + MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
> + lmin(space, MAXMCLBYTES)));
> if ((m->m_flags & M_EXT) == 0)
> goto nopages;
> if (atomic && top == 0) {
> - len = ulmin(MCLBYTES - max_hdr,
> -    resid);
> + len = lmin(lmin(resid, space),
> +    m->m_ext.ext_size -
> +    max_hdr);
> m->m_data += max_hdr;
> } else
> - len = ulmin(MCLBYTES, resid);
> + len = lmin(lmin(resid, space),
> +    m->m_ext.ext_size);
> space -= len;
> } else {
> nopages:
>
> Im using this diff no for a while on my notebook and everything works as
> expected. But i had no time to realy test it or test the performance. This will
> be my next step.
>
> I reproduced the unix socket problem you mentioned with the following little
> programm:
>
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> #include <err.h>
> #include <fcntl.h>
> #include <poll.h>
> #include <unistd.h>
>
> #include <sys/socket.h>
> #include <sys/stat.h>
> #include <sys/wait.h>
>
> #define FILE "/tmp/afile"
>
> int senddesc(int fd, int so);
> int recvdesc(int so);
>
> int
> main(void)
> {
> struct stat sb;
> int sockpair[2];
> pid_t pid = 0;
> int status;
> int newfile;
>
> if (unlink(FILE) < 0)
> warn("unlink: %s", FILE);
>
> int file = open(FILE, O_RDWR|O_CREAT|O_TRUNC);
>
> if (socketpair(AF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, sockpair) < 0)
> err(1, "socketpair");
>
> if ((pid =fork())) {
> senddesc(file, sockpair[0]);
> if (waitpid(pid, &status, 0) < 0)
> err(1, "waitpid");
> } else {
> newfile = recvdesc(sockpair[1]);
> if (fstat(newfile, &sb) < 0)
> err(1, "fstat");
> }
>
> return 0;
> }
>
> int
> senddesc(int fd, int so)
> {
> struct msghdr msg;
> struct cmsghdr *cmsg;
> union {
> struct cmsghdr hdr;
> unsigned char buf[CMSG_SPACE(sizeof(int))];
> } cmsgbuf;
>
> char *cbuf = calloc(6392, sizeof(char));
> memset(cbuf, 'K', 6392);
> struct iovec iov = {
> .iov_base = cbuf,
> .iov_len = 6392,
> };
>
> memset(&msg, 0, sizeof(struct msghdr));
> msg.msg_iov = &iov;
> msg.msg_iovlen = 1;
> msg.msg_control = &cmsgbuf.buf;
> msg.msg_controllen = sizeof(cmsgbuf.buf);
>
> cmsg = CMSG_FIRSTHDR(&msg);
> cmsg->cmsg_len = CMSG_LEN(sizeof(int));
> cmsg->cmsg_level = SOL_SOCKET;
> cmsg->cmsg_type = SCM_RIGHTS;
> *(int *)CMSG_DATA(cmsg) = fd;
>
> struct pollfd pfd[1];
> int nready;
> int wrote = 0;
> int wrote_total = 0;
> pfd[0].fd = so;
> pfd[0].events = POLLOUT;
>
> while (1) {
> nready = poll(pfd, 1, -1);
> if (nready == -1)
> err(1, "poll");
> if ((pfd[0].revents & (POLLERR|POLLNVAL)))
> errx(1, "bad fd %d", pfd[0].fd);
> if ((pfd[0].revents & (POLLOUT|POLLHUP))) {
> if ((wrote = sendmsg(so, &msg, 0)) < 0)
> err(1, "sendmsg");
> }
> wrote_total += wrote;
> iov.iov_len -= wrote * sizeof(char);
> if (iov.iov_len <= 0) {
> printf("send all data: %d byte\n", wrote_total);
> break;
> }
> }
> return 0;
> }
>
> int
> recvdesc(int so)
> {
> int fd = -1;
>
> struct msghdr    msg;
> struct cmsghdr  *cmsg;
> union {
> struct cmsghdr hdr;
> unsigned char    buf[CMSG_SPACE(sizeof(int))];
> } cmsgbuf;
> struct iovec iov;
> iov.iov_base = calloc(6392, sizeof(char));
> iov.iov_len = 6392 * sizeof(char);
>
> memset(&msg, 0, sizeof(struct msghdr));
> msg.msg_control = &cmsgbuf.buf;
> msg.msg_controllen = sizeof(cmsgbuf.buf);
> msg.msg_iov = &iov;
> msg.msg_iovlen = 1;
>
> struct pollfd pfd[1];
> int nready;
> int read_data = 0;
> int total_read_data = 0;
> pfd[0].fd = so;
> pfd[0].events = POLLIN;
>
> while (1) {
> nready = poll(pfd, 1, -1);
> if (nready == -1)
> err(1, "poll");
> if ((pfd[0].revents & (POLLERR|POLLNVAL)))
> errx(1, "bad fd %d", pfd[0].fd);
> if ((pfd[0].revents & (POLLIN|POLLHUP))) {
> if ((read_data = recvmsg(so, &msg, 0)) < 0)
> err(1, "recvmsg");
> }
> total_read_data += read_data;
> iov.iov_len -= read_data * sizeof(char);
> if (iov.iov_len <= 0) {
> printf("received all data: %d byte\n", total_read_data);
> break;
> }
> }
>
> if ((msg.msg_flags & MSG_CTRUNC))
> errx(1, "control message truncated");
>
> if ((msg.msg_flags & MSG_TRUNC))
> errx(1, "message truncated");
>
> for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL;
>     cmsg = CMSG_NXTHDR(&msg, cmsg)) {
> if (cmsg->cmsg_len == CMSG_LEN(sizeof(int)) &&
>    cmsg->cmsg_level == SOL_SOCKET &&
>    cmsg->cmsg_type == SCM_RIGHTS) {
> fd = *(int *)CMSG_DATA(cmsg);
> }
> }
>
> return fd;
> }
>
> Without the fix the following happens. MCLGETI will choose a cluster which is
> to large for the receive buffer of the second domain socket and will then also
> move to much data. Then it goes down the pr_usrreq path, in case of a unix
> socket uipc_usrreq. There it will then use sbappendcontrol to move the data
> into the socket buffer, which is not possible, so it just returns zero. In that
> case the sender thinks he send all the data but the receiver gets nothing, or
> another part which was small enough later on. There is also no error signaling
> there, so what happens is that the mbufs just leak. Maybe this should be
> handled in a different way, but i did not realy think about that so far.
>
> BR
>
> Simon
>
>
> 2016-07-01 2:45 GMT+02:00, David Gwynne <[hidden email]>:
>>
>>> On 1 Jul 2016, at 04:44, Simon Mages <[hidden email]> wrote:
>>>
>>> Do you remember what the problem was you encounter with ospfd and your
>>> kern_socket diff?
>>
>> the unix socket pair between two of the processes closed unexpectedly or had
>> the wrong amount of data available. it was too long ago for me to recall
>> correctly :(
>>
>>>
>>> I'm not a user of ospf :)
>>
>> thats fine. id hope more of the multiprocess daemons like ntpd and smtpd
>> would exhibit the behaviour too.
>>
>> do you have a way of testing performance of sockets?
>>
>> dlg
>>
>>>
>>> 2016-06-22 14:54 GMT+02:00, David Gwynne <[hidden email]>:
>>>> On Wed, Jun 22, 2016 at 01:58:25PM +0200, Simon Mages wrote:
>>>>> On a System where you use the maximum socketbuffer size of 256kbyte you
>>>>> can run out of memory after less then 9k open sockets.
>>>>>
>>>>> My patch adds a new uvm_constraint for the mbufs with a bigger memory
>>>>> area.
>>>>> I choose this area after reading the comments in
>>>>> sys/arch/amd64/include/pmap.h.
>>>>> This patch further changes the maximum sucketbuffer size from 256k to
>>>>> 1gb
>>>>> as
>>>>> it is described in the rfc1323 S2.3.
>>>>>
>>>>> I tested this diff with the ix, em and urndis driver. I know that this
>>>>> diff only works
>>>>> for amd64 right now, but i wanted to send this diff as a proposal what
>>>>> could be
>>>>> done. Maybe somebody has a different solution for this Problem or can
>>>>> me
>>>>> why
>>>>> this is a bad idea.
>>>>
>>>> hey simon,
>>>>
>>>> first, some background.
>>>>
>>>> the 4G watermark is less about limiting the amount of memory used
>>>> by the network stack and more about making the memory addressable
>>>> by as many devices, including network cards, as possible. we support
>>>> older chips that only deal with 32 bit addresses (and one or two
>>>> stupid ones with an inability to address over 1G), so we took the
>>>> conservative option and made made the memory generally usable without
>>>> developers having to think about it much.
>>>>
>>>> you could argue that if you should be able to give big addresses
>>>> to modern cards, but that falls down if you are forwarding packets
>>>> between a modern and old card, cos the old card will want to dma
>>>> the packet the modern card rxed, but it needs it below the 4g line.
>>>> even if you dont have an old card, in todays hotplug world you might
>>>> plug an old device in. either way, the future of an mbuf is very
>>>> hard for the kernel to predict.
>>>>
>>>> secondly, allocating more than 4g at a time to socket buffers is
>>>> generally a waste of memory. in practice you should scale the amount
>>>> of memory available to sockets according to the size of the tcp
>>>> windows you need to saturate the bandwidth available to the box.
>>>> this means if you want to sustain a gigabit of traffic with a 300ms
>>>> round trip time for packets, you'd "only" need ~37.5 megabytes of
>>>> buffers. to sustain 40 gigabit you'd need 1.5 gigabytes, which is
>>>> still below 4G. allowing more use of memory for buffers would likely
>>>> induce latency.
>>>>
>>>> the above means that if you want to sustain a single 40G tcp
>>>> connection to that host you'd need to be able to place 1.5G on the
>>>> socket buffer, which is above the 1G you mention above. however,
>>>> if you want to sustain 2 connections, you ideally want to fairly
>>>> share the 1.5G between both sockets. they should get 750M each.
>>>>
>>>> fairly sharing buffers between the sockets may already be in place
>>>> in openbsd. when i reworked the pools subsystem i set it up so
>>>> things sleeping on memory were woken up in order.
>>>>
>>>> it occurs to me that perhaps we should limit mbufs by the bytes
>>>> they can use rather than the number of them. that would also work
>>>> well if we moved to per cpu caches for mbufs and clusters, cos the
>>>> number of active mbufs in the system becomes hard to limit accurately
>>>> if we want cpus to run independently.
>>>>
>>>> if you want something to work on in this area, could you look at
>>>> letting sockets use the "jumbo" clusters instead of assuming
>>>> everything has to be in 2k clusters? i started on thsi with the
>>>> diff below, but it broke ospfd and i never got back to it.
>>>>
>>>> if you get it working, it would be interested to test creating even
>>>> bigger cluster pools, eg, a 1M or 4M mbuf cluster.
>>>>
>>>> cheers,
>>>> dlg
>>>>
>>>> Index: uipc_socket.c
>>>> ===================================================================
>>>> RCS file: /cvs/src/sys/kern/uipc_socket.c,v
>>>> retrieving revision 1.135
>>>> diff -u -p -r1.135 uipc_socket.c
>>>> --- uipc_socket.c 11 Dec 2014 19:21:57 -0000 1.135
>>>> +++ uipc_socket.c 22 Dec 2014 01:11:03 -0000
>>>> @@ -493,15 +493,18 @@ restart:
>>>> mlen = MLEN;
>>>> }
>>>> if (resid >= MINCLSIZE && space >= MCLBYTES) {
>>>> - MCLGET(m, M_NOWAIT);
>>>> + MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
>>>> +    lmin(space, MAXMCLBYTES)));
>>>> if ((m->m_flags & M_EXT) == 0)
>>>> goto nopages;
>>>> if (atomic && top == 0) {
>>>> - len = lmin(MCLBYTES - max_hdr,
>>>> -    resid);
>>>> + len = lmin(resid,
>>>> +    m->m_ext.ext_size -
>>>> +    max_hdr);
>>>> m->m_data += max_hdr;
>>>> } else
>>>> - len = lmin(MCLBYTES, resid);
>>>> + len = lmin(resid,
>>>> +    m->m_ext.ext_size);
>>>> space -= len;
>>>> } else {
>>>> nopages:
>>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

tinkr
In reply to this post by Theo de Raadt
On 2016-06-23 05:42, Theo de Raadt wrote:

>> secondly, allocating more than 4g at a time to socket buffers is
>> generally a waste of memory.
>
> and there is one further problem.
>
> Eventually, this subsystem will starve the system.  Other subsystems
> which also need large amounts of memory, then have to scramble.  There
> have to be backpressure mechanisms in each subsystem to force out
> memory.
>
> There is no such mechanism in socket buffers.
>
> The mechanisms in the remaining parts of the kernel have always proven
> to be weak, as in, they don't interact as nicely as we want, to create
> space.  There has been much work to make them work better.
>
> However in socket buffers, there is no such mechanism.  What are
> you going to do.  Throw data away?  You can't do that.  Therefore,
> you are holding the remaining system components hostage, and your
> diff creates deadlock.
>
> You probably tested your diff under ideal conditions with gobs of
> memory...

The backpressure mechanism to free up [disk IO] buffer cache content is
really effective though, so 90 is a mostly suitable bufcachepercent
sysctl setting, right?

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

Mark Kettenis
In reply to this post by David Gwynne-5
> From: David Gwynne <[hidden email]>
> Date: Fri, 12 Aug 2016 16:38:45 +1000
>
> > On 1 Aug 2016, at 21:07, Simon Mages <[hidden email]> wrote:
> >
> > I sent this message to dlg@ directly to discuss my modification of his
> > diff to make the
> > bigger mbuf clusters work. i got no response so far, thats why i
> > decided to post it on tech@
> > directly. Maybe this way i get faster some feedback :)
>
> hey simon,
>
> i was travelling when you sent your mail to me and then it fell out
> of my head. sorry about that.
>
> if this is working correctly then i would like to put it in the tree. from the light testing i have done, it is working correctly. would anyone object?
>
> some performance measurement would also be interesting :)

Hmm, during debugging I've relied on the fact that only drivers
allocate the larger mbuf clusters for their rx rings.

Anyway, shouldn't the diff be using ulmin()?


> dlg
>
> >
> > BR
> > Simon
> >
> > ### Original Mail:
> >
> > ---------- Forwarded message ----------
> > From: Simon Mages <[hidden email]>
> > Date: Fri, 22 Jul 2016 13:24:24 +0200
> > Subject: Re: [PATCH] let the mbufs use more then 4gb of memory
> > To: David Gwynne <[hidden email]>
> >
> > Hi,
> >
> > I think i found the problem with your diff regarding the bigger mbuf clusters.
> >
> > You choose a buffer size based on space and resid, but what happens when resid
> > is larger then space and space is for example 2050? The cluster choosen has
> > then the size 4096. But this size is to large for the socket buffer. In the
> > past this was never a problem because you only allocated external clusters
> > of size MCLBYTES and this was only done when space was larger then MCLBYTES.
> >
> > diff:
> > Index: kern/uipc_socket.c
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/uipc_socket.c,v
> > retrieving revision 1.152
> > diff -u -p -u -p -r1.152 uipc_socket.c
> > --- kern/uipc_socket.c 13 Jun 2016 21:24:43 -0000 1.152
> > +++ kern/uipc_socket.c 22 Jul 2016 10:56:02 -0000
> > @@ -496,15 +496,18 @@ restart:
> > mlen = MLEN;
> > }
> > if (resid >= MINCLSIZE && space >= MCLBYTES) {
> > - MCLGET(m, M_NOWAIT);
> > + MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
> > + lmin(space, MAXMCLBYTES)));
> > if ((m->m_flags & M_EXT) == 0)
> > goto nopages;
> > if (atomic && top == 0) {
> > - len = ulmin(MCLBYTES - max_hdr,
> > -    resid);
> > + len = lmin(lmin(resid, space),
> > +    m->m_ext.ext_size -
> > +    max_hdr);
> > m->m_data += max_hdr;
> > } else
> > - len = ulmin(MCLBYTES, resid);
> > + len = lmin(lmin(resid, space),
> > +    m->m_ext.ext_size);
> > space -= len;
> > } else {
> > nopages:
> >
> > Im using this diff no for a while on my notebook and everything works as
> > expected. But i had no time to realy test it or test the performance. This will
> > be my next step.
> >
> > I reproduced the unix socket problem you mentioned with the following little
> > programm:
> >
> > #include <stdlib.h>
> > #include <stdio.h>
> > #include <string.h>
> > #include <err.h>
> > #include <fcntl.h>
> > #include <poll.h>
> > #include <unistd.h>
> >
> > #include <sys/socket.h>
> > #include <sys/stat.h>
> > #include <sys/wait.h>
> >
> > #define FILE "/tmp/afile"
> >
> > int senddesc(int fd, int so);
> > int recvdesc(int so);
> >
> > int
> > main(void)
> > {
> > struct stat sb;
> > int sockpair[2];
> > pid_t pid = 0;
> > int status;
> > int newfile;
> >
> > if (unlink(FILE) < 0)
> > warn("unlink: %s", FILE);
> >
> > int file = open(FILE, O_RDWR|O_CREAT|O_TRUNC);
> >
> > if (socketpair(AF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, sockpair) < 0)
> > err(1, "socketpair");
> >
> > if ((pid =fork())) {
> > senddesc(file, sockpair[0]);
> > if (waitpid(pid, &status, 0) < 0)
> > err(1, "waitpid");
> > } else {
> > newfile = recvdesc(sockpair[1]);
> > if (fstat(newfile, &sb) < 0)
> > err(1, "fstat");
> > }
> >
> > return 0;
> > }
> >
> > int
> > senddesc(int fd, int so)
> > {
> > struct msghdr msg;
> > struct cmsghdr *cmsg;
> > union {
> > struct cmsghdr hdr;
> > unsigned char buf[CMSG_SPACE(sizeof(int))];
> > } cmsgbuf;
> >
> > char *cbuf = calloc(6392, sizeof(char));
> > memset(cbuf, 'K', 6392);
> > struct iovec iov = {
> > .iov_base = cbuf,
> > .iov_len = 6392,
> > };
> >
> > memset(&msg, 0, sizeof(struct msghdr));
> > msg.msg_iov = &iov;
> > msg.msg_iovlen = 1;
> > msg.msg_control = &cmsgbuf.buf;
> > msg.msg_controllen = sizeof(cmsgbuf.buf);
> >
> > cmsg = CMSG_FIRSTHDR(&msg);
> > cmsg->cmsg_len = CMSG_LEN(sizeof(int));
> > cmsg->cmsg_level = SOL_SOCKET;
> > cmsg->cmsg_type = SCM_RIGHTS;
> > *(int *)CMSG_DATA(cmsg) = fd;
> >
> > struct pollfd pfd[1];
> > int nready;
> > int wrote = 0;
> > int wrote_total = 0;
> > pfd[0].fd = so;
> > pfd[0].events = POLLOUT;
> >
> > while (1) {
> > nready = poll(pfd, 1, -1);
> > if (nready == -1)
> > err(1, "poll");
> > if ((pfd[0].revents & (POLLERR|POLLNVAL)))
> > errx(1, "bad fd %d", pfd[0].fd);
> > if ((pfd[0].revents & (POLLOUT|POLLHUP))) {
> > if ((wrote = sendmsg(so, &msg, 0)) < 0)
> > err(1, "sendmsg");
> > }
> > wrote_total += wrote;
> > iov.iov_len -= wrote * sizeof(char);
> > if (iov.iov_len <= 0) {
> > printf("send all data: %d byte\n", wrote_total);
> > break;
> > }
> > }
> > return 0;
> > }
> >
> > int
> > recvdesc(int so)
> > {
> > int fd = -1;
> >
> > struct msghdr    msg;
> > struct cmsghdr  *cmsg;
> > union {
> > struct cmsghdr hdr;
> > unsigned char    buf[CMSG_SPACE(sizeof(int))];
> > } cmsgbuf;
> > struct iovec iov;
> > iov.iov_base = calloc(6392, sizeof(char));
> > iov.iov_len = 6392 * sizeof(char);
> >
> > memset(&msg, 0, sizeof(struct msghdr));
> > msg.msg_control = &cmsgbuf.buf;
> > msg.msg_controllen = sizeof(cmsgbuf.buf);
> > msg.msg_iov = &iov;
> > msg.msg_iovlen = 1;
> >
> > struct pollfd pfd[1];
> > int nready;
> > int read_data = 0;
> > int total_read_data = 0;
> > pfd[0].fd = so;
> > pfd[0].events = POLLIN;
> >
> > while (1) {
> > nready = poll(pfd, 1, -1);
> > if (nready == -1)
> > err(1, "poll");
> > if ((pfd[0].revents & (POLLERR|POLLNVAL)))
> > errx(1, "bad fd %d", pfd[0].fd);
> > if ((pfd[0].revents & (POLLIN|POLLHUP))) {
> > if ((read_data = recvmsg(so, &msg, 0)) < 0)
> > err(1, "recvmsg");
> > }
> > total_read_data += read_data;
> > iov.iov_len -= read_data * sizeof(char);
> > if (iov.iov_len <= 0) {
> > printf("received all data: %d byte\n", total_read_data);
> > break;
> > }
> > }
> >
> > if ((msg.msg_flags & MSG_CTRUNC))
> > errx(1, "control message truncated");
> >
> > if ((msg.msg_flags & MSG_TRUNC))
> > errx(1, "message truncated");
> >
> > for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL;
> >     cmsg = CMSG_NXTHDR(&msg, cmsg)) {
> > if (cmsg->cmsg_len == CMSG_LEN(sizeof(int)) &&
> >    cmsg->cmsg_level == SOL_SOCKET &&
> >    cmsg->cmsg_type == SCM_RIGHTS) {
> > fd = *(int *)CMSG_DATA(cmsg);
> > }
> > }
> >
> > return fd;
> > }
> >
> > Without the fix the following happens. MCLGETI will choose a cluster which is
> > to large for the receive buffer of the second domain socket and will then also
> > move to much data. Then it goes down the pr_usrreq path, in case of a unix
> > socket uipc_usrreq. There it will then use sbappendcontrol to move the data
> > into the socket buffer, which is not possible, so it just returns zero. In that
> > case the sender thinks he send all the data but the receiver gets nothing, or
> > another part which was small enough later on. There is also no error signaling
> > there, so what happens is that the mbufs just leak. Maybe this should be
> > handled in a different way, but i did not realy think about that so far.
> >
> > BR
> >
> > Simon
> >
> >
> > 2016-07-01 2:45 GMT+02:00, David Gwynne <[hidden email]>:
> >>
> >>> On 1 Jul 2016, at 04:44, Simon Mages <[hidden email]> wrote:
> >>>
> >>> Do you remember what the problem was you encounter with ospfd and your
> >>> kern_socket diff?
> >>
> >> the unix socket pair between two of the processes closed unexpectedly or had
> >> the wrong amount of data available. it was too long ago for me to recall
> >> correctly :(
> >>
> >>>
> >>> I'm not a user of ospf :)
> >>
> >> thats fine. id hope more of the multiprocess daemons like ntpd and smtpd
> >> would exhibit the behaviour too.
> >>
> >> do you have a way of testing performance of sockets?
> >>
> >> dlg
> >>
> >>>
> >>> 2016-06-22 14:54 GMT+02:00, David Gwynne <[hidden email]>:
> >>>> On Wed, Jun 22, 2016 at 01:58:25PM +0200, Simon Mages wrote:
> >>>>> On a System where you use the maximum socketbuffer size of 256kbyte you
> >>>>> can run out of memory after less then 9k open sockets.
> >>>>>
> >>>>> My patch adds a new uvm_constraint for the mbufs with a bigger memory
> >>>>> area.
> >>>>> I choose this area after reading the comments in
> >>>>> sys/arch/amd64/include/pmap.h.
> >>>>> This patch further changes the maximum sucketbuffer size from 256k to
> >>>>> 1gb
> >>>>> as
> >>>>> it is described in the rfc1323 S2.3.
> >>>>>
> >>>>> I tested this diff with the ix, em and urndis driver. I know that this
> >>>>> diff only works
> >>>>> for amd64 right now, but i wanted to send this diff as a proposal what
> >>>>> could be
> >>>>> done. Maybe somebody has a different solution for this Problem or can
> >>>>> me
> >>>>> why
> >>>>> this is a bad idea.
> >>>>
> >>>> hey simon,
> >>>>
> >>>> first, some background.
> >>>>
> >>>> the 4G watermark is less about limiting the amount of memory used
> >>>> by the network stack and more about making the memory addressable
> >>>> by as many devices, including network cards, as possible. we support
> >>>> older chips that only deal with 32 bit addresses (and one or two
> >>>> stupid ones with an inability to address over 1G), so we took the
> >>>> conservative option and made made the memory generally usable without
> >>>> developers having to think about it much.
> >>>>
> >>>> you could argue that if you should be able to give big addresses
> >>>> to modern cards, but that falls down if you are forwarding packets
> >>>> between a modern and old card, cos the old card will want to dma
> >>>> the packet the modern card rxed, but it needs it below the 4g line.
> >>>> even if you dont have an old card, in todays hotplug world you might
> >>>> plug an old device in. either way, the future of an mbuf is very
> >>>> hard for the kernel to predict.
> >>>>
> >>>> secondly, allocating more than 4g at a time to socket buffers is
> >>>> generally a waste of memory. in practice you should scale the amount
> >>>> of memory available to sockets according to the size of the tcp
> >>>> windows you need to saturate the bandwidth available to the box.
> >>>> this means if you want to sustain a gigabit of traffic with a 300ms
> >>>> round trip time for packets, you'd "only" need ~37.5 megabytes of
> >>>> buffers. to sustain 40 gigabit you'd need 1.5 gigabytes, which is
> >>>> still below 4G. allowing more use of memory for buffers would likely
> >>>> induce latency.
> >>>>
> >>>> the above means that if you want to sustain a single 40G tcp
> >>>> connection to that host you'd need to be able to place 1.5G on the
> >>>> socket buffer, which is above the 1G you mention above. however,
> >>>> if you want to sustain 2 connections, you ideally want to fairly
> >>>> share the 1.5G between both sockets. they should get 750M each.
> >>>>
> >>>> fairly sharing buffers between the sockets may already be in place
> >>>> in openbsd. when i reworked the pools subsystem i set it up so
> >>>> things sleeping on memory were woken up in order.
> >>>>
> >>>> it occurs to me that perhaps we should limit mbufs by the bytes
> >>>> they can use rather than the number of them. that would also work
> >>>> well if we moved to per cpu caches for mbufs and clusters, cos the
> >>>> number of active mbufs in the system becomes hard to limit accurately
> >>>> if we want cpus to run independently.
> >>>>
> >>>> if you want something to work on in this area, could you look at
> >>>> letting sockets use the "jumbo" clusters instead of assuming
> >>>> everything has to be in 2k clusters? i started on thsi with the
> >>>> diff below, but it broke ospfd and i never got back to it.
> >>>>
> >>>> if you get it working, it would be interested to test creating even
> >>>> bigger cluster pools, eg, a 1M or 4M mbuf cluster.
> >>>>
> >>>> cheers,
> >>>> dlg
> >>>>
> >>>> Index: uipc_socket.c
> >>>> ===================================================================
> >>>> RCS file: /cvs/src/sys/kern/uipc_socket.c,v
> >>>> retrieving revision 1.135
> >>>> diff -u -p -r1.135 uipc_socket.c
> >>>> --- uipc_socket.c 11 Dec 2014 19:21:57 -0000 1.135
> >>>> +++ uipc_socket.c 22 Dec 2014 01:11:03 -0000
> >>>> @@ -493,15 +493,18 @@ restart:
> >>>> mlen = MLEN;
> >>>> }
> >>>> if (resid >= MINCLSIZE && space >= MCLBYTES) {
> >>>> - MCLGET(m, M_NOWAIT);
> >>>> + MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
> >>>> +    lmin(space, MAXMCLBYTES)));
> >>>> if ((m->m_flags & M_EXT) == 0)
> >>>> goto nopages;
> >>>> if (atomic && top == 0) {
> >>>> - len = lmin(MCLBYTES - max_hdr,
> >>>> -    resid);
> >>>> + len = lmin(resid,
> >>>> +    m->m_ext.ext_size -
> >>>> +    max_hdr);
> >>>> m->m_data += max_hdr;
> >>>> } else
> >>>> - len = lmin(MCLBYTES, resid);
> >>>> + len = lmin(resid,
> >>>> +    m->m_ext.ext_size);
> >>>> space -= len;
> >>>> } else {
> >>>> nopages:
> >>>>
> >>
> >>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

Claudio Jeker
In reply to this post by David Gwynne-5
On Fri, Aug 12, 2016 at 04:38:45PM +1000, David Gwynne wrote:

>
> > On 1 Aug 2016, at 21:07, Simon Mages <[hidden email]> wrote:
> >
> > I sent this message to dlg@ directly to discuss my modification of his
> > diff to make the
> > bigger mbuf clusters work. i got no response so far, thats why i
> > decided to post it on tech@
> > directly. Maybe this way i get faster some feedback :)
>
> hey simon,
>
> i was travelling when you sent your mail to me and then it fell out of my head. sorry about that.
>
> if this is working correctly then i would like to put it in the tree. from the light testing i have done, it is working correctly. would anyone object?
>
> some performance measurement would also be interesting :)
>

I would prefer we take the diff I started at n2k16. I need to dig it out
though.

--
:wq Claudio

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] let the mbufs use more then 4gb of memory

Mark Kettenis
> Date: Fri, 12 Aug 2016 14:26:34 +0200
> From: Claudio Jeker <[hidden email]>
>
> On Fri, Aug 12, 2016 at 04:38:45PM +1000, David Gwynne wrote:
> >
> > > On 1 Aug 2016, at 21:07, Simon Mages <[hidden email]> wrote:
> > >
> > > I sent this message to dlg@ directly to discuss my modification of his
> > > diff to make the
> > > bigger mbuf clusters work. i got no response so far, thats why i
> > > decided to post it on tech@
> > > directly. Maybe this way i get faster some feedback :)
> >
> > hey simon,
> >
> > i was travelling when you sent your mail to me and then it fell out of my head. sorry about that.
> >
> > if this is working correctly then i would like to put it in the tree. from the light testing i have done, it is working correctly. would anyone object?
> >
> > some performance measurement would also be interesting :)
> >
>
> I would prefer we take the diff I started at n2k16. I need to dig it out
> though.

I think the subject of the thread has become misleading.  At least the
diff I think David and Simon are talking about is about using the
larger mbuf pools for socket buffers and no longer about using memory
>4G for them.

David, Simon, best to start all over again, and repost the diff with a
proper subject and explanation.  You shouldn't be forcing other
developers to read through several pages of private conversations.