NSD sendto issue

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

NSD sendto issue

Joerg Jung
Hi,

I run a few busy (~800 req/s) NSD servers which I upgraded
to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
sysctl values and nsd.conf matches the default as well, just
added a few hundred zones.

Now, when I increase servers from default 1 to 2 in nsd.conf:
        server-count: 2
it starts spamming my log with:
        nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable

checking the source, server.c seems not to handle EAGAIN
after sendto() and does not recover or retry, it just increases
txerr statistic count - so answer seems really lost :(

I tried higher debug level, as well as increasing socket buffers to:
        net.inet.udp.recvspace= 65536
        net.inet.udp.sendspace=65636
but both didn’t help and netstat -s -p udp does show
        0 dropped due to full socket buffers  
anyways. So, I don’t believe this is a socket buffer issue.

The same server-count: 2 setting worked fine with 6.3.

Any hints, insights, or pointers?
Does anyone else experience the same?

Thanks,
Regards,
Joerg
Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Otto Moerbeek
On Thu, Sep 26, 2019 at 11:16:21AM +0200, Joerg Jung wrote:

> Hi,
>
> I run a few busy (~800 req/s) NSD servers which I upgraded
> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
> sysctl values and nsd.conf matches the default as well, just
> added a few hundred zones.
>
> Now, when I increase servers from default 1 to 2 in nsd.conf:
> server-count: 2
> it starts spamming my log with:
> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
>
> checking the source, server.c seems not to handle EAGAIN
> after sendto() and does not recover or retry, it just increases
> txerr statistic count - so answer seems really lost :(
>
> I tried higher debug level, as well as increasing socket buffers to:
> net.inet.udp.recvspace= 65536
> net.inet.udp.sendspace=65636
> but both didn’t help and netstat -s -p udp does show
> 0 dropped due to full socket buffers  
> anyways. So, I don’t believe this is a socket buffer issue.
>
> The same server-count: 2 setting worked fine with 6.3.
>
> Any hints, insights, or pointers?
> Does anyone else experience the same?
>
> Thanks,
> Regards,
> Joerg

This is likely an fd limit issue. Try:

nsd:\
        :openfiles=512:\
        :tc=daemon:

in login.conf, followed by a restart of nsd.

        -Otto

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Joerg Jung


> On 26. Sep 2019, at 12:40, Otto Moerbeek <[hidden email]> wrote:
>
> On Thu, Sep 26, 2019 at 11:16:21AM +0200, Joerg Jung wrote:
>
>> Hi,
>>
>> I run a few busy (~800 req/s) NSD servers which I upgraded
>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
>> sysctl values and nsd.conf matches the default as well, just
>> added a few hundred zones.
>>
>> Now, when I increase servers from default 1 to 2 in nsd.conf:
>> server-count: 2
>> it starts spamming my log with:
>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
>>
>> checking the source, server.c seems not to handle EAGAIN
>> after sendto() and does not recover or retry, it just increases
>> txerr statistic count - so answer seems really lost :(
>>
>> I tried higher debug level, as well as increasing socket buffers to:
>> net.inet.udp.recvspace= 65536
>> net.inet.udp.sendspace=65636
>> but both didn’t help and netstat -s -p udp does show
>> 0 dropped due to full socket buffers  
>> anyways. So, I don’t believe this is a socket buffer issue.
>>
>> The same server-count: 2 setting worked fine with 6.3.
>>
>> Any hints, insights, or pointers?
>> Does anyone else experience the same?
>
> This is likely an fd limit issue. Try:
>
> nsd:\
>        :openfiles=512:\
>        :tc=daemon:
>
> in login.conf, followed by a restart of nsd.

Thanks for reply. Tried that, but unfortunately did not help.

I checked with fstat, and each of the NSD processes has not much
more than 16 fds open, while complaining loudly in log about
sendto() EAGAIN
From what I understand TCP for (A|I)XFR transfers are also limited to
max 100 TCP connections by default (and would show up with “too
many open files” or similar).
So, should all more or less fit into the daemon class defaults.

Also this sendto() part of the server.c code seems not really changed
between 6.3 and 6.5 so, must be something different. Might be changes
in libevent or thread mutex handling, or something…

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Stuart Henderson
In reply to this post by Joerg Jung
On 2019/09/26 11:16, Joerg Jung wrote:

> Hi,
>
> I run a few busy (~800 req/s) NSD servers which I upgraded
> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
> sysctl values and nsd.conf matches the default as well, just
> added a few hundred zones.
>
> Now, when I increase servers from default 1 to 2 in nsd.conf:
> server-count: 2
> it starts spamming my log with:
> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
>
> checking the source, server.c seems not to handle EAGAIN
> after sendto() and does not recover or retry, it just increases
> txerr statistic count - so answer seems really lost :(
>
> I tried higher debug level, as well as increasing socket buffers to:
> net.inet.udp.recvspace= 65536
> net.inet.udp.sendspace=65636
> but both didn’t help and netstat -s -p udp does show
> 0 dropped due to full socket buffers  
> anyways. So, I don’t believe this is a socket buffer issue.
>
> The same server-count: 2 setting worked fine with 6.3.
>
> Any hints, insights, or pointers?
> Does anyone else experience the same?
>
> Thanks,
> Regards,
> Joerg

Maybe it's worth trying to track down further whether this is due to an
NSD change or something else in the OS - cvs up -r OPENBSD_6_3 .. (be sure
to use "make -f Makefile.bsd-wrapper [..]" when building).

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Stuart Henderson
On 2019/09/26 13:45, Stuart Henderson wrote:

> On 2019/09/26 11:16, Joerg Jung wrote:
> > Hi,
> >
> > I run a few busy (~800 req/s) NSD servers which I upgraded
> > to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
> > sysctl values and nsd.conf matches the default as well, just
> > added a few hundred zones.
> >
> > Now, when I increase servers from default 1 to 2 in nsd.conf:
> > server-count: 2
> > it starts spamming my log with:
> > nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
> >
> > checking the source, server.c seems not to handle EAGAIN
> > after sendto() and does not recover or retry, it just increases
> > txerr statistic count - so answer seems really lost :(
> >
> > I tried higher debug level, as well as increasing socket buffers to:
> > net.inet.udp.recvspace= 65536
> > net.inet.udp.sendspace=65636
> > but both didn’t help and netstat -s -p udp does show
> > 0 dropped due to full socket buffers  
> > anyways. So, I don’t believe this is a socket buffer issue.
> >
> > The same server-count: 2 setting worked fine with 6.3.
> >
> > Any hints, insights, or pointers?
> > Does anyone else experience the same?
> >
> > Thanks,
> > Regards,
> > Joerg
>
> Maybe it's worth trying to track down further whether this is due to an
> NSD change or something else in the OS - cvs up -r OPENBSD_6_3 .. (be sure
> to use "make -f Makefile.bsd-wrapper [..]" when building).
>

Or, following a comment from claudio@, try a kernel built with this:

Index: syscalls.master
===================================================================
RCS file: /cvs/src/sys/kern/syscalls.master,v
retrieving revision 1.189
diff -u -p -r1.189 syscalls.master
--- syscalls.master 11 Jan 2019 18:46:30 -0000 1.189
+++ syscalls.master 26 Sep 2019 13:01:46 -0000
@@ -261,7 +261,7 @@
 130 OBSOL oftruncate
 131 STD { int sys_flock(int fd, int how); }
 132 STD { int sys_mkfifo(const char *path, mode_t mode); }
-133 STD NOLOCK { ssize_t sys_sendto(int s, const void *buf, \
+133 STD { ssize_t sys_sendto(int s, const void *buf, \
     size_t len, int flags, const struct sockaddr *to, \
     socklen_t tolen); }
 134 STD { int sys_shutdown(int s, int how); }


Run "make syscalls" in sys/kern before building.

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Stuart Henderson
In reply to this post by Stuart Henderson
> > I run a few busy (~800 req/s) NSD servers which I upgraded
> > to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any

Did you jump over 6.4 when you updated? (i.e. did you update
directly from 6.3 to 6.5?)

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Joerg Jung
On Thu, Sep 26, 2019 at 02:51:52PM +0100, Stuart Henderson wrote:
> > > I run a few busy (~800 req/s) NSD servers which I upgraded
> > > to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
>
> Did you jump over 6.4 when you updated? (i.e. did you update
> directly from 6.3 to 6.5?)

Of course not: 6.3->6.4->6.5. (I just skipped syspatch'ing 6.4).
Since I upgraded fully automated via some Ansible playbooks (strictly
following upgradeXX.html) I have not really tested/checked 6.4, however
I have looked in the logs right now and it seems like this was an issue
in 6.4 as well already, as I see a few instance of the sendto() EAGAIN
there as well (in the short timeframe while it was downloading the 6.5
tarballs).

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Joerg Jung
In reply to this post by Stuart Henderson

> On 26. Sep 2019, at 15:02, Stuart Henderson <[hidden email]> wrote:
> On 2019/09/26 13:45, Stuart Henderson wrote:
>> On 2019/09/26 11:16, Joerg Jung wrote:
>>> Hi,
>>>
>>> I run a few busy (~800 req/s) NSD servers which I upgraded
>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
>>> sysctl values and nsd.conf matches the default as well, just
>>> added a few hundred zones.
>>>
>>> Now, when I increase servers from default 1 to 2 in nsd.conf:
>>> server-count: 2
>>> it starts spamming my log with:
>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
>>>
>>> checking the source, server.c seems not to handle EAGAIN
>>> after sendto() and does not recover or retry, it just increases
>>> txerr statistic count - so answer seems really lost :(
>>>
>>> I tried higher debug level, as well as increasing socket buffers to:
>>> net.inet.udp.recvspace= 65536
>>> net.inet.udp.sendspace=65636
>>> but both didn’t help and netstat -s -p udp does show
>>> 0 dropped due to full socket buffers  
>>> anyways. So, I don’t believe this is a socket buffer issue.
>>>
>>> The same server-count: 2 setting worked fine with 6.3.
>>>
>>> Any hints, insights, or pointers?
>>> Does anyone else experience the same?
>>>
>>> Thanks,
>>> Regards,
>>> Joerg
>>
>> Maybe it's worth trying to track down further whether this is due to an
>> NSD change or something else in the OS - cvs up -r OPENBSD_6_3 .. (be sure
>> to use "make -f Makefile.bsd-wrapper [..]" when building).
>>
>
> Or, following a comment from claudio@, try a kernel built with this:

FYI, I tried that diff and a few other things but neither did help.

> Index: syscalls.master
> ===================================================================
> RCS file: /cvs/src/sys/kern/syscalls.master,v
> retrieving revision 1.189
> diff -u -p -r1.189 syscalls.master
> --- syscalls.master 11 Jan 2019 18:46:30 -0000 1.189
> +++ syscalls.master 26 Sep 2019 13:01:46 -0000
> @@ -261,7 +261,7 @@
> 130 OBSOL oftruncate
> 131 STD { int sys_flock(int fd, int how); }
> 132 STD { int sys_mkfifo(const char *path, mode_t mode); }
> -133 STD NOLOCK { ssize_t sys_sendto(int s, const void *buf, \
> +133 STD { ssize_t sys_sendto(int s, const void *buf, \
>    size_t len, int flags, const struct sockaddr *to, \
>    socklen_t tolen); }
> 134 STD { int sys_shutdown(int s, int how); }
>
>
> Run "make syscalls" in sys/kern before building.

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Martin Pieuchot
On 17/02/20(Mon) 14:55, Joerg Jung wrote:

>
> > On 26. Sep 2019, at 15:02, Stuart Henderson <[hidden email]> wrote:
> > On 2019/09/26 13:45, Stuart Henderson wrote:
> >> On 2019/09/26 11:16, Joerg Jung wrote:
> >>> Hi,
> >>>
> >>> I run a few busy (~800 req/s) NSD servers which I upgraded
> >>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
> >>> sysctl values and nsd.conf matches the default as well, just
> >>> added a few hundred zones.
> >>>
> >>> Now, when I increase servers from default 1 to 2 in nsd.conf:
> >>> server-count: 2
> >>> it starts spamming my log with:
> >>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
> >>>
> >>> checking the source, server.c seems not to handle EAGAIN
> >>> after sendto() and does not recover or retry, it just increases
> >>> txerr statistic count - so answer seems really lost :(
> >>>
> >>> I tried higher debug level, as well as increasing socket buffers to:
> >>> net.inet.udp.recvspace= 65536
> >>> net.inet.udp.sendspace=65636
> >>> but both didn’t help and netstat -s -p udp does show
> >>> 0 dropped due to full socket buffers  
> >>> anyways. So, I don’t believe this is a socket buffer issue.
> >>>
> >>> The same server-count: 2 setting worked fine with 6.3.
> >>>
> >>> Any hints, insights, or pointers?
> >>> Does anyone else experience the same?
> >>>
> >>> Thanks,
> >>> Regards,
> >>> Joerg
> >>
> >> Maybe it's worth trying to track down further whether this is due to an
> >> NSD change or something else in the OS - cvs up -r OPENBSD_6_3 .. (be sure
> >> to use "make -f Makefile.bsd-wrapper [..]" when building).
> >>
> >
> > Or, following a comment from claudio@, try a kernel built with this:
>
> FYI, I tried that diff and a few other things but neither did help.

Did you ktrace(1) the problem?  How is sendto(2) called, in particular
is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?  Does
that mean the kernel returns EWOULDBLOCK even if the userland said it is
fine to block?


>
> > Index: syscalls.master
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/syscalls.master,v
> > retrieving revision 1.189
> > diff -u -p -r1.189 syscalls.master
> > --- syscalls.master 11 Jan 2019 18:46:30 -0000 1.189
> > +++ syscalls.master 26 Sep 2019 13:01:46 -0000
> > @@ -261,7 +261,7 @@
> > 130 OBSOL oftruncate
> > 131 STD { int sys_flock(int fd, int how); }
> > 132 STD { int sys_mkfifo(const char *path, mode_t mode); }
> > -133 STD NOLOCK { ssize_t sys_sendto(int s, const void *buf, \
> > +133 STD { ssize_t sys_sendto(int s, const void *buf, \
> >    size_t len, int flags, const struct sockaddr *to, \
> >    socklen_t tolen); }
> > 134 STD { int sys_shutdown(int s, int how); }
> >
> >
> > Run "make syscalls" in sys/kern before building.
>

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Joerg Jung

> On 17. Feb 2020, at 15:16, Martin Pieuchot <[hidden email]> wrote:
> On 17/02/20(Mon) 14:55, Joerg Jung wrote:
>>
>>> On 26. Sep 2019, at 15:02, Stuart Henderson <[hidden email]> wrote:
>>> On 2019/09/26 13:45, Stuart Henderson wrote:
>>>> On 2019/09/26 11:16, Joerg Jung wrote:
>>>>>
>>>>>
>>>>> I run a few busy (~800 req/s) NSD servers which I upgraded
>>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
>>>>> sysctl values and nsd.conf matches the default as well, just
>>>>> added a few hundred zones.
>>>>>
>>>>> Now, when I increase servers from default 1 to 2 in nsd.conf:
>>>>> server-count: 2
>>>>> it starts spamming my log with:
>>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
>>>>>
>>>>> checking the source, server.c seems not to handle EAGAIN
>>>>> after sendto() and does not recover or retry, it just increases
>>>>> txerr statistic count - so answer seems really lost :(
>>>>>
>>>>> I tried higher debug level, as well as increasing socket buffers to:
>>>>> net.inet.udp.recvspace= 65536
>>>>> net.inet.udp.sendspace=65636
>>>>> but both didn’t help and netstat -s -p udp does show
>>>>> 0 dropped due to full socket buffers  
>>>>> anyways. So, I don’t believe this is a socket buffer issue.
>>>>>
>>>>> The same server-count: 2 setting worked fine with 6.3.
>>>>>
>>>>> Any hints, insights, or pointers?
>>>>> Does anyone else experience the same?
>>>>>
>>>>
>>>> Maybe it's worth trying to track down further whether this is due to an
>>>> NSD change or something else in the OS - cvs up -r OPENBSD_6_3 .. (be sure
>>>> to use "make -f Makefile.bsd-wrapper [..]" when building).
>>>>
>>>
>>> Or, following a comment from claudio@, try a kernel built with this:
>>
>> FYI, I tried that diff and a few other things but neither did help.

FYI, after upgrade to 6.7 the issue is still the same,
just with different format for the syslog message not
explicitly mentioning sendto() anymore, the new error
log repeating gazillion of times looks like this:
    nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource temporarily unavailable

> Did you ktrace(1) the problem?

Yes, please see below.

> How is sendto(2) called, in particular
> is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?  

From the ktrace dump it seems not sendto() being the culprit, but recvfrom().
Neither, sendto() nor recvfrom() seem to have any flags set.

> Does
> that mean the kernel returns EWOULDBLOCK even if the userland said it is
> fine to block?

I don’t think so.


From my limited understanding it seems the two processes
seem to concurrently try to recvfrom() after BOTH received kevent()
notification with only one being the “winner” and the other one
logging the failing?

Shouldn’t only one of the two try to call recvfrom()?

Not sure if this is a bug in OpenBSD or NSD, but
it worked fine in earlier releases.

However, this one seems related:
https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385 <https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385>



 52438 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ, flags=0x1<EV_ADD>, fflags=0<>, data=62
, udata=0x30562fe4198 }
 55919 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ, flags=0x1<EV_ADD>, fflags=0<>, data=62
, udata=0x30562fe4198 }
 52438 nsd      RET   kevent 1
 55919 nsd      RET   kevent 1
 52438 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
 52438 nsd      STRU  struct timespec { 1224.307942259 }
 55919 nsd      STRU  struct timespec { 1224.307946954 }
 52438 nsd      RET   clock_gettime 0
 55919 nsd      RET   clock_gettime 0
 52438 nsd      CALL  recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
 55919 nsd      CALL  recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
 52438 nsd      GIO   fd 5 read 46 bytes
       "\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
 55919 nsd      RET   recvfrom -1 errno 35 Resource temporarily unavailable
 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
 52438 nsd      RET   recvfrom 46/0x2e
 55919 nsd      STRU  struct timespec { 1224.307967366 }
 55919 nsd      RET   clock_gettime 0
 52438 nsd      CALL  gettimeofday(0x7f7ffffbeed0,0)
 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
 52438 nsd      STRU  struct timeval { 1592402542<"Jun 17 10:02:22 2020">.287233 }
 55919 nsd      STRU  struct timespec { 1224.307977172 }
 52438 nsd      RET   gettimeofday 0
 55919 nsd      RET   clock_gettime 0
 52438 nsd      CALL  sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
 55919 nsd      CALL  kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
 55919 nsd      STRU  struct timespec { 109.487449000 }
 52438 nsd      GIO   fd 5 wrote 46 bytes
       "\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
 52438 nsd      RET   sendto 46/0x2e




>>> Index: syscalls.master
>>> ===================================================================
>>> RCS file: /cvs/src/sys/kern/syscalls.master,v
>>> retrieving revision 1.189
>>> diff -u -p -r1.189 syscalls.master
>>> --- syscalls.master 11 Jan 2019 18:46:30 -0000 1.189
>>> +++ syscalls.master 26 Sep 2019 13:01:46 -0000
>>> @@ -261,7 +261,7 @@
>>> 130 OBSOL oftruncate
>>> 131 STD { int sys_flock(int fd, int how); }
>>> 132 STD { int sys_mkfifo(const char *path, mode_t mode); }
>>> -133 STD NOLOCK { ssize_t sys_sendto(int s, const void *buf, \
>>> +133 STD { ssize_t sys_sendto(int s, const void *buf, \
>>>    size_t len, int flags, const struct sockaddr *to, \
>>>    socklen_t tolen); }
>>> 134 STD { int sys_shutdown(int s, int how); }
>>>
>>>
>>> Run "make syscalls" in sys/kern before building.

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Otto Moerbeek
On Wed, Jun 17, 2020 at 04:57:54PM +0200, Joerg Jung wrote:

>
> > On 17. Feb 2020, at 15:16, Martin Pieuchot <[hidden email]> wrote:
> > On 17/02/20(Mon) 14:55, Joerg Jung wrote:
> >>
> >>> On 26. Sep 2019, at 15:02, Stuart Henderson <[hidden email]> wrote:
> >>> On 2019/09/26 13:45, Stuart Henderson wrote:
> >>>> On 2019/09/26 11:16, Joerg Jung wrote:
> >>>>>
> >>>>>
> >>>>> I run a few busy (~800 req/s) NSD servers which I upgraded
> >>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
> >>>>> sysctl values and nsd.conf matches the default as well, just
> >>>>> added a few hundred zones.
> >>>>>
> >>>>> Now, when I increase servers from default 1 to 2 in nsd.conf:
> >>>>> server-count: 2
> >>>>> it starts spamming my log with:
> >>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
> >>>>>
> >>>>> checking the source, server.c seems not to handle EAGAIN
> >>>>> after sendto() and does not recover or retry, it just increases
> >>>>> txerr statistic count - so answer seems really lost :(
> >>>>>
> >>>>> I tried higher debug level, as well as increasing socket buffers to:
> >>>>> net.inet.udp.recvspace= 65536
> >>>>> net.inet.udp.sendspace=65636
> >>>>> but both didn’t help and netstat -s -p udp does show
> >>>>> 0 dropped due to full socket buffers  
> >>>>> anyways. So, I don’t believe this is a socket buffer issue.
> >>>>>
> >>>>> The same server-count: 2 setting worked fine with 6.3.
> >>>>>
> >>>>> Any hints, insights, or pointers?
> >>>>> Does anyone else experience the same?
> >>>>>
> >>>>
> >>>> Maybe it's worth trying to track down further whether this is due to an
> >>>> NSD change or something else in the OS - cvs up -r OPENBSD_6_3 .. (be sure
> >>>> to use "make -f Makefile.bsd-wrapper [..]" when building).
> >>>>
> >>>
> >>> Or, following a comment from claudio@, try a kernel built with this:
> >>
> >> FYI, I tried that diff and a few other things but neither did help.
>
> FYI, after upgrade to 6.7 the issue is still the same,
> just with different format for the syslog message not
> explicitly mentioning sendto() anymore, the new error
> log repeating gazillion of times looks like this:
>     nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource temporarily unavailable
>
> > Did you ktrace(1) the problem?
>
> Yes, please see below.
>
> > How is sendto(2) called, in particular
> > is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?  
>
> From the ktrace dump it seems not sendto() being the culprit, but recvfrom().
> Neither, sendto() nor recvfrom() seem to have any flags set.

I think this is an error in your analysis. It's perfectly normal for
recvform to return EAGAIN.

On the sending side it indicates an issue.

        -Otto


>
> > Does
> > that mean the kernel returns EWOULDBLOCK even if the userland said it is
> > fine to block?
>
> I don’t think so.
>
>
> From my limited understanding it seems the two processes
> seem to concurrently try to recvfrom() after BOTH received kevent()
> notification with only one being the “winner” and the other one
> logging the failing?
>
> Shouldn’t only one of the two try to call recvfrom()?
>
> Not sure if this is a bug in OpenBSD or NSD, but
> it worked fine in earlier releases.
>
> However, this one seems related:
> https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385 <https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385>
>
>
>
>  52438 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ, flags=0x1<EV_ADD>, fflags=0<>, data=62
> , udata=0x30562fe4198 }
>  55919 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ, flags=0x1<EV_ADD>, fflags=0<>, data=62
> , udata=0x30562fe4198 }
>  52438 nsd      RET   kevent 1
>  55919 nsd      RET   kevent 1
>  52438 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>  55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>  52438 nsd      STRU  struct timespec { 1224.307942259 }
>  55919 nsd      STRU  struct timespec { 1224.307946954 }
>  52438 nsd      RET   clock_gettime 0
>  55919 nsd      RET   clock_gettime 0
>  52438 nsd      CALL  recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
>  55919 nsd      CALL  recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
>  52438 nsd      GIO   fd 5 read 46 bytes
>        "\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
>  55919 nsd      RET   recvfrom -1 errno 35 Resource temporarily unavailable
>  52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
>  55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>  52438 nsd      RET   recvfrom 46/0x2e
>  55919 nsd      STRU  struct timespec { 1224.307967366 }
>  55919 nsd      RET   clock_gettime 0
>  52438 nsd      CALL  gettimeofday(0x7f7ffffbeed0,0)
>  55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>  52438 nsd      STRU  struct timeval { 1592402542<"Jun 17 10:02:22 2020">.287233 }
>  55919 nsd      STRU  struct timespec { 1224.307977172 }
>  52438 nsd      RET   gettimeofday 0
>  55919 nsd      RET   clock_gettime 0
>  52438 nsd      CALL  sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
>  55919 nsd      CALL  kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
>  52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
>  55919 nsd      STRU  struct timespec { 109.487449000 }
>  52438 nsd      GIO   fd 5 wrote 46 bytes
>        "\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
>  52438 nsd      RET   sendto 46/0x2e
>
>
>
>
> >>> Index: syscalls.master
> >>> ===================================================================
> >>> RCS file: /cvs/src/sys/kern/syscalls.master,v
> >>> retrieving revision 1.189
> >>> diff -u -p -r1.189 syscalls.master
> >>> --- syscalls.master 11 Jan 2019 18:46:30 -0000 1.189
> >>> +++ syscalls.master 26 Sep 2019 13:01:46 -0000
> >>> @@ -261,7 +261,7 @@
> >>> 130 OBSOL oftruncate
> >>> 131 STD { int sys_flock(int fd, int how); }
> >>> 132 STD { int sys_mkfifo(const char *path, mode_t mode); }
> >>> -133 STD NOLOCK { ssize_t sys_sendto(int s, const void *buf, \
> >>> +133 STD { ssize_t sys_sendto(int s, const void *buf, \
> >>>    size_t len, int flags, const struct sockaddr *to, \
> >>>    socklen_t tolen); }
> >>> 134 STD { int sys_shutdown(int s, int how); }
> >>>
> >>>
> >>> Run "make syscalls" in sys/kern before building.
>

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Joerg Jung

> On 18. Jun 2020, at 09:47, Otto Moerbeek <[hidden email]> wrote:
> On Wed, Jun 17, 2020 at 04:57:54PM +0200, Joerg Jung wrote:
>>> On 17. Feb 2020, at 15:16, Martin Pieuchot <[hidden email]> wrote:
>>> On 17/02/20(Mon) 14:55, Joerg Jung wrote:
>>>>
>>>>> On 26. Sep 2019, at 15:02, Stuart Henderson <[hidden email]> wrote:
>>>>> On 2019/09/26 13:45, Stuart Henderson wrote:
>>>>>> On 2019/09/26 11:16, Joerg Jung wrote:
>>>>>>>
>>>>>>>
>>>>>>> I run a few busy (~800 req/s) NSD servers which I upgraded
>>>>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
>>>>>>> sysctl values and nsd.conf matches the default as well, just
>>>>>>> added a few hundred zones.
>>>>>>>
>>>>>>> Now, when I increase servers from default 1 to 2 in nsd.conf:
>>>>>>> server-count: 2
>>>>>>> it starts spamming my log with:
>>>>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
>>>>>>>
>>
>> FYI, after upgrade to 6.7 the issue is still the same,
>> just with different format for the syslog message not
>> explicitly mentioning sendto() anymore, the new error
>> log repeating gazillion of times looks like this:
>>    nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource temporarily unavailable
>>
>>> Did you ktrace(1) the problem?
>>
>> Yes, please see below.
>>
>>> How is sendto(2) called, in particular
>>> is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?  
>>
>> From the ktrace dump it seems not sendto() being the culprit, but recvfrom().
>> Neither, sendto() nor recvfrom() seem to have any flags set.
>
> I think this is an error in your analysis. It's perfectly normal for
> recvform to return EAGAIN.

Yes, the log message coming from sendto() / sendmmsg() was
misleading here.

> On the sending side it indicates an issue.

Yes, as verified with ktrace the sendto() works fine.

However, it’s not normal to write 200 log messages per second for
recvfrom() EAGAIN.

So, seems like a (logging) bug in NSD: It looks like the sendmmsg()
wrapper logs the earlier stored EAGAIN from resvmmsg() wrapper
 — if verbosity is 1, which is OpenBSD default.

Besides, I believe (reading man pages) these network errors
should be only logged at a higher verbosity 2 anyways.

Unfortunately, I’m not able to open a bug report in bugzilla upstream
anymore, probably because they moved things to Github nowadays.


>> However, this one seems related:
>> https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385 <https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385><https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385 <https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385>>
>>
>>
>>
>> 52438 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ, flags=0x1<EV_ADD>, fflags=0<>, data=62
>> , udata=0x30562fe4198 }
>> 55919 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ, flags=0x1<EV_ADD>, fflags=0<>, data=62
>> , udata=0x30562fe4198 }
>> 52438 nsd      RET   kevent 1
>> 55919 nsd      RET   kevent 1
>> 52438 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>> 52438 nsd      STRU  struct timespec { 1224.307942259 }
>> 55919 nsd      STRU  struct timespec { 1224.307946954 }
>> 52438 nsd      RET   clock_gettime 0
>> 55919 nsd      RET   clock_gettime 0
>> 52438 nsd      CALL recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
>> 55919 nsd      CALL recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
>> 52438 nsd      GIO   fd 5 read 46 bytes
>>       "\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
>> 55919 nsd      RET   recvfrom -1 errno 35 Resource temporarily unavailable
>> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>> 52438 nsd      RET   recvfrom 46/0x2e
>> 55919 nsd      STRU  struct timespec { 1224.307967366 }
>> 55919 nsd      RET   clock_gettime 0
>> 52438 nsd      CALL  gettimeofday(0x7f7ffffbeed0,0)
>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>> 52438 nsd      STRU  struct timeval { 1592402542<"Jun 17 10:02:22 2020">.287233 }
>> 55919 nsd      STRU  struct timespec { 1224.307977172 }
>> 52438 nsd      RET   gettimeofday 0
>> 55919 nsd      RET   clock_gettime 0
>> 52438 nsd      CALL  sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
>> 55919 nsd      CALL  kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
>> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
>> 55919 nsd      STRU  struct timespec { 109.487449000 }
>> 52438 nsd      GIO   fd 5 wrote 46 bytes
>>       "\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
>> 52438 nsd      RET   sendto 46/0x2e

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Otto Moerbeek
On Thu, Jun 18, 2020 at 11:45:01AM +0200, Joerg Jung wrote:

>
> > On 18. Jun 2020, at 09:47, Otto Moerbeek <[hidden email]> wrote:
> > On Wed, Jun 17, 2020 at 04:57:54PM +0200, Joerg Jung wrote:
> >>> On 17. Feb 2020, at 15:16, Martin Pieuchot <[hidden email]> wrote:
> >>> On 17/02/20(Mon) 14:55, Joerg Jung wrote:
> >>>>
> >>>>> On 26. Sep 2019, at 15:02, Stuart Henderson <[hidden email]> wrote:
> >>>>> On 2019/09/26 13:45, Stuart Henderson wrote:
> >>>>>> On 2019/09/26 11:16, Joerg Jung wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> I run a few busy (~800 req/s) NSD servers which I upgraded
> >>>>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
> >>>>>>> sysctl values and nsd.conf matches the default as well, just
> >>>>>>> added a few hundred zones.
> >>>>>>>
> >>>>>>> Now, when I increase servers from default 1 to 2 in nsd.conf:
> >>>>>>> server-count: 2
> >>>>>>> it starts spamming my log with:
> >>>>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
> >>>>>>>
> >>
> >> FYI, after upgrade to 6.7 the issue is still the same,
> >> just with different format for the syslog message not
> >> explicitly mentioning sendto() anymore, the new error
> >> log repeating gazillion of times looks like this:
> >>    nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource temporarily unavailable
> >>
> >>> Did you ktrace(1) the problem?
> >>
> >> Yes, please see below.
> >>
> >>> How is sendto(2) called, in particular
> >>> is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?  
> >>
> >> From the ktrace dump it seems not sendto() being the culprit, but recvfrom().
> >> Neither, sendto() nor recvfrom() seem to have any flags set.
> >
> > I think this is an error in your analysis. It's perfectly normal for
> > recvform to return EAGAIN.
>
> Yes, the log message coming from sendto() / sendmmsg() was
> misleading here.
>
> > On the sending side it indicates an issue.
>
> Yes, as verified with ktrace the sendto() works fine.
>
> However, it’s not normal to write 200 log messages per second for
> recvfrom() EAGAIN.
>
> So, seems like a (logging) bug in NSD: It looks like the sendmmsg()
> wrapper logs the earlier stored EAGAIN from resvmmsg() wrapper
>  — if verbosity is 1, which is OpenBSD default.
>
> Besides, I believe (reading man pages) these network errors
> should be only logged at a higher verbosity 2 anyways.
>
> Unfortunately, I’m not able to open a bug report in bugzilla upstream
> anymore, probably because they moved things to Github nowadays.

Well, you might consider the obvious...

        -Otto

>
>
> >> However, this one seems related:
> >> https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385 <https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385><https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385 <https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385>>
> >>
> >>
> >>
> >> 52438 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ, flags=0x1<EV_ADD>, fflags=0<>, data=62
> >> , udata=0x30562fe4198 }
> >> 55919 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ, flags=0x1<EV_ADD>, fflags=0<>, data=62
> >> , udata=0x30562fe4198 }
> >> 52438 nsd      RET   kevent 1
> >> 55919 nsd      RET   kevent 1
> >> 52438 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >> 52438 nsd      STRU  struct timespec { 1224.307942259 }
> >> 55919 nsd      STRU  struct timespec { 1224.307946954 }
> >> 52438 nsd      RET   clock_gettime 0
> >> 55919 nsd      RET   clock_gettime 0
> >> 52438 nsd      CALL recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
> >> 55919 nsd      CALL recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
> >> 52438 nsd      GIO   fd 5 read 46 bytes
> >>       "\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
> >> 55919 nsd      RET   recvfrom -1 errno 35 Resource temporarily unavailable
> >> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
> >> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >> 52438 nsd      RET   recvfrom 46/0x2e
> >> 55919 nsd      STRU  struct timespec { 1224.307967366 }
> >> 55919 nsd      RET   clock_gettime 0
> >> 52438 nsd      CALL  gettimeofday(0x7f7ffffbeed0,0)
> >> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >> 52438 nsd      STRU  struct timeval { 1592402542<"Jun 17 10:02:22 2020">.287233 }
> >> 55919 nsd      STRU  struct timespec { 1224.307977172 }
> >> 52438 nsd      RET   gettimeofday 0
> >> 55919 nsd      RET   clock_gettime 0
> >> 52438 nsd      CALL  sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
> >> 55919 nsd      CALL  kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
> >> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
> >> 55919 nsd      STRU  struct timespec { 109.487449000 }
> >> 52438 nsd      GIO   fd 5 wrote 46 bytes
> >>       "\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
> >> 52438 nsd      RET   sendto 46/0x2e
>

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Joerg Jung

> On 18. Jun 2020, at 11:53, Wouter Wijngaards <[hidden email]> wrote:
> On 18/06/2020 11:49, Otto Moerbeek wrote:
>> On Thu, Jun 18, 2020 at 11:45:01AM +0200, Joerg Jung wrote:
>>>> On 18. Jun 2020, at 09:47, Otto Moerbeek <[hidden email]> wrote:
>>>> On Wed, Jun 17, 2020 at 04:57:54PM +0200, Joerg Jung wrote:
>>>>>> On 17. Feb 2020, at 15:16, Martin Pieuchot <[hidden email]> wrote:
>>>>>> On 17/02/20(Mon) 14:55, Joerg Jung wrote:
>>>>>>>
>>>>>>>> On 26. Sep 2019, at 15:02, Stuart Henderson <[hidden email]> wrote:
>>>>>>>> On 2019/09/26 13:45, Stuart Henderson wrote:
>>>>>>>>> On 2019/09/26 11:16, Joerg Jung wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I run a few busy (~800 req/s) NSD servers which I upgraded
>>>>>>>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
>>>>>>>>>> sysctl values and nsd.conf matches the default as well, just
>>>>>>>>>> added a few hundred zones.
>>>>>>>>>>
>>>>>>>>>> Now, when I increase servers from default 1 to 2 in nsd.conf:
>>>>>>>>>> server-count: 2
>>>>>>>>>> it starts spamming my log with:
>>>>>>>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
>>>>>>>>>>
>>>>>
>>>>> FYI, after upgrade to 6.7 the issue is still the same,
>>>>> just with different format for the syslog message not
>>>>> explicitly mentioning sendto() anymore, the new error
>>>>> log repeating gazillion of times looks like this:
>>>>>   nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource temporarily unavailable
>>>>>
>>>>>> Did you ktrace(1) the problem?
>>>>>
>>>>> Yes, please see below.
>>>>>
>>>>>> How is sendto(2) called, in particular
>>>>>> is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?  
>>>>>
>>>>> From the ktrace dump it seems not sendto() being the culprit, but recvfrom().
>>>>> Neither, sendto() nor recvfrom() seem to have any flags set.
>>>>
>>>> I think this is an error in your analysis. It's perfectly normal for
>>>> recvform to return EAGAIN.
>>>
>>> Yes, the log message coming from sendto() / sendmmsg() was
>>> misleading here.
>>>
>>>> On the sending side it indicates an issue.
>>>
>>> Yes, as verified with ktrace the sendto() works fine.
>>>
>>> However, it’s not normal to write 200 log messages per second for
>>> recvfrom() EAGAIN.
>>>
>>> So, seems like a (logging) bug in NSD: It looks like the sendmmsg()
>>> wrapper logs the earlier stored EAGAIN from resvmmsg() wrapper
>>> — if verbosity is 1, which is OpenBSD default.
>>>
>>> Besides, I believe (reading man pages) these network errors
>>> should be only logged at a higher verbosity 2 anyways.
>
> Commit
> https://github.com/NLnetLabs/nsd/commit/374a713b0d175dfe0127f05d986f5f9e1ca8c506 <https://github.com/NLnetLabs/nsd/commit/374a713b0d175dfe0127f05d986f5f9e1ca8c506>
> should remove the EAGAIN log lines so they don't bother you.

Thanks for looking into it.

> At first I thought you were talking about the return value of the
> syscall, but instead it is one of the elements of the send list.

There seems only 1 element in the send list (usually),
which is sent fully and successfully with no error from sendto()
and no EAGAIN being generated.

So, I wonder where the EAGAIN is actually coming from:
Looking at the ktrace output below and the code it seems most
likely to be propagated from recvmmsg() to sendmmsg(), no?

Wild guess: Is there an off-by-one in the while loop, which results
in sendmmsg() called again, which restores the previous errno
from recvmmsg() but does not hit sendto() again.

So, your recent commit may just “hide" the actual “issue".


>>>>> 52438 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ, flags=0x1<EV_ADD>, fflags=0<>, data=62
>>>>> , udata=0x30562fe4198 }
>>>>> 55919 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ, flags=0x1<EV_ADD>, fflags=0<>, data=62
>>>>> , udata=0x30562fe4198 }
>>>>> 52438 nsd      RET   kevent 1
>>>>> 55919 nsd      RET   kevent 1
>>>>> 52438 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>> 52438 nsd      STRU  struct timespec { 1224.307942259 }
>>>>> 55919 nsd      STRU  struct timespec { 1224.307946954 }
>>>>> 52438 nsd      RET   clock_gettime 0
>>>>> 55919 nsd      RET   clock_gettime 0
>>>>> 52438 nsd      CALL recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
>>>>> 55919 nsd      CALL recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
>>>>> 52438 nsd      GIO   fd 5 read 46 bytes
>>>>>      "\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
>>>>> 55919 nsd      RET   recvfrom -1 errno 35 Resource temporarily unavailable
>>>>> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>> 52438 nsd      RET   recvfrom 46/0x2e
>>>>> 55919 nsd      STRU  struct timespec { 1224.307967366 }
>>>>> 55919 nsd      RET   clock_gettime 0
>>>>> 52438 nsd      CALL  gettimeofday(0x7f7ffffbeed0,0)
>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>> 52438 nsd      STRU  struct timeval { 1592402542<"Jun 17 10:02:22 2020">.287233 }
>>>>> 55919 nsd      STRU  struct timespec { 1224.307977172 }
>>>>> 52438 nsd      RET   gettimeofday 0
>>>>> 55919 nsd      RET   clock_gettime 0
>>>>> 52438 nsd      CALL  sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
>>>>> 55919 nsd      CALL  kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
>>>>> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
>>>>> 55919 nsd      STRU  struct timespec { 109.487449000 }
>>>>> 52438 nsd      GIO   fd 5 wrote 46 bytes
>>>>>      "\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
>>>>> 52438 nsd      RET   sendto 46/0x2e

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Joerg Jung


> On 19. Jun 2020, at 15:03, Wouter Wijngaards <[hidden email]> wrote:
> On 19/06/2020 13:50, Joerg Jung wrote:
>>
>>> On 18. Jun 2020, at 11:53, Wouter Wijngaards <[hidden email] <mailto:[hidden email]>
>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>>> On 18/06/2020 11:49, Otto Moerbeek wrote:
>>>> On Thu, Jun 18, 2020 at 11:45:01AM +0200, Joerg Jung wrote:
>>>>>> On 18. Jun 2020, at 09:47, Otto Moerbeek <[hidden email] <mailto:[hidden email]>
>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>>>>>> On Wed, Jun 17, 2020 at 04:57:54PM +0200, Joerg Jung wrote:
>>>>>>>> On 17. Feb 2020, at 15:16, Martin Pieuchot <[hidden email] <mailto:[hidden email]>
>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>>>>>>>> On 17/02/20(Mon) 14:55, Joerg Jung wrote:
>>>>>>>>>
>>>>>>>>>> On 26. Sep 2019, at 15:02, Stuart Henderson
>>>>>>>>>> <[hidden email] <mailto:[hidden email]> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>>>>>>>>>> On 2019/09/26 13:45, Stuart Henderson wrote:
>>>>>>>>>>> On 2019/09/26 11:16, Joerg Jung wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I run a few busy (~800 req/s) NSD servers which I upgraded
>>>>>>>>>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
>>>>>>>>>>>> sysctl values and nsd.conf matches the default as well, just
>>>>>>>>>>>> added a few hundred zones.
>>>>>>>>>>>>
>>>>>>>>>>>> Now, when I increase servers from default 1 to 2 in nsd.conf:
>>>>>>>>>>>> server-count: 2
>>>>>>>>>>>> it starts spamming my log with:
>>>>>>>>>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily
>>>>>>>>>>>> unavailable
>>>>>>>>>>>>
>>>>>>>
>>>>>>> FYI, after upgrade to 6.7 the issue is still the same,
>>>>>>> just with different format for the syslog message not
>>>>>>> explicitly mentioning sendto() anymore, the new error
>>>>>>> log repeating gazillion of times looks like this:
>>>>>>>   nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource
>>>>>>> temporarily unavailable
>>>>>>>
>>>>>>>> Did you ktrace(1) the problem?
>>>>>>>
>>>>>>> Yes, please see below.
>>>>>>>
>>>>>>>> How is sendto(2) called, in particular
>>>>>>>> is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?  
>>>>>>>
>>>>>>> From the ktrace dump it seems not sendto() being the culprit, but
>>>>>>> recvfrom().
>>>>>>> Neither, sendto() nor recvfrom() seem to have any flags set.
>>>>>>
>>>>>> I think this is an error in your analysis. It's perfectly normal for
>>>>>> recvform to return EAGAIN.
>>>>>
>>>>> Yes, the log message coming from sendto() / sendmmsg() was
>>>>> misleading here.
>>>>>
>>>>>> On the sending side it indicates an issue.
>>>>>
>>>>> Yes, as verified with ktrace the sendto() works fine.
>>>>>
>>>>> However, it’s not normal to write 200 log messages per second for
>>>>> recvfrom() EAGAIN.
>>>>>
>>>>> So, seems like a (logging) bug in NSD: It looks like the sendmmsg()
>>>>> wrapper logs the earlier stored EAGAIN from resvmmsg() wrapper
>>>>> — if verbosity is 1, which is OpenBSD default.
>>>>>
>>>>> Besides, I believe (reading man pages) these network errors
>>>>> should be only logged at a higher verbosity 2 anyways.
>>>
>>> Commit
>>> https://github.com/NLnetLabs/nsd/commit/374a713b0d175dfe0127f05d986f5f9e1ca8c506
>>> should remove the EAGAIN log lines so they don't bother you.
>>
>> Thanks for looking into it.
>>
>>> At first I thought you were talking about the return value of the
>>> syscall, but instead it is one of the elements of the send list.
>>
>> There seems only 1 element in the send list (usually),
>> which is sent fully and successfully with no error from sendto()
>> and no EAGAIN being generated.
>>
>> So, I wonder where the EAGAIN is actually coming from:
>> Looking at the ktrace output below and the code it seems most
>> likely to be propagated from recvmmsg() to sendmmsg(), no?
>>
>> Wild guess: Is there an off-by-one in the while loop, which results
>> in sendmmsg() called again, which restores the previous errno
>> from recvmmsg() but does not hit sendto() again.
>>
>> So, your recent commit may just “hide" the actual “issue".
>
> You may be right, does this commit look better:
> https://github.com/NLnetLabs/nsd/commit/a74e23f4a4048b716b0a22b6d3a71dea7afcd104 <https://github.com/NLnetLabs/nsd/commit/a74e23f4a4048b716b0a22b6d3a71dea7afcd104>
>
> It retries when UDP send buffers are full and sets the blocking flag so
> that the operation completes.

Hah. that’s a large hammer, which may work and seems the right thing
to do IFF EAGAIN is triggered from the sendto()/sendmmsg() due to full
buffers.

But from the ktrace the EAGAIN seems not come from sendto()/sendmmsg(),
since sendto() just finishes fine (see last line in ktrace below) -- so no buffer
issues here. Actually, full ktrace contains only EAGAIN from recvfrom() and
none from sendto() (or any other syscall). As I wrote above, the EAGAIN
seems just “propagated” from recvmmsg() to sendmmsg() due to the errno
restore dance.

Why does the code postpone the error handling (and error logging) at all and
stores / restore errno for later iterations - why not instead handle the error (logging)
immediately?


>>>>>>> 52438 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ,
>>>>>>> flags=0x1<EV_ADD>, fflags=0<>, data=62
>>>>>>> , udata=0x30562fe4198 }
>>>>>>> 55919 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ,
>>>>>>> flags=0x1<EV_ADD>, fflags=0<>, data=62
>>>>>>> , udata=0x30562fe4198 }
>>>>>>> 52438 nsd      RET   kevent 1
>>>>>>> 55919 nsd      RET   kevent 1
>>>>>>> 52438 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>>>> 52438 nsd      STRU  struct timespec { 1224.307942259 }
>>>>>>> 55919 nsd      STRU  struct timespec { 1224.307946954 }
>>>>>>> 52438 nsd      RET   clock_gettime 0
>>>>>>> 55919 nsd      RET   clock_gettime 0
>>>>>>> 52438 nsd      CALL
>>>>>>> recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
>>>>>>> 55919 nsd      CALL
>>>>>>> recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
>>>>>>> 52438 nsd      GIO   fd 5 read 46 bytes
>>>>>>>    
>>>>>>> "\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
>>>>>>> 55919 nsd      RET   recvfrom -1 errno 35 Resource temporarily
>>>>>>> unavailable
>>>>>>> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
>>>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>>>> 52438 nsd      RET   recvfrom 46/0x2e
>>>>>>> 55919 nsd      STRU  struct timespec { 1224.307967366 }
>>>>>>> 55919 nsd      RET   clock_gettime 0
>>>>>>> 52438 nsd      CALL  gettimeofday(0x7f7ffffbeed0,0)
>>>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>>>> 52438 nsd      STRU  struct timeval { 1592402542<"Jun 17 10:02:22
>>>>>>> 2020">.287233 }
>>>>>>> 55919 nsd      STRU  struct timespec { 1224.307977172 }
>>>>>>> 52438 nsd      RET   gettimeofday 0
>>>>>>> 55919 nsd      RET   clock_gettime 0
>>>>>>> 52438 nsd      CALL  sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
>>>>>>> 55919 nsd      CALL  kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
>>>>>>> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
>>>>>>> 55919 nsd      STRU  struct timespec { 109.487449000 }
>>>>>>> 52438 nsd      GIO   fd 5 wrote 46 bytes
>>>>>>>    
>>>>>>> "\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
>>>>>>> 52438 nsd      RET   sendto 46/0x2e

Reply | Threaded
Open this post in threaded view
|

Re: NSD sendto issue

Otto Moerbeek
On Sun, Jun 21, 2020 at 10:55:55PM +0200, Joerg Jung wrote:

>
>
> > On 19. Jun 2020, at 15:03, Wouter Wijngaards <[hidden email]> wrote:
> > On 19/06/2020 13:50, Joerg Jung wrote:
> >>
> >>> On 18. Jun 2020, at 11:53, Wouter Wijngaards <[hidden email] <mailto:[hidden email]>
> >>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
> >>> On 18/06/2020 11:49, Otto Moerbeek wrote:
> >>>> On Thu, Jun 18, 2020 at 11:45:01AM +0200, Joerg Jung wrote:
> >>>>>> On 18. Jun 2020, at 09:47, Otto Moerbeek <[hidden email] <mailto:[hidden email]>
> >>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
> >>>>>> On Wed, Jun 17, 2020 at 04:57:54PM +0200, Joerg Jung wrote:
> >>>>>>>> On 17. Feb 2020, at 15:16, Martin Pieuchot <[hidden email] <mailto:[hidden email]>
> >>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
> >>>>>>>> On 17/02/20(Mon) 14:55, Joerg Jung wrote:
> >>>>>>>>>
> >>>>>>>>>> On 26. Sep 2019, at 15:02, Stuart Henderson
> >>>>>>>>>> <[hidden email] <mailto:[hidden email]> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
> >>>>>>>>>> On 2019/09/26 13:45, Stuart Henderson wrote:
> >>>>>>>>>>> On 2019/09/26 11:16, Joerg Jung wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I run a few busy (~800 req/s) NSD servers which I upgraded
> >>>>>>>>>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
> >>>>>>>>>>>> sysctl values and nsd.conf matches the default as well, just
> >>>>>>>>>>>> added a few hundred zones.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Now, when I increase servers from default 1 to 2 in nsd.conf:
> >>>>>>>>>>>> server-count: 2
> >>>>>>>>>>>> it starts spamming my log with:
> >>>>>>>>>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily
> >>>>>>>>>>>> unavailable
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>>> FYI, after upgrade to 6.7 the issue is still the same,
> >>>>>>> just with different format for the syslog message not
> >>>>>>> explicitly mentioning sendto() anymore, the new error
> >>>>>>> log repeating gazillion of times looks like this:
> >>>>>>>   nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource
> >>>>>>> temporarily unavailable
> >>>>>>>
> >>>>>>>> Did you ktrace(1) the problem?
> >>>>>>>
> >>>>>>> Yes, please see below.
> >>>>>>>
> >>>>>>>> How is sendto(2) called, in particular
> >>>>>>>> is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?  
> >>>>>>>
> >>>>>>> From the ktrace dump it seems not sendto() being the culprit, but
> >>>>>>> recvfrom().
> >>>>>>> Neither, sendto() nor recvfrom() seem to have any flags set.
> >>>>>>
> >>>>>> I think this is an error in your analysis. It's perfectly normal for
> >>>>>> recvform to return EAGAIN.
> >>>>>
> >>>>> Yes, the log message coming from sendto() / sendmmsg() was
> >>>>> misleading here.
> >>>>>
> >>>>>> On the sending side it indicates an issue.
> >>>>>
> >>>>> Yes, as verified with ktrace the sendto() works fine.
> >>>>>
> >>>>> However, it’s not normal to write 200 log messages per second for
> >>>>> recvfrom() EAGAIN.
> >>>>>
> >>>>> So, seems like a (logging) bug in NSD: It looks like the sendmmsg()
> >>>>> wrapper logs the earlier stored EAGAIN from resvmmsg() wrapper
> >>>>> — if verbosity is 1, which is OpenBSD default.
> >>>>>
> >>>>> Besides, I believe (reading man pages) these network errors
> >>>>> should be only logged at a higher verbosity 2 anyways.
> >>>
> >>> Commit
> >>> https://github.com/NLnetLabs/nsd/commit/374a713b0d175dfe0127f05d986f5f9e1ca8c506
> >>> should remove the EAGAIN log lines so they don't bother you.
> >>
> >> Thanks for looking into it.
> >>
> >>> At first I thought you were talking about the return value of the
> >>> syscall, but instead it is one of the elements of the send list.
> >>
> >> There seems only 1 element in the send list (usually),
> >> which is sent fully and successfully with no error from sendto()
> >> and no EAGAIN being generated.
> >>
> >> So, I wonder where the EAGAIN is actually coming from:
> >> Looking at the ktrace output below and the code it seems most
> >> likely to be propagated from recvmmsg() to sendmmsg(), no?
> >>
> >> Wild guess: Is there an off-by-one in the while loop, which results
> >> in sendmmsg() called again, which restores the previous errno
> >> from recvmmsg() but does not hit sendto() again.
> >>
> >> So, your recent commit may just “hide" the actual “issue".
> >
> > You may be right, does this commit look better:
> > https://github.com/NLnetLabs/nsd/commit/a74e23f4a4048b716b0a22b6d3a71dea7afcd104 <https://github.com/NLnetLabs/nsd/commit/a74e23f4a4048b716b0a22b6d3a71dea7afcd104>
> >
> > It retries when UDP send buffers are full and sets the blocking flag so
> > that the operation completes.
>
> Hah. that’s a large hammer, which may work and seems the right thing
> to do IFF EAGAIN is triggered from the sendto()/sendmmsg() due to full
> buffers.
>
> But from the ktrace the EAGAIN seems not come from sendto()/sendmmsg(),
> since sendto() just finishes fine (see last line in ktrace below) -- so no buffer
> issues here. Actually, full ktrace contains only EAGAIN from recvfrom() and
> none from sendto() (or any other syscall). As I wrote above, the EAGAIN
> seems just “propagated” from recvmmsg() to sendmmsg() due to the errno
> restore dance.
>
> Why does the code postpone the error handling (and error logging) at all and
> stores / restore errno for later iterations - why not instead handle the error (logging)
> immediately?
>
>
> >>>>>>> 52438 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ,
> >>>>>>> flags=0x1<EV_ADD>, fflags=0<>, data=62
> >>>>>>> , udata=0x30562fe4198 }
> >>>>>>> 55919 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ,
> >>>>>>> flags=0x1<EV_ADD>, fflags=0<>, data=62
> >>>>>>> , udata=0x30562fe4198 }
> >>>>>>> 52438 nsd      RET   kevent 1
> >>>>>>> 55919 nsd      RET   kevent 1
> >>>>>>> 52438 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >>>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >>>>>>> 52438 nsd      STRU  struct timespec { 1224.307942259 }
> >>>>>>> 55919 nsd      STRU  struct timespec { 1224.307946954 }
> >>>>>>> 52438 nsd      RET   clock_gettime 0
> >>>>>>> 55919 nsd      RET   clock_gettime 0
> >>>>>>> 52438 nsd      CALL
> >>>>>>> recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
> >>>>>>> 55919 nsd      CALL
> >>>>>>> recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
> >>>>>>> 52438 nsd      GIO   fd 5 read 46 bytes
> >>>>>>>    
> >>>>>>> "\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
> >>>>>>> 55919 nsd      RET   recvfrom -1 errno 35 Resource temporarily
> >>>>>>> unavailable
> >>>>>>> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
> >>>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >>>>>>> 52438 nsd      RET   recvfrom 46/0x2e
> >>>>>>> 55919 nsd      STRU  struct timespec { 1224.307967366 }
> >>>>>>> 55919 nsd      RET   clock_gettime 0
> >>>>>>> 52438 nsd      CALL  gettimeofday(0x7f7ffffbeed0,0)
> >>>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >>>>>>> 52438 nsd      STRU  struct timeval { 1592402542<"Jun 17 10:02:22
> >>>>>>> 2020">.287233 }
> >>>>>>> 55919 nsd      STRU  struct timespec { 1224.307977172 }
> >>>>>>> 52438 nsd      RET   gettimeofday 0
> >>>>>>> 55919 nsd      RET   clock_gettime 0
> >>>>>>> 52438 nsd      CALL  sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
> >>>>>>> 55919 nsd      CALL  kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
> >>>>>>> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
> >>>>>>> 55919 nsd      STRU  struct timespec { 109.487449000 }
> >>>>>>> 52438 nsd      GIO   fd 5 wrote 46 bytes
> >>>>>>>    
> >>>>>>> "\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
> >>>>>>> 52438 nsd      RET   sendto 46/0x2e
>

I think we have different issues. I originally was thinking this was
about the occasional sendo EAGAIN that some see while the send buffer
does not appear to be full. That is different thing than recfrom
failing.

        -Otto