relayd check script memory explosion

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

relayd check script memory explosion

Jonathan Matthew-4
It's fairly easy to accidentally configure relayd to try to run check scripts
faster than they finish, for example if you have a check interval of one
second and the check script makes a tcp connection to a host that doesn't
exist any more.

In this situation, the hce process will keep writing messages to its imsg
buffer to the parent process asking it to run checks, which causes its memory
usage to grow without bounds.  If the check script starts working again
(or if you change it to just 'exit 0') the parent works its way through the
backlog and memory usage goes back to normal, but ideally relayd would avoid
doing this to itself.

If we don't clear the F_CHECK_SENT and F_CHECK_DONE flags in
hce_launch_checks(), check_script() can use them to figure out if the
last check request it sent for the host has finished yet, so it can avoid
building up a backlog of work for the parent.  The ICMP and script check
implementations clear these flags as they start checks, and the TCP check
code doesn't use them at all, so this shouldn't affect anything else.

ok?


Index: check_script.c
===================================================================
RCS file: /cvs/src/usr.sbin/relayd/check_script.c,v
retrieving revision 1.21
diff -u -p -u -p -r1.21 check_script.c
--- check_script.c 28 May 2017 10:39:15 -0000 1.21
+++ check_script.c 15 Feb 2021 01:28:54 -0000
@@ -38,6 +38,9 @@ check_script(struct relayd *env, struct
  struct ctl_script scr;
  struct table *table;
 
+ if ((host->flags & (F_CHECK_SENT|F_CHECK_DONE)) == F_CHECK_SENT)
+ return;
+
  if ((table = table_find(env, host->conf.tableid)) == NULL)
  fatalx("%s: invalid table id", __func__);
 
@@ -52,7 +55,9 @@ check_script(struct relayd *env, struct
  fatalx("invalid script path");
  memcpy(&scr.timeout, &table->conf.timeout, sizeof(scr.timeout));
 
- proc_compose(env->sc_ps, PROC_PARENT, IMSG_SCRIPT, &scr, sizeof(scr));
+ if (proc_compose(env->sc_ps, PROC_PARENT, IMSG_SCRIPT, &scr,
+    sizeof(scr)) == 0)
+ host->flags |= F_CHECK_SENT;
 }
 
 void
Index: hce.c
===================================================================
RCS file: /cvs/src/usr.sbin/relayd/hce.c,v
retrieving revision 1.79
diff -u -p -u -p -r1.79 hce.c
--- hce.c 6 Aug 2018 17:31:31 -0000 1.79
+++ hce.c 15 Feb 2021 01:28:54 -0000
@@ -139,7 +139,6 @@ hce_launch_checks(int fd, short event, v
  TAILQ_FOREACH(host, &table->hosts, entry) {
  if ((host->flags & F_CHECK_DONE) == 0)
  host->he = HCE_INTERVAL_TIMEOUT;
- host->flags &= ~(F_CHECK_SENT|F_CHECK_DONE);
  if (event_initialized(&host->cte.ev)) {
  event_del(&host->cte.ev);
  close(host->cte.s);

Reply | Threaded
Open this post in threaded view
|

Re: relayd check script memory explosion

Theo Buehler-3
On Mon, Feb 15, 2021 at 12:03:42PM +1000, Jonathan Matthew wrote:

> It's fairly easy to accidentally configure relayd to try to run check scripts
> faster than they finish, for example if you have a check interval of one
> second and the check script makes a tcp connection to a host that doesn't
> exist any more.
>
> In this situation, the hce process will keep writing messages to its imsg
> buffer to the parent process asking it to run checks, which causes its memory
> usage to grow without bounds.  If the check script starts working again
> (or if you change it to just 'exit 0') the parent works its way through the
> backlog and memory usage goes back to normal, but ideally relayd would avoid
> doing this to itself.
>
> If we don't clear the F_CHECK_SENT and F_CHECK_DONE flags in
> hce_launch_checks(), check_script() can use them to figure out if the
> last check request it sent for the host has finished yet, so it can avoid
> building up a backlog of work for the parent.  The ICMP and script check
> implementations clear these flags as they start checks, and the TCP check
> code doesn't use them at all, so this shouldn't affect anything else.
>
> ok?

ok tb

>
>
> Index: check_script.c
> ===================================================================
> RCS file: /cvs/src/usr.sbin/relayd/check_script.c,v
> retrieving revision 1.21
> diff -u -p -u -p -r1.21 check_script.c
> --- check_script.c 28 May 2017 10:39:15 -0000 1.21
> +++ check_script.c 15 Feb 2021 01:28:54 -0000
> @@ -38,6 +38,9 @@ check_script(struct relayd *env, struct
>   struct ctl_script scr;
>   struct table *table;
>  
> + if ((host->flags & (F_CHECK_SENT|F_CHECK_DONE)) == F_CHECK_SENT)
> + return;
> +
>   if ((table = table_find(env, host->conf.tableid)) == NULL)
>   fatalx("%s: invalid table id", __func__);
>  
> @@ -52,7 +55,9 @@ check_script(struct relayd *env, struct
>   fatalx("invalid script path");
>   memcpy(&scr.timeout, &table->conf.timeout, sizeof(scr.timeout));
>  
> - proc_compose(env->sc_ps, PROC_PARENT, IMSG_SCRIPT, &scr, sizeof(scr));
> + if (proc_compose(env->sc_ps, PROC_PARENT, IMSG_SCRIPT, &scr,
> +    sizeof(scr)) == 0)
> + host->flags |= F_CHECK_SENT;
>  }
>  
>  void
> Index: hce.c
> ===================================================================
> RCS file: /cvs/src/usr.sbin/relayd/hce.c,v
> retrieving revision 1.79
> diff -u -p -u -p -r1.79 hce.c
> --- hce.c 6 Aug 2018 17:31:31 -0000 1.79
> +++ hce.c 15 Feb 2021 01:28:54 -0000
> @@ -139,7 +139,6 @@ hce_launch_checks(int fd, short event, v
>   TAILQ_FOREACH(host, &table->hosts, entry) {
>   if ((host->flags & F_CHECK_DONE) == 0)
>   host->he = HCE_INTERVAL_TIMEOUT;
> - host->flags &= ~(F_CHECK_SENT|F_CHECK_DONE);
>   if (event_initialized(&host->cte.ev)) {
>   event_del(&host->cte.ev);
>   close(host->cte.s);
>

Reply | Threaded
Open this post in threaded view
|

Re: relayd check script memory explosion

Giovanni Bechis-7
In reply to this post by Jonathan Matthew-4
On Mon, Feb 15, 2021 at 12:03:42PM +1000, Jonathan Matthew wrote:

> It's fairly easy to accidentally configure relayd to try to run check scripts
> faster than they finish, for example if you have a check interval of one
> second and the check script makes a tcp connection to a host that doesn't
> exist any more.
>
> In this situation, the hce process will keep writing messages to its imsg
> buffer to the parent process asking it to run checks, which causes its memory
> usage to grow without bounds.  If the check script starts working again
> (or if you change it to just 'exit 0') the parent works its way through the
> backlog and memory usage goes back to normal, but ideally relayd would avoid
> doing this to itself.
>
> If we don't clear the F_CHECK_SENT and F_CHECK_DONE flags in
> hce_launch_checks(), check_script() can use them to figure out if the
> last check request it sent for the host has finished yet, so it can avoid
> building up a backlog of work for the parent.  The ICMP and script check
> implementations clear these flags as they start checks, and the TCP check
> code doesn't use them at all, so this shouldn't affect anything else.
>
> ok?
>
ok giovanni@
 Thanks
  Giovanni

>
> Index: check_script.c
> ===================================================================
> RCS file: /cvs/src/usr.sbin/relayd/check_script.c,v
> retrieving revision 1.21
> diff -u -p -u -p -r1.21 check_script.c
> --- check_script.c 28 May 2017 10:39:15 -0000 1.21
> +++ check_script.c 15 Feb 2021 01:28:54 -0000
> @@ -38,6 +38,9 @@ check_script(struct relayd *env, struct
>   struct ctl_script scr;
>   struct table *table;
>  
> + if ((host->flags & (F_CHECK_SENT|F_CHECK_DONE)) == F_CHECK_SENT)
> + return;
> +
>   if ((table = table_find(env, host->conf.tableid)) == NULL)
>   fatalx("%s: invalid table id", __func__);
>  
> @@ -52,7 +55,9 @@ check_script(struct relayd *env, struct
>   fatalx("invalid script path");
>   memcpy(&scr.timeout, &table->conf.timeout, sizeof(scr.timeout));
>  
> - proc_compose(env->sc_ps, PROC_PARENT, IMSG_SCRIPT, &scr, sizeof(scr));
> + if (proc_compose(env->sc_ps, PROC_PARENT, IMSG_SCRIPT, &scr,
> +    sizeof(scr)) == 0)
> + host->flags |= F_CHECK_SENT;
>  }
>  
>  void
> Index: hce.c
> ===================================================================
> RCS file: /cvs/src/usr.sbin/relayd/hce.c,v
> retrieving revision 1.79
> diff -u -p -u -p -r1.79 hce.c
> --- hce.c 6 Aug 2018 17:31:31 -0000 1.79
> +++ hce.c 15 Feb 2021 01:28:54 -0000
> @@ -139,7 +139,6 @@ hce_launch_checks(int fd, short event, v
>   TAILQ_FOREACH(host, &table->hosts, entry) {
>   if ((host->flags & F_CHECK_DONE) == 0)
>   host->he = HCE_INTERVAL_TIMEOUT;
> - host->flags &= ~(F_CHECK_SENT|F_CHECK_DONE);
>   if (event_initialized(&host->cte.ev)) {
>   event_del(&host->cte.ev);
>   close(host->cte.s);
>

signature.asc (849 bytes) Download Attachment