It's fairly easy to accidentally configure relayd to try to run check scripts
faster than they finish, for example if you have a check interval of one second and the check script makes a tcp connection to a host that doesn't exist any more. In this situation, the hce process will keep writing messages to its imsg buffer to the parent process asking it to run checks, which causes its memory usage to grow without bounds. If the check script starts working again (or if you change it to just 'exit 0') the parent works its way through the backlog and memory usage goes back to normal, but ideally relayd would avoid doing this to itself. If we don't clear the F_CHECK_SENT and F_CHECK_DONE flags in hce_launch_checks(), check_script() can use them to figure out if the last check request it sent for the host has finished yet, so it can avoid building up a backlog of work for the parent. The ICMP and script check implementations clear these flags as they start checks, and the TCP check code doesn't use them at all, so this shouldn't affect anything else. ok? Index: check_script.c =================================================================== RCS file: /cvs/src/usr.sbin/relayd/check_script.c,v retrieving revision 1.21 diff -u -p -u -p -r1.21 check_script.c --- check_script.c 28 May 2017 10:39:15 -0000 1.21 +++ check_script.c 15 Feb 2021 01:28:54 -0000 @@ -38,6 +38,9 @@ check_script(struct relayd *env, struct struct ctl_script scr; struct table *table; + if ((host->flags & (F_CHECK_SENT|F_CHECK_DONE)) == F_CHECK_SENT) + return; + if ((table = table_find(env, host->conf.tableid)) == NULL) fatalx("%s: invalid table id", __func__); @@ -52,7 +55,9 @@ check_script(struct relayd *env, struct fatalx("invalid script path"); memcpy(&scr.timeout, &table->conf.timeout, sizeof(scr.timeout)); - proc_compose(env->sc_ps, PROC_PARENT, IMSG_SCRIPT, &scr, sizeof(scr)); + if (proc_compose(env->sc_ps, PROC_PARENT, IMSG_SCRIPT, &scr, + sizeof(scr)) == 0) + host->flags |= F_CHECK_SENT; } void Index: hce.c =================================================================== RCS file: /cvs/src/usr.sbin/relayd/hce.c,v retrieving revision 1.79 diff -u -p -u -p -r1.79 hce.c --- hce.c 6 Aug 2018 17:31:31 -0000 1.79 +++ hce.c 15 Feb 2021 01:28:54 -0000 @@ -139,7 +139,6 @@ hce_launch_checks(int fd, short event, v TAILQ_FOREACH(host, &table->hosts, entry) { if ((host->flags & F_CHECK_DONE) == 0) host->he = HCE_INTERVAL_TIMEOUT; - host->flags &= ~(F_CHECK_SENT|F_CHECK_DONE); if (event_initialized(&host->cte.ev)) { event_del(&host->cte.ev); close(host->cte.s); |
On Mon, Feb 15, 2021 at 12:03:42PM +1000, Jonathan Matthew wrote:
> It's fairly easy to accidentally configure relayd to try to run check scripts > faster than they finish, for example if you have a check interval of one > second and the check script makes a tcp connection to a host that doesn't > exist any more. > > In this situation, the hce process will keep writing messages to its imsg > buffer to the parent process asking it to run checks, which causes its memory > usage to grow without bounds. If the check script starts working again > (or if you change it to just 'exit 0') the parent works its way through the > backlog and memory usage goes back to normal, but ideally relayd would avoid > doing this to itself. > > If we don't clear the F_CHECK_SENT and F_CHECK_DONE flags in > hce_launch_checks(), check_script() can use them to figure out if the > last check request it sent for the host has finished yet, so it can avoid > building up a backlog of work for the parent. The ICMP and script check > implementations clear these flags as they start checks, and the TCP check > code doesn't use them at all, so this shouldn't affect anything else. > > ok? ok tb > > > Index: check_script.c > =================================================================== > RCS file: /cvs/src/usr.sbin/relayd/check_script.c,v > retrieving revision 1.21 > diff -u -p -u -p -r1.21 check_script.c > --- check_script.c 28 May 2017 10:39:15 -0000 1.21 > +++ check_script.c 15 Feb 2021 01:28:54 -0000 > @@ -38,6 +38,9 @@ check_script(struct relayd *env, struct > struct ctl_script scr; > struct table *table; > > + if ((host->flags & (F_CHECK_SENT|F_CHECK_DONE)) == F_CHECK_SENT) > + return; > + > if ((table = table_find(env, host->conf.tableid)) == NULL) > fatalx("%s: invalid table id", __func__); > > @@ -52,7 +55,9 @@ check_script(struct relayd *env, struct > fatalx("invalid script path"); > memcpy(&scr.timeout, &table->conf.timeout, sizeof(scr.timeout)); > > - proc_compose(env->sc_ps, PROC_PARENT, IMSG_SCRIPT, &scr, sizeof(scr)); > + if (proc_compose(env->sc_ps, PROC_PARENT, IMSG_SCRIPT, &scr, > + sizeof(scr)) == 0) > + host->flags |= F_CHECK_SENT; > } > > void > Index: hce.c > =================================================================== > RCS file: /cvs/src/usr.sbin/relayd/hce.c,v > retrieving revision 1.79 > diff -u -p -u -p -r1.79 hce.c > --- hce.c 6 Aug 2018 17:31:31 -0000 1.79 > +++ hce.c 15 Feb 2021 01:28:54 -0000 > @@ -139,7 +139,6 @@ hce_launch_checks(int fd, short event, v > TAILQ_FOREACH(host, &table->hosts, entry) { > if ((host->flags & F_CHECK_DONE) == 0) > host->he = HCE_INTERVAL_TIMEOUT; > - host->flags &= ~(F_CHECK_SENT|F_CHECK_DONE); > if (event_initialized(&host->cte.ev)) { > event_del(&host->cte.ev); > close(host->cte.s); > |
In reply to this post by Jonathan Matthew-4
On Mon, Feb 15, 2021 at 12:03:42PM +1000, Jonathan Matthew wrote:
> It's fairly easy to accidentally configure relayd to try to run check scripts > faster than they finish, for example if you have a check interval of one > second and the check script makes a tcp connection to a host that doesn't > exist any more. > > In this situation, the hce process will keep writing messages to its imsg > buffer to the parent process asking it to run checks, which causes its memory > usage to grow without bounds. If the check script starts working again > (or if you change it to just 'exit 0') the parent works its way through the > backlog and memory usage goes back to normal, but ideally relayd would avoid > doing this to itself. > > If we don't clear the F_CHECK_SENT and F_CHECK_DONE flags in > hce_launch_checks(), check_script() can use them to figure out if the > last check request it sent for the host has finished yet, so it can avoid > building up a backlog of work for the parent. The ICMP and script check > implementations clear these flags as they start checks, and the TCP check > code doesn't use them at all, so this shouldn't affect anything else. > > ok? > Giovanni > > Index: check_script.c > =================================================================== > RCS file: /cvs/src/usr.sbin/relayd/check_script.c,v > retrieving revision 1.21 > diff -u -p -u -p -r1.21 check_script.c > --- check_script.c 28 May 2017 10:39:15 -0000 1.21 > +++ check_script.c 15 Feb 2021 01:28:54 -0000 > @@ -38,6 +38,9 @@ check_script(struct relayd *env, struct > struct ctl_script scr; > struct table *table; > > + if ((host->flags & (F_CHECK_SENT|F_CHECK_DONE)) == F_CHECK_SENT) > + return; > + > if ((table = table_find(env, host->conf.tableid)) == NULL) > fatalx("%s: invalid table id", __func__); > > @@ -52,7 +55,9 @@ check_script(struct relayd *env, struct > fatalx("invalid script path"); > memcpy(&scr.timeout, &table->conf.timeout, sizeof(scr.timeout)); > > - proc_compose(env->sc_ps, PROC_PARENT, IMSG_SCRIPT, &scr, sizeof(scr)); > + if (proc_compose(env->sc_ps, PROC_PARENT, IMSG_SCRIPT, &scr, > + sizeof(scr)) == 0) > + host->flags |= F_CHECK_SENT; > } > > void > Index: hce.c > =================================================================== > RCS file: /cvs/src/usr.sbin/relayd/hce.c,v > retrieving revision 1.79 > diff -u -p -u -p -r1.79 hce.c > --- hce.c 6 Aug 2018 17:31:31 -0000 1.79 > +++ hce.c 15 Feb 2021 01:28:54 -0000 > @@ -139,7 +139,6 @@ hce_launch_checks(int fd, short event, v > TAILQ_FOREACH(host, &table->hosts, entry) { > if ((host->flags & F_CHECK_DONE) == 0) > host->he = HCE_INTERVAL_TIMEOUT; > - host->flags &= ~(F_CHECK_SENT|F_CHECK_DONE); > if (event_initialized(&host->cte.ev)) { > event_del(&host->cte.ev); > close(host->cte.s); > |
Free forum by Nabble | Edit this page |