More bgpd problems

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

More bgpd problems

Matt Hamilton
Hi all,

More bgpd problems last night :( This happened last night on two of our
routers. One running an old version of OpenBSD (4.3) and one running
5.1. Is there anyone out there actually using bpgd in production? How
do you deal with it quitting everytime something unexpected happens on
the network?

The first message below seems to indicate unable to allocate
memory. I'm running these boxes pretty much stock having not tuned any
parameters at all. Both are just running routing daemons (bgpd, ospf)
and the 4.3 box is running OpenVPN. There are no applications running
and both boxes have plenty of RAM (4GB) and not using any swap or
anything.

Is there something I should look at tuning in terms
of memory allocation in order to stop this happening?

OpenBSD 4.3/amd64:

May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot
allocate memory
May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose
error: Cannot allocate memory
May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision
engine exited
May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error:
Broken pipe

OpenBSD 5.1/amd64:

May 29 05:55:09 fw1 bgpd[21316]: Lost child: route decision engine
terminated; signal 11
May 29 05:55:09 fw1 bgpd[21459]: fatal in SE: pipe write error: Broken
pipe


Thanks
-Matt

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Otto Moerbeek
On Tue, May 29, 2012 at 08:57:54AM +0000, Matt Hamilton wrote:

> Hi all,
>
> More bgpd problems last night :( This happened last night on two of our
> routers. One running an old version of OpenBSD (4.3) and one running
> 5.1. Is there anyone out there actually using bpgd in production? How
> do you deal with it quitting everytime something unexpected happens on
> the network?

Yes, lots of people run it in production.

>
> The first message below seems to indicate unable to allocate
> memory. I'm running these boxes pretty much stock having not tuned any
> parameters at all. Both are just running routing daemons (bgpd, ospf)
> and the 4.3 box is running OpenVPN. There are no applications running
> and both boxes have plenty of RAM (4GB) and not using any swap or
> anything.
>
> Is there something I should look at tuning in terms
> of memory allocation in order to stop this happening?
>
> OpenBSD 4.3/amd64:
>
> May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot
> allocate memory
> May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose
> error: Cannot allocate memory
> May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision
> engine exited
> May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error:
> Broken pipe

Only solution: upgrading. You are runing unsupported software, a
foolish thing to do.

>
> OpenBSD 5.1/amd64:
>
> May 29 05:55:09 fw1 bgpd[21316]: Lost child: route decision engine
> terminated; signal 11
> May 29 05:55:09 fw1 bgpd[21459]: fatal in SE: pipe write error: Broken
> pipe

This is a real issue. I'll leave this one to people more experienced
running bgpd.
       
        -Otto

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Stuart Henderson
In reply to this post by Matt Hamilton
On 2012-05-29, Matt Hamilton <[hidden email]> wrote:
> More bgpd problems last night :( This happened last night on two of our
> routers. One running an old version of OpenBSD (4.3) and one running
> 5.1. Is there anyone out there actually using bpgd in production?

Yes.

> How
> do you deal with it quitting everytime something unexpected happens on
> the network?

cron job to restart it, with a random delay to avoid two machines
coming back up at the same time when all the routers at a site
fail together...

> The first message below seems to indicate unable to allocate
> memory. I'm running these boxes pretty much stock having not tuned any
> parameters at all. Both are just running routing daemons (bgpd, ospf)
> and the 4.3 box is running OpenVPN. There are no applications running
> and both boxes have plenty of RAM (4GB) and not using any swap or
> anything.
>
> Is there something I should look at tuning in terms
> of memory allocation in order to stop this happening?

Make sure login.conf memory limits for the daemon class (or the
_bgpd class on a newer OS version using /etc/rc.d) are high enough.
If your limits are insufficient for the size of routing table then
obviously you will have a problem. But also there is a bug
somewhere, possibly to do with nexthop changes, which can result
in very rapidly increasing memory use.

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Matt Hamilton
Stuart Henderson <stu <at> spacehopper.org> writes:

> cron job to restart it, with a random delay to avoid two machines
> coming back up at the same time when all the routers at a site
> fail together...

So you just check it every minute to see if it is alive?

It seems to me to be a pretty fundamental design flaw in the software given
its role. I would expect it to return sending a packet or something, not
just exit.
 

> > The first message below seems to indicate unable to allocate
> > memory. I'm running these boxes pretty much stock having not tuned any
> > parameters at all. Both are just running routing daemons (bgpd, ospf)
> > and the 4.3 box is running OpenVPN. There are no applications running
> > and both boxes have plenty of RAM (4GB) and not using any swap or
> > anything.
> >
> > Is there something I should look at tuning in terms
> > of memory allocation in order to stop this happening?
>
> Make sure login.conf memory limits for the daemon class (or the
> _bgpd class on a newer OS version using /etc/rc.d) are high enough.
> If your limits are insufficient for the size of routing table then
> obviously you will have a problem. But also there is a bug
> somewhere, possibly to do with nexthop changes, which can result
> in very rapidly increasing memory use.

Currently my routing table is pretty small. Only something like 150
routes. This will increase once we start taking full feeds. At the moment
we only have a few partial feeds from networks we peer with and everything
else goes out a default route.

I don't think it is a memory issue with the process itself, but the error
message seems to be more related to memory available to send the packet.
This is why I'm wondering if there is some sysctl or similar somewhere
I should be tweaking.

-Matt

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Matt Hamilton
In reply to this post by Otto Moerbeek
Otto Moerbeek <otto <at> drijf.net> writes:

>
> On Tue, May 29, 2012 at 08:57:54AM +0000, Matt Hamilton wrote:
>
> > Hi all,
> >
> > More bgpd problems last night :( This happened last night on two of our
> > routers. One running an old version of OpenBSD (4.3) and one running
> > 5.1. Is there anyone out there actually using bpgd in production? How
> > do you deal with it quitting everytime something unexpected happens on
> > the network?
>
> Yes, lots of people run it in production.

That is what I'd expect. I just don't understand how with it keep dropping
out when it has some transient problem.

> >
> > The first message below seems to indicate unable to allocate
> > memory. I'm running these boxes pretty much stock having not tuned any
> > parameters at all. Both are just running routing daemons (bgpd, ospf)
> > and the 4.3 box is running OpenVPN. There are no applications running
> > and both boxes have plenty of RAM (4GB) and not using any swap or
> > anything.
> >
> > Is there something I should look at tuning in terms
> > of memory allocation in order to stop this happening?
> >
> > OpenBSD 4.3/amd64:
> >
> > May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot
> > allocate memory
> > May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose
> > error: Cannot allocate memory
> > May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision
> > engine exited
> > May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error:
> > Broken pipe
>
> Only solution: upgrading. You are runing unsupported software, a
> foolish thing to do.

Alas we don't all live in Utopia ;) This box is due to be upgraded soon,
but that upgrade is predicated on getting a stable routing environment
so that I can do so. At the moment we are mid-way through migrating
away from Cisco kit to OpenBSD routers. Until I can be confident that it
won't all just fall over I can't continue with the migration.

So any insight on why I would be getting the same symptoms on the 5.1
box? And was getting bgpd dying before under 5.0? I'm finding it hard
to believe that this behaviour would have been tolerated by people
running bgpd in production all the way from the time of 4.3 to now.
Which leads to the only conclusion... I'm doing something stupid.
The question is what. I have ospfd and bgpd running. On the 5.1 box
there is also a CARP interface too (not an interface we are using ospfd on).

-Matt

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Otto Moerbeek
In reply to this post by Matt Hamilton
On Tue, May 29, 2012 at 10:00:53AM +0000, Matt Hamilton wrote:

> Stuart Henderson <stu <at> spacehopper.org> writes:
>
> > cron job to restart it, with a random delay to avoid two machines
> > coming back up at the same time when all the routers at a site
> > fail together...
>
> So you just check it every minute to see if it is alive?
>
> It seems to me to be a pretty fundamental design flaw in the software given
> its role. I would expect it to return sending a packet or something, not
> just exit.
>  
> > > The first message below seems to indicate unable to allocate
> > > memory. I'm running these boxes pretty much stock having not tuned any
> > > parameters at all. Both are just running routing daemons (bgpd, ospf)
> > > and the 4.3 box is running OpenVPN. There are no applications running
> > > and both boxes have plenty of RAM (4GB) and not using any swap or
> > > anything.
> > >
> > > Is there something I should look at tuning in terms
> > > of memory allocation in order to stop this happening?
> >
> > Make sure login.conf memory limits for the daemon class (or the
> > _bgpd class on a newer OS version using /etc/rc.d) are high enough.
> > If your limits are insufficient for the size of routing table then
> > obviously you will have a problem. But also there is a bug
> > somewhere, possibly to do with nexthop changes, which can result
> > in very rapidly increasing memory use.
>
> Currently my routing table is pretty small. Only something like 150
> routes. This will increase once we start taking full feeds. At the moment
> we only have a few partial feeds from networks we peer with and everything
> else goes out a default route.
>
> I don't think it is a memory issue with the process itself, but the error
> message seems to be more related to memory available to send the packet.
> This is why I'm wondering if there is some sysctl or similar somewhere
> I should be tweaking.
>
> -Matt

the 4.x error and the 5.1 error are unrelated. Your first task should
be to upgrade the 4.x machine.

        -Otto

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Henning Brauer
In reply to this post by Matt Hamilton
* Matt Hamilton <[hidden email]> [2012-05-29 12:02]:
> Stuart Henderson <stu <at> spacehopper.org> writes:
> > cron job to restart it, with a random delay to avoid two machines
> > coming back up at the same time when all the routers at a site
> > fail together...
> So you just check it every minute to see if it is alive?
>
> It seems to me to be a pretty fundamental design flaw in the software given
> its role. I would expect it to return sending a packet or something, not
> just exit.

it doesn't exit under normal circumstances.

bgpd is used in a lot of places, some extremely large ones too. you'd
be surprised. and no, they dont deal with "bgpd exiting constantly" or
however you called it, not at all.

> > > The first message below seems to indicate unable to allocate
> > > memory. I'm running these boxes pretty much stock having not tuned any
> > > parameters at all. Both are just running routing daemons (bgpd, ospf)
> > > and the 4.3 box is running OpenVPN. There are no applications running
> > > and both boxes have plenty of RAM (4GB) and not using any swap or
> > > anything.
> > >
> > > Is there something I should look at tuning in terms
> > > of memory allocation in order to stop this happening?
> >
> > Make sure login.conf memory limits for the daemon class (or the
> > _bgpd class on a newer OS version using /etc/rc.d) are high enough.
> > If your limits are insufficient for the size of routing table then
> > obviously you will have a problem. But also there is a bug
> > somewhere, possibly to do with nexthop changes, which can result
> > in very rapidly increasing memory use.

this bug is hard to trigger and we have not been able to identify a
pattern here, except that it involves iBGP.

--
Henning Brauer, [hidden email], [hidden email]
BS Web Services, http://bsws.de, Full-Service ISP
Secure Hosting, Mail and DNS Services. Dedicated Servers, Root to Fully Managed
Henning Brauer Consulting, http://henningbrauer.com/

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Henning Brauer
In reply to this post by Matt Hamilton
* Matt Hamilton <[hidden email]> [2012-05-29 10:59]:
> OpenBSD 4.3/amd64:
>
> May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot
> allocate memory

out of memory.

others have said enuff about running 4.3.

> OpenBSD 5.1/amd64:
> May 29 05:55:09 fw1 bgpd[21316]: Lost child: route decision engine
> terminated; signal 11

now that is bad. sig11 = segfault, Must Not Happen (tm).
can you get us a backtrace? stuart, can we document the steps to do so
somewhere we can point people to?

--
Henning Brauer, [hidden email], [hidden email]
BS Web Services, http://bsws.de, Full-Service ISP
Secure Hosting, Mail and DNS Services. Dedicated Servers, Root to Fully Managed
Henning Brauer Consulting, http://henningbrauer.com/

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Otto Moerbeek
In reply to this post by Matt Hamilton
On Tue, May 29, 2012 at 10:06:37AM +0000, Matt Hamilton wrote:

> Otto Moerbeek <otto <at> drijf.net> writes:
>
> >
> > On Tue, May 29, 2012 at 08:57:54AM +0000, Matt Hamilton wrote:
> >
> > > Hi all,
> > >
> > > More bgpd problems last night :( This happened last night on two of our
> > > routers. One running an old version of OpenBSD (4.3) and one running
> > > 5.1. Is there anyone out there actually using bpgd in production? How
> > > do you deal with it quitting everytime something unexpected happens on
> > > the network?
> >
> > Yes, lots of people run it in production.
>
> That is what I'd expect. I just don't understand how with it keep dropping
> out when it has some transient problem.
>
> > >
> > > The first message below seems to indicate unable to allocate
> > > memory. I'm running these boxes pretty much stock having not tuned any
> > > parameters at all. Both are just running routing daemons (bgpd, ospf)
> > > and the 4.3 box is running OpenVPN. There are no applications running
> > > and both boxes have plenty of RAM (4GB) and not using any swap or
> > > anything.
> > >
> > > Is there something I should look at tuning in terms
> > > of memory allocation in order to stop this happening?
> > >
> > > OpenBSD 4.3/amd64:
> > >
> > > May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot
> > > allocate memory
> > > May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose
> > > error: Cannot allocate memory
> > > May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision
> > > engine exited
> > > May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error:
> > > Broken pipe
> >
> > Only solution: upgrading. You are runing unsupported software, a
> > foolish thing to do.
>
> Alas we don't all live in Utopia ;) This box is due to be upgraded soon,
> but that upgrade is predicated on getting a stable routing environment
> so that I can do so. At the moment we are mid-way through migrating
> away from Cisco kit to OpenBSD routers. Until I can be confident that it
> won't all just fall over I can't continue with the migration.
>
> So any insight on why I would be getting the same symptoms on the 5.1
> box? And was getting bgpd dying before under 5.0? I'm finding it hard

According to you previous message, you are getting a different
behaviour on the 5.1 box. A segfault is not the same as running out of mem.

As for the quitting problem: if a fatal error occurs, you don't have
any other choice than to quit. A fatal error means the process cannnot
be trusted any more. This is unsatisfactory, but the only way.


> to believe that this behaviour would have been tolerated by people
> running bgpd in production all the way from the time of 4.3 to now.
> Which leads to the only conclusion... I'm doing something stupid.
> The question is what. I have ospfd and bgpd running. On the 5.1 box
> there is also a CARP interface too (not an interface we are using ospfd on).
>
> -Matt

There have been earlier reports of bgpd running out of mem or getting
segfaults. In some cases that lead to fixing bugs. There might remain
unsolved cases.

Working with the developers is one way of getting problems resolved.
Ranting about "I cannot believe this is happening" is not a
constructive way to get closer to the solution.

        -Otto

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Garry Dolley-2
In reply to this post by Matt Hamilton
On Tue, May 29, 2012 at 08:57:54AM +0000, Matt Hamilton wrote:
> Hi all,
>
> More bgpd problems last night :( This happened last night on two of our
> routers. One running an old version of OpenBSD (4.3) and one running
> 5.1. Is there anyone out there actually using bpgd in production? How

Yes.  For the record I run it on OpenBSD 4.4; IPv6 traffic only.
While there have been some quirks over the years, I've never seen it
quit.

--
Garry Dolley
ARP Networks, Inc. | http://www.arpnetworks.com | (818) 206-0181
Data center, VPS, and IP Transit solutions
Member Los Angeles County REACT, Unit 336 | WQGK336
Blog http://scie.nti.st

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Patrick Coleman-5
In reply to this post by Matt Hamilton
On 29/05/2012, at 6:08 PM, Matt Hamilton <[hidden email]> wrote:

> Stuart Henderson <stu <at> spacehopper.org> writes:
>
>> cron job to restart it, with a random delay to avoid two machines
>> coming back up at the same time when all the routers at a site
>> fail together...
>
> So you just check it every minute to see if it is alive?
>
> It seems to me to be a pretty fundamental design flaw in the software given
> its role. I would expect it to return sending a packet or something, not
> just exit.

I run it on five routers in production, balancing a couple of Internet
links and a connection to a peering point. ospfd and ospf6d handle the
internal routing. I don't have a cron job to restart it because I
wasn't aware this is necessary - its been running for a year now with
no issues. There are however a few redundant paths, so if we did lose
a router it wouldn't cause too many problems.

Installations are a mix of 5.0 and 4.7, IIRC. Hardware is Dell R610s
and R415s, plus an embedded Soekris board (at the peering point).

Cheers,

Patrick

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Matt Hamilton
In reply to this post by Otto Moerbeek
Otto Moerbeek <otto <at> drijf.net> writes:

> According to you previous message, you are getting a different
> behaviour on the 5.1 box. A segfault is not the same as running out of mem.

I agree. It seems strangely co-incidental though that bgpd on both version
of OpenBSD died within minutes of each other.
 
> As for the quitting problem: if a fatal error occurs, you don't have
> any other choice than to quit. A fatal error means the process cannnot
> be trusted any more. This is unsatisfactory, but the only way.

true.

> > to believe that this behaviour would have been tolerated by people
> > running bgpd in production all the way from the time of 4.3 to now.
> > Which leads to the only conclusion... I'm doing something stupid.
> > The question is what. I have ospfd and bgpd running. On the 5.1 box
> > there is also a CARP interface too (not an interface we are using ospfd on).
> >
> > -Matt
>
> There have been earlier reports of bgpd running out of mem or getting
> segfaults. In some cases that lead to fixing bugs. There might remain
> unsolved cases.
>
> Working with the developers is one way of getting problems resolved.
> Ranting about "I cannot believe this is happening" is not a
> constructive way to get closer to the solution.

Sorry if you mis-understood what I wrote. I was not ranting, I was pointing
out that as I can't believe it would be tolerated then it means I must be
doing something stupid, or different, or wrong.

-Matt

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Matt Hamilton
In reply to this post by Henning Brauer
Henning Brauer <lists-openbsd <at> bsws.de> writes:

> > OpenBSD 5.1/amd64:
> > May 29 05:55:09 fw1 bgpd[21316]: Lost child: route decision engine
> > terminated; signal 11
>
> now that is bad. sig11 = segfault, Must Not Happen (tm).
> can you get us a backtrace? stuart, can we document the steps to do so
> somewhere we can point people to?

I will happily supply what I can. Just let me know how.
Although as you said in another post
it is hard to replicate. All I seem to be able to see is that this happens
during some period of network instability. It seems that there is a
ripple affect that something happens and that then causes a bgpd
process to die which then propagates more changes to iBGP peers
and they then sometimes die as well.

-Matt

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Peter J. Philipp-3
On Tue, May 29, 2012 at 04:21:12PM +0000, Matt Hamilton wrote:
> I will happily supply what I can. Just let me know how.

Hello, I've never used BGPd personally but perhaps I can help you get a
backtrace.  There is quite possibly two ways to get a backtrace.  

1. Make BGPD dump core

Recompile the bgpd with debugging symbols (CFLAGS+=-g, LDFLAGS+=-g).  And
install that.

Check the directory of the _bgpd user and make the directory writeable for
the _bgpd user.  If after another crash a bgpd.core file pops up you got it.

You can test this by sending bgpd a SIGABRT and if it didn't core something
is wrong, see #2.

You then type 'gdb /usr/sbin/bgpd bgpd.core' and type backtrace within gdb.
Type quit to exit gdb.  Keep the bgpd.core file around by saving it to another
location as it should overwrite with each subsequent segfault.

2. Attach gdb to the process and wait

Recompile the bgpd with debugging symbols (CFLAGS+=-g, LDFLAGS+=-g).  And
install that.

su to root, tmux the session and from within tmux attach to the bgpd process
"gdb /usr/sbin/bgpd <pid of bgpd>" once you're attached bgpd will cease
running temporarily, just type "continue" (make sure you don't set any
breakpoints).

You can now wait until bgpd crashes on signal 11.  gdb will break back to
the debugger command line and you can type backtrace within gdb.
Type quit to exit gdb.

When you get to it when it crashed you can attach to the tmux session with
"tmux att -d" and have before you the gdb command line.  Even better than
just a backtrace is going up and down the stack to see where the program
crashed.  Google for gdb commands.

3. Ask someone else who may have better Ideas.

> Although as you said in another post
> it is hard to replicate. All I seem to be able to see is that this happens
> during some period of network instability. It seems that there is a
> ripple affect that something happens and that then causes a bgpd
> process to die which then propagates more changes to iBGP peers
> and they then sometimes die as well.
>
> -Matt

Cheers,
-peter

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Henning Brauer
* Peter J. Philipp <[hidden email]> [2012-05-29 21:26]:
> 1. Make BGPD dump core

it doesn't work that way due to bgpd dropping privs and chrooting.
the way involves setting kern.nosuidcoredump to 2, but since we have
all that already written down in an email to a non-public list, it'll
be easiest to make that available.

--
Henning Brauer, [hidden email], [hidden email]
BS Web Services, http://bsws.de, Full-Service ISP
Secure Hosting, Mail and DNS Services. Dedicated Servers, Root to Fully Managed
Henning Brauer Consulting, http://henningbrauer.com/

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Philip Guenther-2
On Tue, May 29, 2012 at 12:30 PM, Henning Brauer <[hidden email]> wrote:
> * Peter J. Philipp <[hidden email]> [2012-05-29 21:26]:
>> 1. Make BGPD dump core
>
> it doesn't work that way due to bgpd dropping privs and chrooting.
> the way involves setting kern.nosuidcoredump to 2, but since we have
> all that already written down in an email to a non-public list, it'll
> be easiest to make that available.

Roger.  To paraphrase: in order for such a process to be able to dump
core, do the following:
----
Create /var/empty/var/crash/ and chown it to the user that the
[chroot'ed priv-sep'ed process] runs
as, then set the kern.nosuidcoredump sysctl to 2.
----

Philip Guenther

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

James Shupe-4
In reply to this post by Garry Dolley-2
On 05/29/2012 05:41 AM, Garry Dolley wrote:

> On Tue, May 29, 2012 at 08:57:54AM +0000, Matt Hamilton wrote:
>> Hi all,
>>
>> More bgpd problems last night :( This happened last night on two of our
>> routers. One running an old version of OpenBSD (4.3) and one running
>> 5.1. Is there anyone out there actually using bpgd in production? How
>
> Yes.  For the record I run it on OpenBSD 4.4; IPv6 traffic only.
> While there have been some quirks over the years, I've never seen it
> quit.
>

I've been running it to peer with 3 IPv4 peers and 3 IPv6 peers (full
views) and another partial IPv4 view with 12k routes (actually: varying
amounts of peers over the years, but that's the current setup) since 4.5
without needing any cron jobs to watch over it.

nrpe and ifstated run to verify the peers are up and react accordingly,
but they never trigger unless there is a physical or provider issue.
OpenBGPD has been rock solid for us.

--
James Shupe

[demime 1.01d removed an attachment of type application/pgp-signature which had a name of signature.asc]

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Jiri B-2
In reply to this post by Peter J. Philipp-3
On Tue, May 29, 2012 at 09:25:16PM +0200, Peter J. Philipp wrote:
> Recompile the bgpd with debugging symbols (CFLAGS+=-g, LDFLAGS+=-g).  And
> install that.

I have thought -current is compiled with debug, isn't it?

jirib

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Matt Hamilton
In reply to this post by Philip Guenther-2
Philip Guenther <guenther <at> gmail.com> writes:

> Roger.  To paraphrase: in order for such a process to be able to dump
> core, do the following:
> ----
> Create /var/empty/var/crash/ and chown it to the user that the
> [chroot'ed priv-sep'ed process] runs
> as, then set the kern.nosuidcoredump sysctl to 2.

OK, great. I've done that on all 7 boxes:

4 x OpenBSD 5.1/amd64
2 x OpenBSD 5.0/i386
1 x OpenBSD 4.3/amd64

and tested it with SIGABRT and I get a core file. So now just to sit and
wait until it happens again.

Thanks!

-Matt

Reply | Threaded
Open this post in threaded view
|

Re: More bgpd problems

Stuart Henderson
In reply to this post by Matt Hamilton
On 2012-05-29, Matt Hamilton <[hidden email]> wrote:

> Otto Moerbeek <otto <at> drijf.net> writes:
>
>>
>> On Tue, May 29, 2012 at 08:57:54AM +0000, Matt Hamilton wrote:
>>
>> > Hi all,
>> >
>> > More bgpd problems last night :( This happened last night on two of our
>> > routers. One running an old version of OpenBSD (4.3) and one running
>> > 5.1. Is there anyone out there actually using bpgd in production? How
>> > do you deal with it quitting everytime something unexpected happens on
>> > the network?
>>
>> Yes, lots of people run it in production.
>
> That is what I'd expect. I just don't understand how with it keep dropping
> out when it has some transient problem.
>
>> >
>> > The first message below seems to indicate unable to allocate
>> > memory. I'm running these boxes pretty much stock having not tuned any
>> > parameters at all. Both are just running routing daemons (bgpd, ospf)
>> > and the 4.3 box is running OpenVPN. There are no applications running
>> > and both boxes have plenty of RAM (4GB) and not using any swap or
>> > anything.
>> >
>> > Is there something I should look at tuning in terms
>> > of memory allocation in order to stop this happening?
>> >
>> > OpenBSD 4.3/amd64:
>> >
>> > May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot
>> > allocate memory
>> > May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose
>> > error: Cannot allocate memory
>> > May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision
>> > engine exited
>> > May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error:
>> > Broken pipe
>>
>> Only solution: upgrading. You are runing unsupported software, a
>> foolish thing to do.
>
> Alas we don't all live in Utopia ;) This box is due to be upgraded soon,
> but that upgrade is predicated on getting a stable routing environment
> so that I can do so. At the moment we are mid-way through migrating
> away from Cisco kit to OpenBSD routers. Until I can be confident that it
> won't all just fall over I can't continue with the migration.

I would *not* want to be running ospfd from before 5.1 on a DFZ
router. First RTM_DESYNC (route socket overflows) were not dealt with
at all in ospfd until 4.8 and from then until 5.1 they tended to
result in lots of kernel route table dumps in quick succession to
get back into sync, which is pretty hard on the machine, in 5.1
a holdoff timer was introduced for these resyncs. bgpd-wise since
4.3 there have been crashes fixed triggered by bad updates (these
affected most BGP implementations not just OpenBSD) and numerous
other fixes. If you are upgrading from that version then use bsd.rd
to upgrade rather than untarring sets on the live system, and read
the upgrade notes for the intermediate versions, I think that time
period includes slight incompatible changes to bgpd.conf.

> So any insight on why I would be getting the same symptoms on the 5.1
> box? And was getting bgpd dying before under 5.0? I'm finding it hard
> to believe that this behaviour would have been tolerated by people
> running bgpd in production all the way from the time of 4.3 to now.
> Which leads to the only conclusion... I'm doing something stupid.
> The question is what. I have ospfd and bgpd running. On the 5.1 box
> there is also a CARP interface too (not an interface we are using ospfd on).
>
> -Matt
>
>

Not sure when I started seeing it as I had various other problems
on the network and with hardware back in the 4.3 days (what's that,
4 years ago or so?)                                                                                  

Some people don't seem to hit it at all. One of the most common
uses of OpenBGP is running as route server with mostly LAN-based
connections and I suspect this type of setup is less likely to hit
this problem. I usually only hit it on routers connected via wan
links (redundant paths with ospf which flap on occasion). Usually
hit the memory problem a few times in fairly quick succession,
then not again for sometimes as much as a couple of months or
even longer.

Without having had a way to trigger it in the lab, and in my case not
much storage on the routers to save dumps, getting more information to
help track it down is challenging.. and of course I am reliant on
out-of-band access and needing to get the network back up at that
point, and often not fully awake having been woken by a text from
icinga, so very limited debug opportunities.

If you're better able to try and get some debug information, from what
we've worked out more recently I would suggest flapping the ospf links
as possibly triggering it.

12