20% package loss on CARP after upgrade to 6.3

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

20% package loss on CARP after upgrade to 6.3

Selveste1
Hey everybody,

I'm experiencing problems with CARP after upgrading to 6.3, it was working
fine between my two servers in 6.2 but after upgrading (first backup and
then master) I have a ping package loss on about 20%.

It seem like the backup server tries to take the master, cause it's the
only one changing the states. When it changes state the symptoms is:
one package is dropped (ping), and it switches back to backup. I haven't
changed anything, carp-config or PF, except the upgrade to 6.3.

It works if i shutdown the master, then Backup takes over fine and gives
back to master when it gets up, but when it's just running in backup, it
switches back and fourth.

I have tried tcpdumping and looking at my pfsync0 but I can't find the
problem. I have tried to write my CARP settings again in hostname.carp*
on both servers, check if pfsync0 is on the same interface and IP-range
on both servers, checked my PF and everything, but can't find the problem...

It does it across all 6 CARP's, so it looks like it's missing a hardbeat
or something once in a while.

I also tried switching from multicast to unicast, in case my ISP (running
Juniper equipment) have activated something on the WAN side, but it didn't
change my experience - but since it also happens on my LAN I didn't really
expect this to be the problem.

# Server 1
My /etc/hostname.* for CARP's and pfsync + host adaptor:
https://pastebin.com/vrtuPqnQ
My /etc/pf.conf: https://pastebin.com/yhVkG4x4

# Server 2
My /etc/hostname.* for CARP's and pfsync + host adaptor:
https://pastebin.com/a7fuM923
My /etc/pf.conf: https://pastebin.com/xNr1TtZ7

Any help or pointers would be fantastic.
I have struggled with this for a week now and I'm running out of idears -
the only solution I have right now is turning off the backup server.

$ uname -a
OpenBSD BSD-firewall01.static.semarkit.net 6.3 GENERIC.MP#107 amd64

Both servers is running on a KVM host running Debian Stretch with ZFS-for-
Linux and they haven't been touched either since it got installed, neither
before, under or after the problems started.

em0 is passed through the host and running all the VLAN and CARP things,
while em1 (pfsync0) is a crossed connection between the two host servers
not connected to the outside world or switch.

If you need any other information on anything in the setup, please feel
free to ask, I'm really annoyed by this, since it has worked and now it
don't, and I can't figure out why or what I have missed.

The only thing I haven't tried yet is to install a couple of new server
and reproduce the problem.

Sorry for a really long post!
And to the people receiving this message for the second time, I'm really sorry to, but had some problems with my DMARC settings.

-- Med Venlig Hilsen / Best Regards Henrik Dige Semark

Reply | Threaded
Open this post in threaded view
|

Re: 20% package loss on CARP after upgrade to 6.3

Janne Johansson-3
Den ons 20 juni 2018 kl 19:59 skrev Henrik Dige Semark <[hidden email]>:

> Hey everybody,
>
> # Server 1
> My /etc/hostname.* for CARP's and pfsync + host adaptor:
> https://pastebin.com/vrtuPqnQ
> My /etc/pf.conf: https://pastebin.com/yhVkG4x4
>
> # Server 2
> My /etc/hostname.* for CARP's and pfsync + host adaptor:
> https://pastebin.com/a7fuM923
> My /etc/pf.conf: https://pastebin.com/xNr1TtZ7
>
> Any help or pointers would be fantastic.
> I have struggled with this for a week now and I'm running out of idears -
> the only solution I have right now is turning off the backup server.
>

You should have different advskew on  expected master and slave carps, no?

Also, we used to have something like 20 for master and 80 on slave so one
can place slaves before master, or master after slave if you want to signal
"I am still running but would like to hand over to the other if we can".


--
May the most significant bit of your life be positive.
Reply | Threaded
Open this post in threaded view
|

Re: 20% package loss on CARP after upgrade to 6.3

Robert Blacquiere-7
In reply to this post by Selveste1
On Wed, Jun 20, 2018 at 07:57:17PM +0200, Henrik Dige Semark wrote:

> Hey everybody,
>
> I'm experiencing problems with CARP after upgrading to 6.3, it was working
> fine between my two servers in 6.2 but after upgrading (first backup and
> then master) I have a ping package loss on about 20%.
>
> It seem like the backup server tries to take the master, cause it's the
> only one changing the states. When it changes state the symptoms is:
> one package is dropped (ping), and it switches back to backup. I haven't
> changed anything, carp-config or PF, except the upgrade to 6.3.
>
> It works if i shutdown the master, then Backup takes over fine and gives
> back to master when it gets up, but when it's just running in backup, it
> switches back and fourth.
>
> I have tried tcpdumping and looking at my pfsync0 but I can't find the
> problem. I have tried to write my CARP settings again in hostname.carp*
> on both servers, check if pfsync0 is on the same interface and IP-range
> on both servers, checked my PF and everything, but can't find the problem...
>
> It does it across all 6 CARP's, so it looks like it's missing a hardbeat
> or something once in a while.
>
> I also tried switching from multicast to unicast, in case my ISP (running
> Juniper equipment) have activated something on the WAN side, but it didn't
> change my experience - but since it also happens on my LAN I didn't really
> expect this to be the problem.
>
> # Server 1
> My /etc/hostname.* for CARP's and pfsync + host adaptor:
> https://pastebin.com/vrtuPqnQ
> My /etc/pf.conf: https://pastebin.com/yhVkG4x4
>
> # Server 2
> My /etc/hostname.* for CARP's and pfsync + host adaptor:
> https://pastebin.com/a7fuM923
> My /etc/pf.conf: https://pastebin.com/xNr1TtZ7
>
> Any help or pointers would be fantastic.
> I have struggled with this for a week now and I'm running out of idears -
> the only solution I have right now is turning off the backup server.
>
> $ uname -a
> OpenBSD BSD-firewall01.static.semarkit.net 6.3 GENERIC.MP#107 amd64
>
> Both servers is running on a KVM host running Debian Stretch with ZFS-for-
> Linux and they haven't been touched either since it got installed, neither
> before, under or after the problems started.
>
> em0 is passed through the host and running all the VLAN and CARP things,
> while em1 (pfsync0) is a crossed connection between the two host servers
> not connected to the outside world or switch.
>
> If you need any other information on anything in the setup, please feel
> free to ask, I'm really annoyed by this, since it has worked and now it
> don't, and I can't figure out why or what I have missed.
>
> The only thing I haven't tried yet is to install a couple of new server
> and reproduce the problem.
>
> Sorry for a really long post!
> And to the people receiving this message for the second time, I'm really sorry to, but had some problems with my DMARC settings.
>
> -- Med Venlig Hilsen / Best Regards Henrik Dige Semark

>

Just a quick thought as em devices are emulated on kvm did you try
disableling hw offloading on the interfaces? I had some similair issue
with a vps pings seem to work but other traffic had drops.

Regards

Robert

Reply | Threaded
Open this post in threaded view
|

Re: 20% package loss on CARP after upgrade to 6.3

Stefan Sperling-5
In reply to this post by Janne Johansson-3
On Thu, Jun 21, 2018 at 10:07:06AM +0200, Janne Johansson wrote:

> Den ons 20 juni 2018 kl 19:59 skrev Henrik Dige Semark <[hidden email]>:
>
> > Hey everybody,
> >
> > # Server 1
> > My /etc/hostname.* for CARP's and pfsync + host adaptor:
> > https://pastebin.com/vrtuPqnQ
> > My /etc/pf.conf: https://pastebin.com/yhVkG4x4
> >
> > # Server 2
> > My /etc/hostname.* for CARP's and pfsync + host adaptor:
> > https://pastebin.com/a7fuM923
> > My /etc/pf.conf: https://pastebin.com/xNr1TtZ7
> >
> > Any help or pointers would be fantastic.
> > I have struggled with this for a week now and I'm running out of idears -
> > the only solution I have right now is turning off the backup server.
> >
>
> You should have different advskew on  expected master and slave carps, no?

Looks to me like that is already the case (Server 1 is has advskew 0,
Server 2 has advskew 100).

> Also, we used to have something like 20 for master and 80 on slave so one
> can place slaves before master, or master after slave if you want to signal
> "I am still running but would like to hand over to the other if we can".

The carp demote counter is also relevant to failover and is sometimes
raised at run-time when interface output errors occur. The advskew value
only matters as long as the demote counter is equal on both sides.
See 'ifconfig -g carp' and the 'carpdemote' directives documented in
the INTERFACE GROUPS section of the ifconfig man page.

To avoid potential routing issues, I would recommend setting netmasks
to /32 on all carp interfaces if they share a subnet with an Ethernet
interface.

I have no idea about a possible specific reason for packet loss, though.

Reply | Threaded
Open this post in threaded view
|

Re: 20% package loss on CARP after upgrade to 6.3

Selveste1
On 21-06-2018 10:30, Stefan Sperling wrote:

> On Thu, Jun 21, 2018 at 10:07:06AM +0200, Janne Johansson wrote:
>> Den ons 20 juni 2018 kl 19:59 skrev Henrik Dige Semark <[hidden email]>:
>>
>>> Hey everybody,
>>>
>>> # Server 1
>>> My /etc/hostname.* for CARP's and pfsync + host adaptor:
>>> https://pastebin.com/vrtuPqnQ
>>> My /etc/pf.conf: https://pastebin.com/yhVkG4x4
>>>
>>> # Server 2
>>> My /etc/hostname.* for CARP's and pfsync + host adaptor:
>>> https://pastebin.com/a7fuM923
>>> My /etc/pf.conf: https://pastebin.com/xNr1TtZ7
>>>
>>> Any help or pointers would be fantastic.
>>> I have struggled with this for a week now and I'm running out of idears -
>>> the only solution I have right now is turning off the backup server.
>>>
>> You should have different advskew on  expected master and slave carps, no?
> Looks to me like that is already the case (Server 1 is has advskew 0,
> Server 2 has advskew 100).
To be fair, I have just changed it to see if it makes a difference, but
I still have the problem with package-loss - I'll try to change it to
20/80 later, it's a good idea if I want to change around easy between
the servers.
>> Also, we used to have something like 20 for master and 80 on slave so one
>> can place slaves before master, or master after slave if you want to signal
>> "I am still running but would like to hand over to the other if we can".
> The carp demote counter is also relevant to failover and is sometimes
> raised at run-time when interface output errors occur. The advskew value
> only matters as long as the demote counter is equal on both sides.
> See 'ifconfig -g carp' and the 'carpdemote' directives documented in
> the INTERFACE GROUPS section of the ifconfig man page.
Both servers have
# ifconfig -g carp
carp: carp demote count 0
> To avoid potential routing issues, I would recommend setting netmasks
> to /32 on all carp interfaces if they share a subnet with an Ethernet
> interface.
The only carp that is in the same subnet is carp1 and host interface em0
so that I can connect to each server directly, but I have solved the
routing with creating a different routing table, but it would be a good
idea to change it to /32 so that it's only the default gw that is on the
CARP and nothing else.
> I have no idea about a possible specific reason for packet loss, though.
>
Snippet from: Robert Blacquiere <[hidden email]>
> Just a quick thought as em devices are emulated on kvm did you try
> disableling hw offloading on the interfaces? I had some similair issue
> with a vps pings seem to work but other traffic had drops.
I haven't tried to disable HW offload, but do you think it could be a
problem, when it worked fin under older versions of OpenBSD?

Med Venlig Hilsen / Best Regards
Henrik Dige Semark



Reply | Threaded
Open this post in threaded view
|

Re: 20% package loss on CARP after upgrade to 6.3

Robert Blacquiere-7
On Thu, Jun 21, 2018 at 11:06:36AM +0200, Henrik Dige Semark wrote:

> On 21-06-2018 10:30, Stefan Sperling wrote:
> > On Thu, Jun 21, 2018 at 10:07:06AM +0200, Janne Johansson wrote:
> >> Den ons 20 juni 2018 kl 19:59 skrev Henrik Dige Semark <[hidden email]>:
> >>
> >>> Hey everybody,
> >>>
> >>> # Server 1
> >>> My /etc/hostname.* for CARP's and pfsync + host adaptor:
> >>> https://pastebin.com/vrtuPqnQ
> >>> My /etc/pf.conf: https://pastebin.com/yhVkG4x4
> >>>
> >>> # Server 2
> >>> My /etc/hostname.* for CARP's and pfsync + host adaptor:
> >>> https://pastebin.com/a7fuM923
> >>> My /etc/pf.conf: https://pastebin.com/xNr1TtZ7
> >>>
> >>> Any help or pointers would be fantastic.
> >>> I have struggled with this for a week now and I'm running out of idears -
> >>> the only solution I have right now is turning off the backup server.
> >>>
> >> You should have different advskew on  expected master and slave carps, no?
> > Looks to me like that is already the case (Server 1 is has advskew 0,
> > Server 2 has advskew 100).
> To be fair, I have just changed it to see if it makes a difference, but
> I still have the problem with package-loss - I'll try to change it to
> 20/80 later, it's a good idea if I want to change around easy between
> the servers.
> >> Also, we used to have something like 20 for master and 80 on slave so one
> >> can place slaves before master, or master after slave if you want to signal
> >> "I am still running but would like to hand over to the other if we can".
> > The carp demote counter is also relevant to failover and is sometimes
> > raised at run-time when interface output errors occur. The advskew value
> > only matters as long as the demote counter is equal on both sides.
> > See 'ifconfig -g carp' and the 'carpdemote' directives documented in
> > the INTERFACE GROUPS section of the ifconfig man page.
> Both servers have
> # ifconfig -g carp
> carp: carp demote count 0
> > To avoid potential routing issues, I would recommend setting netmasks
> > to /32 on all carp interfaces if they share a subnet with an Ethernet
> > interface.
> The only carp that is in the same subnet is carp1 and host interface em0
> so that I can connect to each server directly, but I have solved the
> routing with creating a different routing table, but it would be a good
> idea to change it to /32 so that it's only the default gw that is on the
> CARP and nothing else.
> > I have no idea about a possible specific reason for packet loss, though.
> >
> Snippet from: Robert Blacquiere <[hidden email]>
> > Just a quick thought as em devices are emulated on kvm did you try
> > disableling hw offloading on the interfaces? I had some similair issue
> > with a vps pings seem to work but other traffic had drops.
> I haven't tried to disable HW offload, but do you think it could be a
> problem, when it worked fin under older versions of OpenBSD?
>
> Med Venlig Hilsen / Best Regards
> Henrik Dige Semark
>
>
>

I had some issues with vps with em interfaces and pseudo hw offloading.
Now I never use offloading on vps and have not encountered these strange
things like packet drop or  icmp work but tcp/udp fails and carp strange
hickups. Also encountered issue with multicast on juniper in combination
with numbered management vlan on the default vlan. Some where in juniper
they got silenced.

Regards

Robert

Reply | Threaded
Open this post in threaded view
|

Re: 20% package loss on CARP after upgrade to 6.3

Janne Johansson-3
In reply to this post by Stefan Sperling-5
Den tors 21 juni 2018 kl 10:31 skrev Stefan Sperling <[hidden email]>:

> On Thu, Jun 21, 2018 at 10:07:06AM +0200, Janne Johansson wrote:
> > Den ons 20 juni 2018 kl 19:59 skrev Henrik Dige Semark <[hidden email]>:
> >
> > > Hey everybody,
> > >
> > > # Server 1
> > > My /etc/hostname.* for CARP's and pfsync + host adaptor:
> > > https://pastebin.com/vrtuPqnQ
> > > My /etc/pf.conf: https://pastebin.com/yhVkG4x4
> > >
> > > # Server 2
> > > My /etc/hostname.* for CARP's and pfsync + host adaptor:
> > > https://pastebin.com/a7fuM923
> > > My /etc/pf.conf: https://pastebin.com/xNr1TtZ7
> > >
> > > Any help or pointers would be fantastic.
> > > I have struggled with this for a week now and I'm running out of
> idears -
> > > the only solution I have right now is turning off the backup server.
> > >
> >
> > You should have different advskew on  expected master and slave carps,
> no?
>
> Looks to me like that is already the case (Server 1 is has advskew 0,
> Server 2 has advskew 100).
>

Oh damned, I might have looked at the same url twice. My bad.

--
May the most significant bit of your life be positive.
Reply | Threaded
Open this post in threaded view
|

Re: 20% package loss on CARP after upgrade to 6.3

Selveste1
In reply to this post by Robert Blacquiere-7
I have now tried to change the carp netmask to /32 on the one interface
that shares subnet (management), changed advskew to 20 on master and 80
on slave. But ping loss still persist (see bottom of this mail).

I could try to set the carpdemote higher on the slave, but what then
when/if the master actually goes down?

>> CARP and nothing else.
>>> I have no idea about a possible specific reason for packet loss, though.
>>>
>> Snippet from: Robert Blacquiere <[hidden email]>
>>> Just a quick thought as em devices are emulated on kvm did you try
>>> disableling hw offloading on the interfaces? I had some similair issue
>>> with a vps pings seem to work but other traffic had drops.
>> I haven't tried to disable HW offload, but do you think it could be a
>> problem, when it worked fin under older versions of OpenBSD?
>>
>> Med Venlig Hilsen / Best Regards
>> Henrik Dige Semark
>>
>>
>>
> I had some issues with vps with em interfaces and pseudo hw offloading.
> Now I never use offloading on vps and have not encountered these strange
> things like packet drop or  icmp work but tcp/udp fails and carp strange
> hickups. Also encountered issue with multicast on juniper in combination
> with numbered management vlan on the default vlan. Some where in juniper
> they got silenced.
>
> Regards
>
> Robert
>
@Robert: What exactly do you turn off, and how?

Information:

# ifconfig em0
em0: flags=8b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST>
mtu 1500
        lladdr a8:8d:35:55:7d:5f
        description: Management
        index 1 priority 0 llprio 3
        media: Ethernet autoselect (1000baseT full-duplex)
        status: active
        inet 192.168.245.2 netmask 0xffffff00 broadcast 192.168.245.255
        inet6 fe80::24e8:4c63:629c:3d53%em0 prefixlen 64 scopeid 0x1
        inet6 2001:470:1b6a:45::2 prefixlen 64

# ifconfig carp1
carp1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        lladdr 00:00:5e:00:01:01
        description: Management
        index 5 priority 15 llprio 3
        carp: MASTER carpdev em0 vhid 1 advbase 1 advskew 20
        groups: carp lan
        status: master
        inet 192.168.245.1 netmask 0xffffffff
        inet6 fe80::3c71:a9ea:18d8:872%carp1 prefixlen 64 scopeid 0x5
        inet6 2001:470:1b6a:45::1 prefixlen 128



# ping -c 50 8.8.8.8 (From my laptop to Google DNS)
--- 8.8.8.8 ping statistics ---
50 packets transmitted, 40 received, +10 errors, 20% packet loss, time
49158ms
rtt min/avg/max/mdev = 7.046/7.370/10.165/0.517 ms

# ping -c 50 192.168.245.2 (From my laptop to Server 1 em0)
--- 192.168.245.2 ping statistics ---
50 packets transmitted, 50 received, 0% packet loss, time 49350ms
rtt min/avg/max/mdev = 0.658/1.169/4.643/0.682 ms

# ping -c 50 192.168.245.1 (From my laptop to carp1 (default gw))
PING 192.168.245.1 (192.168.245.1) 56(84) bytes of data.
64 bytes from 192.168.245.1: icmp_seq=1 ttl=255 time=0.766 ms
64 bytes from 192.168.245.1: icmp_seq=2 ttl=255 time=0.972 ms
64 bytes from 192.168.245.1: icmp_seq=3 ttl=255 time=1.18 ms
64 bytes from 192.168.245.1: icmp_seq=4 ttl=255 time=0.718 ms
64 bytes from 192.168.245.1: icmp_seq=5 ttl=255 time=0.816 ms
64 bytes from 192.168.245.1: icmp_seq=6 ttl=255 time=0.818 ms
64 bytes from 192.168.245.1: icmp_seq=7 ttl=255 time=0.964 ms
64 bytes from 192.168.245.1: icmp_seq=8 ttl=255 time=0.833 ms
64 bytes from 192.168.245.1: icmp_seq=9 ttl=255 time=0.839 ms
64 bytes from 192.168.245.1: icmp_seq=10 ttl=255 time=0.955 ms
64 bytes from 192.168.245.1: icmp_seq=11 ttl=255 time=1.62 ms
64 bytes from 192.168.245.1: icmp_seq=12 ttl=255 time=0.916 ms
64 bytes from 192.168.245.1: icmp_seq=13 ttl=255 time=0.785 ms
64 bytes from 192.168.245.1: icmp_seq=14 ttl=255 time=0.734 ms
64 bytes from 192.168.245.1: icmp_seq=15 ttl=255 time=1.99 ms
*64 bytes from 192.168.245.1: icmp_seq=16 ttl=255 time=36.8 ms*
64 bytes from 192.168.245.1: icmp_seq=17 ttl=255 time=0.853 ms
64 bytes from 192.168.245.1: icmp_seq=18 ttl=255 time=1.19 ms
64 bytes from 192.168.245.1: icmp_seq=19 ttl=255 time=0.744 ms
64 bytes from 192.168.245.1: icmp_seq=20 ttl=255 time=1.89 ms
64 bytes from 192.168.245.1: icmp_seq=21 ttl=255 time=0.853 ms
64 bytes from 192.168.245.1: icmp_seq=22 ttl=255 time=1.78 ms
64 bytes from 192.168.245.1: icmp_seq=23 ttl=255 time=0.861 ms
64 bytes from 192.168.245.1: icmp_seq=24 ttl=255 time=1.15 ms
64 bytes from 192.168.245.1: icmp_seq=25 ttl=255 time=0.731 ms
64 bytes from 192.168.245.1: icmp_seq=26 ttl=255 time=0.701 ms
64 bytes from 192.168.245.1: icmp_seq=27 ttl=255 time=2.07 ms
64 bytes from 192.168.245.1: icmp_seq=28 ttl=255 time=1.07 ms
*64 bytes from 192.168.245.1: icmp_seq=29 ttl=255 time=41.5 ms*
64 bytes from 192.168.245.1: icmp_seq=30 ttl=255 time=0.798 ms
64 bytes from 192.168.245.1: icmp_seq=31 ttl=255 time=1.65 ms
64 bytes from 192.168.245.1: icmp_seq=32 ttl=255 time=0.846 ms
64 bytes from 192.168.245.1: icmp_seq=33 ttl=255 time=0.782 ms
64 bytes from 192.168.245.1: icmp_seq=34 ttl=255 time=1.94 ms
64 bytes from 192.168.245.1: icmp_seq=35 ttl=255 time=0.841 ms
64 bytes from 192.168.245.1: icmp_seq=36 ttl=255 time=0.874 ms
64 bytes from 192.168.245.1: icmp_seq=37 ttl=255 time=0.819 ms
64 bytes from 192.168.245.1: icmp_seq=38 ttl=255 time=1.57 ms
64 bytes from 192.168.245.1: icmp_seq=39 ttl=255 time=1.79 ms
64 bytes from 192.168.245.1: icmp_seq=40 ttl=255 time=0.710 ms
64 bytes from 192.168.245.1: icmp_seq=41 ttl=255 time=0.762 ms
*64 bytes from 192.168.245.1: icmp_seq=42 ttl=255 time=16.4 ms*
64 bytes from 192.168.245.1: icmp_seq=43 ttl=255 time=0.698 ms
64 bytes from 192.168.245.1: icmp_seq=44 ttl=255 time=0.776 ms
64 bytes from 192.168.245.1: icmp_seq=45 ttl=255 time=0.778 ms
64 bytes from 192.168.245.1: icmp_seq=46 ttl=255 time=1.43 ms
64 bytes from 192.168.245.1: icmp_seq=47 ttl=255 time=0.742 ms
64 bytes from 192.168.245.1: icmp_seq=48 ttl=255 time=0.773 ms
64 bytes from 192.168.245.1: icmp_seq=49 ttl=255 time=0.888 ms
64 bytes from 192.168.245.1: icmp_seq=50 ttl=255 time=0.699 ms

--- 192.168.245.1 ping statistics ---
50 packets transmitted, 50 received, 0% packet loss, time 49458ms
rtt min/avg/max/mdev = 0.698/2.877/41.590/7.745 ms

My guess would be that it changes server in the three places marked with
* and higher ping-times is seen

--
Med Venlig Hilsen / Best Regards
Henrik Dige Semark
Mobil: +45 2633 1701