IKEDv2 lost tunnel. How to reproduce at will, effects and work around.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

IKEDv2 lost tunnel. How to reproduce at will, effects and work around.

Daniel Ouellet
I sure hope this will help.

***Setup***
Two server on 5.8. Establish VPN with IKEDv2. One side active, one side
passive. Use rsa keys, or pass phrase if you like.

Active side:
# cat /etc/iked.conf
ikev2 Ouellet active from re0 to 66.63.5.250 from 66.63.50.16/28 to
0.0.0.0/0 peer 66.63.5.250

Passive side:
# cat /etc/iked.conf
ikev2 Ouellet passive from em0 to 108.56.142.37 from 0.0.0.0/0 to
66.63.50.16/28 peer 108.56.142.37

***Issues***
1. On heavy traffic, you will get many instance of SAD that will only
get clean up on the expiration of the lifetime in time, even if the
lifetiem is size has pass multiple times. Meaning clean up is only done
on timer, not on data limit reach.

2. On heavy download the destination (Passive side), when the data
limits is reach in a few occasion, the passive side wil try to change
the tunnel to use NAT-T, even if there is no NAT and then the only
solution is to stop/start the active side to establish the tunnel again.

***How to trigger and reproduce at will***
To easily trigger the issue often, just reduce the default with adding
on both sides a much shorter life time

lifetime 1m bytes 100k

as this:

ikev2 Ouellet active from re0 to 66.63.5.250 from 66.63.50.16/28 to
0.0.0.0/0 peer 66.63.5.250 lifetime 1m bytes 100k

And then just watch the logs live with
tail -f /var/log/daemon | grep iked
on passive side, you will see very quickly this:

------------------------------------------------------------------------
Dec 11 20:01:32 tunnel iked[1801]: pfkey_reply: message: No such process
Dec 11 20:01:32 tunnel iked[1801]: ikev2_pld_delete: deleted 1 spis
Dec 11 20:01:32 tunnel iked[1801]: ikev2_msg_send: INFORMATIONAL
response from 66.63.5.250:500 to 108.56.142.37:500 msgid 3, 80 bytes, NAT-T
------------------------------------------------------------------------

Then you will loose access to the tunnel completely and it will not
recover until you manually reset the active side with rc.d/iked stop and
start.

The data limit is small, so you can trigger it with just:

ping -s 1500 66.63.5.250 from the active side of the network. Or what
ever way you want to generate traffic and before you know it you coudl
see this:

# ipsecctl -sa | wc -l
     493

and the number of SAD will ONLY get reduce when the time limits is
reach, even if they are not valid anymore and have been trigger by the
data limits.

May be the clean up should happen on both, time and data limits. Just a
thought.

***Work Around***
Now to work around the problem for now, simply change the lifetime of
the PASSIVE side. I just pick 2x the Active side for both time and data
so that it NEVER trigger the NAT-T issue. Not an ideal solution, but for
now it fix the lost of VPN at random time.

You can test and do the same as above to see it with only have the
active side with the same

lifetime 1m bytes 100k

and then the passive side with

lifetime 2m bytes 200k

And just flow traffic.

You still will see the huge increase in SAD on the active side as the
data limits get reach and new child get created, as they don't get clean
up then, but only on time limits reach.

But this way at a minimum, you will NOT loose your VPN.

The same issue show up as well even if both side are active. It's more
like a timing issue I guess possibly, but really if a VPN works without
NAT I think it should never try to establish NAT-T anyway, specially if
it has pass traffic constantly all the way to 500Mb, being he default
and when the VPN carry huge traffic, may be it should clean up the old
child on the SAD when a data limit is reach and a new child is created
instead of doing it only on time limit reach, so that if you decide to
setup no limit on time, then you box don't explode because of lack of
resources or what not and old child are not release.

Hopefully this will be useful to someone as it took me a week to isolate
why in hell I loose VPN at random time on an otherwise perfectly working
VPN.

Best,

Daniel

Reply | Threaded
Open this post in threaded view
|

Re: IKEDv2 lost tunnel. How to reproduce at will, effects and work around.

Daniel Ouellet
OK,

Here is more updates on this after now 3 weeks of testing any possible
variation and configurations.

I finally find a way to have it stable. I don't like it, but it works.
72 hours so far, or close to it.

May be it wasn't notice before because I get the feeling that it is
mostly use in NAT setup, so the issue doesn't show up for most users.

One thing I have to say, I would love a way to DISABLE the NAT
capability oppose to have the system try it by default. I also
understand that switch is NOT something love and I also agree. Not sure
what the RFC say about it, but anyway I thought that if oyu know you do
not have NAT, nor will you have it, it would be wise to make sure it
wouldn't try it ever for any reason.

Now what works, and ONLY that combination actually works, is this one:

Remote site:
ikev2 Ouellet passive from em0 to 108.56.142.37 from 0.0.0.0/0 to
66.63.50.16/28 peer 108.56.142.37 srcid tunnel.realconnect.com dstid
gateway.ouellet.us lifetime 0 bytes 0

Local site:
ikev2 Ouellet active from re0 to 66.63.5.250 from 66.63.50.16/28 to
0.0.0.0/0 peer 66.63.5.250 srcid gateway.ouellet.us dstid
tunnel.realconnect.com lifetime 0 bytes 0

if you remove both or even only one of

srcid gateway.ouellet.us dstid tunnel.realconnect.com it iwll not work.

If you do not include

lifetime 0 bytes 0

It will not work, so, ALL 4 elements needs to be there for the remote
site NOT try to switch to NAT-T and then kill the flows and the only way
to restore them is via the source restart of iked

I now this may not make sense, but I see the

Dec 11 19:19:01 tunnel iked[9794]: ikev2_msg_send: INFORMATIONAL request
from 66.63.5.250:4500 to 108.56.142.37:4500 msgid 1, 80 bytes, NAT-T

messages with NAY other configurations and I trial ALL possible
variation of it to find out what works.

I would have assume that to setup the lifetime to not expired would have
done it, even if you shouldn't do that as it weaken the ipsec
configuration as it rely on rekey to make =it strong, it iwll still try
NAT-T even if set to never expire.

Something has to be wrong in the logic here. I started to look at the
code, nothing yet and I am not sure I will find why, may be Reyk might
have an idea, as he wrote it, but I am also sure he is very busy with
other things.

But that's the final results of 3 weeks or research and frustration on
this. Now I can get it stable. Well 72 hours anyway so far, so will see,
but that's what all the tests provide and i hope that somehow it is
useful to someone and may be allow to find why when NAT is not in the
path it will regardless try to do NAT-T. That part doesn't make sense to me.

What I had below is still true, but the same scenario with NAT-T will
show up, just somewhat less frequently, but still present. Above so far,
none.

Best,

Daniel



On 12/11/15 8:51 PM, Daniel Ouellet wrote:

> I sure hope this will help.
>
> ***Setup***
> Two server on 5.8. Establish VPN with IKEDv2. One side active, one side
> passive. Use rsa keys, or pass phrase if you like.
>
> Active side:
> # cat /etc/iked.conf
> ikev2 Ouellet active from re0 to 66.63.5.250 from 66.63.50.16/28 to
> 0.0.0.0/0 peer 66.63.5.250
>
> Passive side:
> # cat /etc/iked.conf
> ikev2 Ouellet passive from em0 to 108.56.142.37 from 0.0.0.0/0 to
> 66.63.50.16/28 peer 108.56.142.37
>
> ***Issues***
> 1. On heavy traffic, you will get many instance of SAD that will only
> get clean up on the expiration of the lifetime in time, even if the
> lifetiem is size has pass multiple times. Meaning clean up is only done
> on timer, not on data limit reach.
>
> 2. On heavy download the destination (Passive side), when the data
> limits is reach in a few occasion, the passive side wil try to change
> the tunnel to use NAT-T, even if there is no NAT and then the only
> solution is to stop/start the active side to establish the tunnel again.
>
> ***How to trigger and reproduce at will***
> To easily trigger the issue often, just reduce the default with adding
> on both sides a much shorter life time
>
> lifetime 1m bytes 100k
>
> as this:
>
> ikev2 Ouellet active from re0 to 66.63.5.250 from 66.63.50.16/28 to
> 0.0.0.0/0 peer 66.63.5.250 lifetime 1m bytes 100k
>
> And then just watch the logs live with
> tail -f /var/log/daemon | grep iked
> on passive side, you will see very quickly this:
>
> ------------------------------------------------------------------------
> Dec 11 20:01:32 tunnel iked[1801]: pfkey_reply: message: No such process
> Dec 11 20:01:32 tunnel iked[1801]: ikev2_pld_delete: deleted 1 spis
> Dec 11 20:01:32 tunnel iked[1801]: ikev2_msg_send: INFORMATIONAL
> response from 66.63.5.250:500 to 108.56.142.37:500 msgid 3, 80 bytes, NAT-T
> ------------------------------------------------------------------------
>
> Then you will loose access to the tunnel completely and it will not
> recover until you manually reset the active side with rc.d/iked stop and
> start.
>
> The data limit is small, so you can trigger it with just:
>
> ping -s 1500 66.63.5.250 from the active side of the network. Or what
> ever way you want to generate traffic and before you know it you coudl
> see this:
>
> # ipsecctl -sa | wc -l
>      493
>
> and the number of SAD will ONLY get reduce when the time limits is
> reach, even if they are not valid anymore and have been trigger by the
> data limits.
>
> May be the clean up should happen on both, time and data limits. Just a
> thought.
>
> ***Work Around***
> Now to work around the problem for now, simply change the lifetime of
> the PASSIVE side. I just pick 2x the Active side for both time and data
> so that it NEVER trigger the NAT-T issue. Not an ideal solution, but for
> now it fix the lost of VPN at random time.
>
> You can test and do the same as above to see it with only have the
> active side with the same
>
> lifetime 1m bytes 100k
>
> and then the passive side with
>
> lifetime 2m bytes 200k
>
> And just flow traffic.
>
> You still will see the huge increase in SAD on the active side as the
> data limits get reach and new child get created, as they don't get clean
> up then, but only on time limits reach.
>
> But this way at a minimum, you will NOT loose your VPN.
>
> The same issue show up as well even if both side are active. It's more
> like a timing issue I guess possibly, but really if a VPN works without
> NAT I think it should never try to establish NAT-T anyway, specially if
> it has pass traffic constantly all the way to 500Mb, being he default
> and when the VPN carry huge traffic, may be it should clean up the old
> child on the SAD when a data limit is reach and a new child is created
> instead of doing it only on time limit reach, so that if you decide to
> setup no limit on time, then you box don't explode because of lack of
> resources or what not and old child are not release.
>
> Hopefully this will be useful to someone as it took me a week to isolate
> why in hell I loose VPN at random time on an otherwise perfectly working
> VPN.
>
> Best,
>
> Daniel

Reply | Threaded
Open this post in threaded view
|

Re: IKEDv2 lost tunnel. How to reproduce at will, effects and work around.

Christian Weisgerber
In reply to this post by Daniel Ouellet
There has been zero reaction to this, but I certainly see what looks
to be the same problem: After passing a significant amount of traffic
(hundreds of MBs, I guess), the iked's lose sync, flows and SAs are
in disarray, and it takes a number of minutes before they manage
to sync up again.

(Yes, that's vague.  Start with Daniel's report for details.  I
haven't gotten around to really looking at what happens.)

--
Christian "naddy" Weisgerber                          [hidden email]