NAT reliability in light of recent checksum changes

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

NAT reliability in light of recent checksum changes

Richard Procter
Hi all,

I'm using OpenBSD 5.3 to provide an Alix-based home firewall. Thank
you all for the commitment to elegant, well-documented software which
isn't pernicious to the mental health of its users.

I've a question about the new checksum changes[0], being interested
in such things and having listened to Henning's presentation and
poked around in the archives a little. My understanding is that
checksums are now always recalculated when a header is altered,
never updated.[1]

Is that right and if so has this affected NAT reliability?
Recalculation here would compromise reliable end-to-end transport
as the payload checksum no longer covers the entire network path,
and so break a basic transport layer design principle.[2][3]

best,
Richard.

[0] http://www.openbsd.org/54.html "Reworked checksum handling for
network protocols."

[1] e.g.
   26:45 slide 27, 'use protocol checksum offloading better'
   http://quigon.bsws.de/papers/2013/EuroBSDcon/mgp00027.html 
   30:51 slide 30, 'consequences in pf'
   http://quigon.bsws.de/papers/2013/EuroBSDcon/mgp00030.html
   https://www.youtube.com/watch?v=AymV11igbLY 
   'The surprising complexity of checksums in TCP/IP'

[2] V. Cerf, R. Khan, IEEE Trans on Comms, Vol Com-22, No 5 May 1974
Page 3 in original emphasis.

> The remainder of the packet consists of text for delivery to the
> destination and a trailing check sum used for end-to-end software
> verification. The GATEWAY does /not/ modify the text and merely
> forwards the check sum along without computing or recomputing it.

[3] Page 3. http://www.ietf.org/rfc/rfc793.txt

> The TCP must recover from data that is damaged, lost, duplicated, or
> delivered out of order by the internet communication system. [...]
> Damage is handled by adding a checksum to each segment transmitted,
> checking it at the receiver, and discarding damaged segments.

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Stuart Henderson
On 2014-01-14, Richard Procter <[hidden email]> wrote:

> Hi all,
>
> I'm using OpenBSD 5.3 to provide an Alix-based home firewall. Thank
> you all for the commitment to elegant, well-documented software which
> isn't pernicious to the mental health of its users.
>
> I've a question about the new checksum changes[0], being interested
> in such things and having listened to Henning's presentation and
> poked around in the archives a little. My understanding is that
> checksums are now always recalculated when a header is altered,
> never updated.[1]
>
> Is that right and if so has this affected NAT reliability?
>
> Recalculation here would compromise reliable end-to-end transport
> as the payload checksum no longer covers the entire network path,
> and so break a basic transport layer design principle.[2][3]

That is exactly what slides 30-33 talk about. PF now checks
the incoming packets before it rewrites the checksum, so it can
reject them if they are broken.

> [1] e.g.
>    26:45 slide 27, 'use protocol checksum offloading better'
>    http://quigon.bsws.de/papers/2013/EuroBSDcon/mgp00027.html 
>    30:51 slide 30, 'consequences in pf'
>    http://quigon.bsws.de/papers/2013/EuroBSDcon/mgp00030.html
>    https://www.youtube.com/watch?v=AymV11igbLY 
>    'The surprising complexity of checksums in TCP/IP'
>
> [2] V. Cerf, R. Khan, IEEE Trans on Comms, Vol Com-22, No 5 May 1974
> Page 3 in original emphasis.
>
>> The remainder of the packet consists of text for delivery to the
>> destination and a trailing check sum used for end-to-end software
>> verification. The GATEWAY does /not/ modify the text and merely
>> forwards the check sum along without computing or recomputing it.
>
> [3] Page 3. http://www.ietf.org/rfc/rfc793.txt
>
>> The TCP must recover from data that is damaged, lost, duplicated, or
>> delivered out of order by the internet communication system. [...]
>> Damage is handled by adding a checksum to each segment transmitted,
>> checking it at the receiver, and discarding damaged segments.

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Richard Procter
In reply to this post by Richard Procter
On 2014-01-15, Stuart Henderson <[hidden email]> wrote:

> On 2014-01-14, Richard Procter <[hidden email]> wrote:
>>
>> I've a question about the new checksum changes. [...]
>> My understanding is that checksums are now always recalculated when
>> a header is altered, never updated.
>>
>> Is that right and if so has this affected NAT reliability?
>>
>> Recalculation here would compromise reliable end-to-end transport
>> as the payload checksum no longer covers the entire network path,
>> and so break a basic transport layer design principle.
>
> That is exactly what slides 30-33 talk about. PF now checks
> the incoming packets before it rewrites the checksum, so it can
> reject them if they are broken.

Right -- so NAT now replaces the existing transport checksum
with one newly computed from the payload [0].

This fundamentally weakens its usefulness, though: a correct
checksum now implies only that the payload likely matches
what the last NAT router happened to have in its memory,
whereas the receiver wants to know whether what it got is
what was originally transmitted. In the worst case of NAT on
every intermediate node the transport checksum is
effectively reduced to an adjunct of the link layer
checksum.

This means transport layer payload integrity is no longer
reliant on the quality of the checksum algorithm alone but
now depends too on the reliability of the path the packet
took through the network.

I think it's great to see someone working hard to simplify
crucial code but in light of the above I believe pf should
always update the checksum, as it did in versions prior to
5.4, as the alternative fundamentally undermines TCP by
making the undetected error rate of its streams unknown and
unbounded. One might argue networks these days are reliable;
I think it better to avoid the need to make the argument.
In any case the work I've found on that question is not
reassuring [1].

best,
Richard.

[0] pf.c 1.863

On initial rule match:
pf_test_rule()
  3445: pf_translate()
     3707: pf_change_ap()
        1677: PF_ACPY [= pf_addrcpy()]
  3461: pf_cksum()
     6775: pd->hdr.tcp->th_sum = 0;
           m->m_pkthdr.csum_flags |= M_TCP_CSUM_OUT
           (if orig checksum good)

On subsequent state matching:
pf_test_state()
   ~4445: pf_change_ap() etc
   4471: pf_cksum() etc

[1] "Probably the strongest message of this study is that the
networking hardware is often trashing the packets which are
entrusted to it"

http://conferences.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf

Jonathan Stone and Craig Partridge. 2000. When the CRC and TCP checksum disagree.
In Proceedings of the conference on Applications, Technologies, Architectures, and
Protocols for Computer Communication (SIGCOMM '00). ACM, New York, NY, USA, 309-319.
DOI=10.1145/347059.347561 http://doi.acm.org/10.1145/347059.347561

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Henning Brauer
* Richard Procter <[hidden email]> [2014-01-22 06:44]:
> > That is exactly what slides 30-33 talk about. PF now checks
> > the incoming packets before it rewrites the checksum, so it can
> > reject them if they are broken.
> Right -- so NAT now replaces the existing transport checksum
> with one newly computed from the payload [0].

correct - IF the original cksum was right.

> This fundamentally weakens its usefulness, though: a correct
> checksum now implies only that the payload likely matches
> what the last NAT router happened to have in its memory,
> whereas the receiver wants to know whether what it got is
> what was originally transmitted. In the worst case of NAT on
> every intermediate node the transport checksum is
> effectively reduced to an adjunct of the link layer
> checksum.

huh?
we receive a packet with correct cksum -> NAT -> packet goes out with
correct cksum.
we receive a packet with broken cksum -> NAT -> we leave the cksum
alone, i. e. leave it broken.

> I think it's great to see someone working hard to simplify
> crucial code but in light of the above I believe pf should
> always update the checksum, as it did in versions prior to
> 5.4, as the alternative fundamentally undermines TCP by
> making the undetected error rate of its streams unknown and
> unbounded. One might argue networks these days are reliable;
> I think it better to avoid the need to make the argument.
> In any case the work I've found on that question is not
> reassuring [1].

It doesn't seem you know what you are talking about. the cksum is dead
simple, if we had bugs in claculating or verifying it, we really had a
LOT of other problems. There is no "undetected error rate", nothing
really changes there.

--
Henning Brauer, [hidden email], [hidden email]
BS Web Services GmbH, http://bsws.de, Full-Service ISP
Secure Hosting, Mail and DNS Services. Dedicated Servers, Root to Fully Managed
Henning Brauer Consulting, http://henningbrauer.com/

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Christian Weisgerber
Henning Brauer <[hidden email]> wrote:

> > This fundamentally weakens its usefulness, though: a correct
> > checksum now implies only that the payload likely matches
> > what the last NAT router happened to have in its memory,
> > whereas the receiver wants to know whether what it got is
> > what was originally transmitted.
>
> we receive a packet with correct cksum -> NAT -> packet goes out with
> correct cksum.
> we receive a packet with broken cksum -> NAT -> we leave the cksum
> alone, i. e. leave it broken.

The point Richard may be trying to make is that a packet may be
corrupted in memory on the NAT gateway (e.g. RAM error, buggy code
writing into random location), and that regenerating the checksum
hides such corruption.

--
Christian "naddy" Weisgerber                          [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Richard Procter
In reply to this post by Henning Brauer
On 22/01/2014, at 7:19 PM, Henning Brauer wrote:

> * Richard Procter <[hidden email]> [2014-01-22 06:44]:
>> This fundamentally weakens its usefulness, though: a correct
>> checksum now implies only that the payload likely matches
>> what the last NAT router happened to have in its memory
>
> huh?
> we receive a packet with correct cksum -> NAT -> packet goes out with
> correct cksum.
> we receive a packet with broken cksum -> NAT -> we leave the cksum
> alone, i. e. leave it broken.

Christian said it better than me: routers may corrupt data
and regenerating the checksum will hide it.

That's more than a theoretical concern. The article I
referenced is a detailed study of real-world traces
co-authored by a member of the Stanford distributed systems
group that concludes "Probably the strongest message of this
study is that the networking hardware is often trashing the
packets which are entrusted to it"[0].

More generally, TCP checksums provide for an acceptable
error rate that is independent of the reliability of the
underlying network[*] by allowing us to verify its workings.
But it's no longer possible to verify network operation if
it may be regenerating TCP checksums, as these may hide
network faults. That's a fundamental change from the scheme
Cerf and Khan emphasized in their design notes for what
became known as TCP:

"The remainder of the packet consists of text for delivery
to the destination and a trailing check sum used for
end-to-end software verification. The GATEWAY does /not/
modify the text and merely forwards the check sum along
without computing or recomputing it."[1]

> It doesn't seem you know what you are talking about. the
> cksum is dead simple, if we had bugs in claculating or
> verifying it, we really had a LOT of other problems.

I'm not saying the calculation is bad. I'm saying it's being
calculated from the wrong copy of the data and by the wrong
device. And it's not just me saying it: I'm quoting the guys
who designed TCP.

> There is no "undetected error rate", nothing really changes
> there.

I disagree. Every TCP stream containing aribitrary data may
have undetected errors as checksums cannot detect all the
errors networks may make (being shorter than the data they
cover). The engineer's task is to make network errors
reliably negligible in practice.

As network regenerated checksums may hide any amount of
arbitrary data corruption I believe it's correct to say the
network error rate undetected by TCP is then "unknown and
unbounded".

best,
Richard.

[*] Under reasonable assumptions of the error modes most likely
in practice. And some applications require lower error rates
than TCP checksums can provide.

[0]
http://conferences.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf

Jonathan Stone and Craig Partridge. 2000. When the CRC and
TCP checksum disagree.  In Proceedings of the conference on
Applications, Technologies, Architectures, and Protocols for
Computer Communication (SIGCOMM '00). ACM, New York, NY,
USA, 309-319.  DOI=10.1145/347059.347561
http://doi.acm.org/10.1145/347059.347561

[1] "A Protocol for Packet Network Intercommunication"
V. Cerf, R. Khan, IEEE Trans on Comms, Vol Com-22, No 5 May
1974 Page 3 in original emphasis.

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Theo de Raadt
In reply to this post by Richard Procter
>From owner-misc+M137142=deraadt=[hidden email] Sat Jan 25 12:41:53 2014
>DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=utWhBX3niMM2LVtE8mfHlY/ky3wCOdmsmIjoMdLaY5Q=; b=EDAHtMzwKNWiAeY56T7Fkl0Q29kOMAMn5QUkTmADQG5qZJ7k9mOWDRnjlN0DLClrDO TpAA7OUGMfA55tXh/dEkKgtjb3inl7IMNyhUahJrETz0uHedS9xyZSTKBbDi9zVWfey1 V3broKdxZP3MA6jmF0aT4jdkaDfC/Hj7UhSX79Qc6zMkr3wZMN6e3sA+31RCnrCj/hwf 8oDhmqPtNYVGBZMm9hyhX1x/FTp/3Ra6tWzUnDtnKozUq2ZeovgLwG3JjcFooQ5572Ef w1uIA4w2em5DRlUSdDtome8dVVewRb25ZeNkPMe8Gul6azVh2zqNNYx7a9b71mLTwGML YXwA==
>X-Received: by 10.68.204.4 with SMTP id ku4mr21464025pbc.66.1390678851934; Sat, 25 Jan 2014 11:40:51 -0800 (PST)
>Content-Type: text/plain; charset=us-ascii
>Mime-Version: 1.0 (Apple Message framework v1085)
>Subject: Re: NAT reliability in light of recent checksum changes
>From: Richard Procter <[hidden email]>
>In-Reply-To: <[hidden email]>
>Date: Sun, 26 Jan 2014 08:40:44 +1300
>Content-Transfer-Encoding: 8bit
>References: <[hidden email]> <[hidden email]>
>To: [hidden email]
>X-Mailer: Apple Mail (2.1085)
>List-Help: <mailto:[hidden email]?body=help>
>List-ID: <misc.openbsd.org>
>List-Owner: <mailto:[hidden email]>
>List-Post: <mailto:[hidden email]>
>List-Subscribe: <mailto:[hidden email]?body=sub%20misc>
>List-Unsubscribe: <mailto:[hidden email]?body=unsub%20misc>
>X-Loop: [hidden email]
>Precedence: list
>Sender: [hidden email]
>
>On 22/01/2014, at 7:19 PM, Henning Brauer wrote:
>
>> * Richard Procter <[hidden email]> [2014-01-22 06:44]:
>>> This fundamentally weakens its usefulness, though: a correct
>>> checksum now implies only that the payload likely matches
>>> what the last NAT router happened to have in its memory
>>
>> huh?
>> we receive a packet with correct cksum -> NAT -> packet goes out with
>> correct cksum.
>> we receive a packet with broken cksum -> NAT -> we leave the cksum
>> alone, i. e. leave it broken.
>
>Christian said it better than me: routers may corrupt data
>and regenerating the checksum will hide it.
>
>That's more than a theoretical concern. The article I
>referenced is a detailed study of real-world traces
>co-authored by a member of the Stanford distributed systems
>group that concludes "Probably the strongest message of this
>study is that the networking hardware is often trashing the
>packets which are entrusted to it"[0].
>
>More generally, TCP checksums provide for an acceptable
>error rate that is independent of the reliability of the
>underlying network[*] by allowing us to verify its workings.
>But it's no longer possible to verify network operation if
>it may be regenerating TCP checksums, as these may hide
>network faults. That's a fundamental change from the scheme
>Cerf and Khan emphasized in their design notes for what
>became known as TCP:
>
>"The remainder of the packet consists of text for delivery
>to the destination and a trailing check sum used for
>end-to-end software verification. The GATEWAY does /not/
>modify the text and merely forwards the check sum along
>without computing or recomputing it."[1]
>
>> It doesn't seem you know what you are talking about. the
>> cksum is dead simple, if we had bugs in claculating or
>> verifying it, we really had a LOT of other problems.
>
>I'm not saying the calculation is bad. I'm saying it's being
>calculated from the wrong copy of the data and by the wrong
>device. And it's not just me saying it: I'm quoting the guys
>who designed TCP.
>
>> There is no "undetected error rate", nothing really changes
>> there.
>
>I disagree. Every TCP stream containing aribitrary data may
>have undetected errors as checksums cannot detect all the
>errors networks may make (being shorter than the data they
>cover). The engineer's task is to make network errors
>reliably negligible in practice.
>
>As network regenerated checksums may hide any amount of
>arbitrary data corruption I believe it's correct to say the
>network error rate undetected by TCP is then "unknown and
>unbounded".
>
>best,
>Richard.
>
>[*] Under reasonable assumptions of the error modes most likely
>in practice. And some applications require lower error rates
>than TCP checksums can provide.
>
>[0]
>http://conferences.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf
>
>Jonathan Stone and Craig Partridge. 2000. When the CRC and
>TCP checksum disagree.  In Proceedings of the conference on
>Applications, Technologies, Architectures, and Protocols for
>Computer Communication (SIGCOMM '00). ACM, New York, NY,
>USA, 309-319.  DOI=10.1145/347059.347561
>http://doi.acm.org/10.1145/347059.347561
>
>[1] "A Protocol for Packet Network Intercommunication"
>V. Cerf, R. Khan, IEEE Trans on Comms, Vol Com-22, No 5 May
>1974 Page 3 in original emphasis.

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Simon Perreault-3
In reply to this post by Richard Procter
Le 2014-01-25 14:40, Richard Procter a écrit :
> I'm not saying the calculation is bad. I'm saying it's being
> calculated from the wrong copy of the data and by the wrong
> device. And it's not just me saying it: I'm quoting the guys
> who designed TCP.

Those guys didn't envision NAT.

If you want end-to-end checksum purity, don't do NAT.

Simon

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Nick Bender
On Mon, Jan 27, 2014 at 8:19 AM, Simon Perreault <
[hidden email]> wrote:

> Le 2014-01-25 14:40, Richard Procter a écrit :
>
>  I'm not saying the calculation is bad. I'm saying it's being
>> calculated from the wrong copy of the data and by the wrong
>> device. And it's not just me saying it: I'm quoting the guys
>> who designed TCP.
>>
>
> Those guys didn't envision NAT.
>
> If you want end-to-end checksum purity, don't do NAT.
>
> Simon
>
>
Relying on TCP checksums is risky - they are too weak.

I live at the end of a wireless link that starts at around 7K feet
elevation, goes over a 12K foot ridge, lands at my neighbors roof at 10k
feet and then bounces across the street to my house. At one point I was
having lots of issues with data corruption - updates failing, even images
on web pages going technicolor half way through the download. The ISP
ultimately determined there was a bad transmitter and replaced it. The
corruption was so severe that it was overwhelming the TCP checksums to the
point that as far as TCP was concerned it was delivering good data (just
not the same data twice :-). Until they fixed the issue I was able to run a
proxy over ssh which gave me slower but reliable network service.

-N

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Giancarlo Razzolini-3
Em 27-01-2014 14:30, Nick Bender escreveu:

> On Mon, Jan 27, 2014 at 8:19 AM, Simon Perreault <
> [hidden email]> wrote:
>
>> Le 2014-01-25 14:40, Richard Procter a écrit :
>>
>>  I'm not saying the calculation is bad. I'm saying it's being
>>> calculated from the wrong copy of the data and by the wrong
>>> device. And it's not just me saying it: I'm quoting the guys
>>> who designed TCP.
>>>
>> Those guys didn't envision NAT.
>>
>> If you want end-to-end checksum purity, don't do NAT.
>>
>> Simon
>>
>>
> Relying on TCP checksums is risky - they are too weak.
>
> I live at the end of a wireless link that starts at around 7K feet
> elevation, goes over a 12K foot ridge, lands at my neighbors roof at 10k
> feet and then bounces across the street to my house. At one point I was
> having lots of issues with data corruption - updates failing, even images
> on web pages going technicolor half way through the download. The ISP
> ultimately determined there was a bad transmitter and replaced it. The
> corruption was so severe that it was overwhelming the TCP checksums to the
> point that as far as TCP was concerned it was delivering good data (just
> not the same data twice :-). Until they fixed the issue I was able to run a
> proxy over ssh which gave me slower but reliable network service.
>
> -N
>
I had the same issue on a different scenario. I traveled to a place
where the internet connection was so slow and so unreliable, that almost
all https handshakes would never complete. And yet checksums had a rate
of almost 60% of them being ok. That's why I always have a VPN server
lying around, to route my traffic to. In my experience, on very
unreliable connections, a UDP vpn, such as openvpn, saves the day. NAT
should (and will) have a very slow and painful death. But, then again,
IPv4 is about to die for more than a decade, and it's still here. I
guess the death will be very, very, very slow.

Cheers,

--
Giancarlo Razzolini
GPG: 4096R/77B981BC

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Why 42? The lists account.
FWIW, you don't have to out in the sticks (the backwoods?) to have
a network problem:

    http://mina.naguib.ca/blog/2012/10/22/the-little-ssh-that-sometimes-couldnt.html

However, as I understand it, in this case the TCP checksumming worked
and protected the application from the corrupted data.

Cheers,
Robb.

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Giancarlo Razzolini-3
Em 27-01-2014 19:05, Why 42? The lists account. escreveu:

> FWIW, you don't have to out in the sticks (the backwoods?) to have
> a network problem:
>
>     http://mina.naguib.ca/blog/2012/10/22/the-little-ssh-that-sometimes-couldnt.html
>
> However, as I understand it, in this case the TCP checksumming worked
> and protected the application from the corrupted data.
>
> Cheers,
> Robb.
>
    I wasn't exactly in the woods, but I had a 600Kbps unreliable ADSL
connection that would send the packets. But the latency and corruption
was so severe that TLS handshakes would take too long. And even if
complete, the connection wouldn't sustain itself. Anyway, the UDP vpn
improved things quite a bit. This due, well, to UDP of course, and to
the dynamic compression, reducing the amount of data sent to the wire.

    This case you pointed, the TCP checksumming was doing it's job.
Successfully protecting the application. This kind of things, where bits
"randomly" flip, proves that computer science can be anything but an
EXACT science. That's one of the reasons why the machines will
(hopefully) always need humans.

Cheers,

--
Giancarlo Razzolini
GPG: 4096R/77B981BC

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

gwes-2
On 01/27/2014 08:07 PM, Giancarlo Razzolini wrote:

> Em 27-01-2014 19:05, Why 42? The lists account. escreveu:
>> FWIW, you don't have to out in the sticks (the backwoods?) to have
>> a network problem:
>>
>>      http://mina.naguib.ca/blog/2012/10/22/the-little-ssh-that-sometimes-couldnt.html
>>
>> However, as I understand it, in this case the TCP checksumming worked
>> and protected the application from the corrupted data.
>>
>> Cheers,
>> Robb.
>>
>      I wasn't exactly in the woods, but I had a 600Kbps unreliable ADSL
> connection that would send the packets. But the latency and corruption
> was so severe that TLS handshakes would take too long. And even if
> complete, the connection wouldn't sustain itself. Anyway, the UDP vpn
> improved things quite a bit. This due, well, to UDP of course, and to
> the dynamic compression, reducing the amount of data sent to the wire.
>
>      This case you pointed, the TCP checksumming was doing it's job.
> Successfully protecting the application. This kind of things, where bits
> "randomly" flip, proves that computer science can be anything but an
> EXACT science. That's one of the reasons why the machines will
> (hopefully) always need humans.
>
> Cheers,
>
To add to the preceeding...
    One client of mine used a CVS repository via coast-to-coast NFS.

    Somewhere in the deeps, the UDP checksum was set to 0 (no checksum).
Somewhere else, one bit in each packet was corrupted.

   If the UDP checksum had been present we would have seen the bad
data a lot sooner. We had to go back at least a month, sometimes more,
to find good data, and then recreate all the edits.

   This scenario shows a danger of silently passing corrupt packets.

   It would be good if when data protected by a checksum is modified,
the current checksum is validated and some appropriate? action is done
(drop? produce invalid new checksum?) when proceeding.

Geoff Steckel

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Richard Procter
In reply to this post by Simon Perreault-3
On 28/01/2014, at 4:19 AM, Simon Perreault wrote:

> Le 2014-01-25 14:40, Richard Procter a écrit :
>> I'm not saying the calculation is bad. I'm saying it's being
>> calculated from the wrong copy of the data and by the wrong
>> device. And it's not just me saying it: I'm quoting the guys
>> who designed TCP.
>
> Those guys didn't envision NAT.
>
> If you want end-to-end checksum purity, don't do NAT.

Let's look at the options.

The world needs more addresses than IPv4 provides and NAT
gives them to us. There's IPv6, which has about a hundred
billion addresses for every bacteria estimated to live on
the planet[0], but it's not looking to replace IPv4 any time
soon. So NAT is here to stay for a good while longer.

Perhaps I can at least stop using NAT on my own network. In
my case I can't but let's assume I do. This eliminates one
source of error. But my TCP streams may still have
now-undetected one-bit errors (at least) if there may be
routers out there regenerating checksums. As long as there
are, good checksums no longer mean as much by themselves and
if I want at least some assurance the network did its job, I
still need some other way (e.g, checking the network path
contains no such routers, either by inspection or
statistically, or by reimplementing an end-to-end checksum
at a higher layer, etc). Regenerated checksums affect me
whether or not I use NAT myself.

Another option is to always update the checksum as versions
prior to version 5.4 did. It's reasonable to ask, well is
any more reliable than recomputing them as 5.4 does?
That is, can the old update code hide payload corruption,
too?

In order to hide payload corruption the update code would
have to modify the checksum to exactly account for it. But
that would have to happen by accident, as it never considers
the payload. It's not impossible, but, on the other hand,
checksum regeneration guarantees to hide any bad data.
So updates are more reliable.

A lot more reliable, in fact, as you'd require precisely
those memory errors necessary to in effect compute the
correct update, or some freak fault in the ALU that did the
same thing, or some combination of both. And as that has
nothing to do with the update code it is in principle
possible for non-NAT connections, too. For the hardware,
updates are just an extra load/modify/store and so the
chances of a checksum update hiding a corrupted payload are
in practical terms equivalent to those of normal forwarding.

So your statement holds only if checksums are being
regenerated. In general, NAT needn't compromise end-to-end
TCP payload checksum integrity, and in versions prior to
5.4, it didn't.

best,
Richard.


[0] "Prokaryotes: The unseen majority"
    Proc Natl Acad Sci U S A. 1998 June 9; 95(12): 6578–6583.
    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC33863/

    2^128 IPv6 addresses = ~ 10^38

     ~ 10^38 IPv6 addresses / ~ 10^30 bacteria cells
    =
     ~ 10^8 addresses per cell.

[1] RFC1071 "Computing the Internet Checksum" p21
    "If anything, [this end-to-end property] is the most powerful
     feature of the TCP checksum!". Page 15 is also touches on
     the end-to-end preserving properties of checksum update.

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Simon Perreault-2
In reply to this post by gwes-2
Le 2014-01-27 21:21, Geoff Steckel a écrit :
>     It would be good if when data protected by a checksum is modified,
> the current checksum is validated and some appropriate? action is done
> (drop? produce invalid new checksum?) when proceeding.

This is exactly what's being done. Don't you listen when Henning speaks?

Simon

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Simon Perreault-2
In reply to this post by Richard Procter
Le 2014-01-28 03:39, Richard Procter a écrit :
> In order to hide payload corruption the update code would
> have to modify the checksum to exactly account for it. But
> that would have to happen by accident, as it never considers
> the payload. It's not impossible, but, on the other hand,
> checksum regeneration guarantees to hide any bad data.
> So updates are more reliable.

This analysis is bullshit. You need to take into account the fact that
checksums are verified before regenerating them. That is, you need to
compare a) verifying + regenerating vs b) updating. If there's an
undetectable error, you're going to propagate it no matter whether you
do a) or b).

Simon

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Stuart Henderson
On 2014-01-28, Simon Perreault <[hidden email]> wrote:

> Le 2014-01-28 03:39, Richard Procter a écrit :
>> In order to hide payload corruption the update code would
>> have to modify the checksum to exactly account for it. But
>> that would have to happen by accident, as it never considers
>> the payload. It's not impossible, but, on the other hand,
>> checksum regeneration guarantees to hide any bad data.
>> So updates are more reliable.
>
> This analysis is bullshit. You need to take into account the fact that
> checksums are verified before regenerating them. That is, you need to
> compare a) verifying + regenerating vs b) updating. If there's an
> undetectable error, you're going to propagate it no matter whether you
> do a) or b).
>
> Simon
>
>

Checksums are, in many cases, only verified *on the NIC*.

Consider this scenario, which has happened in real life.

- NIC supports checksum offloading, verified checksum is OK.

- PCI transfers are broken (in my case it affected multiple machines
of a certain type, so most likely a motherboard bug), causing some
corruption in the payload, but the machine won't detect them because
it doesn't look at checksums itself, just trusts the NIC's "rx csum
good" flag.

In this situation, packets which have been NATted that are corrupt
now get a new checksum that is valid; so the final endpoint can not
detect the breakage.

I'm not sure if this is common enough to be worth worrying about
here, but the analysis is not bullshit.

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Giancarlo Razzolini-3
Em 28-01-2014 15:45, Stuart Henderson escreveu:

> On 2014-01-28, Simon Perreault <[hidden email]> wrote:
>> Le 2014-01-28 03:39, Richard Procter a écrit :
>>> In order to hide payload corruption the update code would
>>> have to modify the checksum to exactly account for it. But
>>> that would have to happen by accident, as it never considers
>>> the payload. It's not impossible, but, on the other hand,
>>> checksum regeneration guarantees to hide any bad data.
>>> So updates are more reliable.
>> This analysis is bullshit. You need to take into account the fact that
>> checksums are verified before regenerating them. That is, you need to
>> compare a) verifying + regenerating vs b) updating. If there's an
>> undetectable error, you're going to propagate it no matter whether you
>> do a) or b).
>>
>> Simon
>>
>>
> Checksums are, in many cases, only verified *on the NIC*.
>
> Consider this scenario, which has happened in real life.
>
> - NIC supports checksum offloading, verified checksum is OK.
>
> - PCI transfers are broken (in my case it affected multiple machines
> of a certain type, so most likely a motherboard bug), causing some
> corruption in the payload, but the machine won't detect them because
> it doesn't look at checksums itself, just trusts the NIC's "rx csum
> good" flag.
>
> In this situation, packets which have been NATted that are corrupt
> now get a new checksum that is valid; so the final endpoint can not
> detect the breakage.
>
> I'm not sure if this is common enough to be worth worrying about
> here, but the analysis is not bullshit.
>
Stuart,

    It is more common than you might think. I had some gigabit
motherboards in which some models always would corrupt the packets when
using the onboard nic. I believe that in these cases there isn't much
that the OS can do. Unfortunately, it's always the application job to
detect if it is receiving good or bad data.

Cheers,

--
Giancarlo Razzolini
GPG: 4096R/77B981BC

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Simon Perreault-2
In reply to this post by Stuart Henderson
Le 2014-01-28 12:45, Stuart Henderson a écrit :

>> This analysis is bullshit. You need to take into account the fact that
>> checksums are verified before regenerating them. That is, you need to
>> compare a) verifying + regenerating vs b) updating. If there's an
>> undetectable error, you're going to propagate it no matter whether you
>> do a) or b).
>
> Checksums are, in many cases, only verified *on the NIC*.
>
> Consider this scenario, which has happened in real life.
>
> - NIC supports checksum offloading, verified checksum is OK.
>
> - PCI transfers are broken (in my case it affected multiple machines
> of a certain type, so most likely a motherboard bug), causing some
> corruption in the payload, but the machine won't detect them because
> it doesn't look at checksums itself, just trusts the NIC's "rx csum
> good" flag.
>
> In this situation, packets which have been NATted that are corrupt
> now get a new checksum that is valid; so the final endpoint can not
> detect the breakage.
>
> I'm not sure if this is common enough to be worth worrying about
> here, but the analysis is not bullshit.

You're right. I was in the rough, sorry, and thanks for the explanation.
I don't think this scenario is worth worrying about though.

Simon

Reply | Threaded
Open this post in threaded view
|

Re: NAT reliability in light of recent checksum changes

Henning Brauer
In reply to this post by Richard Procter
* Richard Procter <[hidden email]> [2014-01-25 20:41]:

> On 22/01/2014, at 7:19 PM, Henning Brauer wrote:
> > * Richard Procter <[hidden email]> [2014-01-22 06:44]:
> >> This fundamentally weakens its usefulness, though: a correct
> >> checksum now implies only that the payload likely matches
> >> what the last NAT router happened to have in its memory
> > huh?
> > we receive a packet with correct cksum -> NAT -> packet goes out with
> > correct cksum.
> > we receive a packet with broken cksum -> NAT -> we leave the cksum
> > alone, i. e. leave it broken.
> Christian said it better than me: routers may corrupt data
> and regenerating the checksum will hide it.

if that happened we had much bigger problems than NAT.

--
Henning Brauer, [hidden email], [hidden email]
BS Web Services GmbH, http://bsws.de, Full-Service ISP
Secure Hosting, Mail and DNS Services. Dedicated Servers, Root to Fully Managed
Henning Brauer Consulting, http://henningbrauer.com/

12