Low latency High Frequency Trading

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Low latency High Frequency Trading

Dan Shechter-2
Hi All,

<current situation>
A windows 2008 server is receiving TCP traffic from a stock exchange
and sends it, almost as is, using UDP multicast to automated high
frequancy traders.

StockExchange --TCP---> windows2008 ---MCAST-UDP---->

On average, the time it take to do the TCP to UDP translation, using
winsock, is 240 micro seconds. It can even be as high as 60,000 micro
seconds.
</current situation>

<my idea>
1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
box with two NICs. One for the TCP, the other for the multicast UDP.
2. Put the TCP port in a promiscuous mode.
3. Write my TCP->UDP logic directly into ether_input.c
</my idea>

Now for the questions:
1. Am I on the right track? or in other words how crazy is my idea?
2. What would be the latency? Can I achieve 50 microseconds between
getting the interrupt and until sending the new packet through the
NIC?
3. Which NIC/CPU/Memory should I use? Money is not a problem.

Thanks,
Dan

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Johan Beisser
On Thu, Nov 8, 2012 at 4:12 AM, Dan Shechter <[hidden email]> wrote:

> Hi All,
>
> <current situation>
> A windows 2008 server is receiving TCP traffic from a stock exchange
> and sends it, almost as is, using UDP multicast to automated high
> frequancy traders.
>
> StockExchange --TCP---> windows2008 ---MCAST-UDP---->
>
> On average, the time it take to do the TCP to UDP translation, using
> winsock, is 240 micro seconds. It can even be as high as 60,000 micro
> seconds.
> </current situation>
>
> <my idea>
> 1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
> box with two NICs. One for the TCP, the other for the multicast UDP.

You'll incur an extra penalty offloading to the kernel. Winsock is
already doing that, though.

> 2. Put the TCP port in a promiscuous mode.

Why? You can just set up the right bits to listen to on the network,
and pull raw frames to be processed. Or, just let the network stack
behave as it should.

> 3. Write my TCP->UDP logic directly into ether_input.c

Any reason to not use pf for this translation?

> </my idea>
>
> Now for the questions:
> 1. Am I on the right track? or in other words how crazy is my idea?

Pretty crazy. You may want to see if there's hardware accelerated or
on NIC TCP off-load options instead.

> 2. What would be the latency? Can I achieve 50 microseconds between
> getting the interrupt and until sending the new packet through the
> NIC?

See above. You'll end up having to do some tuning.

> 3. Which NIC/CPU/Memory should I use? Money is not a problem.

Custom order a few NICs, hire a developer to write a driver to offload
TCP/UDP on the NIC, and enable as little kernel interference as
possible.

Money's not a problem, right?

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Ariel Burbaickij
If money is not a problem -- go buy high-trading on the chip solutions and
have sub-microsecond resolution.

http://lmgtfy.com/?q=high+frequency+trading+FPGA

On Thu, Nov 8, 2012 at 6:36 PM, Johan Beisser <[hidden email]> wrote:

> On Thu, Nov 8, 2012 at 4:12 AM, Dan Shechter <[hidden email]> wrote:
> > Hi All,
> >
> > <current situation>
> > A windows 2008 server is receiving TCP traffic from a stock exchange
> > and sends it, almost as is, using UDP multicast to automated high
> > frequancy traders.
> >
> > StockExchange --TCP---> windows2008 ---MCAST-UDP---->
> >
> > On average, the time it take to do the TCP to UDP translation, using
> > winsock, is 240 micro seconds. It can even be as high as 60,000 micro
> > seconds.
> > </current situation>
> >
> > <my idea>
> > 1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
> > box with two NICs. One for the TCP, the other for the multicast UDP.
>
> You'll incur an extra penalty offloading to the kernel. Winsock is
> already doing that, though.
>
> > 2. Put the TCP port in a promiscuous mode.
>
> Why? You can just set up the right bits to listen to on the network,
> and pull raw frames to be processed. Or, just let the network stack
> behave as it should.
>
> > 3. Write my TCP->UDP logic directly into ether_input.c
>
> Any reason to not use pf for this translation?
>
> > </my idea>
> >
> > Now for the questions:
> > 1. Am I on the right track? or in other words how crazy is my idea?
>
> Pretty crazy. You may want to see if there's hardware accelerated or
> on NIC TCP off-load options instead.
>
> > 2. What would be the latency? Can I achieve 50 microseconds between
> > getting the interrupt and until sending the new packet through the
> > NIC?
>
> See above. You'll end up having to do some tuning.
>
> > 3. Which NIC/CPU/Memory should I use? Money is not a problem.
>
> Custom order a few NICs, hire a developer to write a driver to offload
> TCP/UDP on the NIC, and enable as little kernel interference as
> possible.
>
> Money's not a problem, right?

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Johan Beisser
On Thu, Nov 8, 2012 at 9:58 AM, Ariel Burbaickij
<[hidden email]> wrote:
> If money is not a problem -- go buy high-trading on the chip solutions and
> have sub-microsecond resolution.
>
> http://lmgtfy.com/?q=high+frequency+trading+FPGA

I'd love to see PF offloading on to something like that. Not that I
can justify the expense for my work, but it'd be useful.

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Ariel Burbaickij
I know that  you have an impression I am getting caustic :-)  but these
ideas are pretty obvious once money is not a problem field, so:

http://en.wikipedia.org/wiki/Netronome

IXPs on steroids.


On Thu, Nov 8, 2012 at 7:01 PM, Johan Beisser <[hidden email]> wrote:

> On Thu, Nov 8, 2012 at 9:58 AM, Ariel Burbaickij
> <[hidden email]> wrote:
> > If money is not a problem -- go buy high-trading on the chip solutions
> and
> > have sub-microsecond resolution.
> >
> > http://lmgtfy.com/?q=high+frequency+trading+FPGA
>
> I'd love to see PF offloading on to something like that. Not that I
> can justify the expense for my work, but it'd be useful.

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Dan Shechter-2
In reply to this post by Johan Beisser
For unrelated reasons, I can't directly receive the TCP stream.

I must copy the TCP data from a running stream to another server. I
can use tap or just port-mirroring on the switch. So I can't use any
network stack or leverage any offloading.

I also need to modify the received data, and add few application
headers before sending it as a multicast udp stream.

Winsock is userland. What I want to do is in the kernel, even before
ip_input. I guess it should be faster.

I am looking at netFPGA too, but prefer to do this in software.





Best regards,
Dan


On Thu, Nov 8, 2012 at 7:36 PM, Johan Beisser <[hidden email]> wrote:

> On Thu, Nov 8, 2012 at 4:12 AM, Dan Shechter <[hidden email]> wrote:
>> Hi All,
>>
>> <current situation>
>> A windows 2008 server is receiving TCP traffic from a stock exchange
>> and sends it, almost as is, using UDP multicast to automated high
>> frequancy traders.
>>
>> StockExchange --TCP---> windows2008 ---MCAST-UDP---->
>>
>> On average, the time it take to do the TCP to UDP translation, using
>> winsock, is 240 micro seconds. It can even be as high as 60,000 micro
>> seconds.
>> </current situation>
>>
>> <my idea>
>> 1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
>> box with two NICs. One for the TCP, the other for the multicast UDP.
>
> You'll incur an extra penalty offloading to the kernel. Winsock is
> already doing that, though.
>
>> 2. Put the TCP port in a promiscuous mode.
>
> Why? You can just set up the right bits to listen to on the network,
> and pull raw frames to be processed. Or, just let the network stack
> behave as it should.
>
>> 3. Write my TCP->UDP logic directly into ether_input.c
>
> Any reason to not use pf for this translation?
>
>> </my idea>
>>
>> Now for the questions:
>> 1. Am I on the right track? or in other words how crazy is my idea?
>
> Pretty crazy. You may want to see if there's hardware accelerated or
> on NIC TCP off-load options instead.
>
>> 2. What would be the latency? Can I achieve 50 microseconds between
>> getting the interrupt and until sending the new packet through the
>> NIC?
>
> See above. You'll end up having to do some tuning.
>
>> 3. Which NIC/CPU/Memory should I use? Money is not a problem.
>
> Custom order a few NICs, hire a developer to write a driver to offload
> TCP/UDP on the NIC, and enable as little kernel interference as
> possible.
>
> Money's not a problem, right?

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Dan Shechter-2
In reply to this post by Ariel Burbaickij
When I was saying money is not a problem, it was related to server
component costs... :)
Best regards,
Dan


On Thu, Nov 8, 2012 at 8:07 PM, Ariel Burbaickij
<[hidden email]> wrote:

> I know that  you have an impression I am getting caustic :-)  but these
> ideas are pretty obvious once money is not a problem field, so:
>
> http://en.wikipedia.org/wiki/Netronome
>
> IXPs on steroids.
>
>
>
> On Thu, Nov 8, 2012 at 7:01 PM, Johan Beisser <[hidden email]> wrote:
>>
>> On Thu, Nov 8, 2012 at 9:58 AM, Ariel Burbaickij
>> <[hidden email]> wrote:
>> > If money is not a problem -- go buy high-trading on the chip solutions
>> > and
>> > have sub-microsecond resolution.
>> >
>> > http://lmgtfy.com/?q=high+frequency+trading+FPGA
>>
>> I'd love to see PF offloading on to something like that. Not that I
>> can justify the expense for my work, but it'd be useful.

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Ariel Burbaickij
They are all available with PCI Express interface, no worries, so you will
be able of  plug them straight into your server.
Alternatively, how about going for the second option of making living in
this business :-) ?

On Thu, Nov 8, 2012 at 7:09 PM, Dan Shechter <[hidden email]> wrote:

> When I was saying money is not a problem, it was related to server
> component costs... :)
> Best regards,
> Dan
>
>
> On Thu, Nov 8, 2012 at 8:07 PM, Ariel Burbaickij
> <[hidden email]> wrote:
> > I know that  you have an impression I am getting caustic :-)  but these
> > ideas are pretty obvious once money is not a problem field, so:
> >
> > http://en.wikipedia.org/wiki/Netronome
> >
> > IXPs on steroids.
> >
> >
> >
> > On Thu, Nov 8, 2012 at 7:01 PM, Johan Beisser <[hidden email]> wrote:
> >>
> >> On Thu, Nov 8, 2012 at 9:58 AM, Ariel Burbaickij
> >> <[hidden email]> wrote:
> >> > If money is not a problem -- go buy high-trading on the chip solutions
> >> > and
> >> > have sub-microsecond resolution.
> >> >
> >> > http://lmgtfy.com/?q=high+frequency+trading+FPGA
> >>
> >> I'd love to see PF offloading on to something like that. Not that I
> >> can justify the expense for my work, but it'd be useful.

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Diana Eichert
In reply to this post by Dan Shechter-2
take a look at Tilera TileGX boards
(you better hire a s/w developer.)

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

William Ahern-2
In reply to this post by Dan Shechter-2
On Thu, Nov 08, 2012 at 08:08:05PM +0200, Dan Shechter wrote:

> For unrelated reasons, I can't directly receive the TCP stream.
>
> I must copy the TCP data from a running stream to another server. I
> can use tap or just port-mirroring on the switch. So I can't use any
> network stack or leverage any offloading.
>
> I also need to modify the received data, and add few application
> headers before sending it as a multicast udp stream.
>
> Winsock is userland. What I want to do is in the kernel, even before
> ip_input. I guess it should be faster.
>
> I am looking at netFPGA too, but prefer to do this in software.
>

You might want to try this:

        http://info.iet.unipi.it/~luigi/netmap/

It's FreeBSD and Linux only, though.

The emerging solution for high performance traffic routers like this is to
have one or more threads loop in userspace over a memory mapped NIC buffer.
Most of these interfaces are highly proprietary. Netmap provides the
relative programmatic simplicity of a TAP-type interface with the zero-copy
performance of the mapped buffering.

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Tomas Bodzar-4
In reply to this post by Diana Eichert
On Thu, Nov 8, 2012 at 8:55 PM, Diana Eichert <[hidden email]> wrote:
> take a look at Tilera TileGX boards
> (you better hire a s/w developer.)
>

Some company is already working on that
http://mail-index.netbsd.org/netbsd-users/2012/10/31/msg011803.html

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Diana Eichert
On Fri, 9 Nov 2012, Tomas Bodzar wrote:

> On Thu, Nov 8, 2012 at 8:55 PM, Diana Eichert <[hidden email]> wrote:
>> take a look at Tilera TileGX boards
>> (you better hire a s/w developer.)
>>
>
> Some company is already working on that
> http://mail-index.netbsd.org/netbsd-users/2012/10/31/msg011803.html

Porting an O/S is not software development.  The Tilera stuff is
different.

FWIW, I already knew about the NBSD stuff, but chose not to post
about other O/S on an OpenBSD list.

diana

Past hissy-fits are not a predictor of future hissy-fits.
Nick Holland(06 Dec 2005)

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Ryan McBride-3
In reply to this post by Dan Shechter-2
My immediate reaction is "don't do it", but on the other hand I've never
known people for whom 'money is not a problem' to shy away from
something because of boring concerns like security. So...


Software:

Basically, to do this "correctly" you need to parse all the packets
running in both directions between the two endpoints, tracking the acks
and correctly emulating the behaviour of the TCP stacks on both sides to
determine what is valid data to convert to UDP.

Things to think about:

- IP fragment reassembly
- duplicate packets
- out of order packets
- lost packets
- TCP resends
- TCP checksums
- IP checksums
- TCP sequence number validation
- etc, etc.

Look at pf_normalise_state_tcp() in pf_norm.c and pf_test_state_tcp() in
pf.c for a small taste of the scope of what you're considering if you
want to write this in the kernel.  Further examples for TCP reassembly
could be found in the source code for ports/net/snort or
ports/net/tcpflow.

Of course you can take some shortcuts if you assume that the data you're
getting is clean, and even more if you don't have to parse the TCP
stream but can handle each individual TCP packet as an individual
payload. Perhaps your current problematic implementation already does
this? If so, it's also probably trivial to inject bogus data into the
stream and have it accepted. Maybe that's a feature.

Remember: Lots of attacks can be performed against this hacked up
monstrosity unless everything is exactly perfect. Good luck with the
frankenstein code, it's not supported.


Hardware:

- NIC: something that allows you to adjust the interrupt rate, e.g. em,
  bnx. On the other hand if the packet rate is not too high a cheaper
  network card without any bells and whistles might give you better
  performance (less overhead in the interrupt handler). I'd say you'd be
  best off buying a bunch and testing them.

- CPU: maximum SINGLE CORE "turbo" speed. Disable the other cores,
  they're not helping you at all; in theory you want the biggest,
  fastest cache possible, but perhaps not necessary depending on how much
  software you're running.

- Fast RAM might help, but you don't need much. probably the minimum you
  can get in a board with the above CPU.

Also, remember to use the shortest patch cables possible, to reduce
signal propagation latency.



On Thu, Nov 08, 2012 at 08:08:05PM +0200, Dan Shechter wrote:

> For unrelated reasons, I can't directly receive the TCP stream.
>
> I must copy the TCP data from a running stream to another server. I
> can use tap or just port-mirroring on the switch. So I can't use any
> network stack or leverage any offloading.
>
> I also need to modify the received data, and add few application
> headers before sending it as a multicast udp stream.
>
> Winsock is userland. What I want to do is in the kernel, even before
> ip_input. I guess it should be faster.
>
>
> On Thu, Nov 8, 2012 at 7:36 PM, Johan Beisser <[hidden email]> wrote:
> > On Thu, Nov 8, 2012 at 4:12 AM, Dan Shechter <[hidden email]> wrote:
> >> Hi All,
> >>
> >> <current situation>
> >> A windows 2008 server is receiving TCP traffic from a stock exchange
> >> and sends it, almost as is, using UDP multicast to automated high
> >> frequancy traders.
> >>
> >> StockExchange --TCP---> windows2008 ---MCAST-UDP---->
> >>
> >> On average, the time it take to do the TCP to UDP translation, using
> >> winsock, is 240 micro seconds. It can even be as high as 60,000 micro
> >> seconds.
> >> </current situation>
> >>
> >> <my idea>
> >> 1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
> >> box with two NICs. One for the TCP, the other for the multicast UDP.
> >
> > You'll incur an extra penalty offloading to the kernel. Winsock is
> > already doing that, though.
> >
> >> 2. Put the TCP port in a promiscuous mode.
> >
> > Why? You can just set up the right bits to listen to on the network,
> > and pull raw frames to be processed. Or, just let the network stack
> > behave as it should.
> >
> >> 3. Write my TCP->UDP logic directly into ether_input.c
> >
> > Any reason to not use pf for this translation?
> >
> >> </my idea>
> >>
> >> Now for the questions:
> >> 1. Am I on the right track? or in other words how crazy is my idea?
> >
> > Pretty crazy. You may want to see if there's hardware accelerated or
> > on NIC TCP off-load options instead.
> >
> >> 2. What would be the latency? Can I achieve 50 microseconds between
> >> getting the interrupt and until sending the new packet through the
> >> NIC?
> >
> > See above. You'll end up having to do some tuning.
> >
> >> 3. Which NIC/CPU/Memory should I use? Money is not a problem.
> >
> > Custom order a few NICs, hire a developer to write a driver to offload
> > TCP/UDP on the NIC, and enable as little kernel interference as
> > possible.
> >
> > Money's not a problem, right?

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Ariel Burbaickij
What is the rationale behind this statement:


"...
- CPU: maximum SINGLE CORE "turbo" speed. Disable the other cores,
  they're not helping you at all..."?

/wbr
Ariel Burbaickij

On Fri, Nov 9, 2012 at 3:47 PM, Ryan McBride <[hidden email]> wrote:

> My immediate reaction is "don't do it", but on the other hand I've never
> known people for whom 'money is not a problem' to shy away from
> something because of boring concerns like security. So...
>
>
> Software:
>
> Basically, to do this "correctly" you need to parse all the packets
> running in both directions between the two endpoints, tracking the acks
> and correctly emulating the behaviour of the TCP stacks on both sides to
> determine what is valid data to convert to UDP.
>
> Things to think about:
>
> - IP fragment reassembly
> - duplicate packets
> - out of order packets
> - lost packets
> - TCP resends
> - TCP checksums
> - IP checksums
> - TCP sequence number validation
> - etc, etc.
>
> Look at pf_normalise_state_tcp() in pf_norm.c and pf_test_state_tcp() in
> pf.c for a small taste of the scope of what you're considering if you
> want to write this in the kernel.  Further examples for TCP reassembly
> could be found in the source code for ports/net/snort or
> ports/net/tcpflow.
>
> Of course you can take some shortcuts if you assume that the data you're
> getting is clean, and even more if you don't have to parse the TCP
> stream but can handle each individual TCP packet as an individual
> payload. Perhaps your current problematic implementation already does
> this? If so, it's also probably trivial to inject bogus data into the
> stream and have it accepted. Maybe that's a feature.
>
> Remember: Lots of attacks can be performed against this hacked up
> monstrosity unless everything is exactly perfect. Good luck with the
> frankenstein code, it's not supported.
>
>
> Hardware:
>
> - NIC: something that allows you to adjust the interrupt rate, e.g. em,
>   bnx. On the other hand if the packet rate is not too high a cheaper
>   network card without any bells and whistles might give you better
>   performance (less overhead in the interrupt handler). I'd say you'd be
>   best off buying a bunch and testing them.
>
> - CPU: maximum SINGLE CORE "turbo" speed. Disable the other cores,
>   they're not helping you at all; in theory you want the biggest,
>   fastest cache possible, but perhaps not necessary depending on how much
>   software you're running.
>
> - Fast RAM might help, but you don't need much. probably the minimum you
>   can get in a board with the above CPU.
>
> Also, remember to use the shortest patch cables possible, to reduce
> signal propagation latency.
>
>
>
> On Thu, Nov 08, 2012 at 08:08:05PM +0200, Dan Shechter wrote:
> > For unrelated reasons, I can't directly receive the TCP stream.
> >
> > I must copy the TCP data from a running stream to another server. I
> > can use tap or just port-mirroring on the switch. So I can't use any
> > network stack or leverage any offloading.
> >
> > I also need to modify the received data, and add few application
> > headers before sending it as a multicast udp stream.
> >
> > Winsock is userland. What I want to do is in the kernel, even before
> > ip_input. I guess it should be faster.
> >
> >
> > On Thu, Nov 8, 2012 at 7:36 PM, Johan Beisser <[hidden email]> wrote:
> > > On Thu, Nov 8, 2012 at 4:12 AM, Dan Shechter <[hidden email]>
> wrote:
> > >> Hi All,
> > >>
> > >> <current situation>
> > >> A windows 2008 server is receiving TCP traffic from a stock exchange
> > >> and sends it, almost as is, using UDP multicast to automated high
> > >> frequancy traders.
> > >>
> > >> StockExchange --TCP---> windows2008 ---MCAST-UDP---->
> > >>
> > >> On average, the time it take to do the TCP to UDP translation, using
> > >> winsock, is 240 micro seconds. It can even be as high as 60,000 micro
> > >> seconds.
> > >> </current situation>
> > >>
> > >> <my idea>
> > >> 1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
> > >> box with two NICs. One for the TCP, the other for the multicast UDP.
> > >
> > > You'll incur an extra penalty offloading to the kernel. Winsock is
> > > already doing that, though.
> > >
> > >> 2. Put the TCP port in a promiscuous mode.
> > >
> > > Why? You can just set up the right bits to listen to on the network,
> > > and pull raw frames to be processed. Or, just let the network stack
> > > behave as it should.
> > >
> > >> 3. Write my TCP->UDP logic directly into ether_input.c
> > >
> > > Any reason to not use pf for this translation?
> > >
> > >> </my idea>
> > >>
> > >> Now for the questions:
> > >> 1. Am I on the right track? or in other words how crazy is my idea?
> > >
> > > Pretty crazy. You may want to see if there's hardware accelerated or
> > > on NIC TCP off-load options instead.
> > >
> > >> 2. What would be the latency? Can I achieve 50 microseconds between
> > >> getting the interrupt and until sending the new packet through the
> > >> NIC?
> > >
> > > See above. You'll end up having to do some tuning.
> > >
> > >> 3. Which NIC/CPU/Memory should I use? Money is not a problem.
> > >
> > > Custom order a few NICs, hire a developer to write a driver to offload
> > > TCP/UDP on the NIC, and enable as little kernel interference as
> > > possible.
> > >
> > > Money's not a problem, right?

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Ryan McBride-3
On Fri, Nov 09, 2012 at 04:14:28PM +0100, Ariel Burbaickij wrote:
> What is the rationale behind this statement:
>
>
> "...
> - CPU: maximum SINGLE CORE "turbo" speed. Disable the other cores,
>   they're not helping you at all..."?

OpenBSD doesn't run multiprocessor inside the kernel, so SMP provides no
benefit. Even if it did, the overhead of SMP would probably be a net
loss for this particular workload.

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Ariel Burbaickij
Ah OK,  as several other architectures/OSes were thrown around in this
thread I did not immediately understand that you were talking
about specifically OpenBSD context.  Thank you for clarification.

On Fri, Nov 9, 2012 at 4:19 PM, Ryan McBride <[hidden email]> wrote:

> On Fri, Nov 09, 2012 at 04:14:28PM +0100, Ariel Burbaickij wrote:
> > What is the rationale behind this statement:
> >
> >
> > "...
> > - CPU: maximum SINGLE CORE "turbo" speed. Disable the other cores,
> >   they're not helping you at all..."?
>
> OpenBSD doesn't run multiprocessor inside the kernel, so SMP provides no
> benefit. Even if it did, the overhead of SMP would probably be a net
> loss for this particular workload.

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Christian Weisgerber
In reply to this post by Ryan McBride-3
Ryan McBride <[hidden email]> wrote:

> Also, remember to use the shortest patch cables possible, to reduce
> signal propagation latency.

More seriously, is there an appreciable latency difference between
copper and fiber PHYs?

--
Christian "naddy" Weisgerber                          [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Dan Shechter-2
In reply to this post by Ryan McBride-3
Hi Ryan,

Thanks for the detailed answer.

I can do some assumptions regarding the TCP flow and its origins. Its
coming from the stock exchange over IPSEC  gateways over leased lines.
I think I can trust the origin of the flow. At least I can trust it as
much as the off the shelf software does.

When I was saying money is not a problem, I was referring to the cost
of the server to run this.

I know that I need to implement state machine for the TCP session and
keep some buffers for out of order packets.

Do you think the right place to place the code is in ether_input.c?

Its about 1k packets per second max.

I plan to coil the path cable to make electrical filed surrounding my
device to protect it from evil.
Best regards,
Dan


On Fri, Nov 9, 2012 at 4:47 PM, Ryan McBride <[hidden email]> wrote:

> My immediate reaction is "don't do it", but on the other hand I've never
> known people for whom 'money is not a problem' to shy away from
> something because of boring concerns like security. So...
>
>
> Software:
>
> Basically, to do this "correctly" you need to parse all the packets
> running in both directions between the two endpoints, tracking the acks
> and correctly emulating the behaviour of the TCP stacks on both sides to
> determine what is valid data to convert to UDP.
>
> Things to think about:
>
> - IP fragment reassembly
> - duplicate packets
> - out of order packets
> - lost packets
> - TCP resends
> - TCP checksums
> - IP checksums
> - TCP sequence number validation
> - etc, etc.
>
> Look at pf_normalise_state_tcp() in pf_norm.c and pf_test_state_tcp() in
> pf.c for a small taste of the scope of what you're considering if you
> want to write this in the kernel.  Further examples for TCP reassembly
> could be found in the source code for ports/net/snort or
> ports/net/tcpflow.
>
> Of course you can take some shortcuts if you assume that the data you're
> getting is clean, and even more if you don't have to parse the TCP
> stream but can handle each individual TCP packet as an individual
> payload. Perhaps your current problematic implementation already does
> this? If so, it's also probably trivial to inject bogus data into the
> stream and have it accepted. Maybe that's a feature.
>
> Remember: Lots of attacks can be performed against this hacked up
> monstrosity unless everything is exactly perfect. Good luck with the
> frankenstein code, it's not supported.
>
>
> Hardware:
>
> - NIC: something that allows you to adjust the interrupt rate, e.g. em,
>   bnx. On the other hand if the packet rate is not too high a cheaper
>   network card without any bells and whistles might give you better
>   performance (less overhead in the interrupt handler). I'd say you'd be
>   best off buying a bunch and testing them.
>
> - CPU: maximum SINGLE CORE "turbo" speed. Disable the other cores,
>   they're not helping you at all; in theory you want the biggest,
>   fastest cache possible, but perhaps not necessary depending on how much
>   software you're running.
>
> - Fast RAM might help, but you don't need much. probably the minimum you
>   can get in a board with the above CPU.
>
> Also, remember to use the shortest patch cables possible, to reduce
> signal propagation latency.
>
>
>
> On Thu, Nov 08, 2012 at 08:08:05PM +0200, Dan Shechter wrote:
>> For unrelated reasons, I can't directly receive the TCP stream.
>>
>> I must copy the TCP data from a running stream to another server. I
>> can use tap or just port-mirroring on the switch. So I can't use any
>> network stack or leverage any offloading.
>>
>> I also need to modify the received data, and add few application
>> headers before sending it as a multicast udp stream.
>>
>> Winsock is userland. What I want to do is in the kernel, even before
>> ip_input. I guess it should be faster.
>>
>>
>> On Thu, Nov 8, 2012 at 7:36 PM, Johan Beisser <[hidden email]> wrote:
>> > On Thu, Nov 8, 2012 at 4:12 AM, Dan Shechter <[hidden email]> wrote:
>> >> Hi All,
>> >>
>> >> <current situation>
>> >> A windows 2008 server is receiving TCP traffic from a stock exchange
>> >> and sends it, almost as is, using UDP multicast to automated high
>> >> frequancy traders.
>> >>
>> >> StockExchange --TCP---> windows2008 ---MCAST-UDP---->
>> >>
>> >> On average, the time it take to do the TCP to UDP translation, using
>> >> winsock, is 240 micro seconds. It can even be as high as 60,000 micro
>> >> seconds.
>> >> </current situation>
>> >>
>> >> <my idea>
>> >> 1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
>> >> box with two NICs. One for the TCP, the other for the multicast UDP.
>> >
>> > You'll incur an extra penalty offloading to the kernel. Winsock is
>> > already doing that, though.
>> >
>> >> 2. Put the TCP port in a promiscuous mode.
>> >
>> > Why? You can just set up the right bits to listen to on the network,
>> > and pull raw frames to be processed. Or, just let the network stack
>> > behave as it should.
>> >
>> >> 3. Write my TCP->UDP logic directly into ether_input.c
>> >
>> > Any reason to not use pf for this translation?
>> >
>> >> </my idea>
>> >>
>> >> Now for the questions:
>> >> 1. Am I on the right track? or in other words how crazy is my idea?
>> >
>> > Pretty crazy. You may want to see if there's hardware accelerated or
>> > on NIC TCP off-load options instead.
>> >
>> >> 2. What would be the latency? Can I achieve 50 microseconds between
>> >> getting the interrupt and until sending the new packet through the
>> >> NIC?
>> >
>> > See above. You'll end up having to do some tuning.
>> >
>> >> 3. Which NIC/CPU/Memory should I use? Money is not a problem.
>> >
>> > Custom order a few NICs, hire a developer to write a driver to offload
>> > TCP/UDP on the NIC, and enable as little kernel interference as
>> > possible.
>> >
>> > Money's not a problem, right?

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Ryan McBride-3
On Fri, Nov 09, 2012 at 06:27:06PM +0200, Dan Shechter wrote:
> I can do some assumptions regarding the TCP flow and its origins. Its
> coming from the stock exchange over IPSEC  gateways over leased lines.
> I think I can trust the origin of the flow. At least I can trust it as
> much as the off the shelf software does.

If something goes wrong with the off-the-shelf software, you can blame
the vendor. Your own hand-rolled solution, not so much...


> When I was saying money is not a problem, I was referring to the cost
> of the server to run this.

I know, you said that already. But I've worked in this industry also and
I am well aware of the reality distortion that occurs when gambling with
billions of dollars of other people's imaginary money.


> I know that I need to implement state machine for the TCP session and
> keep some buffers for out of order packets.
>
> Do you think the right place to place the code is in ether_input.c?

I gues you mean to say either sys/net/ip_ethersubr.c, or the
ether_input() function inside that file, but either way the bulk of your
code should get added to a separate file if you don't want a maintenance
nightmare.


> Its about 1k packets per second max.

This should be doable on all but the most ancient hardware. But you will
need to consider how you want to handle bursts or anomalies: is it more
important to never lose a packet, or is it acceptable to lose some
number of packets in order to keep latency low?


In the former case you need to rely on buffers; you will want to move
packets off the interface recieve ring and into another buffer like the
ip input queue as quickly as possible, then do your magic in software
interrupt context. You'll also want to look at disabling the MCLGETI
functionality in the network card driver, and possible increase the
recieve ring size on the card.

Here you probably want to hook your code at the very beginning
of ipv4_input().


In the later case, you'll want to reduce the recieve ring size on the
interface and handle the bulk of your processing in the hardware
interrupt. Fragment reassembly, handling TCP resends and out-of-order
packets, etc may no longer be useful (since it requires buffering
packets), and you may opt to simply drop data that doesn't arrive in
correct sequence.

This is when ether_input() would be the right place to hook your code.



> I plan to coil the path cable to make electrical filed surrounding my
> device to protect it from evil.

Think about how you might be able to use ACLs on a high-end switch to
guarantee that the packets you recieve fit a certain profile (for
example, ensure that all packets are IPv4 TCP port 12345 between hostA and
hostB), to help shrink your code path.

Similarly, it may be possible to configure the device that's handling
the IPsec tunnel to do IP and TCP reassembly for you (if not, can you
replace it with one that does?), in which case your code could be made
MUCH simpler.


You didn't mention the protocol you're handling, but solutions like the
following may be helpful (or you might be able to implement the whole
thing there, and avoid supporting a frankenkernel).

http://www.brocade.com/solutions-technology/enterprise/application-delivery/fix-financial-applications/index.page

It's optimized for cloud service delivery, so it's at least 9000x as
awesome as OpenBSD.

Reply | Threaded
Open this post in threaded view
|

Re: Low latency High Frequency Trading

Florenz Kley
On 10 Nov 2012, at 00:56, Ryan McBride <[hidden email]> wrote:
> http://www.brocade.com/solutions-technology/enterprise/application-delivery/fix-financial-applications/index.page

From the product info: "Client identity may be based on a choice of Layer 3 (IP), Layer 4 (TCP Port) and Layer 7 (FIX header SenderCompID field) information."

ohmigod. Sounds like people who earn my trust based on their uncompromising attention to detail with which they design highly secure systems. Important for stuff like moving money around (even if imaginary).

fl

12