Stupid Ideas - softraid and ExpEther

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Stupid Ideas - softraid and ExpEther

J.C. Roberts-3
On Fri, 3 Apr 2009 13:52:28 -0500 Marco Peereboom <[hidden email]>
wrote:

> That said I can guarantee that the OpenBSD project pays more attention
> to its users then other OS'.  This does not mean that the users get to
> set the road-map.  When an idea is not good the author is told so,
> usually, in strong language.  The opposite is Linux and other unnamed
> BSDs where everyone agrees with each other paralyzing proper
> development.  A stupid idea is still a stupid idea and it isn't
> magically going to mature like a good wine.


Over the last handful of days, I've been trying to figure out whether
or not the the way of doing things proposed by my work is actually a
stupid idea. Though I've been sent out to investigate status on all UNIX
variants, both open and closed source, I'm pushing for using OpenBSD in
one of their new designs.

The design involves a technology called "Express Ether" though it is
typically written as "ExpEther," and it is basically a way to run a
PCIe bus over ethernet. Though this might be the first you've heard of
it, ExpEther has been in development at NEC for the last five years,
and yes, I'm currently working on getting the documentation released for
the existing silicon.

http://www.nec.co.jp/press/en/0702/0801.html
http://www.expether.org/

In short, you can think of ExpEther as something between a bus extender
and a bridge (PCIe<->ethernet), so basically anything you can plug into
a PCIe slot can be made available to a remote machine. Yep, you can
even partition attached devices into VLANs and basically "build" a
computer on the fly out of available parts attached to the network. For
example if your VPN or secure website is running a little slow, you
would usually halt the machine and add a crypto accelerator, but with
ExpEther, you just export a crypto accelerator device on another system
to the system that needs it and the recipient system assumes the device
is attached to it's local PCIe bus.

One of the first applications I'm working on is exporting a softraid
volume over ExpEther. I was asked if it was possible to build a shim
that makes a block device like a softraid sd0a look like an ATA device
sitting on a (fictitious) ATA controller on the PCIe bus?

Though it's certainly an uncommon thing to try to do, there's just
something about this approach that makes me wonder if it's a
crazy/stupid idea, or absolutely brilliant?

To *me* (complete idiot), I'm wondering if this is being approached at
the wrong level, namely shimming a block device like sd0a to be seen as
a ata/scsi device on a fictitious controller, versus shimming something
below it, i.e.
        scsibus0 at softraid0
        sd0 at scsibus0

The *consumer* of the resource is expecting to see a disk attached to
a (fictitious) scsi/ata controller on it's local PCIe bus (which is
imported via ExpEther).

The *provider* of the resource needs to take a softraid volume and make
it look like just a (fictitious) disk attached to a (fictitious)
scsi/ata controller on a (fictitious) PCIe bus (which is exported via
ExpEther).

Whether or not the shimming is done below partitioning on the provider
side is yet to be determined. If it is done above partitioning on the
provider side (i.e. block devices like sd0a), the result will be two
layers of partitioning (both provider and consumer) since sd0a on the
provider-system would become the (fictitious) sd0 on the
consuming-system.

The thing to remember is we're talking *below* the file system, so well
intended suggestions of NFS, ZFS, or file-system-de-jour are not at all
relevant.

As for the vast number of different types of potential failure modes,
the PCIe spec includes hot-plug requirements (yet who knows if your
$VENDOR implemented them properly), but on top of that, the ExpEther
spec also has it's own hot-plug requirements, and they've been
implemented. Even with all this, getting the potential failure modes
correctly handled at the various levels will take a lot of effort.

Yes Nick, you can show up with your nail-gun (ramset). (;

Though this is for my work, I'm quietly doing my best to make it
benefit the project in various ways (including docs, code, ...). If any
of you would be kind enough to drop kick me in the direction of finding
a clue, or even want to voice an opinion about the bleeding edge (pink
elephants, vaporware, etc.) stuff I'm working on, it would be much
appreciated. I'm *way* over my head on a lot of this stuff, but I'm
learning it as fast as I can.

Thanks,
jcr

--
J.C. Roberts

Reply | Threaded
Open this post in threaded view
|

Re: Stupid Ideas - softraid and ExpEther

SJP Lists
2009/4/7 J.C. Roberts <[hidden email]>:

> The design involves a technology called "Express Ether" though it is
> typically written as "ExpEther," and it is basically a way to run a
> PCIe bus over ethernet. Though this might be the first you've heard of
> it, ExpEther has been in development at NEC for the last five years,
> and yes, I'm currently working on getting the documentation released for
> the existing silicon.

DMA to host memory via Ethernet?

O_o

Reply | Threaded
Open this post in threaded view
|

Re: Stupid Ideas - softraid and ExpEther

Steven Shockley
In reply to this post by J.C. Roberts-3
On 4/6/2009 10:23 PM, J.C. Roberts wrote:
>For
> example if your VPN or secure website is running a little slow, you
> would usually halt the machine and add a crypto accelerator, but with
> ExpEther, you just export a crypto accelerator device on another system
> to the system that needs it and the recipient system assumes the device
> is attached to it's local PCIe bus.

How does that help if you're encrypting the connection to the ExpEther
server/device?  I mostly trust that nobody is sniffing my PCI bus, I'm
less trusting when data goes over the network.

Reply | Threaded
Open this post in threaded view
|

Re: Stupid Ideas - softraid and ExpEther

Declan Ingram
In reply to this post by J.C. Roberts-3
 On Tue 07/04/09 9:28 PM , Steve Shockley  wrote:

 On 4/6/2009 10:23 PM, J.C. Roberts wrote:
 >For
 > example if your VPN or secure website is running a little slow, you
 > would usually halt the machine and add a crypto accelerator, but with
 > ExpEther, you just export a crypto accelerator device on another system
 > to the system that needs it and the recipient system assumes the device
 > is attached to it's local PCIe bus.

 How does that help if you're encrypting the connection to the ExpEther
 server/device? I mostly trust that nobody is sniffing my PCI bus, I'm
 less trusting when data goes over the network.

 Just tunnel it over SSH

Reply | Threaded
Open this post in threaded view
|

Re: Stupid Ideas - softraid and ExpEther

Steven Shockley
On 4/7/2009 9:08 AM, Declan Ingram wrote:
>>   How does that help if you're encrypting the connection to the ExpEther
>>   server/device? I mostly trust that nobody is sniffing my PCI bus, I'm
>>   less trusting when data goes over the network.
>
>   Just tunnel it over SSH

That's fine, but then how do I offload the load from the ssh tunnel?
That's probably going to be the same load as the original ssl I'm
offloading.

Reply | Threaded
Open this post in threaded view
|

Re: Stupid Ideas - softraid and ExpEther

Jussi Peltola
On Tue, Apr 07, 2009 at 11:23:59AM -0400, Steve Shockley wrote:

> On 4/7/2009 9:08 AM, Declan Ingram wrote:
>>>   How does that help if you're encrypting the connection to the ExpEther
>>>   server/device? I mostly trust that nobody is sniffing my PCI bus, I'm
>>>   less trusting when data goes over the network.
>>
>>   Just tunnel it over SSH
>
> That's fine, but then how do I offload the load from the ssh tunnel?  
> That's probably going to be the same load as the original ssl I'm  
> offloading.

not necessarily, ssh is one session, https is a stream of tiny ones.
still, the point stands, encrypting bus data sounds pretty slow
especially since it's latency sensitive

--
Jussi Peltola

Reply | Threaded
Open this post in threaded view
|

Re: Stupid Ideas - softraid and ExpEther

Marco Peereboom
In reply to this post by J.C. Roberts-3
> The design involves a technology called "Express Ether" though it is
> typically written as "ExpEther," and it is basically a way to run a
> PCIe bus over ethernet. Though this might be the first you've heard of
> it, ExpEther has been in development at NEC for the last five years,
> and yes, I'm currently working on getting the documentation released for
> the existing silicon.

Getting these docs would be kick ass.  I was vaguely aware that this was
happening and what I don't know is how the silicon works or looks like.
Got any free docs on that?

>
> http://www.nec.co.jp/press/en/0702/0801.html
> http://www.expether.org/
>
> In short, you can think of ExpEther as something between a bus extender
> and a bridge (PCIe<->ethernet), so basically anything you can plug into
> a PCIe slot can be made available to a remote machine. Yep, you can
> even partition attached devices into VLANs and basically "build" a
> computer on the fly out of available parts attached to the network. For
> example if your VPN or secure website is running a little slow, you
> would usually halt the machine and add a crypto accelerator, but with
> ExpEther, you just export a crypto accelerator device on another system
> to the system that needs it and the recipient system assumes the device
> is attached to it's local PCIe bus.

So this is where all the work comes in.  We need a new pci bridge (or
bus) device that does all the magic.  Once this is in place one could
trivially hook hardware up and make it work regardless of distance
(latency would have to be considered obviously).

I am a little confused here though; if this is done right it should be
transparent to the OS and no code would have to be written at all (minus
management obviously).  Why do we need code?

> One of the first applications I'm working on is exporting a softraid
> volume over ExpEther. I was asked if it was possible to build a shim
> that makes a block device like a softraid sd0a look like an ATA device
> sitting on a (fictitious) ATA controller on the PCIe bus?

Sure it could easily be used for that however if you want to make this
much more usable see my previous paragraph.  You really want to solve
the problem only once and not multiple times.

> Though it's certainly an uncommon thing to try to do, there's just
> something about this approach that makes me wonder if it's a
> crazy/stupid idea, or absolutely brilliant?

Fine hack to prove a concept however a pci bridge (or bus) is the device
you really need and should write.

> To *me* (complete idiot), I'm wondering if this is being approached at
> the wrong level, namely shimming a block device like sd0a to be seen as
> a ata/scsi device on a fictitious controller, versus shimming something
> below it, i.e.
> scsibus0 at softraid0
> sd0 at scsibus0

Softraid is nothing but a virtual HBA.  Or a shim or a
$insert_fancy_name_here.

>
> The *consumer* of the resource is expecting to see a disk attached to
> a (fictitious) scsi/ata controller on it's local PCIe bus (which is
> imported via ExpEther).
>
> The *provider* of the resource needs to take a softraid volume and make
> it look like just a (fictitious) disk attached to a (fictitious)
> scsi/ata controller on a (fictitious) PCIe bus (which is exported via
> ExpEther).

Sure all this is done in softraid today.  See the disabled AOE code as
an example.

> Whether or not the shimming is done below partitioning on the provider
> side is yet to be determined. If it is done above partitioning on the
> provider side (i.e. block devices like sd0a), the result will be two
> layers of partitioning (both provider and consumer) since sd0a on the
> provider-system would become the (fictitious) sd0 on the
> consuming-system.
>
> The thing to remember is we're talking *below* the file system, so well
> intended suggestions of NFS, ZFS, or file-system-de-jour are not at all
> relevant.

Correct.

Reply | Threaded
Open this post in threaded view
|

Re: Stupid Ideas - softraid and ExpEther

J.C. Roberts-3
In reply to this post by Jussi Peltola
On Tue, 7 Apr 2009 19:04:00 +0300 Jussi Peltola <[hidden email]> wrote:

> On Tue, Apr 07, 2009 at 11:23:59AM -0400, Steve Shockley wrote:
> > On 4/7/2009 9:08 AM, Declan Ingram wrote:
> >>>   How does that help if you're encrypting the connection to the
> >>> ExpEther server/device? I mostly trust that nobody is sniffing my
> >>> PCI bus, I'm less trusting when data goes over the network.
> >>
> >>   Just tunnel it over SSH
> >
> > That's fine, but then how do I offload the load from the ssh
> > tunnel? That's probably going to be the same load as the original
> > ssl I'm offloading.
>
> not necessarily, ssh is one session, https is a stream of tiny ones.
> still, the point stands, encrypting bus data sounds pretty slow
> especially since it's latency sensitive
>

It seems the three of you, Jussi, Declan, and Steve, are thinking on
the wrong OSI level. ExpEther runs at Layer 2, raw ethernet frames,
and is used with a Layer 2 mesh switch. Though it is theoretically
possible to put a device on the other side of the globe and use a VLAN
(IEEE 802.1Q) to make it appear "local" to the switch, doing so would
obviously increase your latency considerably.

The typical mesh network configuration (in this sense) is limited to
4096 node topology, but it is possible to extend past this limitation by
combining/bridging them together. The partitioning within the network,
or better said assignment of PCIe devices from producers to consumers,
is done through VLANs.

If you've ever worked with low-latency, high-speed shared *memory*
interconnects in the HPC space ("High Performance Computing" - i.e.
Super Computing Clusters), such as Myrinet, you'd know maintaining
low latency within the cluster (i.e. datacenter) is very important, but
this problem was solved a long time ago.

Unlike Myrinet which is only a shared memory interconnect, ExpEther
gives you the ability to supposedly share any PCIe device. Even though I
just started working with this stuff, I'm currently unconvinced of the
claim of "any" PCIe device, but then again, I tend to be very skeptical
until I've got actual proof.

As for the mentioned issue of encrypting the bus data, since you've got
the VLAN it is feasible, but if you've got an attacker inside the
switches of your datacenter, then you obviously have more important
problems. Also, there are a number of applications where the "switch"
is actually an isolated back-plane of sorts built into the device
housing (think blade server), so it is completely cut off from what you
think of as normal network traffic.

--
J.C. Roberts

Reply | Threaded
Open this post in threaded view
|

Re: Stupid Ideas - softraid and ExpEther

J.C. Roberts-3
In reply to this post by Marco Peereboom
On Tue, 7 Apr 2009 11:48:52 -0500 Marco Peereboom <[hidden email]>
wrote:

> > The design involves a technology called "Express Ether" though it is
> > typically written as "ExpEther," and it is basically a way to run a
> > PCIe bus over ethernet. Though this might be the first you've heard
> > of it, ExpEther has been in development at NEC for the last five
> > years, and yes, I'm currently working on getting the documentation
> > released for the existing silicon.
>
> Getting these docs would be kick ass.  I was vaguely aware that this
> was happening and what I don't know is how the silicon works or looks
> like. Got any free docs on that?
>

I'm working simultaneously on five (5) different fronts:

1.) Getting Documentation Open
2.) Getting Existing Code Open
3.) Prevent NDA Nonsense
4.) Providing Hardware
5.) Reciprocation

To put it as bluntly as possible, *I* *WANT* the code quality and
reliability of OpenBSD in the system I'm working on, so I'm doing
everything I can to make it happen. This is the first time I've had the
opportunity to strongly influence, if not decide, practice, policy and
procedure, so when it comes to getting OpenBSD involved, I *am* making
things up as I go along. Needless to say, changing the thinking and
standard operating procedures at multiple bureaucratic corporations
may prove to be nothing more than wishful thinking on my part, but the
only way to find out is to try.

Think for a moment of the recent announcement about commercial use of
OpenSSH versus the support given to the OpenSSH authors, and all
similar situations. Since reciprocation from corporate entities to open
source software authors is typically lacking, reciprocation is an
important issue to mention publicly. None the less, the details of
reciprocation are a private matter.

As for getting docs, hardware, code and whatnot open an into the hands
of interested developers, I'm doing it as fast as possible. Since I'm
a NDA-slave, I have some of it here, so it's mostly a matter of getting
the required permissions to release it. The (stupidly) super secret
sauce documentation I have here is from us forcing NEC to provide
accurate translations to English of their internal Japanese docs. They
are dated "March 30, 2009," so this is all brutally new.

ExpEther has been one of the secret research projects developed at NEC
labs, and has been in development for over five years. It works but just
getting example and/or reference hardware for partner corporations has,
thus far, been a royal pain (limited supply issues), but I'll be getting
this issue fixed shortly.

> >
> > http://www.nec.co.jp/press/en/0702/0801.html
> > http://www.expether.org/
> >
> > In short, you can think of ExpEther as something between a bus
> > extender and a bridge (PCIe<->ethernet), so basically anything you
> > can plug into a PCIe slot can be made available to a remote
> > machine. Yep, you can even partition attached devices into VLANs
> > and basically "build" a computer on the fly out of available parts
> > attached to the network. For example if your VPN or secure website
> > is running a little slow, you would usually halt the machine and
> > add a crypto accelerator, but with ExpEther, you just export a
> > crypto accelerator device on another system to the system that
> > needs it and the recipient system assumes the device is attached to
> > it's local PCIe bus.
>
> So this is where all the work comes in.  We need a new pci bridge (or
> bus) device that does all the magic.  Once this is in place one could
> trivially hook hardware up and make it work regardless of distance
> (latency would have to be considered obviously).
>
> I am a little confused here though; if this is done right it should be
> transparent to the OS and no code would have to be written at all
> (minus management obviously).  Why do we need code?
>

Outside of getting the ExpEther driver ported/written for OpenBSD,
there's only one place where it has been suggested that new code is
needed; getting the pseudo-device created by softraid, and the
pseudo-device pretending to be a scsi controller (where the softraid
device is attached), to appear to be attached to a PCIe bus. --As far
as I can tell, you've *already* written most of the code, and have
recently been attempting to do something (roughly) similar to ExpEther
with your softraid-AoE code.

> > One of the first applications I'm working on is exporting a softraid
> > volume over ExpEther. I was asked if it was possible to build a shim
> > that makes a block device like a softraid sd0a look like an ATA
> > device sitting on a (fictitious) ATA controller on the PCIe bus?
>
> Sure it could easily be used for that however if you want to make this
> much more usable see my previous paragraph.  You really want to solve
> the problem only once and not multiple times.
>
> > Though it's certainly an uncommon thing to try to do, there's just
> > something about this approach that makes me wonder if it's a
> > crazy/stupid idea, or absolutely brilliant?
>
> Fine hack to prove a concept however a pci bridge (or bus) is the
> device you really need and should write.
>

I agree.

> > To *me* (complete idiot), I'm wondering if this is being approached
> > at the wrong level, namely shimming a block device like sd0a to be
> > seen as a ata/scsi device on a fictitious controller, versus
> > shimming something below it, i.e.
> > scsibus0 at softraid0
> > sd0 at scsibus0
>
> Softraid is nothing but a virtual HBA.  Or a shim or a
> $insert_fancy_name_here.
>

Yep, the "shim" they asked about would take something that is already
virtualized (a softraid block device sd0a), and virtualize it a again as
a HBA. Personally, I think this double-virtualization is a stupid idea,
and it does *not* qualify a "fine hack to prove a concept" because I
can see nothing (but overhead and wasted time) being gained from it.

To *me* it makes far more sense to just use softraid as intended.

I still need to get through the existing ExpEther driver source, but
I suspect it must supply a virtual PCIe bus, so it's really just a
matter of getting the pseudo-HBA of softraid attached to it. I also
suspect the existing driver must provide a pci bridge so it can talk
to physical physically connected PCIe devices.

> >
> > The *consumer* of the resource is expecting to see a disk attached
> > to a (fictitious) scsi/ata controller on it's local PCIe bus (which
> > is imported via ExpEther).
> >
> > The *provider* of the resource needs to take a softraid volume and
> > make it look like just a (fictitious) disk attached to a
> > (fictitious) scsi/ata controller on a (fictitious) PCIe bus (which
> > is exported via ExpEther).
>
> Sure all this is done in softraid today.  See the disabled AOE code as
> an example.
>

I read your AoE code once briefly, and drooled on myself, but once I
get through the other docs (and finish beating up the required people
to get them released), I'll give the AoE code another read.

--
J.C. Roberts

Reply | Threaded
Open this post in threaded view
|

Re: Stupid Ideas - softraid and ExpEther

Matthew Dempsky-3
In reply to this post by Steven Shockley
On Tue, Apr 7, 2009 at 4:28 AM, Steve Shockley
<[hidden email]> wrote:
> I mostly trust that nobody is sniffing my PCI bus, I'm less
> trusting when data goes over the network.

You can use a dedicated network.

Reply | Threaded
Open this post in threaded view
|

Re: Stupid Ideas - softraid and ExpEther

Steven Shockley
In reply to this post by J.C. Roberts-3
On 4/7/2009 9:43 PM, J.C. Roberts wrote:
> As for the mentioned issue of encrypting the bus data, since you've got
> the VLAN it is feasible, but if you've got an attacker inside the
> switches of your datacenter, then you obviously have more important
> problems. Also, there are a number of applications where the "switch"
> is actually an isolated back-plane of sorts built into the device
> housing (think blade server), so it is completely cut off from what you
> think of as normal network traffic.

Okay, so this isn't something that would be offered as a network
service, such as iSCSI or NFS.  Thanks for the detailed explanation.

Reply | Threaded
Open this post in threaded view
|

Re: Stupid Ideas - softraid and ExpEther

Joseph C. Bender
In reply to this post by J.C. Roberts-3
J.C. Roberts wrote:
>
> As for the mentioned issue of encrypting the bus data, since you've got
> the VLAN it is feasible, but if you've got an attacker inside the
> switches of your datacenter, then you obviously have more important
> problems.
>

Another scenario is that you get a compromised machine that has access
to this pool of resources.  I don't have to compromise your switching, I
just have to compromise a host that uses this network.  Given that
Windows hosts get to participate with this sort of thing, that's just a
matter of time.

Given that the security model relies on *VLANS* of all things to segment
network resources (from what little information is out there), one
compromised host could ruin your whole day, especially if the switch has
VLAN tagging vulnerabilities as well (which has happened more times than
I'd like to think about.)


-JCB

Reply | Threaded
Open this post in threaded view
|

Re: Stupid Ideas - softraid and ExpEther

Felipe Scarel
Forgot to CC the list, my bad.

On Wed, Apr 8, 2009 at 12:25 PM, Joseph C. Bender
<[hidden email]> wrote:

> J.C. Roberts wrote:
>>
>> As for the mentioned issue of encrypting the bus data, since you've got
>> the VLAN it is feasible, but if you've got an attacker inside the
>> switches of your datacenter, then you obviously have more important
>> problems.
>
> Another scenario is that you get a compromised machine that has access to
> this pool of resources.  I don't have to compromise your switching, I just
> have to compromise a host that uses this network.  Given that Windows hosts
> get to participate with this sort of thing, that's just a matter of time.
>
> Given that the security model relies on *VLANS* of all things to segment
> network resources (from what little information is out there), one
> compromised host could ruin your whole day, especially if the switch has
> VLAN tagging vulnerabilities as well (which has happened more times than
I'd
> like to think about.)
>

Since J.C. is talking about HPC, I don't think that'd be such a
concern. Like Matthew said, the "dedicated network" scenario is much
more likely, and thus the probability of a compromised host decreases
dramatically (since you control every single host in the network).

I'm currently working with bioinformatics algorithms in cluster
environments, so (as always), your extremely detailed emails have been
a great reading material, J.C. Thanks, and keep up the great work!

>
> -JCB