tpmr(4): 802.1Q Two-Port MAC Relay

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

tpmr(4): 802.1Q Two-Port MAC Relay

David Gwynne-5
a Two-Port MAC Relay is basically a cut down bridge(4). it only supports
two ports, and unconditionally relays packets between those ports
instead of doing learning or anything like that.

i've been trying to get a redundant pair of bridges set up between two
datacenters here to help me while i migrate between them. so far all my
efforts to make it redundant have mostly worked, until they introduced
loops in the layer 2 topology, which generates a broadcast storm, which
basically takes the net down for a few minutes at a time. it's feels
very betraying.

my frustration is that switches plugged together have mechanisms to
prevent loops like that, more specifically they use spanning tree or
lacp to make appropriate use of redundant links. i got to a point where
i just wanted the switches to talk to each other and do their own thing
to negotiate use of the redundant links.

unfortunately the only way to get ethernet packets off a physical
wire and onto a tunnel over an ip network is bridge(4), and bridge(4)
tries to be a compliant switch from a standards point of view. this
means it intercepts packets that are meant to be processed by bridges,
because it is a bridge. these types of packets include spanning tree and
lacp, which means i couldnt get the physical switches at each site to
talk to each other. sadface.

so to solve my problem i hacked up a small driver that did less than
bridge(4). however, it turns out that what i hacked up is an actual
thing that already exists as something done in the real world. IEEE
802.1Q describes TPMR, which is defined as intercepting far less
than a real bridge does. one of the appendices specifically describes
lacp going through one, which is exactly what i wanted. cisco does
something like this with their layer 2 cross-connects (search for cisco
xconnect for examples), juniper has l2circuits, and so on.

the way i'm using this is like below. i have a pair of bridges in each
datacenter, so 4 boxes in total. they peer directly with the ip network
that sits between the datacenter. each box has a 4 physical network
ports. 2 of those ports are configured with aggr(4) and talk IP into the
core network. the other two ports are connected to the switches at
each site for use with tpmr. there's 2 etherip interfaces configured on
each physical box, each of which is connected to the tpmr.

all that together looks a bit like the following:

 +-+ +--------------------------+      +---------------------------+ +-+
 |d|-|ix2 <-> tpmr0 <-> etherip0|------|etherip0 <-> tpmr0 <-> ixl0|-|d|
 |c| |                          |      |                           | |c|
 |0|-|ix3 <-> tpmr1 <-> etherip1|-    -|etherip1 <-> tpmr1 <-> ixl1|-|1|
 ||| +--------------------------+ \  / +---------------------------+ |||
 |s|         dc0-bridge0           \/          dc1-bridge0           |s|
 |w|                               /\                                |w|
 |i| +--------------------------+ /  \ +---------------------------+ |i|
 |t|-|ix2 <-> tpmr0 <-> etherip0|-    -|etherip0 <-> tpmr0 <-> ixl0|-|t|
 |c| |                          |      |                           | |c|
 |h|-|ix3 <-> tpmr1 <-> etherip1|------|etherip1 <-> tpmr1 <-> ixl1|-|h|
 +-+ +--------------------------+      +---------------------------+ +-+
             dc0-bridge1                       dc1-bridge1

each switch has a 4 port port-channel (lacp aggregation) set up. because
each physical interface on the bridges are tied to a single tunnel, the
packets effectively traverse a point-to-point link, ie, a really
complicated wire. because lacp makes it from each point to the other
point, the switches make sure only active lacp ports are used, which
avoids layer 2 loops. lacp also means i get to use all the links when
theyre available.

with the topology above i can lose a bridge at each site and should
still have a working link to the other side, so i get my redundancy. the
use of the extra links with lacp is a bonus. at this point i would have
been happy for spanning tree to shut links down.

anyway, here's the code.

it was originally called xcon(4) since it provides a software
cross-connect, but i changed my mind after looking at 802.1Q. it might
be unfair to refer to 802.1Q because tpmr(4) does none of the filtering
that the spec says it should. i just needed it to work though.

the guts of it is tpmr_input(). it basically gets the rxed packet from
one port and enqueues it for tranmission immediately on the other port.
it does run bpf though, and supports filtering on bpf, which has been
handy for us when we needed to test taking bpdus off the wire for a bit.

because it does such a small amount of work, it is relatively fast.
hrvoje popovski has given it a quick spin and seen the following
results on a fast box with a pair of ix(4) interfaces:

plain ip forwarding: 1.5Mpps
bridge(4) under load from 14Mpps: 500Kpps
bridge(4) under load from 1Mpps: 800Kpps
tpmr(4): 1.75Mpps

1.75Mpps was lower than I was expecting, but it turns out he was hitting
limits in other parts of the system. with some tuning we got it up to
2.25Mpps. the softnet taskq was only at about 66% cpu time, but we
couldnt see any other obvious places that we were dropping load.

on a slower box that can do IP forwarding at 1Mpps, tpmr(4) can do
1.6Mpps. it's worth noting that the boxes were extremely responsive (ie,
ssh feels fine) when tpmr is under load, which is not the case when ip
forwarding or bridge are being hammered.

my point is that it might be useful having tpmr(4) just to be able to
test network driver performance improvements independently of the stack.
im probably going to be using it to monitor links as a "bump in the
wire" too.

lastly regarding the code. i made this use the trunk(4) ioctls instead
of the bridge ones, mostly because i had to fake less stuff to make
ifconfig output look ok.

ifconfig output looks like this:

xdlg@dc3-bridge1:~$ ifconfig tpmr
     
tpmr0: flags=51<UP,POINTOPOINT,RUNNING>
        description: xconnect
        index 15 priority 0 llprio 7
        trunk: trunkproto none
                ix2 port active,collecting,distributing
                etherip10 port active,collecting,distributing
        groups: tpmr
        status: active

anyway. thoughts? ok?

Index: net/if_tpmr.c
===================================================================
RCS file: net/if_tpmr.c
diff -N net/if_tpmr.c
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ net/if_tpmr.c 29 Jul 2019 09:44:26 -0000
@@ -0,0 +1,717 @@
+/* $OpenBSD$ */
+
+/*
+ * Copyright (c) 2019 The University of Queensland
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+/*
+ * This code was written by David Gwynne <[hidden email]> as part
+ * of the Information Technology Infrastructure Group (ITIG) in the
+ * Faculty of Engineering, Architecture and Information Technology
+ * (EAIT).
+ */
+
+#include "bpfilter.h"
+#include "vlan.h"
+
+#include <sys/param.h>
+#include <sys/kernel.h>
+#include <sys/malloc.h>
+#include <sys/mbuf.h>
+#include <sys/queue.h>
+#include <sys/socket.h>
+#include <sys/sockio.h>
+#include <sys/systm.h>
+#include <sys/syslog.h>
+#include <sys/rwlock.h>
+#include <sys/percpu.h>
+#include <sys/smr.h>
+#include <sys/task.h>
+
+#include <net/if.h>
+#include <net/if_dl.h>
+#include <net/if_types.h>
+
+#include <netinet/in.h>
+#include <netinet/if_ether.h>
+
+#include <net/if_media.h> /* if_trunk.h uses ifmedia bits */
+#include <crypto/siphash.h> /* if_trunk.h uses siphash bits */
+#include <net/if_trunk.h>
+
+#if NBPFILTER > 0
+#include <net/bpf.h>
+#endif
+
+#if NVLAN > 0
+#include <net/if_vlan_var.h>
+#endif
+
+/*
+ * tpmr interface
+ */
+
+#define TPMR_NUM_PORTS 2
+#define TPMR_TRUNK_PROTO TRUNK_PROTO_NONE
+
+struct tpmr_softc;
+
+struct tpmr_port {
+ struct ifnet *p_ifp0;
+
+ int (*p_ioctl)(struct ifnet *, u_long, caddr_t);
+ int (*p_output)(struct ifnet *, struct mbuf *, struct sockaddr *,
+    struct rtentry *);
+
+ void *p_lcookie;
+ void *p_dcookie;
+
+ struct tpmr_softc *p_tpmr;
+ unsigned int p_slot;
+};
+
+struct tpmr_softc {
+ struct ifnet sc_if;
+ unsigned int sc_dead;
+
+ struct tpmr_port *sc_ports[TPMR_NUM_PORTS];
+ unsigned int sc_nports;
+};
+
+#define DPRINTF(_sc, fmt...) do { \
+ if (ISSET((_sc)->sc_if.if_flags, IFF_DEBUG)) \
+ printf(fmt); \
+} while (0)
+
+static int tpmr_clone_create(struct if_clone *, int);
+static int tpmr_clone_destroy(struct ifnet *);
+
+static int tpmr_ioctl(struct ifnet *, u_long, caddr_t);
+static int tpmr_enqueue(struct ifnet *, struct mbuf *);
+static int tpmr_output(struct ifnet *, struct mbuf *, struct sockaddr *,
+    struct rtentry *);
+static void tpmr_start(struct ifqueue *);
+
+static int tpmr_up(struct tpmr_softc *);
+static int tpmr_down(struct tpmr_softc *);
+static int tpmr_iff(struct tpmr_softc *);
+
+static void tpmr_p_linkch(void *);
+static void tpmr_p_detach(void *);
+static int tpmr_p_ioctl(struct ifnet *, u_long, caddr_t);
+static int tpmr_p_output(struct ifnet *, struct mbuf *,
+    struct sockaddr *, struct rtentry *);
+
+static int tpmr_get_trunk(struct tpmr_softc *, struct trunk_reqall *);
+static void tpmr_p_dtor(struct tpmr_softc *, struct tpmr_port *,
+    const char *);
+static int tpmr_add_port(struct tpmr_softc *,
+    const struct trunk_reqport *);
+static int tpmr_get_port(struct tpmr_softc *, struct trunk_reqport *);
+static int tpmr_del_port(struct tpmr_softc *,
+    const struct trunk_reqport *);
+
+static struct if_clone tpmr_cloner =
+    IF_CLONE_INITIALIZER("tpmr", tpmr_clone_create, tpmr_clone_destroy);
+
+void
+tpmrattach(int count)
+{
+ if_clone_attach(&tpmr_cloner);
+}
+
+static int
+tpmr_clone_create(struct if_clone *ifc, int unit)
+{
+ struct tpmr_softc *sc;
+ struct ifnet *ifp;
+
+ sc = malloc(sizeof(*sc), M_DEVBUF, M_WAITOK|M_ZERO|M_CANFAIL);
+ if (sc == NULL)
+ return (ENOMEM);
+
+ ifp = &sc->sc_if;
+
+ snprintf(ifp->if_xname, sizeof(ifp->if_xname), "%s%d",
+    ifc->ifc_name, unit);
+
+ ifp->if_softc = sc;
+ ifp->if_type = IFT_BRIDGE;
+ ifp->if_hardmtu = ETHER_MAX_HARDMTU_LEN;
+ ifp->if_mtu = 0;
+ ifp->if_addrlen = ETHER_ADDR_LEN;
+ ifp->if_hdrlen = ETHER_HDR_LEN;
+ ifp->if_ioctl = tpmr_ioctl;
+ ifp->if_output = tpmr_output;
+ ifp->if_enqueue = tpmr_enqueue;
+ ifp->if_qstart = tpmr_start;
+ ifp->if_flags = IFF_POINTOPOINT;
+ ifp->if_xflags = IFXF_CLONED | IFXF_MPSAFE;
+ ifp->if_link_state = LINK_STATE_DOWN;
+ IFQ_SET_MAXLEN(&ifp->if_snd, IFQ_MAXLEN);
+
+ if_counters_alloc(ifp);
+ if_attach(ifp);
+        if_alloc_sadl(ifp);
+
+#if NBPFILTER > 0
+ bpfattach(&ifp->if_bpf, ifp, DLT_EN10MB, ETHER_HDR_LEN);
+#endif
+
+ ifp->if_llprio = IFQ_MAXPRIO;
+
+ return (0);
+}
+
+static int
+tpmr_clone_destroy(struct ifnet *ifp)
+{
+ struct tpmr_softc *sc = ifp->if_softc;
+ unsigned int i;
+
+ NET_LOCK();
+ sc->sc_dead = 1;
+
+ if (ISSET(ifp->if_flags, IFF_RUNNING))
+ tpmr_down(sc);
+ NET_UNLOCK();
+
+ if_detach(ifp);
+
+ for (i = 0; i < nitems(sc->sc_ports); i++) {
+ struct tpmr_port *p = SMR_PTR_GET_LOCKED(&sc->sc_ports[i]);
+ if (p == NULL)
+ continue;
+ tpmr_p_dtor(sc, p, "destroy");
+ }
+
+ free(sc, M_DEVBUF, sizeof(*sc));
+
+ return (0);
+}
+
+static int
+tpmr_input(struct ifnet *ifp0, struct mbuf *m, void *cookie)
+{
+ struct tpmr_port *p = cookie;
+ struct tpmr_softc *sc = p->p_tpmr;
+ struct ifnet *ifp = &sc->sc_if;
+ struct tpmr_port *pn;
+ int len;
+#if NBPFILTER > 0
+ caddr_t if_bpf;
+#endif
+
+ if (!ISSET(ifp->if_flags, IFF_RUNNING))
+ goto drop;
+
+#if NVLAN > 0
+ /*
+ * If the underlying interface removed the VLAN header itself,
+ * add it back.
+ */
+ if (ISSET(m->m_flags, M_VLANTAG)) {
+ m = vlan_inject(m, ETHERTYPE_VLAN, m->m_pkthdr.ether_vtag);
+ if (m == NULL) {
+ counters_inc(ifp->if_counters, ifc_ierrors);
+ goto drop;
+ }
+ }
+#endif
+
+ len = m->m_pkthdr.len;
+ counters_pkt(ifp->if_counters, ifc_ipackets, ifc_ibytes, len);
+
+#if NBPFILTER > 0
+        if_bpf = ifp->if_bpf;
+        if (if_bpf) {
+                if (bpf_mtap(if_bpf, m, 0))
+ goto drop;
+ }
+#endif
+
+ smr_read_enter();
+ pn = SMR_PTR_GET(&sc->sc_ports[!p->p_slot]);
+ if (pn == NULL)
+ m_freem(m);
+ else {
+ struct ifnet *ifpn = pn->p_ifp0;
+ if ((*ifpn->if_enqueue)(ifpn, m))
+ counters_inc(ifp->if_counters, ifc_oerrors);
+ else {
+ counters_pkt(ifp->if_counters,
+    ifc_opackets, ifc_obytes, len);
+ }
+ }
+ smr_read_leave();
+
+ return (1);
+
+drop:
+ m_freem(m);
+ return (1);
+}
+
+static int
+tpmr_output(struct ifnet *ifp, struct mbuf *m, struct sockaddr *dst,
+    struct rtentry *rt)
+{
+ m_freem(m);
+ return (ENODEV);
+}
+
+static int
+tpmr_enqueue(struct ifnet *ifp, struct mbuf *m)
+{
+ m_freem(m);
+ return (ENODEV);
+}
+
+static void
+tpmr_start(struct ifqueue *ifq)
+{
+ ifq_purge(ifq);
+}
+
+static int
+tpmr_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data)
+{
+ struct tpmr_softc *sc = ifp->if_softc;
+ int error = 0;
+
+ if (sc->sc_dead)
+ return (ENXIO);
+
+ switch (cmd) {
+ case SIOCSIFADDR:
+ error = EAFNOSUPPORT;
+ break;
+
+ case SIOCSIFFLAGS:
+ if (ISSET(ifp->if_flags, IFF_UP)) {
+ if (!ISSET(ifp->if_flags, IFF_RUNNING))
+ error = tpmr_up(sc);
+ } else {
+ if (ISSET(ifp->if_flags, IFF_RUNNING))
+ error = tpmr_down(sc);
+ }
+ break;
+
+ case SIOCSTRUNK:
+ error = suser(curproc);
+ if (error != 0)
+ break;
+
+ if (((struct trunk_reqall *)data)->ra_proto !=
+    TRUNK_PROTO_LACP) {
+ error = EPROTONOSUPPORT;
+ break;
+ }
+
+ /* nop */
+ break;
+ case SIOCGTRUNK:
+ error = tpmr_get_trunk(sc, (struct trunk_reqall *)data);
+ break;
+
+ case SIOCSTRUNKOPTS:
+ error = suser(curproc);
+ if (error != 0)
+ break;
+
+ error = EPROTONOSUPPORT;
+ break;
+
+ case SIOCGTRUNKOPTS:
+ break;
+
+ case SIOCGTRUNKPORT:
+ error = tpmr_get_port(sc, (struct trunk_reqport *)data);
+ break;
+ case SIOCSTRUNKPORT:
+ error = suser(curproc);
+ if (error != 0)
+ break;
+
+ error = tpmr_add_port(sc, (struct trunk_reqport *)data);
+ break;
+ case SIOCSTRUNKDELPORT:
+ error = suser(curproc);
+ if (error != 0)
+ break;
+
+ error = tpmr_del_port(sc, (struct trunk_reqport *)data);
+ break;
+
+ default:
+ error = ENOTTY;
+ break;
+ }
+
+ if (error == ENETRESET)
+ error = tpmr_iff(sc);
+
+ return (error);
+}
+
+static int
+tpmr_get_trunk(struct tpmr_softc *sc, struct trunk_reqall *ra)
+{
+ struct ifnet *ifp = &sc->sc_if;
+ size_t size = ra->ra_size;
+ caddr_t ubuf = (caddr_t)ra->ra_port;
+ int error = 0;
+ int i;
+
+ ra->ra_proto = TPMR_TRUNK_PROTO;
+ memset(&ra->ra_psc, 0, sizeof(ra->ra_psc));
+
+ ra->ra_ports = sc->sc_nports;
+ for (i = 0; i < nitems(sc->sc_ports); i++) {
+ struct trunk_reqport rp;
+ struct ifnet *ifp0;
+ struct tpmr_port *p = SMR_PTR_GET_LOCKED(&sc->sc_ports[i]);
+ if (p == NULL)
+ continue;
+
+ if (size < sizeof(rp))
+ break;
+
+ ifp0 = p->p_ifp0;
+
+ CTASSERT(sizeof(rp.rp_ifname) == sizeof(ifp->if_xname));
+ CTASSERT(sizeof(rp.rp_portname) == sizeof(ifp0->if_xname));
+
+ memset(&rp, 0, sizeof(rp));
+ memcpy(rp.rp_ifname, ifp->if_xname, sizeof(rp.rp_ifname));
+ memcpy(rp.rp_portname, ifp0->if_xname, sizeof(rp.rp_portname));
+
+ if (!ISSET(ifp0->if_flags, IFF_RUNNING))
+ SET(rp.rp_flags, TRUNK_PORT_DISABLED);
+ else {
+ SET(rp.rp_flags, TRUNK_PORT_ACTIVE);
+ if (LINK_STATE_IS_UP(ifp0->if_link_state)) {
+ SET(rp.rp_flags, TRUNK_PORT_COLLECTING |
+    TRUNK_PORT_DISTRIBUTING);
+ }
+ }
+
+ error = copyout(&rp, ubuf, sizeof(rp));
+ if (error != 0)
+ break;
+
+ ubuf += sizeof(rp);
+ size -= sizeof(rp);
+ }
+
+ return (error);
+}
+
+static int
+tpmr_add_port(struct tpmr_softc *sc, const struct trunk_reqport *rp)
+{
+ struct ifnet *ifp = &sc->sc_if;
+ struct ifnet *ifp0;
+ struct arpcom *ac0;
+ struct tpmr_port **pp;
+ struct tpmr_port *p;
+ int i;
+ int error;
+
+ NET_ASSERT_LOCKED();
+ if (sc->sc_nports >= nitems(sc->sc_ports))
+ return (ENOSPC);
+
+ ifp0 = ifunit(rp->rp_portname);
+ if (ifp0 == NULL)
+ return (EINVAL);
+
+ if (ifp0->if_type != IFT_ETHER)
+ return (EPROTONOSUPPORT);
+
+ ac0 = (struct arpcom *)ifp0;
+ if (ac0->ac_trunkport != NULL)
+ return (EBUSY);
+
+ /* let's try */
+
+ ifp0 = if_get(ifp0->if_index); /* get an actual reference */
+ if (ifp0 == NULL) {
+ /* XXX this should never happen */
+ return (EINVAL);
+ }
+
+ p = malloc(sizeof(*p), M_DEVBUF, M_WAITOK|M_ZERO|M_CANFAIL);
+ if (p == NULL) {
+ error = ENOMEM;
+ goto put;
+ }
+
+ p->p_ifp0 = ifp0;
+ p->p_tpmr = sc;
+
+ p->p_ioctl = ifp0->if_ioctl;
+ p->p_output = ifp0->if_output;
+
+ error = ifpromisc(ifp0, 1);
+ if (error != 0)
+ goto free;
+
+ p->p_lcookie = hook_establish(ifp0->if_linkstatehooks, 1,
+    tpmr_p_linkch, p);
+ p->p_dcookie = hook_establish(ifp0->if_detachhooks, 0,
+    tpmr_p_detach, p);
+
+ /* commit */
+ DPRINTF(sc, "%s %s trunkport: creating port\n",
+    ifp->if_xname, ifp0->if_xname);
+
+ for (i = 0; i < nitems(sc->sc_ports); i++) {
+ pp = &sc->sc_ports[i];
+ if (SMR_PTR_GET_LOCKED(pp) == NULL)
+ break;
+ }
+ sc->sc_nports++;
+
+ p->p_slot = i;
+
+ ac0->ac_trunkport = p;
+ /* make sure p is visible before handlers can run */
+ membar_producer();
+ ifp0->if_ioctl = tpmr_p_ioctl;
+ ifp0->if_output = tpmr_p_output;
+ if_ih_insert(ifp0, tpmr_input, p);
+
+ SMR_PTR_SET_LOCKED(pp, p);
+
+ tpmr_p_linkch(p);
+
+ return (0);
+
+free:
+ free(p, M_DEVBUF, sizeof(*p));
+put:
+ if_put(ifp0);
+ return (error);
+}
+
+static struct tpmr_port *
+tpmr_trunkport(struct tpmr_softc *sc, const char *name)
+{
+ unsigned int i;
+
+ for (i = 0; i < nitems(sc->sc_ports); i++) {
+ struct tpmr_port *p = SMR_PTR_GET_LOCKED(&sc->sc_ports[i]);
+ if (p == NULL)
+ continue;
+
+ if (strcmp(p->p_ifp0->if_xname, name) == 0)
+ return (p);
+ }
+
+ return (NULL);
+}
+
+static int
+tpmr_get_port(struct tpmr_softc *sc, struct trunk_reqport *rp)
+{
+ struct tpmr_port *p;
+
+ NET_ASSERT_LOCKED();
+ p = tpmr_trunkport(sc, rp->rp_portname);
+ if (p == NULL)
+ return (EINVAL);
+
+ /* XXX */
+
+ return (0);
+}
+
+static int
+tpmr_del_port(struct tpmr_softc *sc, const struct trunk_reqport *rp)
+{
+ struct tpmr_port *p;
+
+ NET_ASSERT_LOCKED();
+ p = tpmr_trunkport(sc, rp->rp_portname);
+ if (p == NULL)
+ return (EINVAL);
+
+ tpmr_p_dtor(sc, p, "del");
+
+ return (0);
+}
+
+static int
+tpmr_p_ioctl(struct ifnet *ifp0, u_long cmd, caddr_t data)
+{
+ struct arpcom *ac0 = (struct arpcom *)ifp0;
+ struct tpmr_port *p = ac0->ac_trunkport;
+ int error = 0;
+
+ switch (cmd) {
+ case SIOCSIFADDR:
+ error = EBUSY;
+ break;
+
+ case SIOCGTRUNKPORT: {
+ struct trunk_reqport *rp = (struct trunk_reqport *)data;
+ struct tpmr_softc *sc = p->p_tpmr;
+ struct ifnet *ifp = &sc->sc_if;
+
+ if (strncmp(rp->rp_ifname, rp->rp_portname,
+    sizeof(rp->rp_ifname)) != 0)
+ return (EINVAL);
+
+ CTASSERT(sizeof(rp->rp_ifname) == sizeof(ifp->if_xname));
+ memcpy(rp->rp_ifname, ifp->if_xname, sizeof(rp->rp_ifname));
+ break;
+ }
+
+ default:
+ error = (*p->p_ioctl)(ifp0, cmd, data);
+ break;
+ }
+
+ return (error);
+}
+
+static int
+tpmr_p_output(struct ifnet *ifp0, struct mbuf *m, struct sockaddr *dst,
+    struct rtentry *rt)
+{
+ struct arpcom *ac0 = (struct arpcom *)ifp0;
+ struct tpmr_port *p = ac0->ac_trunkport;
+
+ /* restrict transmission to bpf only */
+ if ((m_tag_find(m, PACKET_TAG_DLT, NULL) == NULL)) {
+ m_freem(m);
+ return (EBUSY);
+ }
+
+ return ((*p->p_output)(ifp0, m, dst, rt));
+}
+
+static void
+tpmr_p_dtor(struct tpmr_softc *sc, struct tpmr_port *p, const char *op)
+{
+ struct ifnet *ifp = &sc->sc_if;
+ struct ifnet *ifp0 = p->p_ifp0;
+ struct arpcom *ac0 = (struct arpcom *)ifp0;
+
+ DPRINTF(sc, "%s %s: destroying port\n",
+    ifp->if_xname, ifp0->if_xname);
+
+ if_ih_remove(ifp0, tpmr_input, p);
+
+ ifp0->if_ioctl = p->p_ioctl;
+ ifp0->if_output = p->p_output;
+ membar_producer();
+
+ ac0->ac_trunkport = NULL;
+
+ sc->sc_nports--;
+ SMR_PTR_SET_LOCKED(&sc->sc_ports[p->p_slot], NULL);
+
+ if (ifpromisc(ifp0, 0) != 0) {
+ log(LOG_WARNING, "%s %s: unable to disable promisc",
+    ifp->if_xname, ifp0->if_xname);
+ }
+
+ hook_disestablish(ifp0->if_detachhooks, p->p_dcookie);
+ hook_disestablish(ifp0->if_linkstatehooks, p->p_lcookie);
+
+ smr_barrier();
+
+ if_put(ifp0);
+ free(p, M_DEVBUF, sizeof(*p));
+
+ if (ifp->if_link_state != LINK_STATE_DOWN) {
+ ifp->if_link_state = LINK_STATE_DOWN;
+ if_link_state_change(ifp);
+ }
+}
+
+static void
+tpmr_p_detach(void *arg)
+{
+ struct tpmr_port *p = arg;
+ struct tpmr_softc *sc = p->p_tpmr;
+
+ tpmr_p_dtor(sc, p, "detach");
+
+ NET_ASSERT_LOCKED();
+}
+
+static int
+tpmr_p_active(struct tpmr_port *p)
+{
+ struct ifnet *ifp0 = p->p_ifp0;
+
+ return (ISSET(ifp0->if_flags, IFF_RUNNING) &&
+    LINK_STATE_IS_UP(ifp0->if_link_state));
+}
+
+static void
+tpmr_p_linkch(void *arg)
+{
+ struct tpmr_port *p = arg;
+ struct tpmr_softc *sc = p->p_tpmr;
+ struct ifnet *ifp = &sc->sc_if;
+ struct tpmr_port *np;
+ u_char link_state = LINK_STATE_FULL_DUPLEX;
+
+ NET_ASSERT_LOCKED();
+
+ if (!tpmr_p_active(p))
+ link_state = LINK_STATE_DOWN;
+
+ np = SMR_PTR_GET_LOCKED(&sc->sc_ports[!p->p_slot]);
+ if (np == NULL || !tpmr_p_active(np))
+ link_state = LINK_STATE_DOWN;
+
+ if (ifp->if_link_state != link_state) {
+ ifp->if_link_state = link_state;
+ if_link_state_change(ifp);
+ }
+}
+
+static int
+tpmr_up(struct tpmr_softc *sc)
+{
+ struct ifnet *ifp = &sc->sc_if;
+
+ NET_ASSERT_LOCKED();
+ SET(ifp->if_flags, IFF_RUNNING);
+
+ return (0);
+}
+
+static int
+tpmr_iff(struct tpmr_softc *sc)
+{
+ return (0);
+}
+
+static int
+tpmr_down(struct tpmr_softc *sc)
+{
+ struct ifnet *ifp = &sc->sc_if;
+
+ NET_ASSERT_LOCKED();
+ CLR(ifp->if_flags, IFF_RUNNING);
+
+ return (0);
+}
Index: conf/GENERIC
===================================================================
RCS file: /cvs/src/sys/conf/GENERIC,v
retrieving revision 1.263
diff -u -p -r1.263 GENERIC
--- conf/GENERIC 8 Jul 2019 01:16:02 -0000 1.263
+++ conf/GENERIC 29 Jul 2019 09:44:26 -0000
@@ -102,6 +102,7 @@ pseudo-device pppx # PPP multiplexer
 pseudo-device sppp 1 # Sync PPP/HDLC
 pseudo-device trunk # Trunking support
 pseudo-device aggr # 802.1AX Link Aggregation
+pseudo-device tpmr # 802.1Q Two-Port MAC Relay (TPMR)
 pseudo-device tun # network tunneling over tty (tun & tap)
 pseudo-device vether # Virtual ethernet
 pseudo-device vxlan # Virtual extensible LAN
Index: conf/files
===================================================================
RCS file: /cvs/src/sys/conf/files,v
retrieving revision 1.672
diff -u -p -r1.672 files
--- conf/files 5 Jul 2019 01:37:13 -0000 1.672
+++ conf/files 29 Jul 2019 09:44:26 -0000
@@ -563,6 +563,7 @@ pseudo-device mobileip: ifnet
 pseudo-device crypto: ifnet
 pseudo-device trunk: ifnet, ether, ifmedia
 pseudo-device aggr: ifnet, ether, ifmedia
+pseudo-device tpmr: ifnet, ether, ifmedia
 pseudo-device mpe: ifnet, mpls
 pseudo-device mpw: ifnet, mpls, ether
 pseudo-device mpip: ifnet, mpls
@@ -818,6 +819,7 @@ file net/if_mobileip.c mobileip needs
 file net/if_trunk.c trunk needs-count
 file net/trunklacp.c trunk
 file net/if_aggr.c aggr
+file net/if_tpmr.c tpmr
 file net/if_mpe.c mpe needs-count
 file net/if_mpw.c mpw needs-count
 file net/if_mpip.c mpip


Reply | Threaded
Open this post in threaded view
|

Re: tpmr(4): 802.1Q Two-Port MAC Relay

David Gwynne-5
On Tue, Jul 30, 2019 at 01:36:59PM +1000, David Gwynne wrote:
> a Two-Port MAC Relay is basically a cut down bridge(4). it only supports
> two ports, and unconditionally relays packets between those ports
> instead of doing learning or anything like that.

i had written a manpage too:


TPMR(4)                      Device Drivers Manual                     TPMR(4)

NAME
     tpmr - IEEE 802.1Q Two-Port MAC Relay interface

SYNOPSIS
     pseudo-device tpmr

DESCRIPTION
     The tpmr driver implements an 802.1Q (originally 802.1aj) Two-Port MAC
     Relay (TPMR), otherwise known as an Ethernet cross-connect, or layer 2
     circuit.

     A TPMR is a simplified Ethernet bridge that provides a subset of the
     functionality as found in bridge(4).  A TPMR has exactly two ports, and
     unconditionally relays Ethernet packets between the two ports.

     tpmr interfaces can be created at runtime using the ifconfig tpmrN create
     command or by setting up a hostname.if(5) configuration file for
     netstart(8).  The interface itself can be configured with ifconfig(8);
     see its manual page for more information.

     tpmr interfaces may be configured with ifconfig(8) and netstart(8) using
     the following options:

     trunkport child-iface
             Add child-iface as a port.

     -trunkport child-iface
             Remove the port child-iface.

     Other forms of Ethernet bridging are available using the bridge(4)
     driver.  Other forms of aggregation of Ethernet interfaces are available
     using the aggr(4) and trunk(4) drivers.

EXAMPLES
     tpmr can be used to cross-connect Ethernet devices that support different
     physical media.  For example, a device that supports a 100baseTX half-
     duplex connection can be connected to a switch with 1000baseSX optical
     ports by using tpmr with a pair of physical network interfaces, each of
     which supports the required media types.  If fxp(4) is used to connect to
     the 100baseTX device, and em(4) is used to connect to the 1000baseSX
     switch, the following configuration can be used:

     # ifconfig tpmr0 create
     # ifconfig tpmr0 trunkport fxp0 trunkport em0
     # ifconfig fxp0 up
     # ifconfig em0 up
     # ifconfig tpmr0 up

     Multiple TPMRs can be chained to transport Ethernet traffic for a pair of
     devices over another network.  Given two physically separate Ethernet
     switches, TPMRs can be used as follows to provide a point-to-point
     Ethernet link between them.  TPMRs allow the use of the Link Aggregation
     Control Protocol (LACP) or Spanning Tree Protocol (STP) by the switches
     to detect communication failures or connectivity loops respectively,
     which is not possible using bridge(4) as it filters those protocols.

     If Host A connected to Router B has the external IP address 192.0.2.10 on
     em0, Host D connected to Router C has the external IP address
     198.51.100.14 on em0, and both hosts have em1 connected to the switches,
     the following configuration can be used to connect the switches together.
     etherip(4) is used to transport the Ethernet packets over the IP network.

     Switch X ---- Host A ---------- tunnel ----------- Host D ---- Switch E
                    \                                    /
                     \                                  /
                      +---- Router B ---- Router C ----+

     Create the tpmr and etherip(4) interfaces:

           # ifconfig etherip0 create
           # ifconfig tpmr0 create

     Configure the etherip interface:

           (on Host A) # ifconfig etherip0 tunnel 192.0.2.10 198.51.100.14 up
           (on Host D) # ifconfig etherip0 tunnel 198.51.100.14 192.0.2.10 up

     Add the etherip interface and physical interface to the TPMR:

           # ifconfig tpmr0 trunkport em1 trunkport etherip0 up

     An equivalent setup using MPLS pseudowires instead of IP as the transport
     can be built using mpw(4) interfaces.

SEE ALSO
     aggr(4), bridge(4), trunk(4), hostname.if(5), ifconfig(8), netstart(8)

HISTORY
     The tpmr driver first appeared in OpenBSD 6.6.

OpenBSD 6.5                      July 5, 2019                      OpenBSD 6.5


Index: Makefile
===================================================================
RCS file: /cvs/src/share/man/man4/Makefile,v
retrieving revision 1.716
diff -u -p -r1.716 Makefile
--- Makefile 5 Jul 2019 01:41:14 -0000 1.716
+++ Makefile 30 Jul 2019 04:10:34 -0000
@@ -70,8 +70,8 @@ MAN= aac.4 abcrtc.4 ac97.4 acphy.4 acrtc
  st.4 ste.4 stge.4 sti.4 stp.4 sv.4 switch.4 sxiccmu.4 sximmc.4 \
  sxipio.4 sxirsb.4 sxirtc.4 sxitemp.4 sxitwi.4 sym.4 sypwr.4 syscon.4 \
  tcic.4 tcp.4 termios.4 tht.4 ti.4 tipmic.4 tl.4 \
- tlphy.4 thmc.4 tpm.4 tqphy.4 trm.4 trunk.4 tsl.4 tty.4 tun.4 tap.4 \
- twe.4 \
+ tlphy.4 thmc.4 tpm.4 tpmr.4 tqphy.4 trm.4 trunk.4 tsl.4 tty.4 \
+ tun.4 tap.4 twe.4 \
  txp.4 txphy.4 uaudio.4 uark.4 uath.4 ubcmtp.4 uberry.4 ubsa.4 \
  ubsec.4 ucom.4 uchcom.4 ucrcom.4 ucycom.4 ukspan.4 uslhcom.4 \
  udav.4 udcf.4 udl.4 udp.4 udsbr.4 \
Index: tpmr.4
===================================================================
RCS file: tpmr.4
diff -N tpmr.4
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tpmr.4 30 Jul 2019 04:10:34 -0000
@@ -0,0 +1,165 @@
+.\" $OpenBSD$
+.\"
+.\" Copyright (c) 2019 David Gwynne <[hidden email]>
+.\"
+.\" Permission to use, copy, modify, and distribute this software for any
+.\" purpose with or without fee is hereby granted, provided that the above
+.\" copyright notice and this permission notice appear in all copies.
+.\"
+.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+.\"
+.Dd $Mdocdate: July 5 2019 $
+.Dt TPMR 4
+.Os
+.Sh NAME
+.Nm tpmr
+.Nd IEEE 802.1Q Two-Port MAC Relay interface
+.Sh SYNOPSIS
+.Cd "pseudo-device tpmr"
+.Sh DESCRIPTION
+The
+.Nm
+driver implements an 802.1Q (originally 802.1aj) Two-Port MAC Relay
+(TPMR), otherwise known as an Ethernet cross-connect, or layer 2
+circuit.
+.Pp
+A TPMR is a simplified Ethernet bridge that provides a subset of the functionality as found in
+.Xr bridge 4 .
+A TPMR has exactly two ports, and unconditionally relays Ethernet
+packets between the two ports.
+.Pp
+.Nm
+interfaces can be created at runtime using the
+.Ic ifconfig tpmr Ns Ar N Ic create
+command or by setting up a
+.Xr hostname.if 5
+configuration file for
+.Xr netstart 8 .
+The interface itself can be configured with
+.Xr ifconfig 8 ;
+see its manual page for more information.
+.Pp
+.Nm
+interfaces may be configured with
+.Xr ifconfig 8
+and
+.Xr netstart 8
+using the following options:
+.Bl -tag -width Ds
+.It Cm trunkport Ar child-iface
+Add
+.Ar child-iface
+as a port.
+.It Cm -trunkport Ar child-iface
+Remove the port
+.Ar child-iface .
+.El
+.\" document the ioctls?
+.Pp
+Other forms of Ethernet bridging are available using the
+.Xr bridge 4
+driver.
+Other forms of aggregation of Ethernet interfaces are available
+using the
+.Xr aggr 4
+and
+.Xr trunk 4
+drivers.
+.Sh EXAMPLES
+.Nm
+can be used to cross-connect Ethernet devices that support different
+physical media.
+For example, a device that supports a 100baseTX half-duplex connection
+can be connected to a switch with 1000baseSX optical ports by using
+.Nm
+with a pair of physical network interfaces, each of which supports
+the required media types.
+If
+.Xr fxp 4
+is used to connect to the 100baseTX device, and
+.Xr em 4
+is used to connect to the 1000baseSX switch, the following configuration
+can be used:
+.Bd -literal
+# ifconfig tpmr0 create
+# ifconfig tpmr0 trunkport fxp0 trunkport em0
+# ifconfig fxp0 up
+# ifconfig em0 up
+# ifconfig tpmr0 up
+.Ed
+.Pp
+Multiple TPMRs can be chained to transport Ethernet traffic for a
+pair of devices over another network.
+Given two physically separate Ethernet switches, TPMRs can be used
+as follows to provide a point-to-point Ethernet link between them.
+TPMRs allow the use of the Link Aggregation Control Protocol (LACP)
+or Spanning Tree Protocol (STP) by the switches to detect communication
+failures or connectivity loops respectively, which is not possible
+using
+.Xr bridge 4
+as it filters those protocols.
+.Pp
+If Host A connected to Router B has the external IP address 192.0.2.10
+on em0, Host D connected to Router C has the external IP address
+198.51.100.14 on em0, and both hosts have em1 connected to the
+switches, the following configuration can be used to connect the
+switches together.
+.Xr etherip 4
+is used to transport the Ethernet packets over the IP network.
+.Bd -literal
+Switch X ---- Host A ---------- tunnel ----------- Host D ---- Switch E
+               \e                                    /
+                \e                                  /
+                 +---- Router B ---- Router C ----+
+.Ed
+.Pp
+Create the
+.Nm
+and
+.Xr etherip 4
+interfaces:
+.Bd -literal -offset indent
+# ifconfig etherip0 create
+# ifconfig tpmr0 create
+.Ed
+.Pp
+Configure the etherip interface:
+.Bd -literal -offset indent
+(on Host A) # ifconfig etherip0 tunnel 192.0.2.10 198.51.100.14 up
+(on Host D) # ifconfig etherip0 tunnel 198.51.100.14 192.0.2.10 up
+.Ed
+.Pp
+Add the etherip interface and physical interface to the TPMR:
+.Bd -literal -offset indent
+# ifconfig tpmr0 trunkport em1 trunkport etherip0 up
+.Ed
+.Pp
+An equivalent setup using MPLS pseudowires instead of IP as the
+transport can be built using
+.Xr mpw 4
+interfaces.
+.Sh SEE ALSO
+.Xr aggr 4 ,
+.Xr bridge 4 ,
+.Xr trunk 4 ,
+.Xr hostname.if 5 ,
+.Xr ifconfig 8 ,
+.Xr netstart 8
+.\" .Sh STANDARDS
+.\" .Rs
+.\" .%T IEEE 802.1Q
+.\" .Re
+.\" .Rs
+.\" .%T IEEE 802.1aj
+.\" .Re
+.Sh HISTORY
+The
+.Nm
+driver first appeared in
+.Ox 6.6 .

Reply | Threaded
Open this post in threaded view
|

Re: tpmr(4): 802.1Q Two-Port MAC Relay

Remi Locherer
In reply to this post by David Gwynne-5
On Tue, Jul 30, 2019 at 01:36:59PM +1000, David Gwynne wrote:

> a Two-Port MAC Relay is basically a cut down bridge(4). it only supports
> two ports, and unconditionally relays packets between those ports
> instead of doing learning or anything like that.
>
> i've been trying to get a redundant pair of bridges set up between two
> datacenters here to help me while i migrate between them. so far all my
> efforts to make it redundant have mostly worked, until they introduced
> loops in the layer 2 topology, which generates a broadcast storm, which
> basically takes the net down for a few minutes at a time. it's feels
> very betraying.
>
> my frustration is that switches plugged together have mechanisms to
> prevent loops like that, more specifically they use spanning tree or
> lacp to make appropriate use of redundant links. i got to a point where
> i just wanted the switches to talk to each other and do their own thing
> to negotiate use of the redundant links.
>
> unfortunately the only way to get ethernet packets off a physical
> wire and onto a tunnel over an ip network is bridge(4), and bridge(4)
> tries to be a compliant switch from a standards point of view. this
> means it intercepts packets that are meant to be processed by bridges,
> because it is a bridge. these types of packets include spanning tree and
> lacp, which means i couldnt get the physical switches at each site to
> talk to each other. sadface.
>
> so to solve my problem i hacked up a small driver that did less than
> bridge(4). however, it turns out that what i hacked up is an actual
> thing that already exists as something done in the real world. IEEE
> 802.1Q describes TPMR, which is defined as intercepting far less
> than a real bridge does. one of the appendices specifically describes
> lacp going through one, which is exactly what i wanted. cisco does
> something like this with their layer 2 cross-connects (search for cisco
> xconnect for examples), juniper has l2circuits, and so on.
>
> the way i'm using this is like below. i have a pair of bridges in each
> datacenter, so 4 boxes in total. they peer directly with the ip network
> that sits between the datacenter. each box has a 4 physical network
> ports. 2 of those ports are configured with aggr(4) and talk IP into the
> core network. the other two ports are connected to the switches at
> each site for use with tpmr. there's 2 etherip interfaces configured on
> each physical box, each of which is connected to the tpmr.
>
> all that together looks a bit like the following:
>
>  +-+ +--------------------------+      +---------------------------+ +-+
>  |d|-|ix2 <-> tpmr0 <-> etherip0|------|etherip0 <-> tpmr0 <-> ixl0|-|d|
>  |c| |                          |      |                           | |c|
>  |0|-|ix3 <-> tpmr1 <-> etherip1|-    -|etherip1 <-> tpmr1 <-> ixl1|-|1|
>  ||| +--------------------------+ \  / +---------------------------+ |||
>  |s|         dc0-bridge0           \/          dc1-bridge0           |s|
>  |w|                               /\                                |w|
>  |i| +--------------------------+ /  \ +---------------------------+ |i|
>  |t|-|ix2 <-> tpmr0 <-> etherip0|-    -|etherip0 <-> tpmr0 <-> ixl0|-|t|
>  |c| |                          |      |                           | |c|
>  |h|-|ix3 <-> tpmr1 <-> etherip1|------|etherip1 <-> tpmr1 <-> ixl1|-|h|
>  +-+ +--------------------------+      +---------------------------+ +-+
>              dc0-bridge1                       dc1-bridge1
>
> each switch has a 4 port port-channel (lacp aggregation) set up. because
> each physical interface on the bridges are tied to a single tunnel, the
> packets effectively traverse a point-to-point link, ie, a really
> complicated wire. because lacp makes it from each point to the other
> point, the switches make sure only active lacp ports are used, which
> avoids layer 2 loops. lacp also means i get to use all the links when
> theyre available.
>
> with the topology above i can lose a bridge at each site and should
> still have a working link to the other side, so i get my redundancy. the
> use of the extra links with lacp is a bonus. at this point i would have
> been happy for spanning tree to shut links down.
>
> anyway, here's the code.
>
> it was originally called xcon(4) since it provides a software
> cross-connect, but i changed my mind after looking at 802.1Q. it might
> be unfair to refer to 802.1Q because tpmr(4) does none of the filtering
> that the spec says it should. i just needed it to work though.
>
> the guts of it is tpmr_input(). it basically gets the rxed packet from
> one port and enqueues it for tranmission immediately on the other port.
> it does run bpf though, and supports filtering on bpf, which has been
> handy for us when we needed to test taking bpdus off the wire for a bit.
>
> because it does such a small amount of work, it is relatively fast.
> hrvoje popovski has given it a quick spin and seen the following
> results on a fast box with a pair of ix(4) interfaces:
>
> plain ip forwarding: 1.5Mpps
> bridge(4) under load from 14Mpps: 500Kpps
> bridge(4) under load from 1Mpps: 800Kpps
> tpmr(4): 1.75Mpps
>
> 1.75Mpps was lower than I was expecting, but it turns out he was hitting
> limits in other parts of the system. with some tuning we got it up to
> 2.25Mpps. the softnet taskq was only at about 66% cpu time, but we
> couldnt see any other obvious places that we were dropping load.
>
> on a slower box that can do IP forwarding at 1Mpps, tpmr(4) can do
> 1.6Mpps. it's worth noting that the boxes were extremely responsive (ie,
> ssh feels fine) when tpmr is under load, which is not the case when ip
> forwarding or bridge are being hammered.
>
> my point is that it might be useful having tpmr(4) just to be able to
> test network driver performance improvements independently of the stack.
> im probably going to be using it to monitor links as a "bump in the
> wire" too.
>
> lastly regarding the code. i made this use the trunk(4) ioctls instead
> of the bridge ones, mostly because i had to fake less stuff to make
> ifconfig output look ok.
>
> ifconfig output looks like this:
>
> xdlg@dc3-bridge1:~$ ifconfig tpmr
>      
> tpmr0: flags=51<UP,POINTOPOINT,RUNNING>
> description: xconnect
> index 15 priority 0 llprio 7
> trunk: trunkproto none
> ix2 port active,collecting,distributing
> etherip10 port active,collecting,distributing
> groups: tpmr
> status: active
>
> anyway. thoughts? ok?

Have you tried to use bridge with STP enabled in your setup? Just curious.
I understand that with STP on the OpenBSD box you could not use all links
and forwarding performance would not be as good.

Anyway, I think tpmr would be a nice addition!

Remi

Reply | Threaded
Open this post in threaded view
|

Re: tpmr(4): 802.1Q Two-Port MAC Relay

David Gwynne-5


> On 30 Jul 2019, at 6:28 pm, Remi Locherer <[hidden email]> wrote:
>
> On Tue, Jul 30, 2019 at 01:36:59PM +1000, David Gwynne wrote:
>> a Two-Port MAC Relay is basically a cut down bridge(4). it only supports
>> two ports, and unconditionally relays packets between those ports
>> instead of doing learning or anything like that.
>>
>> i've been trying to get a redundant pair of bridges set up between two
>> datacenters here to help me while i migrate between them. so far all my
>> efforts to make it redundant have mostly worked, until they introduced
>> loops in the layer 2 topology, which generates a broadcast storm, which
>> basically takes the net down for a few minutes at a time. it's feels
>> very betraying.
>>
>> my frustration is that switches plugged together have mechanisms to
>> prevent loops like that, more specifically they use spanning tree or
>> lacp to make appropriate use of redundant links. i got to a point where
>> i just wanted the switches to talk to each other and do their own thing
>> to negotiate use of the redundant links.
>>
>> unfortunately the only way to get ethernet packets off a physical
>> wire and onto a tunnel over an ip network is bridge(4), and bridge(4)
>> tries to be a compliant switch from a standards point of view. this
>> means it intercepts packets that are meant to be processed by bridges,
>> because it is a bridge. these types of packets include spanning tree and
>> lacp, which means i couldnt get the physical switches at each site to
>> talk to each other. sadface.
>>
>> so to solve my problem i hacked up a small driver that did less than
>> bridge(4). however, it turns out that what i hacked up is an actual
>> thing that already exists as something done in the real world. IEEE
>> 802.1Q describes TPMR, which is defined as intercepting far less
>> than a real bridge does. one of the appendices specifically describes
>> lacp going through one, which is exactly what i wanted. cisco does
>> something like this with their layer 2 cross-connects (search for cisco
>> xconnect for examples), juniper has l2circuits, and so on.
>>
>> the way i'm using this is like below. i have a pair of bridges in each
>> datacenter, so 4 boxes in total. they peer directly with the ip network
>> that sits between the datacenter. each box has a 4 physical network
>> ports. 2 of those ports are configured with aggr(4) and talk IP into the
>> core network. the other two ports are connected to the switches at
>> each site for use with tpmr. there's 2 etherip interfaces configured on
>> each physical box, each of which is connected to the tpmr.
>>
>> all that together looks a bit like the following:
>>
>> +-+ +--------------------------+      +---------------------------+ +-+
>> |d|-|ix2 <-> tpmr0 <-> etherip0|------|etherip0 <-> tpmr0 <-> ixl0|-|d|
>> |c| |                          |      |                           | |c|
>> |0|-|ix3 <-> tpmr1 <-> etherip1|-    -|etherip1 <-> tpmr1 <-> ixl1|-|1|
>> ||| +--------------------------+ \  / +---------------------------+ |||
>> |s|         dc0-bridge0           \/          dc1-bridge0           |s|
>> |w|                               /\                                |w|
>> |i| +--------------------------+ /  \ +---------------------------+ |i|
>> |t|-|ix2 <-> tpmr0 <-> etherip0|-    -|etherip0 <-> tpmr0 <-> ixl0|-|t|
>> |c| |                          |      |                           | |c|
>> |h|-|ix3 <-> tpmr1 <-> etherip1|------|etherip1 <-> tpmr1 <-> ixl1|-|h|
>> +-+ +--------------------------+      +---------------------------+ +-+
>>             dc0-bridge1                       dc1-bridge1
>>
>> each switch has a 4 port port-channel (lacp aggregation) set up. because
>> each physical interface on the bridges are tied to a single tunnel, the
>> packets effectively traverse a point-to-point link, ie, a really
>> complicated wire. because lacp makes it from each point to the other
>> point, the switches make sure only active lacp ports are used, which
>> avoids layer 2 loops. lacp also means i get to use all the links when
>> theyre available.
>>
>> with the topology above i can lose a bridge at each site and should
>> still have a working link to the other side, so i get my redundancy. the
>> use of the extra links with lacp is a bonus. at this point i would have
>> been happy for spanning tree to shut links down.
>>
>> anyway, here's the code.
>>
>> it was originally called xcon(4) since it provides a software
>> cross-connect, but i changed my mind after looking at 802.1Q. it might
>> be unfair to refer to 802.1Q because tpmr(4) does none of the filtering
>> that the spec says it should. i just needed it to work though.
>>
>> the guts of it is tpmr_input(). it basically gets the rxed packet from
>> one port and enqueues it for tranmission immediately on the other port.
>> it does run bpf though, and supports filtering on bpf, which has been
>> handy for us when we needed to test taking bpdus off the wire for a bit.
>>
>> because it does such a small amount of work, it is relatively fast.
>> hrvoje popovski has given it a quick spin and seen the following
>> results on a fast box with a pair of ix(4) interfaces:
>>
>> plain ip forwarding: 1.5Mpps
>> bridge(4) under load from 14Mpps: 500Kpps
>> bridge(4) under load from 1Mpps: 800Kpps
>> tpmr(4): 1.75Mpps
>>
>> 1.75Mpps was lower than I was expecting, but it turns out he was hitting
>> limits in other parts of the system. with some tuning we got it up to
>> 2.25Mpps. the softnet taskq was only at about 66% cpu time, but we
>> couldnt see any other obvious places that we were dropping load.
>>
>> on a slower box that can do IP forwarding at 1Mpps, tpmr(4) can do
>> 1.6Mpps. it's worth noting that the boxes were extremely responsive (ie,
>> ssh feels fine) when tpmr is under load, which is not the case when ip
>> forwarding or bridge are being hammered.
>>
>> my point is that it might be useful having tpmr(4) just to be able to
>> test network driver performance improvements independently of the stack.
>> im probably going to be using it to monitor links as a "bump in the
>> wire" too.
>>
>> lastly regarding the code. i made this use the trunk(4) ioctls instead
>> of the bridge ones, mostly because i had to fake less stuff to make
>> ifconfig output look ok.
>>
>> ifconfig output looks like this:
>>
>> xdlg@dc3-bridge1:~$ ifconfig tpmr
>>
>> tpmr0: flags=51<UP,POINTOPOINT,RUNNING>
>> description: xconnect
>> index 15 priority 0 llprio 7
>> trunk: trunkproto none
>> ix2 port active,collecting,distributing
>> etherip10 port active,collecting,distributing
>> groups: tpmr
>> status: active
>>
>> anyway. thoughts? ok?
>
> Have you tried to use bridge with STP enabled in your setup? Just curious.
> I understand that with STP on the OpenBSD box you could not use all links
> and forwarding performance would not be as good.

The ports I plug into in one of the datacenters are on fabric extenders on a cisco nexus setup, and they don't like to do spanning tree. I ended up having to do LACP.

dlg

>
> Anyway, I think tpmr would be a nice addition!
>
> Remi