================= Dual Uplink BCP ================= :Author: Trent W. Buck :Date: July 2012 :Audience: Cyber sysadmins NOTE: if you are already familiar with the lartc documentation, the key difference is I use "throw" in the main table to avoid duplicating its non-default rules in both uplink-specific tables. This article describes the configuration of alpha.cyber.com.au, a router with two ADSL2+ uplinks (to different ISPs), with an eye to business continuity (i.e. failover), *not* load balancing. Alpha runs Ubuntu 10.04 LTS. It has four downstream LANs, each a /26 within the public 203.7.155/24. It has two uplinks: Internode and Exetel. Both uplinks terminate at 6P2C (phone) ports on a two-port Traverse Solos ADSL2+ PCI card. Alpha runs PPPoA on both uplinks. For failover, is it *critical* that uplink down/up events can trigger routing changes. On alpha, this is done in /etc/ppp/ip-up.d and -down.d. This would work equally well if Alpha was doing PPPoE to dedicated ethernet ports connected to modems in bridged mode. It remains an unsolved issue when the uplinks terminate on CPEs managed *by the ISP*, which then speaks a static /30 to the router. In that latter case, using SNMP traps or SNMP polling of the ISP-managed CPE is recommended; failing that pinging both the CPE and a highly-available host on the internet (e.g. 8.8.8.8). Cyber owns a public /24, but not an AS number. Therefore we have Internode advertise our /24 and designate this the "primary" link, and do not NAT it. Exetel behaves more conventionally, with a single public IP that alpha SNATs/MASQUERADEs/ the LAN's (public) /24 to. The approach described in this article should work equally well without a public /24, i.e. NATting both uplinks. First I will describe the state of the running system. Then I will describe the config files that bring this about. The Running System ------------------ First, observe the main route table:: # ip route 58.96.2.202 dev ppp1 proto kernel scope link src 58.96.67.67 150.101.212.44 dev ppp0 proto kernel scope link src 150.101.159.241 203.7.155.192/26 dev scratch proto kernel scope link src 203.7.155.193 203.7.155.128/26 dev managed proto kernel scope link src 203.7.155.129 203.7.155.64/26 dev unmanaged proto kernel scope link src 203.7.155.65 203.7.155.0/26 dev dmz proto kernel scope link src 203.7.155.1 blackhole 203.7.155.0/24 blackhole 203.27.58.0/24 blackhole 169.254.0.0/16 blackhole 192.168.0.0/16 blackhole 172.16.0.0/12 blackhole 10.0.0.0/8 throw default The blackhole routes are a safety net; you can ignore them for now. The uplinks are ppp1 (exetel) and ppp0 (internode). Unfortunately they cannot be named logically (http://bugs.debian.org/458646). The last line is the most important. Any decisions reaching it and matching "default" (i.e. 0/0) are "thrown", i.e. the decision returns to the policy routing tables (ip rule). Let's look at those now:: # ip rule 0: from all lookup local 128: from all lookup main 256: from all fwmark 0xa lookup internode 512: from all fwmark 0xb lookup exetel 1024: from all lookup internode 2048: from all lookup exetel 32766: from all lookup main 32767: from all lookup default If you have never used ip rule before, just think of this as the routing decisions that take place before the "ip route" routing decisions. First, note there are two jumps to the "main" table. The one at priority 32766 is the one you start with, and it is not actually used. The one at priority 128 is one we have added, and it has been added explicitly by us. You can ignore the one at 32766, as it is effectively unreachable. (We could delete it in our scripts, but it is perhaps a little safer to leave it there in case our scripts break). The problem with the multiple uplink approach described by LARTC.org is that the main table remains at the end, meaning that all your "normal" routing decisions must be tediously reproduced in both the internode and exetel tables. By moving the main table to the top of the policy routing, and throwing on 0/0, we can leave all normal routing in the main table, and only make policy routing decisions when EITHER uplink is a sensible choice. The priority numbers in the policy routing table above, are arbitrary. I chose round (powers of two) numbers, but I could just as easily have used 1 through 5 instead of 128 through 2048. Before I explain those entries, let's just glance at the non-main route tables:: # ip route show table default # ip route show table local [...] # ip route show table internode default dev ppp0 scope link # ip route show table exetel default dev ppp1 scope link The default and local tables are builtins managed by the system; we don't care about them. The internode and exetel tables do nothing but send all traffic (0/0, a.k.a. default) to the appropriate line. This is sensible because, you recall, we only hit the internode/extel tables AFTER being thrown from the final rule of the main table. Look back at "ip rules". Notice that we have TWO sets of lookup rules. The first two (the fwmarks) are to route return packets. Without them, an incoming connection might arrive on exetel and have return packets routed back out via internode. Since we have no AS, that would result in the other guy going "WTF? I'm talking to one guy and another guy is responding!" and throwing the connection away. This is similar/identical to triangle routing seen in broken NAT setups. Let's look at where these marks are handled in netfilter:: # iptables-save -t mangle # I have added comments in manually. ## Associate packets with the internet interface (ppp[01]) on which ## the connection originated. ## ## Set the ctmark ONCE per conn, then restore that ## to the packet mark of each packet. ## ## PS: I have *absolutely no idea* why the OUTPUT rule is needed. *mangle :PREROUTING ACCEPT :INPUT ACCEPT :FORWARD ACCEPT :OUTPUT ACCEPT :POSTROUTING ACCEPT -A PREROUTING -m conntrack --ctstate NEW -i ppp0 -j CONNMARK --set-mark 0xA -A PREROUTING -m conntrack --ctstate NEW -i ppp1 -j CONNMARK --set-mark 0xB -A PREROUTING -j CONNMARK --restore-mark -A OUTPUT -j CONNMARK --restore-mark COMMIT As you can see, fwmarks are not added for connections that originate within the LAN. Those fall through to the second pair of ip rule uplink lines (priorities 1024 and 2048). It should be obvious by now that since there are no constraints on priority 1024, it will always match. Thus, this configuration amounts to this logic:: if a packet is viable on only one link, use it; otherwise, if both links are up, use internode; otherwise, use the link that's up. It is possible to add additional policy routing; such rules would go between priorities 512 and 1024. Policy routing by source address (e.g. "traffic from the backup machine should route via exetel") can be done directly in the ip rule table. Policy routing by destination port (e.g. "browsing goes via exetel") requires entries in the mangle netfilter table. They would be similar to those above, with -i replaced by -p tcp --dport or similar. Great caution should be taken when configuring "fancy" policy routing, as it can be debugging VERY difficult and confusing. For example, if you tried to load balance by using --cstate NEW -m statistic --every 2 to send every second connection via exetel, someone debugging an unrelated issue from within your LAN would wonder WTF was wrong with the remote server, that it was responding to only ever second(ish) connection. To be clear (as shall be shown in the next section), when a link comes up, its peer-to-peer route is added to the main route table, and its two lookup lines are added to the ip rule table. When the link goes down, all three are removed. Other than that, the configuration is completely static. The Configuration Files ----------------------- I'm just going to dump the whole lot on you with little/no further explanation. You should be smart enough to follow it based on your background experience and the discussion in the previous section. /etc/iproute2/rt_tables:: # # reserved values # 255 local 254 main 253 default 0 unspec # # local # #1 inr.ruhep 10 internode 11 exetel /etc/network/interfaces:: # This -*-conf-*- file describes the network interfaces on your system # and how to activate them. For more information, see interfaces(5). allow-auto lo unmanaged managed scratch dmz allow-hotplug lo unmanaged managed scratch dmz iface lo inet loopback # Until I think of a better place, add blackhole routes when the # loopback interface comes up. These ensure (unused) private IP # ranges aren't accidentally sent to the internet. Note that even # without this, Internode filters outbound packets to private IPs. up ip route add blackhole 10/8 up ip route add blackhole 172.16/12 up ip route add blackhole 192.168/16 up ip route add blackhole 169.254/16 up ip route add blackhole 203.7.155/24 up ip route add blackhole 203.27.58/24 down ip route flush type blackhole # Also prepare some policy routing for two upstreams. up ip rule add pri 128 lookup main up ip route add throw default down ip route del throw default down ip rule del pri 128 lookup main iface dmz inet manual up ip link set dev $IFACE up up ip address add dev $IFACE brd + 203.7.155.1/26 down ip address flush dev $IFACE down ip link set dev $IFACE down iface managed inet manual up ip link set dev $IFACE up up ip address add dev $IFACE brd + 203.7.155.129/26 down ip address flush dev $IFACE down ip link set dev $IFACE down iface unmanaged inet manual up ip link set dev $IFACE up up ip address add dev $IFACE brd + 203.7.155.65/26 down ip address flush dev $IFACE down ip link set dev $IFACE down iface scratch inet manual up ip link set dev $IFACE up up ip address add dev $IFACE brd + 203.7.155.193/26 down ip address flush dev $IFACE down ip link set dev $IFACE down # The "unit N" below forces the interface to be named pppN. # It has NOT RELATION to the port number on the ATM card. # This is done via upstart now, because ifupdown was doing the Wrong Thing. #iface internode inet ppp # provider internode # unit 0 #iface exetel inet ppp # provider exetel # unit 1 /etc/init/pppd-internode.conf:: cat /etc/init/pppd-internode.conf start on runlevel [2345] stop on runlevel [^2345] respawn exec pppd call internode unit 0 /etc/init/pppd-exetel.conf:: cat /etc/init/pppd-exetel.conf start on runlevel [2345] stop on runlevel [^2345] respawn exec pppd call exetel unit 1 /etc/ppp/peers/internode:: user "UNPRINTABLE" plugin pppoatm.so 0.8.35 noipdefault persist maxfail 0 nodeflate noauth linkname internode nodetach /etc/ppp/peers/exetel:: user "UNPRINTABLE" plugin pppoatm.so 1.8.35 noipdefault persist maxfail 0 noauth nodeflate linkname exetel nodetach /etc/ppp/ip-down.d/route:: #!/bin/sh -x case $PPP_IFACE in ppp0) # internode ip rule del pri 256 fwmark 0xA lookup internode ip rule del pri 1024 lookup internode ip route del default dev ppp0 table internode ;; ppp1) # exetel ip rule del pri 512 fwmark 0xB lookup exetel ip rule del pri 2048 lookup exetel ip route del default dev ppp1 table exetel ;; *) echo >&2 "$0: This can't happen!" exit 1 ;; esac /etc/ppp/ip-up.d/route:: #!/bin/sh -x case $PPP_IFACE in ppp0) # internode ip rule add pri 256 fwmark 0xA lookup internode ip rule add pri 1024 lookup internode ip route add default dev ppp0 table internode # The *highest* priority upstream iface must lack RPF, else # traffic to all OTHER upstream iface(s). echo 0 > /proc/sys/net/ipv4/conf/ppp0/rp_filter # It must also be turned off for all, because if "all" is 1, # the individual settings are ignored. echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter ;; ppp1) # exetel ip rule add pri 512 fwmark 0xB lookup exetel ip rule add pri 2048 lookup exetel ip route add default dev ppp1 table exetel ;; *) echo >&2 "$0: This can't happen!" exit 1 ;; esac ## UPDATE: apparently the above rp_filter line is not sufficient. ## Brute force it for now. for i in /proc/sys/net/ipv4/conf/*/rp_filter; do echo 0 > $i; done # Ref. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/ip-sysctl.txt;hb=HEAD