Kernel Configuration

From Exterior Memory
Jump to: navigation, search

The kernel of both Mac OS X and Linux can be configured using sysctl. Either call the program sysctl, or edit /etc/sysctl.conf.

For example:

sysctl net.ipv4.tcp_syncookies=1

Or in /etc/sysctl.conf:

net.ipv4.tcp_syncookies=1

The following guide was made for Linux kernel 2.6, and Mac OS 10.5. Defaults which are already set properly are not mentioned.

For a thorough overview, see

TODO: verify statements with http://www.netadmintools.com/html/7tcp.man.html

TCP/IP Security

See also:

SYN Attacks

An attacker may perform a denial of service attack by opening a lot of connections at the same time, with spoofed IP addresses. The queue of the computer will easily run out. This is a SYN attack (since it brings a lot of connections in the SYN state, without responding with SYN/ACK, causing a lot of time-outs).

This attack is known since 1994, but the prevention of it is still open for debate. RFC 4987 gives a good overview. The RFC lists the following mitigation strategies:

  • Filtering (of bogon IP addresses)
  • Increasing Backlog (the queue length)
  • Reducing SYN-RECEIVED Timer (thus, reduce the amount of time an opening TCP circuit can stay in the HALF_OPEN state)
  • Recycling the oldest Half-Open connection
  • SYN Cache (keep as little state as possible till the ACK)
  • SYN Cookies (keep no state as possible till the ACK, but use a cryptographic hash)
  • Hybrid Approaches
  • Firewalls and Proxies

Furthermore, you can:

  • Buy more memory, thus allowing more incoming connections
  • Reduce the amount of time a closing TCP circuit can stay in the TIME_WAIT state (close faulty connections earlier, freeing memory for new connections).

Of course, whatever the solution, remember the problem and solution: a SYN attacks aims at disrupting regular network connections, so your solution must never have the effect to unnecessarily close valid connections. Otherwise, you only help the attacker.

SYN Cache is considered the best strategy to mitigate SYN attacks, beside obvious improvements such as filtering of bogon IP addresses. SYN cookies is the next best solution, and even performs slightly better then SYN cache, at the expense that it can can not be used with TCP options (suck as selective ACKS), and thus performs slightly worse during idle conditions. When SYN Cookies are only enabled when there are many incoming connections (during a SYN attack), it is equally good as SYN Cache.

Given that SYN Cookies and SYN Cache perform equally well, but that SYN Cookies has some major disadvantages during regular operations (no or bad support for TCP options), I recommend to use SYN cache on machines where that is available (FreeBSD), and SYN cookies on other machines, (such a Linux and Windows). On machines which implement neither (Mac OS X), buy more RAM and cross your fingers.

See also:

SYN Cache

A server that implements SYN Cache only allocates minimal memory for each incomplete connection (after receiving a SYN). The full memory structure (which is 1300 bytes for Linux, excluding the send and receive buffers) is only allocated after the ACK is received.

Unlike SYN Cookies, there are no drawbacks to SYN Cache, while it performs just as good.

FreeBSD implements SYN cache by default, and does not need to be tuned.

Linux, Windows and Mac OS X do not seem to have a SYN cache implementation.

SYN Cookies

SYN cookies was first proposed by Daniel Bernstein, and does not keep any state for incoming connections until they are ACK'ed by the client. Rather than sending a random ID for each connection, it sends a hash over a predefined number (which changes in time), and the clients IP address. The disadvantage of SYN Cookies is that it disables the use of any TCP options, such as SACK (Selective Acknowledgement), Window size settings, and it interferes with the experimental T/TCP (Transaction TCP, RFC 1644). Linux 2.6.26 and up contain an ugly hack to make SYN Cookies and SACK work together).

There is some discussion on whether TCP cookies are a good idea or not. It consumes some CPU power, and most experts think it is more trouble then it's worth, considering the amount of memory available to most computers. However, measurements show that SYN Cookies helps in serving legitimate connections during an attack.

SYN cookies is available in Linux, but is disabled by default. As Linux unfortunately does not implement SYN cache, I recommend to enable SYN cookies:

net.ipv4.tcp_syncookies = 1

FreeBSD implements both SYN Cache and SYN Cookies, and have them both enabled by default (presumably, the cookies part is only used during high load).

Windows has a SYN cookie-like mechanism which only kicks in during high network traffic. This is a good design.

Faster Closing of Connections

These ensure that TIME_WAIT ports either get reused or closed fast.

For Linux:

net.ipv4.tcp_fin_timeout = 1
net.ipv4.tcp_tw_recycle = 1
This article is unfinished.

TODO:(examine first, as the rest of the advice was rather bad)


IP Source Routing

Turn off source routing.

Linux:

net.ipv4.conf.default.accept_source_route = 0
net.ipv4.conf.all.mc_forwarding = 0
net.ipv4.conf.all.forwarding = 0

Mac OS X is secure by default:

net.inet.ip.accept_sourceroute = 0
net.inet.ip.sourceroute = 0

ICMP Routing Redirects

Linux:

net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv6.conf.all.send_redirects = 0

Mac OS X:

net.inet.ip.redirect = 0
net.inet.ip6.redirect = 0

TODO: Also turn of packet forwarding, unless you really want to create a firewall or router:

Reverse Path Filter

The reverse path filter drops all packets with incoming IP that does not match the outgoing route table. If you have a default route, this means that no packets are filtered. If you do not have a default route (i.e. you are an ISP with multiple peers), you can not assume that traffic routes are asymmetric: in practice many ISPs do hot potato routing, and traffic many come from a different route that you sent it to. In short, this feature is useless, and you should turn it off:

Linux:

net.ipv4.conf.default.rp_filter = 0

Of course, if you are curious about asymmetric paths, you can log them (again, this only works if you have no default route on your host):

net.ipv4.conf.default.log_martians = 1

That said, filtering bogon traffic is mighty useful, both for home networks and ISPs. However, this is a feature that you really should implement in your firewall, not in individual hosts, let alone by some kernel option.

First of all, you should also filter outcoming traffic to make sure that the source address of all packets are really IP address that your ISP, RIPE or ARIN assigned to you (If you are an ISP, you will be admired by fellow CERT members of fellow ISPs who want to track DDoS attacks).

Furthermore, it is a good idea to filter incoming traffic with forged source addresses. Such source IP addresses are called bogons. Team Cymru's Bogon Reference gives an excellent technical reference (geared towards ISPs). If you decide to filter bogon, do make sure to automatically update your firewall filter, as RIPE, ARIN and other LRIRs often assign new address blocks, and you filter should reflect those changes as well.

Broadcast Echo Response

Linux (correctly configured by default):

net.ipv4.icmp_echo_ignore_broadcasts = 1

Mac OS X:

net.inet.icmp.bmcastecho = 0

Other Broadcast Probes

While not really harmful, you may want to turn of responses to ICMP broadcast pings altogether. This is not tunable on Linux, and properly configured by default on Mac OS X already.

TCP Performance Tuning

Window Size

The most important tuning mechanism for all reliable transport mechanisms is the window size. The window is a buffer that contains a copy of all packets that are sent out. If a packet is lost during transit, it is taken from the buffer and resent. So the buffer should be large enough to contain all unacknowledged packets, thus all the packets that are still in transit. If the window size is too small, it can severely throttle the throughput. However, a large window size can consume very large quantities of memory.

The most optimal window size can be calculated by window size = bandwidth * RTT, where RTT is the rount trip time of a connection. As bandwidth is often calculate in bit per second, while the window size is given in bytes, you need to divide the outcome by 8 bits/byte: window size (in bytes) = bandwidth (in bit/s) * RTT (in sec) / 8. Here are some examples:

Bandwidth Round Trip Time Window Size
100 Mbit/s (regular home network) 0.5 ms (same building) 6250 byte (6 kiByte)
1 Gbit/s (fast home network) 0.5 ms (same building) 62500 byte (60 kiByte)
1 Mbit/s (DSL upload) 10 ms (same country or state) 1250 byte (1.2 kByte)
1 Mbit/s (DSL upload) 250 ms (other side of the world) 31250 byte (30 kiByte)
10 Mbit/s (ADSL download) 10 ms (same country or state) 12500 byte (12 kByte)
10 Mbit/s (ADSL download) 250 ms (other side of the world) 312500 byte (300 kiByte)
100 Mbit/s (fiber tot the home) 10 ms (same country or state) 125000 byte (122 kiByte)
100 Mbit/s (fiber tot the home) 250 ms (other side of the world) 3125000 byte (3.0 MiByte)
1 Gbit/s (data center) 10 ms (same country or state) 1250000 byte (1.2 MiByte)
1 Gbit/s (data center) 250 ms (other side of the world) 31250000 byte (30 MiByte)
10 Gbit/s (backbone connection) 250 ms (other side of the world) 312500000 byte (300 MiByte)

For a 20 Mbit/s downstream link and a 1 Mbit/s upstream link (and a max RTT of 250ms), the maximum receive window size is about 625 kByte, and maximum send window size is about 32 kByte.

As you can see, the required memory differs immensely. Unfortunately, TCP does not know beforehand how much buffer space to reserve. If it would reserve 300 MiByte for each and every network connection, your computer would easy run out of memory (if you browse, mail and a program downloads a software update at the same time, you quickly consume 10 or 20 network connections). Fortunately, most operating systems are rather smart and have an auto-tuning feature to dynamically adjust the window size.

In addition to the sender TCP window (the cwnd), the receiver must maintain a buffer as well (the wnd). The receiver reports the buffer size (wnd) to the sender for flow control purposes (this is essential if the receiver host or application can not deal with the data flood coming from the network). The receive buffer is also used to reorder packets which were not received in correct order.

So, you have to set:

  • Enable auto-tuning
  • Default sender buffer (TCP window) size
  • Maximum sender buffer (TCP window) size
  • Default receiver buffer (TCP window) size
  • Maximum receiver buffer (TCP window) size

For Linux, you only need to set the minimum, default and maximum window size (these settings are for computers with a Gigabit Ethernet uplink):

net.ipv4.tcp_window_scaling  = 1        # Enable TCP window scaling
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216 # minimum, default and maximum window size
net.ipv4.tcp_wmem = 4096 65536 16777216 # minimum, default and maximum window size

Do not set tcp_mem yourself, and restrict yourself to the above values.

For Mac OS X and FreeBSD:

net.inet.tcp.rfc1323   = 1          # Enable TCP window scaling
kern.ipc.maxsockbuf    = 16777216   # Maximum TCP Window size
net.inet.tcp.sendspace = 131072     # Default send buffer
net.inet.tcp.recvspace = 358400     # Default receive buffer

For FreeBSD, you can also set the following values for TCP window auto tuning:

net.inet.tcp.sendbuf_auto= 1        # enable autotuning
net.inet.tcp.sendbuf_inc = 8192     # step size
net.inet.tcp.sendbuf_max = 16777216 # Default send buffer
net.inet.tcp.recvbuf_auto= 1        # enable autotuning
net.inet.tcp.recvbuf_inc = 16384    # step size
net.inet.tcp.recvbuf_max = 16777216 # Default receive buffer

The maxsockbuf is high enough for Gigabit Ethernet upstream connections, while the send and receive space are a good compromise between speed and memory usage. These values (taken from the Broadband Tuner) are certainly good enough for home use with all DSL links (up to 100 Mbit/s).

Unless you are still using a modem, you should disable FreeBSD's window scaling limiter (it's really crappy for today's networks):

net.inet.tcp.inflight.enable = 0

See also:

Other Buffer Size

In addition to the TCP window size you may increase the default window size for the loopback interface (a software socket for localhost connections), as well as for UDP.

For Mac OS X:

net.local.stream.sendspace = 32768
net.local.stream.recvspace = 32768

For Mac OS X:

net.inet.udp.recvspace=74848

Selective Acknowledgements (SACK)

Selective acknowledgements make sure that if a single packet is lost, only that one packet is resent (by default, the sender can not acknowledge the receipt of package arriving after this one, so those have to be resent too).

SACK is enabled by default on all operating systems, so there is no need to set any kernel parameters.

net.inet.tcp.sack = 1

SACK be not work if the server has SYN Cookies enabled (because SYN Cookies does ignored any TCP options, such as window scaling and SACK.

Furthermore, there may be an esoteric bug in Linux when you use SACK for connections with a window size of 12 Mbyte and up. For those large window sizes, it may take so long to find a specific packet that TCP times out.

I recommend to leave sack enabled in all situations.

Limit Number of Connections

The queue size limits the number of sockets. This can be used for denial of service attacks if it is set too low. Don't set it too high either, as your server will have limited memory, and have a large queue does not help in giving faster answers. The common consensus is that the default value is too low and has not kept up with other hard limits (such as kern.maxproc and kern.maxfiles).

The following numbers are fine for a small server (file server). For a home computer this should be lower. For a high-traffic webserver, the values should be still be a bit higher.

Linux:

ifconfig eth0 txqueuelen 2048   # limit total number of connections

Linux:

net.ipv4.tcp_max_syn_backlog = 1024  # limit number of new connections
net.core.netdev_max_backlog  = 1024
net.core.somaxconn = 1024        # limit number of new connections

Mac OS X and FreeBSD:

kern.ipc.somaxconn  = 1024   # limit number of new connections
kern.ipc.maxsockets = 1024   # Initial number of sockets in memory

For Mac OS X and presumably FreeBSD, there is no hard limit on the number of sockets. kern.ipc.maxsockets will simply grow as sockets are required and memory allows. There is no need to set its initial value very large, as many sources suggest.

API Buffer

This article is unfinished.

(default 8k, this should be increased)

Delayed Acknowledgements

Delayed Acknowledgements sent fewer acknowledgements in an effort to reduce ACK traffic. This is particularly useful for connections with lots of small packets, such as SSH connections.

Unfortunately, there is a pretty bad interaction between Nagle's algorithm, which is used for a simular protection, but for the sender side (it waits with sending more data till the previous data is acknowledged). Delayed acknowledgements and Nagle's algorithm can interact, causing the sender and receiver to wait on each other. This is why I recommend to turn of delayed acknowledgements:

Linux:

net.inet.tcp.delayed_ack=0

See also:

TCP Congestion Control Algorithms

Mac OS X:

net.inet.tcp.newreno=1

Since Linux 2.6.23, you can choose between a slew of congestion control algorithms (Vegas (older), Westwood (for lossy networks), Reno (common), bic, cunic (new), htcp = Hamilton). To see the options:

sysctl net.ipv4.tcp_available_congestion_control

To set the algorithm:

sysctl -w net.ipv4.tcp_congestion_control=bic

See also:

Packet Size

Mac OS X has a very small default TCP packet size, 512 bytes. You are recommended to increase it to the Ethernet default of 1500 bytes, which corresponds to a maximum segment size of 1460 bytes:

Mac OS X:

net.inet.tcp.mssdflt=1460

The packet sizes for UDP (net.inet.udp.maxdgram and net.inet.udp.recvspace) are correct.

Further Tuning

Linux 2.4 and 2.6 kernels have a weird feature that re-uses the slow-start threshold of the previous connections. This is OK if there is a lot of continuous congestion, but for short periods of congestion it makes little sense. I strongly recommend to turn it of.

Linux:

net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1