Response to "L" Root Server Scaling Report released by ICANN 17 September 2009

Summary
Background
Research
The Problem
Why this Only Affects the Root
Fix

Summary

OARC researchers Geoffrey Sisson and Duane Wessels discovered that BIND 9 suffered poor performance when it was serving a simulated DNSSEC signed root zone of about 100,000 or more delegations.

ISC investigated and discovered that BIND serving zones using DNSSEC containing a lot of glue records (see http://tinyurl.com/DNS-glue) between non-glue records can be much slower at answering queries. This is a property unique to the root zone.

THIS ISSUE ONLY AFFECTS ROOT SERVER OPERATORS SERVING A
VERY LARGE SIGNED ROOT ZONE.

Resolution:

ISC will include the fix for this issue in our beta release of BIND 9.7.0, available 2009-10-01 (October 1, 2009). The final release of BIND 9.7.0 is due out on 2009-12-08 (December 8, 2009), and will also include the fix.

Background

OARC has conducted a study of the impact of large numbers of delegations in the root on the operation of a root name server. A part of this study used a zone file to simulate a large root zone, both secured with DNSSEC and not secured with DNSSEC. The sizes of the zone files were increased by a factor of 10, so there were files with 10 thousand delegations, 100 thousand delegations, 1 million delegations, and so on.

Benchmarks conducted on the simulated root zone files with 100 thousand delegations and more showed a decrease in performance, with a large percentage of queries being dropped by the server. The researchers confirmed this result by repeating the same test with the BIND server running on a different operating system.

A pre-release of the study was shown to ISC staff, who asked for permission to investigate the problem. OARC kindly allowed someone from ISC to log on to the test machines and do some measurement and testing.

Research

The first step was to duplicate the results. This was straightforward, using the zone and query set from the draft report, on the same computers.

The second step was to look for strange system behavior. This would include heavy disk access, excessive memory use, odd network patterns, and so on. When performance was analyzed using basic tools available on the Linux systems being tested (top, vmstat, netstat), the only thing unusual was that BIND was using almost 100% of the CPU. Since BIND appeared to be answering properly, just very slowly, this indicated that some code was performing sub-optimally with this particular data set.

The third step was to figure out which code this was. BIND was compiled using gcc's "-pg" option, which causes profile information to be output when it exits. It was run in a mode which allowed this information to be checked for the problem input. This revealed that the majority of time was spent on a single function: find_closest_nsec().

The fourth step was to look at the function and figure out why it was using all of the time. The code was instrumented to see which portions of the function were taking the longest. This instrumentation was done by recording the time from the gettimeofday() function before and after different points in the function, and then adding the difference to a simple timer. This revealed that there was no single point in this function consuming all of the time, but rather that the large loop that covers most of the function had the execution time spread more-or-less evenly over it.

The fifth step was to investigate this looping behavior. Counters were added at various points in the loop to see what was actually running. This showed that the loop itself was running many, many times for each invocation - over 700 on average. The reason for this looping was tracked down to a few lines of code:

        } else if (found == NULL && foundsig == NULL) {
		/*
		 * This node is active, but has no NSEC or
		 * RRSIG NSEC.  That means it's glue or
		 * other obscured zone data that isn't
		 * relevant for our search.  Treat the
		 * node as if it were empty and keep looking.
		 */
		empty_node = ISC_TRUE;
		result = dns_rbtnodechain_prev(&search->chain,
					       NULL, NULL);

At this point, the results of the investigation were handed over to the larger ISC BIND development team for analysis. The discussion indicated that this loop should only run a few times at most. It was looking in a tree which contains a node for every name. The only entries returned for this code should be either NSEC records or glue.

The Problem

One of the engineers was given a copy of the data and was able to reproduce it outside of the test environment. He discovered that the problem is that there are large series of glue records in this zone.

The code works by using the tree to find a record matching a name. This is efficient, taking O(lgN) time on average. However, since glue records are unsigned, they cannot be returned even though they match the name. So the code walks backwards one record at a time, until it finds a non-glue record. This is not an efficient operation, taking O(N) time.

The reason the code is written this way is that in the usual case it was expected there would be very little glue: perhaps 2 or 3 records per NSEC. In this case the work would not matter much. However, in the case of the data for the RZAIA study, there was one case with over 10 thousand glue records before the NSEC record!

Why this Only Affects the Root

The current root zone has a portion that looks like this:

net.                    172800  IN      NS      l.gtld-servers.net.
net.                    172800  IN      NS      m.gtld-servers.net.
ns1.aalnet.net.         172800  IN      A       194.112.0.1
ns2.aalnet.net.         172800  IN      A       194.112.0.5
ns3.aalnet.net.         172800  IN      A       82.199.186.130
ns1.admin.net.          172800  IN      A       198.73.186.1
ns2.admin.net.          172800  IN      A       216.113.38.83
dns2.gt.amnetdatos.net. 172800  IN      A       200.30.145.4
ns.amnic.net.           172800  IN      A       195.250.64.90
ns.amnic.net.           172800  IN      AAAA    2001:4d00::90
sec1.apnic.net.         172800  IN      AAAA    2001:dc0:2001:a:4608::59
sec1.apnic.net.         172800  IN      A       202.12.29.59

... 272 more glue ownernames ...

auth210.ns.uu.net.      172800  IN      A       195.129.12.74
auth51.ns.uu.net.       172800  IN      A       198.6.1.162
auth61.ns.uu.net.       172800  IN      A       198.6.1.182
avala.yubc.net.         172800  IN      A       212.124.160.1
nf.                     172800  IN      NS      nf1.dyntld.net.
nf.                     172800  IN      NS      nf2.dyntld.net.

If the root is signed, then any lookup between net and nf (such as "network" or "new") will have to scan backwards through the 284 glue records.

This pattern occurs because:

The root zone requires glue for every name server.
In other zones, it is possible to have a server under a different domain. For example, the org domain might contain:
```
example.org.		ns	ns1.example.org.
			ns	ns2.example.com.
ns1.example.org.	a	192.0.2.1
```
Glue for ns2.example.com is not included because it is in the com domain. The root zone must include glue for all name servers.
All glue in a given TLD occurs in a run.
In a typical delegation-only domain, we expect a few glue records per domain at most, like this:
```
alpha.example.		ns	ns1.alpha.example.
			ns	ns2.alpha.example.
ns1.alpha.example.	a	192.0.2.3
ns2.alpha.example.	a	192.0.2.130
ns-beta.alpha.example.	a	192.0.2.131
beta.example.		ns	ns1.beta.example.
			ns	ns-beta.alpha.example.
ns1.beta.example.	a	192.0.2.4
```
This is because we are putting glue in for the next level. In the root, we are putting glue in for two levels down, which all get sorted right after the delegation they fall in. The part of the root zone above shows this.

This combination of requiring all glue and having all of the glue for a given TLD in a row means that the root zone will have long series of glue records. This is exactly the behavior that causes BIND to have poor performance with an NSEC-signed zone.

This problem only affects "traditional" DNSSEC. For NSEC3 zones, the NSEC3 records are stored in a separate tree, so this walking never occurs.

The current root has 284 of 1135 servers with glue in the net domain, or roughly 25%. While we cannot exactly predict the pattern of name servers for a larger root zone, it is reasonable to assume a similar ratio will continue, leading to problems with unfixed BIND.

It is possible to construct a zone with a lot of glue records, then sign it using NSEC, and subject it to
NXDOMAIN queries to cause BIND to perform poorly. However it is not something that will affect normal zones, even very large ones.

Fix

The solution to the problem is to use a tree to store NSEC records, like NSEC3 records are stored.

This is a non-trivial change. As such it will likely not be included in previous versions of BIND 9 as a bug fix. Administrators running large zones that have a lot of glue will either need to use NSEC3 to secure their zones or install BIND 9.7 or newer.