Response to "L" Root Server Scaling Report released by ICANN 17
September 2009
- Summary
- Background
- Research
- The Problem
- Why this Only Affects the Root
- Fix
Summary
OARC researchers Geoffrey Sisson and Duane Wessels discovered that
BIND 9 suffered poor performance when it was serving a simulated DNSSEC
signed root zone of about 100,000 or more delegations.
ISC investigated and discovered that BIND serving zones using DNSSEC
containing a lot of glue records (see http://tinyurl.com/DNS-glue)
between non-glue records can be much slower at answering queries. This
is a property unique to the root zone.
THIS
ISSUE ONLY AFFECTS ROOT SERVER OPERATORS SERVING A
VERY LARGE SIGNED ROOT ZONE.
Resolution:
ISC will include the fix for this issue in our beta release of BIND
9.7.0, available 2009-10-01 (October 1, 2009). The final release of
BIND 9.7.0 is due out on 2009-12-08 (December 8, 2009), and will also
include the fix.
Background
OARC has conducted a study of the impact of large numbers of
delegations in the root on the operation of a root name server. A part
of this study used a zone file to simulate a large root zone, both
secured with DNSSEC and not secured with DNSSEC. The sizes of the zone
files were increased by a factor of 10, so there were files with 10
thousand delegations, 100 thousand delegations, 1 million delegations,
and so on.
Benchmarks conducted on the simulated root zone files with 100 thousand
delegations and more showed a decrease in performance, with a large
percentage of queries being dropped by the server. The researchers
confirmed this result by repeating the same test with the BIND server
running on a different operating system.
A pre-release of the study was shown to ISC staff, who asked for
permission to investigate the problem. OARC kindly allowed someone from
ISC to log on to the test machines and do some measurement and testing.
Research
The first step was to duplicate the results. This was straightforward,
using the zone and query set from the draft report, on the same
computers.
The second step was to look for strange system behavior. This would
include heavy disk access, excessive memory use, odd network patterns,
and so on. When performance was analyzed using basic tools available on
the Linux systems being tested (top, vmstat, netstat), the only thing
unusual was that BIND was using almost 100% of the CPU. Since
BIND appeared to be answering properly, just very slowly, this
indicated that some code was performing sub-optimally with this
particular data set.
The third step was to figure out which code this was. BIND was compiled
using gcc's "-pg" option, which causes profile information to be output
when it exits. It was run in a mode which allowed this information to
be checked for the problem input. This revealed that the majority of
time was spent on a single function: find_closest_nsec().
The fourth step was to look at the function and figure out why it was
using all of the time. The code was instrumented to see which portions
of the function were taking the longest. This instrumentation was done
by recording the time from the gettimeofday()
function before and after different points in the function, and then
adding the difference to a
simple timer. This revealed that there was no single point in this
function consuming all of the time, but rather that the large loop that
covers most of the function had the execution time spread more-or-less
evenly over it.
The fifth step was to investigate this looping behavior. Counters were
added at various points in the loop to see what was actually running.
This showed that the loop itself was running many, many times for each
invocation - over 700 on average. The reason for this looping was
tracked down to a few lines of code:
} else if (found == NULL && foundsig == NULL) {
/*
* This node is active, but has no NSEC or
* RRSIG NSEC. That means it's glue or
* other obscured zone data that isn't
* relevant for our search. Treat the
* node as if it were empty and keep looking.
*/
empty_node = ISC_TRUE;
result = dns_rbtnodechain_prev(&search->chain,
NULL, NULL);
At this point, the results of the investigation were handed over to the
larger ISC BIND development team for analysis. The discussion indicated
that this loop should only run a few times at most. It was looking in a
tree which contains a node for every name. The only entries returned
for this code should be either NSEC records or glue.
The Problem
One of the engineers was given a copy of the data and was able to
reproduce
it outside of the test environment. He discovered that the problem is
that there are large series of glue records in this zone.
The code works by using the tree to find a record matching a name. This
is efficient, taking O(lgN)
time on average. However, since glue records are unsigned, they cannot
be returned even though they match the name. So the code walks
backwards one record at a time, until it finds a non-glue record. This
is not an efficient operation, taking O(N)
time.
The reason the code is written this way is that in the usual case it
was expected there would be very little glue: perhaps 2 or 3 records
per NSEC. In this case the work would not matter much. However, in the
case of the data for the RZAIA study, there was one case with over 10
thousand glue records before the NSEC record!
Why this Only Affects
the Root
The current root zone has a portion that looks like this:
net. 172800 IN NS l.gtld-servers.net.
net. 172800 IN NS m.gtld-servers.net.
ns1.aalnet.net. 172800 IN A 194.112.0.1
ns2.aalnet.net. 172800 IN A 194.112.0.5
ns3.aalnet.net. 172800 IN A 82.199.186.130
ns1.admin.net. 172800 IN A 198.73.186.1
ns2.admin.net. 172800 IN A 216.113.38.83
dns2.gt.amnetdatos.net. 172800 IN A 200.30.145.4
ns.amnic.net. 172800 IN A 195.250.64.90
ns.amnic.net. 172800 IN AAAA 2001:4d00::90
sec1.apnic.net. 172800 IN AAAA 2001:dc0:2001:a:4608::59
sec1.apnic.net. 172800 IN A 202.12.29.59
... 272 more glue
ownernames ...
auth210.ns.uu.net. 172800 IN A 195.129.12.74
auth51.ns.uu.net. 172800 IN A 198.6.1.162
auth61.ns.uu.net. 172800 IN A 198.6.1.182
avala.yubc.net. 172800 IN A 212.124.160.1
nf. 172800 IN NS nf1.dyntld.net.
nf. 172800 IN NS nf2.dyntld.net.
If the root is signed, then any lookup between net and nf (such as "network" or "new")
will have to scan backwards through the 284 glue records.
This pattern occurs because:
- The root zone requires glue for every name server.
In other zones, it is possible to have a server under a different
domain. For example, the org
domain might contain:
example.org. ns ns1.example.org.
ns ns2.example.com.
ns1.example.org. a 192.0.2.1
Glue for ns2.example.com
is not included because it is in the com domain. The
root zone must include glue for all name servers.
- All glue in a given TLD occurs in a run.
In a typical delegation-only domain, we expect a few glue records per
domain at most, like this:
alpha.example. ns ns1.alpha.example.
ns ns2.alpha.example.
ns1.alpha.example. a 192.0.2.3
ns2.alpha.example. a 192.0.2.130
ns-beta.alpha.example. a 192.0.2.131
beta.example. ns ns1.beta.example.
ns ns-beta.alpha.example.
ns1.beta.example. a 192.0.2.4
This is because we are putting glue in for the next level. In the root,
we are putting glue in for two levels down, which all get sorted right
after the delegation they fall in. The part of the root zone above
shows this.
This combination of requiring all glue and having all of the glue for a
given TLD in a row means that the root zone will have long series of
glue records. This is exactly the behavior that causes BIND to have
poor performance with an NSEC-signed zone.
This problem only affects "traditional" DNSSEC. For NSEC3 zones, the
NSEC3 records are stored in a separate tree, so this walking never
occurs.
The current root has 284 of 1135 servers with glue in the net domain, or roughly 25%.
While we cannot exactly predict the pattern of name servers for a
larger root zone, it is reasonable to assume a similar ratio will
continue, leading to problems with unfixed BIND.
It is possible to construct a zone with a lot of glue records, then
sign it using NSEC, and subject it to
NXDOMAIN queries to cause BIND to perform poorly. However it is not
something that will affect normal zones, even very large ones.
Fix
The solution to the problem is to use a tree to store NSEC records,
like NSEC3 records are stored.
This is a non-trivial change. As such it will likely not be included in
previous versions of BIND 9 as a bug fix. Administrators running large
zones that have a lot of glue will either need to use NSEC3 to secure
their zones or install BIND 9.7 or newer.