on
Fun with IP address parsing
In my quest to write a fast IPv4+6 parser, I wrote a slow-but-I-think-correct parser, to use as a base of comparison. In doing so, I discovered more cursed IP address representations that I was previously unaware of. Let’s explore together!
We start out simple, with IPv4 and IPv6 in what I’ll call their
“canonical form”: 192.168.0.1
and 1:2:3:4:5:6:7:8
. Various specs
call these “dotted quad” (more specifically, “dotted decimal”),
dot-separated fields each representing 1 byte; and “colon-hex”,
colon-separated fields each representing 2 bytes.
The first bits of complexity come from IPv6. In canonical form, common
addresses would end up with long runs of zeros in the middle. So, ::
allows you to elide 1 or more 16-bit blocks of zeros: 1:2::3:4
means
1:2:0:0:0:0:3:4
Next up, for cursed historical reasons, IPv6 permits you to write the
final 32 bits of the address in dotted quad form. Effectively, you can
splat an IPv4 address onto the end of IPv6 addresses!
1:2:3:4:5:6:77.77.88.88
means 1:2:3:4:5:6:4d4d:5858
.
And of course, you can combine the two! fe80::1.2.3.4
means fe80:0:0:0:0:0:102:304
The existence of ::
introduces an annoying edge case in parsing: the
::
can be at the start or end of the address, and the “empty” side
of the address is not one of the 16-bit fields. ::1
means
0:0:0:0:0:0:0:1
, 1::
means 1:0:0:0:0:0:0:0
, and ::
means
0:0:0:0:0:0:0:0
. This is a natural consequence of the ::
rule, but
it makes the parser slightly more annoying to write.
One final rule for IPv6: technically, each colon-hex field is 4 hex
digits, but you can elide leading zeros, as I’ve been doing so
far. Fully canonically, ::
is
0000:0000:0000:0000:0000:0000:0000:0000
. My apologies to trypophobic
readers.
That’s it for IPv6, mostly. Now, on to IPv4!
Fun fact, the textual representation of IPv4 was never standardized in any document before IPv6 needed a grammar for its weirdo “trailing dotted quad” notation. So, it’s a de-facto standard that boils down to mostly “what did 4.2BSD understand?”, and “what did other OSes keep when they copied 4.2BSD?”
And hoo boy, strap yourselves in, because 4.2BSD sure had some whacky
opinions! Let’s use 192.168.140.255
as an example. That’s an IPv4
address that people would look at and go “yes, that sure is an IPv4
address.” How else can we write that exact same address?
This is the same IP address: 3232271615
. You get that by
interpreting the 4 bytes of the IP address as a big-endian unsigned
32-bit integer, and print that. This leads to a classic parlor trick:
if you try to visit http://3232271615 , Chrome will load
http://192.168.140.255.
Okay, but that’s sort-of sensible, right? An IPv4 address is 4 bytes, so printing it as a single number is a bit human-unfriendly, but broadly plausible, right?
How about 0300.0250.0214.0377
? That’s still the same
address. Dotted quad, except each field is written out in octal.
And if octal is supported, you might be wondering about hex. And you’d
be right! 192.168.140.255
is also 0xc0.0xa8.0x8c.0xff
, according
to 4.2BSD.
Now, remember before we had CIDR (Classless Inter-Domain Routing) ? IPv4 addresses were Class A, Class B or Class C. It was a weird time.
And that weird time made it into IP addresses! The familiar
192.168.140.255
notation is technically the “Class C” notation. You
can also write that address in “class B” notation as 192.168.36095
,
or in “Class A” notation as 192.11046143
. What we’re doing is
coalescing the final bytes of the address into either a 16-bit or a
24-bit integer field.
This, by the way, is why utilities like ping
will accept weird
looking addresses like 127.1
for 127.0.0.1
. Unlike IPv6, it’s not
doing some kind of “missing fields are zero” expansion. 127.1
is the
Class A notation for “host 1 of network 127”, where the 1 is a 24-bit
number.
And finally, we come to one last bit of unspecified behavior: do IPv4
addresses permit an unlimited number of leading zeros in each quad? Or
is there a maximum of 3 digits? 001.002.003.004
is universally
recognized as valid. What about
0000000001.0000000002.0000000003.000000004
?
You might also be wondering if either of these numbers should be read in as octal, since we said earlier that a leading zero might be interpreted as octal. It depends! There are implementations that do both, but most modern implementations have abandoned the octal and hex notation, and treat leading 0s as decimal.
The leading zero debate also infects IPv6, to some extent. Is
000001::00001.00002.00003.00004
is a valid IPv6 address (“common”
form 1::1.2.3.4
, or 1::102:304
)? Most modern parsers seem to allow
an unlimited amount of leading zeros in their representations,
probably because they’re leaning on some “parse integer” library that
implements that behavior.
And so, we reach the bitter end. If you want to truly parse IP addresses, this is the bullshit you have to put up with.
Currently, my slow reference parser jettisons a lot of old baggage, and sticks to what I think is a sensible subset of these possibilities. It understands:
- Classic v4 dotted decimal, with any number of leading zeros.
- It does not process Class A/B notation, or hex or octal notation.
- It does not process the “uint32 to the knee” representation.
- For IPv6, it understands canonical colon-hex form, as well as :: and trailing-IPv4 style (where the trailing IPv4 follows the same rules as the previous tweet). Each field is allowed any number of leading zeros.
I’m on the fence about that last one, the “IPv6 with an embedded
dotted decimal” form. My reference parser (Go’s net.ParseIP
)
understands it, but it’s not really that useful any more in the real
world. At the dawn of IPv6, the idea was that you could upgrade an
address to IPv6 by prepending a pair of colons, as in ::1.2.3.4
, but
modern transition mechanisms no longer offer anything as clear-cut as
this, so the notation doesn’t really show up in the wild.