Fun with IP address parsing

In my quest to write a fast IPv4+6 parser, I wrote a slow-but-I-think-correct parser, to use as a base of comparison. In doing so, I discovered more cursed IP address representations that I was previously unaware of. Let’s explore together!

We start out simple, with IPv4 and IPv6 in what I’ll call their “canonical form”: 192.168.0.1 and 1:2:3:4:5:6:7:8. Various specs call these “dotted quad” (more specifically, “dotted decimal”), dot-separated fields each representing 1 byte; and “colon-hex”, colon-separated fields each representing 2 bytes.

The first bits of complexity come from IPv6. In canonical form, common addresses would end up with long runs of zeros in the middle. So, :: allows you to elide 1 or more 16-bit blocks of zeros: 1:2::3:4 means 1:2:0:0:0:0:3:4

Next up, for cursed historical reasons, IPv6 permits you to write the final 32 bits of the address in dotted quad form. Effectively, you can splat an IPv4 address onto the end of IPv6 addresses! 1:2:3:4:5:6:77.77.88.88 means 1:2:3:4:5:6:4d4d:5858.

And of course, you can combine the two! fe80::1.2.3.4 means fe80:0:0:0:0:0:102:304

The existence of :: introduces an annoying edge case in parsing: the :: can be at the start or end of the address, and the “empty” side of the address is not one of the 16-bit fields. ::1 means 0:0:0:0:0:0:0:1, 1:: means 1:0:0:0:0:0:0:0, and :: means 0:0:0:0:0:0:0:0. This is a natural consequence of the :: rule, but it makes the parser slightly more annoying to write.

One final rule for IPv6: technically, each colon-hex field is 4 hex digits, but you can elide leading zeros, as I’ve been doing so far. Fully canonically, :: is 0000:0000:0000:0000:0000:0000:0000:0000. My apologies to trypophobic readers.

That’s it for IPv6, mostly. Now, on to IPv4!

Fun fact, the textual representation of IPv4 was never standardized in any document before IPv6 needed a grammar for its weirdo “trailing dotted quad” notation. So, it’s a de-facto standard that boils down to mostly “what did 4.2BSD understand?”, and “what did other OSes keep when they copied 4.2BSD?”

And hoo boy, strap yourselves in, because 4.2BSD sure had some whacky opinions! Let’s use 192.168.140.255 as an example. That’s an IPv4 address that people would look at and go “yes, that sure is an IPv4 address.” How else can we write that exact same address?

This is the same IP address: 3232271615. You get that by interpreting the 4 bytes of the IP address as a big-endian unsigned 32-bit integer, and print that. This leads to a classic parlor trick: if you try to visit http://3232271615 , Chrome will load http://192.168.140.255.

Okay, but that’s sort-of sensible, right? An IPv4 address is 4 bytes, so printing it as a single number is a bit human-unfriendly, but broadly plausible, right?

How about 0300.0250.0214.0377 ? That’s still the same address. Dotted quad, except each field is written out in octal.

And if octal is supported, you might be wondering about hex. And you’d be right! 192.168.140.255 is also 0xc0.0xa8.0x8c.0xff, according to 4.2BSD.

Now, remember before we had CIDR (Classless Inter-Domain Routing) ? IPv4 addresses were Class A, Class B or Class C. It was a weird time.

And that weird time made it into IP addresses! The familiar 192.168.140.255 notation is technically the “Class C” notation. You can also write that address in “class B” notation as 192.168.36095, or in “Class A” notation as 192.11046143. What we’re doing is coalescing the final bytes of the address into either a 16-bit or a 24-bit integer field.

This, by the way, is why utilities like ping will accept weird looking addresses like 127.1 for 127.0.0.1. Unlike IPv6, it’s not doing some kind of “missing fields are zero” expansion. 127.1 is the Class A notation for “host 1 of network 127”, where the 1 is a 24-bit number.

And finally, we come to one last bit of unspecified behavior: do IPv4 addresses permit an unlimited number of leading zeros in each quad? Or is there a maximum of 3 digits? 001.002.003.004 is universally recognized as valid. What about 0000000001.0000000002.0000000003.000000004?

You might also be wondering if either of these numbers should be read in as octal, since we said earlier that a leading zero might be interpreted as octal. It depends! There are implementations that do both, but most modern implementations have abandoned the octal and hex notation, and treat leading 0s as decimal.

The leading zero debate also infects IPv6, to some extent. Is 000001::00001.00002.00003.00004 is a valid IPv6 address (“common” form 1::1.2.3.4, or 1::102:304)? Most modern parsers seem to allow an unlimited amount of leading zeros in their representations, probably because they’re leaning on some “parse integer” library that implements that behavior.

And so, we reach the bitter end. If you want to truly parse IP addresses, this is the bullshit you have to put up with.

Currently, my slow reference parser jettisons a lot of old baggage, and sticks to what I think is a sensible subset of these possibilities. It understands:

I’m on the fence about that last one, the “IPv6 with an embedded dotted decimal” form. My reference parser (Go’s net.ParseIP) understands it, but it’s not really that useful any more in the real world. At the dawn of IPv6, the idea was that you could upgrade an address to IPv6 by prepending a pair of colons, as in ::1.2.3.4, but modern transition mechanisms no longer offer anything as clear-cut as this, so the notation doesn’t really show up in the wild.