International Domain Names

From Exterior Memory
(Redirected from Convert text to punycode)
Jump to: navigation, search

International Domain Names

International Domain Names (IDN) contain other characters than what was allowed in the traditional LDH-labels, which only allows letters, digits and hyphen (hence the name LDH). Both top-level and secondary labels may contain international labels.

To translate a domain name, each non-LDH label should individually be lowercased, normalized to NFKC, encoded in punycode and prepend with the "xn--" string. The result is the ASCII-compatible encoding (ACE) form, also called the A-label.

For example, the IDN bücher.ch has two labels, bücher and ch. The punycode for bücher is bcher-kva, so the A-label for bücher is xn--bcher-kva. ch is already a valid LDH-label, and thus does not need to be translated.

May registries carry a IDN top level domain. For example, the Russian registry can be found at кц.рф (xn--j1ay.xn--p1ai).

Valid Labels

Note that not all Unicode strings are valid IDN labels. They are subject to the regulations of the Unicode consortium, IETF, ICANN and individual registries.

  • The label must be normalized to NFKC (Compatibility Decomposition, Canonical Composition)
  • Only codepoints that are "PVALID" in RFC 5892 are allowed.
  • All code points in a single label will be taken from the same script
  • Individual registries limit the allowed codepoints

Previous Standards

The current standard for International Domains Names is IDNA2008, and was published in 2010 by the IETF. Two earlier versions exists, IDNA2003 published by the IETF (RFC349), and UTS46 published by the Unicode consortium.

To understand the differences, read Unicode Technical Standard #46: Unicode IDNA Compatibility Processing.

Non-Compliant TLDs

One of the consequences of changes in allowed codepoints is that registries have allowed registrations of domain names that are no longer valid. A domain like ¬.com is still registered, while code point U+00AC (¬) is disallowed according to RFC 5892, and .com has a minimum domain name length of 3 characters. Clearly, the checks that are now in place have not always been in place.

Other notable examples are the registration of ✪df.ws by Daring Fireball, and the registrations of 💩.la (yes, that's poopla), 🚚.la, and 🍃.la by Panic.

Many examples above involve .la (Laos), .ws (Western Samoa) and .tk (Tokelau). While some no longer accept the above emojicon-examples, others may still do. You may want to look for TLD that allow IDNs, but have not published a list of allowed characters.

It should come as no surprise that registering non-compliant domains are not generally supported by software, especially with emoticons which reside above Unicode codepoint U+10000 and can not be represented in UCS-2. In fact, the above emojicons probably don't show up if you use Google Chrome or Firefox. In fact, you will find that many registries do not properly support IDN. A registry let us pay for the registry of appleπ.com, but subsequently had to return the money because Verisign rejected the domain. You have been warned.

Security Concerns

One of the main security concerns for international domain names is visual spoofing of domain names.

UTR#36 (Unicode Security Considerations gives the example of "IBM" versus "IBM". The М in the first example is a Cyrillic character, while the M in the second example is a Latin character.

For this reasons, international domain names that mix different scripts in the same label are not allowed, even if there is no chance for confusion (e.g. appleπ.com is not allowed).

This does not mean that spoofing is not possible. A culprit may still register xn--bnk-l11b.com. It is unlikely that many users will spot the difference between bạnk.com and bank.com.

.com limitations

Verisign, the registry of .com, has more elaborate rules than other registries. Whereas most registries specify a list of allowed code points per script, Verisign requires users to first specify a language, which can than choose between one of more scripts, each having it's own set of allowed codepoints.

Thus instead of simply specifying "Cyrillic" as the script, one should specify "Russian" as the language, which has specified a set of Cyrillic characters and some Latin (for numbers). A domain that specifies "German" as the language, does not have a set of allowed characters, but must subsequently choose a script, which can be Cuneiform just as well as Latin. In addition, labels with these languages may also include characters from the special "Common" or "Inherited" scripts defined by Unicode. Thus it is possible to mix mathematical symbols with latin script.

The precise guidelines are published at IDN Registration Rules.

My personal opinion is that Verisign is overcomplicating things, whereas in the past they did not have enough checks (e.g. by registering domains like ¶.com and ¬.com). Verisign seems to refer to an older versions of the ICANN guidelines (2.2). If they move to version 3.0, they will hopefully simply the rules and simply publish a per-script allowed set of codepoints.

Conversion Between Unicode and Punycode

The usual tool for converting between text encodings is iconv:

 echo '¡hëllø wórłd!' | iconv -f utf-8 -t utf-16

Unfortunately, inconv does not support punycode, probably for the same reason that base64 encoding is not supported (it's more of a byte encoding than a character encoding).

Here is a small snippet in Python2 to convert from UTF-8 to PUNYCODE:

>>> u'¡hëllø wórłd!'.encode('punycode')
'hll wrd!-ria55e2crcv7d'

To create a valid label, it should prepended with xn-- and first normalized to NFKC.

>>> import unicodedata
>>> u"xn--" + unicodedata.normalize("NFKC",u"Bücher").encode('punycode')
u'xn--Bcher-kva'

To decode punycode, make sure the input is binary (ASCI):

>>> print b'Bcher-kva'.decode('punycode')
Bücher