Declaring Your Content’s Natural Language
The web is a globe-spanning, borderless nation that speaks many languages — all of them, in fact. A web page created by an American, hosted in America, targeted at an American audience, can still just as easily be seen by people in Malaysia, Argentina, Finland, Turkey, and even Canada. If your particular slice of the web will mostly be seen by speakers of a particular language, you should still make some consideration for speakers of other languages by declaring the base language of your document.
Declaring the natural language of your content will assist user-agents in parsing and rendering it. Search engines can automatically filter their results based on language, returning a listing of pages written in the language specified by the searcher. Screen readers can alter their pronunciation so German sounds like German and Tagalog sounds like Tagalog (in theory, anyway).
You should declare the primary language of your entire document by including the lang
and xml:lang attributes in the document’s root html element.
You can then differentiate individual phrases or passages written in another language by adding
the attributes to their appropriate parent element; lang and xml:lang
can be validly attached to almost any element.
The lang attribute comes from HTML, while xml:lang is the XML
equivalent to be used in XHTML documents. However, because you can’t reliably serve XHTML
as XML (since not all browsers correctly support the application/xml+xhtml MIME
type), even XHTML documents are treated as HTML (with a text/html MIME type).
This means the xml:lang attribute alone won’t work in documents served as HTML.
And yet, you should still strive for XML compliance in your XHTML markup; an XHTML document
should be well-formed XML even if it’s not being served as such. To ensure full compatibility,
both the lang and xml:lang attributes should be included, with identical
values, like so:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
The two-letter abbreviated language code “en” indicates that this
document is written in English. To be even more specific, I’m writing in American
English (as opposed to King’s, Canadian, or Australian English), and I can declare that
specific dialect by extending the language code with a hyphenated regional subcode thusly:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-us" lang="en-us">
Many common dialects of major languages have standardized subcodes, such as en-us
for American English, en-gb for British English, en-ca for Canadian English,
fr-ca for Canadian French, and fr-be for Belgian French, to name just a few. In
the cases of such national dialects, the language code typically takes the form language-country.
(note, however, that British English is en-gb, not en-uk; the latter subcode —
though perfectly logical — is not correct). Dialects from narrower regions can be declared as
language-country-dialect. However, you should only make such dialectic distinctions when it’s
necessary; declaring the base language is usually sufficient.
Language Codes
These language codes are an official standard, ISO 639. Like most web standards, ISO 639 has changed and evolved over time, and continues to do so. The original version (639-1) included codes for 138 languages, covering most of the common languages in the world. That still doesn’t come close to encompassing the full breadth of humanity, with languages numbering in the thousands. The standard was expanded in the early 1990s, introducing three-letter codes to allow a greater number of permutations.
Below are the 138 standardized, two-letter abbreviated language codes of ISO 639-1. The latest, exhaustive listing can be found online at the IANA Language Subtag Registry, in a practically unreadable format. Thankfully, W3C Internationalization Activity Lead Richard Ishida has cooked up a handy search utility.
| Language Name | Code |
|---|---|
| A | |
| Abkhazian | ab |
| Afan (Oromo) | om |
| Afar | aa |
| Afrikaans | af |
| Albanian | sq |
| Amharic | am |
| Arabic | ar |
| Armenian | ar |
| Assamese | hy |
| Aymara | ay |
| Azerbaijani | az |
| B | |
| Bashkir | ba |
| Basque | eu |
| Bengali/Bangla | bn |
| Bhutani | dz |
| Bihari | bh |
| Bislama | bi |
| Breton | br |
| Bulgarian | bg |
| Burmese | my |
| Byelorussian | be |
| C | |
| Cambodian | km |
| Catalan | ca |
| Chinese | zh |
| Corsican | co |
| Croatian | hr |
| Czech | cs |
| D | |
| Danish | da |
| Dutch | nl |
| E | |
| English | en |
| Esperanto | eo |
| Estonian | et |
| F | |
| Faroese | fo |
| Fiji | fj |
| Finnish | fi |
| French | fr |
| Frisian | fy |
| G | |
| Galician | gl |
| Georgian | ka |
| German | de |
| Greek | el |
| Greenlandic | kl |
| Guarani | gn |
| Gujarati | gu |
| H | |
| Hausa | ha |
| Hebrew | he |
| Hindi | hi |
| Hungarian | hu |
| I | |
| Icelandic | is |
| Indonesian | id |
| Interlingua | ia |
| Interlingue | ie |
| Inuktitut | iu |
| Inupiak | ik |
| Irish | ga |
| Italian | it |
| J | |
| Japanese | ja |
| Javanese | jv |
| K | |
| Kannada | kn |
| Kashmiri | ks |
| Kazakh | kk |
| Kinyarwanda | rw |
| Kirghiz | ky |
| Kurundi | rn |
| Korean | ko |
| Kurdish | ku |
| L | |
| Laothian | lo |
| Latin | la |
| Latvian/Lettish | lv |
| Lingala | ln |
| Lithuanian | lt |
| M | |
| Macedonian | mk |
| Malagasy | mg |
| Malay | ms |
| Malayalam | ml |
| Maltese | mt |
| Maori | mi |
| Marathi | mr |
| Moldavian | mo |
| Mongolian | mn |
| N | |
| Nauru | na |
| Nepali | ne |
| Norwegian | no |
| O | |
| Occitan | oc |
| Oriya | or |
| P | |
| Pashto/Pushto | ps |
| Persian (Farsi) | fa |
| Polish | pl |
| Portuguese | pt |
| Punjabi | pa |
| Q | |
| Quechua | qu |
| R | |
| Rhaeto-Romance | rm |
| Romanian | ro |
| Russian | ru |
| S | |
| Samoan | sm |
| Sangho | sg |
| Sanskrit | sa |
| Scots Gaelic | gd |
| Serbian | sr |
| Serbo-Croatian | sh |
| Setswana | st |
| Shona | tn |
| Sindhi | sn |
| Siswati | ss |
| Slovak | sk |
| Slovenian | sl |
| Somali | so |
| Spanish | es |
| Sundanese | su |
| Swahili | sw |
| Swedish | sv |
| Singhalese | si |
| T | |
| Tagalog | tl |
| Tajik | ta |
| Tatar | tt |
| Telugu | te |
| Thai | th |
| Tibetan | bo |
| Tigrinya | ti |
| Tonga | to |
| Tsonga | ts |
| Turkish | tr |
| Turkmen | tk |
| Twi | tw |
| U | |
| Uigur | ug |
| Ukrainian | uk |
| Urdu | ur |
| Uzbek | uz |
| V | |
| Vietnamese | vi |
| Volapuk | vo |
| W | |
| Welsh | cy |
| Wolof | wo |
| X | |
| Xhosa | xh |
| Volapuk | vo |
| Y | |
| Yiddish | yi |
| Yoruba | yo |
| Z | |
| Zhuang | za |
| Zulu | zu |
Further Reading
- The ‘
lang’ Attribute by Charl van Niekerk - Language Codes, an appendix of the book Building Accessible Websites by Joe Clark.
- ISO 639-2 Registration Authority
- Language tags in HTML and XML, from the W3C Internationalization (I18N) Activity (includes guidelines for extending codes for regions and dialects).

