Variations on Japanese romanization

Many times, Japanese names, titles, and phrases need to be converted into text in Latin letters for various good reasons. (Writing to an English-speaking audience, using computer software that only handles file names in ASCII, etc.) As I explored this problem, I found that there were many subtle variations of Japanese romanization used in the wild, each with a valid reason for existing. In this article I will try to give a near-complete overview of all the reasonable variations on how to romanize Japanese text.

Basic kana
Long vowels
Kana n
Small tsu
Particles
Spaces
Hyphens
Capitalization
Foreign words
Small kana
Non-standard dakuten
Losslessness

Basic kana

There are two major styles for romanizing kana: Nihon-shiki versus Hepburn. Nihon-shiki has a very uniform structure of consonant (plus optional y) plus vowel, whereas Hepburn conveys the pronunciation more accurately at the expense of irregular spelling. For example, the romanization of the T line in Nihon-shiki is ta ti tu te to, but in Hepburn it’s ta chi tsu te to. The following list shows the kana (and the only kana) whose romanizations are different in Nihon-shiki versus Hepburn. Each entry respectively states the kana, the Nihon-shiki romanization, the Hepburn romanization, and any alternate romanizations:

し: si, shi
じ: zi, ji
ち: ti, chi
ぢ: di, ji, dji
つ: tu, tsu
づ: du, zu, dzu
ふ: hu, fu
しゃ: sya, sha
しゅ: syu, shu
しょ: syo, sho
じゃ: zya, ja, jya
じゅ: zyu, ju, jyu
じょ: zyo, jo, jyo
ちゃ: tya, cha
ちゅ: tyu, chu
ちょ: tyo, cho
ぢゃ: dya, ja, dja
ぢゅ: dyu, ju, dju
ぢょ: dyo, jo, djo

Note: The forms {dji, dzu, dja, dju, djo} are modified from Hepburn and are for disambiguation. The forms {jya, jyu, jyo} are in between Hepburn and systematic romanization.

Long vowels

In spoken and written Japanese, there are words that differ only by the length of a vowel. There are two vowel lengths: single and double. Distinguishing this in rōmaji is an important goal, although not absolutely critical.

Macron: A common scheme used in Japanese textbooks for English-speaking learners and on Wikipedia. Easier to understand than wāpuro.
e.g. Tōkyō, Ōsaka, sensē, onēsan, onīsan, okāsan, yūbe
Circumflex: A simple variation on the macron scheme. Possibly invented because some typesetting systems don’t support the macron diacritic.
e.g. Tôkyô, Ôsaka, sensê, onêsan, onîsan, okâsan, yûbe
Wāpuro: A very popular scheme outside of formal publications, found especially in the anime fansub, manga scanlation, and file sharing communities. Preserves the original orthography to the best extent out of all the schemes, which helps if the text needs to be converted back into Japanese for search/correlation/etc.
e.g. Toukyou, Oosaka, sensei, oneesan, oniisan, okaasan, yuube
Doubling: Similar to wāpuro but favors pronunciation rather than kana spelling. I doubt that it’s used in the wild.
e.g. Tookyoo, Oosaka, sensee, oneesan, oniisan, okaasan, yuube
Conflate with short vowels: Not used much in actual romanized sentences, but used very often in the official romanization of city names, etc.
e.g. Tokyo, Osaka, sense, onesan, onisan, yube
oo/ou as oh: Common in situations where diacritics are not expected, such as in typical English writing. It’s rather ad hoc, and I think it looks ugly.
e.g. Tohkyoh, Ohsaka

All schemes that map the long vowels おう and おお to the same sequence of letters are inherently ambiguous. おう is the most common spelling in most Japanese words, but おお does arise occasionally, and it is critical in Japanese writing to distinguish the two.

For katakana long vowels, all the above options are applicable plus a few more. Whereas a hiragana long vowel uses a vowel kana as the second letter (e.g. くう), a katakana long vowel uses a horizontal mark as the second letter (e.g. クー). For example using the above schemes, セーラー can be romanized as sērā, sêrâ, seeraa, or sera. Additionally:

Hyphen (wāpuro): This reflects how katakana long vowels are entered into IMEs.
e.g. se-ra-
Foreign spelling: This looks far more natural in romanized text. But it requires the reader to know the English/foreign pronunciation and map it back to katakana sounds on the fly.
e.g. sailor

Kana n

The kana ん requires some care because its pronunciation changes in front of some consonants, and because n + vowel is not the same as a single syllable (such as んう vs. ぬ).

Always use n: Common.
e.g. sankaku, sanpo, senpai
Sometimes use m: The old Hepburn scheme uses m when the next kana is b- or p-.
e.g. sankaku, sampo, sempai
Use n’ (apostrophe): Either always use n’, or drop the apostrophe in non-ambiguous cases.
e.g. san’kaku/sankaku, san’po/sanpo, sen’pai/senpai, ren’ai, Jun’ichi
Always use nn: This is an artifact from input method editors (IMEs), but is never used in writing because it causes massive confusion with popular practices.
e.g. sannkaku, sannpo, sennpai, rennai, Junnichi

Small tsu

Double the previous consonant: This is essentially the universally adopted scheme. It generally works well enough but breaks down in some minor, esoteric edge cases, which are discussed later.
Treat as an individual character: Romanize as xtu, xtsu, or otherwise. This is helpful if the goal is lossless romanization.

Particles

Three particles in Japanese have a different pronunciation than what their written kana symbol suggests. They are は, へ, を.

Favor orthography: Romanize as ha, he, wo. This is good for lossless conversions.
Favor phonetics: Romanize as wa, e, o. This is good for reading romanized text aloud.
Favor phonetics but preserve wo: Romanize as wa, e, wo. This preserves the distinction between o and wo. Furthermore, を can be pronounced as wo in songs. (But there are a number of other permitted liberalities in song lyric pronunciation, which I will not elaborate on.)

Spaces

Native Japanese text has no spaces. But in such text, a kana followed by a kanji strongly suggests a word boundary in between, which works well in practice. In rōmaji this is not possible, so spaces are necessary. Furthermore, every language that uses the Latin alphabet uses spaces in normal text.

Japanese example with full kanji: 例：貴方は家で御飯を食べたか。
Space between words, including particles: This is the most common scheme.
e.g. anata wa ie de gohan wo tabeta ka.
Space between words, no space before particle: This mimics kana spacing for beginners in native Japanese.
e.g. anatawa iede gohanwo tabetaka.
例：あなたは　いえで　ごはんを　たべたか。
Space between each kana: Possibly helpful for beginners, good for lossless transliteration, but not used in practice.
e.g. a na ta wa i e de go ha n wo ta be ta ka
No space: This removes the work needed to find word boundaries in the Japanese text, but results in hard-to-read romanized text. However, the romanization is still lossless and unambiguous, since Japanese text does not have spaces to begin with.
e.g. anatawaiedegohanwotabetaka

Hyphens

Closely related to spacing is hyphenation. Words that have a loose relationship can be joined with a hyphen instead of a space. Examples:

Numbered items: dai-ichi (第一), ni-chōme (二丁目), san-jū (三十), yon-mai (四枚), go-ji (五時)
Honorific suffixes: -san, -sama, -chan
Honorific prefixes: o-, go-

Capitalization

Base techniques:

Everything in lowercase: Simple, easy to read. Popular.
Hiragana in lowercase, katakana in uppercase: Lossless. This scheme is sometimes used in fan-made song lyric romanizations.

Further considerations:

Capitalize the first word in a sentence: Just like in European languages.
e.g. Kore ga watashi no go-shujin-sama desu.
Capitalize proper nouns (names, etc.).: That is, capitalize the names of people/places/companies/products/publications/etc.
e.g. Mary-san to Yuki-san wa senshuu London e ikimashita.
Title case for titles: Capitalize every word, or only capitalize significant words?
e.g. Pikachu wa Genki Desu ne

Foreign words

Foreign/loan words are usually written in katakana, and phonetically approximate the word’s pronunciation in the original language. The vast majority of loanwords in use are from English.

Romanize as kana: Systematic but extremely ugly, and hard to recognize even for people who can read English.
e.g. pāsonaru konpyūtā, kurisumasu, makudonarudo, arubaito, saito
Revert to original spelling: May require context and interpretation, have ambiguity, and/or require diacritics.
e.g. personal computer, Christmas, McDonald, arbeit, site/sight

Small kana

Small vowels: Treat them as big kana, except for well-known cases. e.g. まぁ romanized as maa.
Use the pseudo-consonant “x”: e.g. xa, xyu, xtsu

Non-standard dakuten

Rarely used, usually for silly emphatic effect, a dakuten (゛) or han-dakuten (゜) is used on a kana that does not normally accept such a diacritic. For example, ま゛. I cannot think of any reasonable way to romanize in a situation like this, other than to ignore the (han-)dakuten and use the base kana.

Losslessness

Mathematically speaking, romanization can be thought of as a function that maps a sequence of kana letters to a sequence of Latin letters.

It’s easy to create a romanization scheme that is lossless (injective/one-to-one), in the sense that each distinct sequence of kana gives a different sequence of rōmaji. But such a scheme wouldn’t be completely pretty, since it would probably use Nihon-shiki for simplicity, have a space between each kana, and use capitalization to distinguish hiragana and katakana.

Here are some contrived edge cases to consider if your goal is to design a lossless romanization scheme:

Kana	Lossless romanization	Comment
っい	xtu i	No consonant to double
あっな	a xtu na	Ambiguous consonant doubling
てっっと	te xtu xtu to	Tests for consonant tripling
しゃゃゅょさゃ	si xya xya xyu xyo sa xya	No good way to represent small kana after the first one
くくうくううくううう	ku ku u ku u u ku u u u	Distinguishing more than two vowel lengths – fails macron and circumflex
とおとう	to o to u	Fails vowel macron, circumflex, doubling, ou/oo as oh, and dropping
ねえねい	ne e ne i	May fail vowel macron, circumflex, and doubling
ニーニイニィ	NI - NI I NI XI	Fails any pronunciation-based scheme, because all three are pronounced the same
にーにいにぃ	ni - ni i ni xi	Fails any pronunciation-based scheme, because all three are pronounced the same
が゜お゛	ga ° o "	Fails any scheme that ignores non-standard dakuten

Project Nayuki