Project Nayuki


Variations on Japanese romanization

Many times, Japanese names, titles, and prose need to be converted into text in Latin letters for various good reasons. (Writing to an English-speaking audience, using computer software that only handles file names in ASCII, etc.) As I explored this problem, I found that there were many subtle variations of Japanese romanization used in the wild, each with a valid reason for existing. In this article I will try to give a near-complete overview of all the reasonable variations on how to romanize Japanese text.

Contents

Basic kana

There are two major styles for romanizing kana: Nihon-shiki versus Hepburn. Nihon-shiki has a very uniform structure of consonant (plus optional y) plus vowel, whereas Hepburn conveys the pronunciation more accurately at the expense of irregular spelling. For example, the romanization of the T line in Nihon-shiki is ta ti tu te to, but in Hepburn it’s ta chi tsu te to. The following list shows (respectively) the kana, the Nihon-shiki romanization, the Hepburn romanization, and any alternate romanizations:

  • し: si, shi
  • じ: zi, ji
  • ち: ti, chi
  • ぢ: di, ji, dji
  • つ: tu, tsu
  • づ: du, zu, dzu
  • ふ: hu, fu
  • しゃ: sya, sha
  • しゅ: syu, shu
  • しょ: syo, sho
  • じゃ: zya, ja, jya
  • じゅ: zyu, ju, jyu
  • じょ: zyo, jo, jyo
  • ちゃ: tya, cha
  • ちゅ: tyu, chu
  • ちょ: tyo, cho
  • ぢゃ: dya, ja, dja
  • ぢゅ: dyu, ju, dju
  • ぢょ: dyo, jo, djo

Note: The forms {dji, dzu, dja, dju, djo} are modified from Hepburn and are for disambiguation. The forms {jya, jyu, jyo} are in between Hepburn and systematic romanization.

Long vowels

In spoken and written Japanese, there are words that differ only by the length of a vowel. There are two vowel lengths: single and double. Distinguishing this in rōmaji is an important goal, although not absolutely critical.

Macron

A common scheme used in Japanese textbooks for English-speaking learners and on Wikipedia. Easier to understand than wāpuro.
e.g. Tōkyō, Ōsaka, sensē, onēsan, onīsan, okāsan, yūbe

Circumflex

A simple variation on the macron scheme. Possibly invented because some typesetting systems don’t support the macron diacritic.
e.g. Tôkyô, Ôsaka, sensê, onêsan, onîsan, okâsan, yûbe

Wāpuro

A very popular scheme outside of formal publications, found especially in the anime fansub, manga scanlation, and file sharing communities. Preserves the original orthography to the best extent out of all the schemes, which helps if the text needs to be converted back into Japanese for search/correlation/etc.
e.g. Toukyou, Oosaka, sensei, oneesan, oniisan, okaasan, yuube

Doubling

Similar to wāpuro but favors pronunciation rather than kana spelling. I doubt that it’s used in the wild.
e.g. Tookyoo, Oosaka, sensee, oneesan, oniisan, okaasan, yuube

Conflate with short vowels

Not used much in actual romanized sentences, but used very often in the official romanization of city names, etc.
e.g. Tokyo, Osaka, sense, onesan, onisan, yube

oo/ou as oh

Common in situations where diacritics are not expected, such as in typical English writing. It’s rather ad hoc, and I think it looks ugly.
e.g. Tohkyoh, Ohsaka

For katakana long vowels, all the above options are applicable plus a few more. Whereas a hiragana long vowel uses a vowel kana as the second letter (e.g. くう), a katakana long vowel uses a horizontal mark as the second letter (e.g. クー). For example using the above schemes, セーラー can be romanized as sērā, sêrâ, seeraa, or sera. Additionally:

Hyphen (wāpuro)

e.g. se-ra-

Foreign spelling

e.g. sailor

Kana n

Always use n

Common.
e.g. sankaku, sanpo, senpai

Sometimes use m

The old Hepburn scheme uses m when the next kana is b- or p-.
e.g. sankaku, sampo, sempai

Use n’ (apostrophe)

Either always use n’, or drop the apostrophe in non-ambiguous cases.
e.g. san’kaku/sankaku, san’po/sanpo, sen’pai/senpai, ren’ai, Jun’ichi

Always use nn

This is an artifact from input method editors (IMEs) and is never used in writing because it causes massive confusion with popular practices.
e.g. sannkaku, sannpo, sennpai, rennai, Junnichi

Small tsu

Double the previous consonant

This is essentially the universally adopted scheme. It generally works well enough but breaks down in some minor, esoteric edge cases, which are discussed later.

Treat as an individual character

Romanize as xtu, xtsu, or otherwise. This is helpful if the goal is lossless romanization..

Particles

Three particles in Japanese have a different pronunciation than what their written kana symbol suggests. They are は, へ, を.

Favor orthography

Romanize as ha, he, wo. This is good for lossless conversions.

Favor phonetics

Romanize as wa, e, o. This is good for reading romanized text aloud.

Favor phonetics but preserve wo

Romanize as wa, e, wo. This preserves the distinction between o and wo. Furthermore, を can be pronounced as wo in songs. (But there are a number of other permitted liberalities in song lyric pronunciation, which I will not elaborate on.)

Spaces

Native Japanese text has no spaces. But in such text, a kana followed by a kanji strongly suggests a word boundary in between, which works well in practice. In rōmaji this is not possible, so spaces are necessary. Furthermore, every language that uses the Latin alphabet uses spaces in normal text.

Space between words, including particles

This is the most common scheme.
e.g. anata wa ie de gohan wo tabeta ka.

Space between words, no space before particle

This mimics kana spacing for beginners in native Japanese.
e.g. anatawa iede gohanwo tabetaka.

Space between each kana

Possibly helpful for beginners, good for lossless transliteration, but not used in practice.
e.g. a na ta wa i e de go ha n wo ta be ta ka

No space

This removes the work needed to find word boundaries in the Japanese text, but results in hard-to-read romanized text. However, the romanization is still lossless and unambiguous, since Japanese text does not have spaces to begin with.
e.g. anatawaiedegohanwotabetaka

Hyphens

Closely related to spacing is hyphenation. Words that have a loose relationship can be joined with a hyphen instead of a space. Examples:

  • Numbered items: dai-ichi (第一), ni-chōme (二丁目), san-jū (三十), yon-mai (四枚), go-ji (五時)
  • Honorific suffixes: -san, -sama, -chan
  • Honorific prefixes: o-, go-

Capitalization

Base techniques:

Everything in lowercase

Simple, easy to read. Popular.

Hiragana in lowercase, katakana in uppercase

Lossless. This scheme is sometimes used in fan-made song lyric romanizations.

Further considerations:

Capitalize the first word in a sentence

Just like in European languages.
e.g. Kore ga watashi no go-shujin-sama desu.

Capitalize proper nouns (names, etc.).

That is, capitalize the names of people/places/companies/products/publications/etc.
e.g. Mary-san to Yuki-san wa senshuu London e ikimashita.

Title case for titles

Capitalize every word, or only capitalize significant words?
e.g. Pikachu wa Genki Desu ne

Foreign words

Foreign/loan words are usually written in katakana, and phonetically approximate the word’s pronunciation in the original language. The vast majority of loanwords in use are from English.

Romanize as kana

Systematic but extremely ugly, and hard to recognize even for people who can read English.
e.g. pāsonaru konpyūtā

Revert to original spelling

May require context and interpretation, have ambiguity, and/or require diacritics.
e.g. personal computer

Small kana

Small vowels

Treat them as big kana, except for well-known cases

Use the pseudo-consonant “x”

e.g. xa, xyu, xtsu

Non-standard dakuten

Rarely used, usually for silly emphatic effect, a dakuten (゛) or han-dakuten (゜) is used on a kana that does not normally accept such a diacritic. For example, ま゛. I cannot think of any reasonable way to romanize in a situation like this, other than to ignore the (han-)dakuten and use the base kana.

Losslessness

Mathematically speaking, romanization can be thought of as a function that maps a sequence of kana letters to a sequence of Latin letters.

It’s easy to create a romanization scheme that is lossless (injective/one-to-one), in the sense that each distinct sequence of kana gives a different sequence of rōmaji. But such a scheme wouldn’t be completely pretty, since it would probably use Nihon-shiki for simplicity, have a space between each kana, and use capitalization to distinguish hiragana and katakana.

Here are some contrived edge cases to consider if your goal is to design a lossless romanization scheme:

  • っい (xtu i)
  • あっな (a xtu na)
  • てっっと (te xtu xtu to)
  • しゃゃゅょさゃ (si xya xya xyu xyo sa xya)
  • くくうくううくううう (ku ku u ku u u ku u u u)
  • とおとう (to o to u)
  • ねえねい (ne e ne i)
  • ニーニイニィ (NI - NI I NI XI)
  • にーにいにぃ (ni - ni i ni xi)
  • が゜お゛ (ga ° o ")