Variations on Japanese romanization
Many times, Japanese names, titles, and prose need to be converted into text in Latin letters for various good reasons. (Writing to an English-speaking audience, using computer software that only handles file names in ASCII, etc.) As I explored this problem, I found that there were many subtle variations of Japanese romanization used in the wild, each with a valid reason for existing. In this article I will try to give a near-complete overview of all the reasonable variations on how to romanize Japanese text.
Basic kana
There are two major styles for romanizing kana: Nihon-shiki versus Hepburn. Nihon-shiki has a very uniform structure of consonant (plus optional y) plus vowel, whereas Hepburn conveys the pronunciation more accurately at the expense of irregular spelling. For example, the romanization of the T line in Nihon-shiki is “ta ti tu te to”, but in Hepburn it’s “ta chi tsu te to”. The following list shows (respectively) the kana, the Nihon-shiki romanization, the Hepburn romanization, and any alternate romanizations:
- し: si, shi
- じ: zi, ji
- ち: ti, chi
- ぢ: di, ji, dji
- つ: tu, tsu
- づ: du, zu, dzu
- ふ: hu, fu
- しゃ: sya, sha
- しゅ: syu, shu
- しょ: syo, sho
- じゃ: zya, ja, jya
- じゅ: zyu, ju, jyu
- じょ: zyo, jo, jyo
- ちゃ: tya, cha
- ちゅ: tyu, chu
- ちょ: tyo, cho
- ぢゃ: dya, ja, dja
- ぢゅ: dyu, ju, dju
- ぢょ: dyo, jo, djo
Note: The forms {dji, dzu, dja, dju, djo} are modified from Hepburn and are for disambiguation. The forms {jya, jyu, jyo} are in between Hepburn and systematic romanization.
Long vowels
In spoken and written Japanese, there are words that differ only by the length of a vowel. There are two vowel lengths: single and double. Distinguishing this in rōmaji is an important goal, although not absolutely critical.
- Macron
-
A common scheme used in Japanese textbooks for English-speaking learners and on Wikipedia. Easier to understand than wāpuro.
e.g. Tōkyō, Ōsaka, sensē, onēsan, onīsan, okāsan, yūbe - Circumflex
-
A simple variation on the macron scheme. Possibly invented because some typesetting systems don’t support the macron diacritic.
e.g. Tôkyô, Ôsaka, sensê, onêsan, onîsan, okâsan, yûbe - Wāpuro
-
A very popular scheme outside of formal publications, especially popular in the anime fansub, manga scanlation, and file sharing community. Preserves the original orthography to the best extent out of all the schemes, which helps if the text needs to be converted back into Japanese for search/correlation/etc.
e.g. Toukyou, Oosaka, sensei, oneesan, oniisan, okaasan, yuube - Doubling
-
Similar to wāpuro but favors pronunciation rather than kana spelling. I doubt that it’s used in the wild.
e.g. Tookyoo, Oosaka, sensee, oneesan, oniisan, okaasan, yuube - Conflate with short vowels
-
Not used much in actual romanized sentences, but used very often in the official romanization of city names, etc.
e.g. Tokyo, Osaka, sense, onesan, onisan, yube - oo/ou as oh
-
Common in situations where diacritics are not expected, such as in typical English writing. It’s rather ad hoc, and I think it looks ugly.
e.g. Tohkyoh, Ohsaka
Kana n
- Always use n
-
Common.
e.g. sankaku, sanpo, senpai - Sometimes use m
-
The old Hepburn scheme uses m when the next kana is b- or p-.
e.g. sankaku, sampo, sempai - Use n’ (apostrophe)
-
Either always use n’, or drop the apostrophe in non-ambiguous cases.
e.g. san’kaku/sankaku, san’po/sanpo, sen’pai/senpai, ren’ai, Jun’ichi - Always use nn
-
This is an artifact from input method editors (IMEs) and is never used in writing because it causes massive confusion with popular practices.
e.g. sannkaku, sannpo, sennpai, rennai, Junnichi
Small tsu
- Double the previous consonant
-
This is essentially the universally adopted scheme. It generally works well enough but breaks down in some minor, esoteric edge cases, which are discussed later.
- Treat as an individual character
-
Romanize as xtu, xtsu, or otherwise. This is helpful if the goal is lossless romanization, as described in the “Losslessness” section below.
Katakana long vowels
A long vowel in hiragana uses a vowel kana as the second letter (e.g. くう). A long vowel in katakana uses a horizontal mark as the second letter (e.g. クー). This example will be used in all the cases: セーラー
- Double the vowel
e.g. seeraa
- Use a macro or circumflex
e.g. sērā/sêrâ
- Use a hyphen
e.g. se-ra-
- Ignore and treat as short vowel
e.g. sera
- Sidestep the problem and interpret as the original word
e.g. sailor
Particles
Three particles in Japanese have a different pronunciation than what their written kana symbol suggests. They are は, へ, を.
- Favor orthography
Romanize as ha, he, wo. This is good for lossless conversions.
- Favor phonetics
Romanize as wa, e, o. This is good for reading romanized text aloud.
- Favor phonetics but preserve wo
Romanize as wa, e, wo. This preserves the distinction between o and wo. Furthermore, を can be pronounced as wo in songs. (But there are a number of other permitted liberalities in song lyric pronunciation, which I will not elaborate on.)
Spaces
Native Japanese text has no spaces. But in such text, a kana followed by a kanji strongly suggests a word boundary in between, which works well in practice. In rōmaji this is not possible, so spaces are necessary. Furthermore, every language that uses the Latin alphabet uses spaces in normal text.
- Space between words, including particles
-
This is the most common scheme.
e.g. anata wa ie de gohan wo tabeta ka. - Space between words, no space before particle
-
This mimics kana spacing for beginners in native Japanese.
e.g. anatawa iede gohanwo tabetaka. - Space between each kana
-
Possibly helpful for beginners, good for lossless transliteration, but not used in practice.
e.g. a na ta wa i e de go ha n wo ta be ta ka - No space
-
This removes the work needed to find word boundaries in the Japanese text, but results in hard-to-read romanized text. However, the romanization is still lossless and unambiguous, since Japanese text does not have spaces to begin with.
e.g. anatawaiedegohanwotabetaka
Hyphens
Closely related to spacing is hyphenation. Words that have a loose relationship can be joined with a hyphen instead of a space. Examples:
- Numbered items: ichi-ban, ni-chome, san-juu
- Honorific suffixes: -san, -sama, -chan
- Honorific prefixes: o-, go-
Capitalization
Base techniques:
- Everything in lowercase
Simple, easy to read. Popular.
- Hiragana in lowercase, katakana in uppercase
Lossless. This scheme is sometimes used in fan-made song lyric romanizations.
Further considerations:
- Capitalize the first word in a sentence
-
Just like in European languages.
e.g. Kore ga watashi no go-shujin-sama desu. - Capitalize proper nouns (names, etc.).
-
That is, capitalize the names of people/places/companies/products/publications/etc.
e.g. Mary-san to Yuki-san wa senshuu London e ikimashita. - Title case for titles
-
Capitalize every word, or only capitalize significant words?
e.g. Pikachu wa Genki Desu ne
Foreign words
Foreign/loan words are usually written in katakana, and phonetically approximate the word’s pronunciation in the original language. The vast majority of loanwords in use are from English.
- Romanize them as kana
Systematic but extremely ugly, and hard to recognize even for people who can read English.
- Convert back to original spelling
May require context and interpretation, may have ambiguity.
Small kana
- Small vowels
Treat them as big kana, except for well known cases
- Use the pseudo-consonant “x”
e.g. xa, xyu, xtsu
Non-standard dakuten, han-dakuten
Rarely used, usually for silly emphatic effect, a dakuten (゛) or han-dakuten (゜) is used on a kana that does not normally accept such a diacritic. For example, ま゛. I cannot think of any reasonable way to romanize in a situation like this, other than to ignore the (han-)dakuten and use the base kana.
Losslessness
Mathematically speaking, romanization can be thought of as a function that maps a sequence of kana letters to a sequence of Latin letters.
It’s easy to create a romanization scheme that is lossless (injective/one-to-one), in the sense that each distinct sequence of kana gives a different sequence of roomaji. But such a scheme wouldn’t be completely pretty, since it would probably use Nihon-shiki for simplicity, have a space between each kana, and use capitalization to distinguish hiragana and katakana.
Here are some contrived edge cases to consider if your goal is to design a lossless romanization scheme:
- っい (xtu i)
- あっな (a xtu na)
- てっっと (te xtu xtu to)
- しゃゃゅょさゃ (si xya xya xyu xyo sa xya)
- とおとう (to o to u)
- ニーニイニィ (NI - NI I NI XI)
- にーにいにぃ (ni - ni i ni xi)
- が゜お゛ (ga ° o ")