...

How CJKI`s resources are used - The CJK Dictionary Institute

by user

on
Category: Documents
17

views

Report

Comments

Transcript

How CJKI`s resources are used - The CJK Dictionary Institute
Table of Contents
Overview
Arabic
Chinese
Japanese
Korean
Multilingual
概要
‫عرب ي‬
汉语
日本語
한국어
多言語
…………
…………
…………
…………
…………
…………
3
6
13
23
35
38
3
The CJK Dictionary Institute
日中韓辭典研究所
The CJK Dictionary Institute, Inc. (CJKI) specializes in CJK and Arabic computational lexicography.
The institute creates and maintains CJK (Chinese, Japanese and Korean) and Arabic lexical databases
currently covering approximately 24 million entries. Located in Saitama, Japan, CJKI is headed by Jack
Halpern, editor-in-chief of the world-renowned New Japanese-English Character Dictionary and of
various other CJK dictionaries.
CJKI plays a leading role in helping the IT industry penetrate the lucrative East Asian market by
providing software developers with high quality dictionary data. This includes comprehensive databases
of general vocabulary, proper nouns and technical terms for CJK languages, including Chinese dialects
such as Cantonese and Hakka. CJKI also maintains databases and romanization systems of Arabic
proper nouns, a large-scale Spanish-English dictionary, and various multilingual databases of proper
nouns and geographic data.
CJKI has become one of the world's prime sources for CJK lexical resources. It is contributing to CJK
and Arabic information processing technology by providing high-quality lexical resources and
professional consulting services to some of the world's leading software developers and IT companies,
including Fujitsu, Sharp, Sony, IBM, Google, Microsoft, Yahoo, Amazon and Baidu.
4
How CJKI's resources are used
CJKI's team of professional editors and software engineers use advanced computational lexicography
methods to compile and maintain comprehensive lexical databases and dictionaries that include a
variety of features for a broad gamut of applications, such as:
 Natural language processing applications such as information retrieval tools, search engine
technology and morphological analyzers
 Anti-money laundering and fraud detection
 Security applications such as criminal watch lists
 CJK input method editors (front-end processors)
 Machine translation and online translation tools
 Speech technology applications, both text-to-speech and automatic speech recognition
 Geographical data for multilingual maps, machine translation and tokenization
 Conversion between Simplified and Traditional Chinese
 Electronic dictionaries for desktop and mobile platforms
 Pedagogical, linguistic and computational lexicography research
 Transcription and transliteration applications
 Data cleansing
Doing business with CJKI
CJKI has a flexible business model that is decided on a case-by-case basis to suit the convenience of
the customer. We are not "resellers," nor are we "data vendors" -- we are a linguistic institute, and
create the data ourselves based on several decades of experience and extensive knowhow of CJK and
Arabic lexicography.
Our fundamental policy is to customize our databases to the specific requirements of the customer at
no extra charge. To achieve this, we study our customers' needs in-depth and prepare a data package
that meets the customer's precise needs. We also build custom databases from scratch. We have
extensive experience in putting together teams to compile large-scale dictionaries in a short period of
time, using our sophisticated tools for automating the compilation process, which significantly reduces
costs to the customer.
It is important to note that the benefits of working with CJKI go well beyond cost. We are flexible in
matters of format, delivery dates and business model, work hard to gain an in-depth understanding of
the customer's needs, and provide excellent service that includes a reasonable amount of free technical
and linguistic consulting as well as free minor upgrades. Licensing data from CJKI is not merely
"buying" data -- it is entering into a close relationship that ensures constant advice, technical/linguistic
support, upgrades, and reasonable fees.
5
CJKI's Lexical Resources
CJKI's extensive CJKI and Arabic lexical resources currently cover approximately 24 million entries,
used by major portals and software developers in a wide variety of applications. Our main resources
include:
 Bilingual dictionaries
 Multilingual dictionaries
 Arabic personal names
 Proper nouns and geographical data
 Technical terminology
 Monolingual lexical databases
 Phonetic and phonological databases
 Mapping tables for Chinese conversion
 Morphological databases
 Lexical databases
In addition to the resources described above, CJKI has developed resources containing millions of
more entries for the following:

Arabic transcription and vocalization systems

Arabic, Japanese, and Spanish-English full-form lexicons*

Arabic Place names

Databases for input method editors

Frequency statistics based on web and corpora

Database for CJK IMEs
* "Full form lexicon" refers to a comprehensive lexical database that includes every single
inflected form (verb conjugations, plurals, etc.) and declined forms (case endings) of a
language. Each full form lexicon contains millions of entries accompanied by a rich set of
grammatical attributes.
6
Arabic Lexical Resources
‫موارد معجم ية ل ل غة ال عرب ية‬
Arabic, one of the six official languages of the United Nations, is spoken by 246 million
speakers worldwide -- not only in North Africa and the Middle East, but also in many other
countries since it is the language of the Koran. Though Arabic has become a world language
of critical importance, lexical resources, especially for proper nouns, are either scarce or exist
only on a small scale. The CJK Dictionary Institute has been engaged in an intensive effort to
develop comprehensive Arabic lexical databases, with special focus on proper nouns.
Below is a description of some of CJKI’s principal Arabic resources.
Principal Resources

Database of Arab Names

Database of Arab Names in Arabic

Expanded OFAC

Arabic transcription and romanization systems

Dictionary of Arabic Place Name Variants

Dictionary of Arabic Proper Nouns

Arabic Broken Plurals

Arabic Transcription and Transliteration

Arabic Lexical Database

Arabic full form Dictionary
7
Database of Arab Names
‫عدةق ا ب يان ات األ سماء العربية‬
CJKI's comprehensive Database of Arab Names (DAN), which currently covers approximately 6.5
million entries, consists of Arabic personal names and name variants mapped to the original Arabic
script with a large variety of supplementary information. DAN is based on authoritative resources and
has undergone extensive proofreading and expansion based on about 25 million names derived from a
large variety of sources, including websites, corpora, books, dictionaries, phone books, and
encyclopedias.
Key Features

6.5 million validated Arabic name variants

Ideal for security and anti-money laundering, and NLP

Based on over 25 million source names from authoritative resources

Proofread by native editors trained in Arabic phonology

Validated against the web and corpora

Fully vocalized with various variants in Arabic script

Web-based frequency statistics for each name

Various romanization systems, such as the official IC standard

Fully supports OFAC names, their official aliases and unofficial variants
DAN is playing an important role in helping software developers, especially of security applications
and NLP tools, to enhance their technology by enabling named entity recognition and extraction,
machine translation, variant normalization, and information retrieval of Arabic names.
8
The table below shows a snippet of the 1,100 variants of ‫ ع بدال عزي ز‬along with their frequency of
occurrence on the web.
Database of Arab Names
ID
SUBID VARIANT ARABIC BUCKWALTER FREQUENCY
V000010 01140
Abd-Al Azeez ‫ ع بدال عزي ز‬EbdAlEzyz
0000002000
V000010 01141
Abd-Al Azez ‫ ع بدال عزي ز‬EbdAlEzyz
0000000118
V000010 01142
Abd-Al Aziez ‫ ع بدال عزي ز‬EbdAlEzyz
0000000033
V000010 01143
Abd-Al Aziiz ‫ ع بدال عزي ز‬EbdAlEzyz
0000000016
V000010 01144
Abd-Al Aziz
‫ ع بدال عزي ز‬EbdAlEzyz
0000064000
V000010 01145
Abd-Alazeez
‫ ع بدال عزي ز‬EbdAlEzyz
0000000012
V000010 01146
Abd-Alazez
‫ ع بدال عزي ز‬EbdAlEzyz
0000000114
V000010 01147
Abd-Alaziz
‫ ع بدال عزي ز‬EbdAlEzyz
0000002000
V000010 01148
Abd-El'azeez ‫ ع بدال عزي ز‬EbdAlEzyz
0000000052
V000010 01149
Abd-El'azez
‫ عزي زع بدال‬EbdAlEzyz
0000000154
V000010 01150
Abd-El'aziz
‫ ع بدال عزي ز‬EbdAlEzyz
0000008000
V000010 01151
Abd-El'eziz
‫ ع بدال عزي ز‬EbdAlEzyz
0000000003
V000010 01152
Abd-El-'Azeez ‫ ع بدال عزي ز‬EbdAlEzyz
0000002000
V000010 01153
Abd-El-'Azez ‫ ع بدال عزي ز‬EbdAlEzyz
0000000014
V000010 01154
Abd-El-'Aziiz ‫ ع بدال عزي ز‬EbdAlEzyz
0000000001
V000010 01155
Abd-El-'Aziz ‫ ع بدال عزي ز‬EbdAlEzyz
0000024000
9
Database of Arab Names in Arabic
The complexity of the Arabic script gives rise to a variety of Arabic spelling variants and spelling
errors. The CJKI Database of Arab Names in Arabic (DANA) covers several hundred thousand
Arabic script variants and common spelling mistakes, as shown in the table below.
A key feature of DANA is that every Arabic name is normalized and vocalized to produce a database
of error-free, fully sanitized Arabic canonical forms. The vocalization is performed by a team of
editors with the aid of tools and interfaces designed to achieve maximum efficiency. The canonical
forms are used both as a basis for creating accurate romanized variants for DAN, as well as Arabic
orthographic variants for DANA.
Database of Arab Names in Arabic
CANONICAL
VARIANT FREQUENCY TYPE
‫ع بدال عزي ز‬
‫ع بدال عزي ز‬
023102030
Normal
‫ع بدال عزي ز‬
‫ع بد ال عزي ز‬
018868920
Variant
‫ع بدال عزي ز‬
‫ع بد ل عزي ز‬
000000019
Error
‫ع بدال عزي ز‬
‫ع بدإل عزي ز‬
000000010
Error
‫ع بدال عزي ز‬
‫ع بدل عزي ز‬
000000003
Error
‫ع بدال عزي ز‬
‫ع بد أل عزي ز‬
000000000
Error
‫ع بدال عزي ز‬
‫ع بد إل عزي ز‬
000000000
Error
‫ع بدال عزي ز‬
‫ع بدأل عزي ز‬
000000000
Error
10
Expanded OFAC
The US government's watch lists have come under fire from members of Congress as being "crippled
by technical flaws." One of the major factors behind these assertions is the inability to correctly
identify and process the numerous variants of names appearing in the Specially Designated Nationals
(SDN) list maintained by The Office of Foreign Assets Control (OFAC).
To address these shortcomings, CJKI has exploited the linguistic and technical resources to develop a
comprehensive Expanded OFAC database (XOFAC) of OFAC full name variants, the vast majority of
which are not listed in OFAC. Containing millions of potential and actual variants of the Arab names
in OFAC's SDN List, XOFAC is ideal for those agencies and institutions that require maximum recall
in their compliance and watch list filtering applications. In contrast to DAN, XOFAC consists of
variants of full names only; in other words, names of actual and potential individuals and their variants.
For example, the table below lists the top 15 out of about 130,000 actual and potential variants of the
OFAC name Hatim Ahmad BARAKAT. The table has the following fields:
Rank
Variant
Freq1
Freq2
Freq3
relative ranking based on component frequencies
variants of OFAC name, mostly not appearing in the OFAC list
frequency of occurrence on the web of Hatim variants
frequency of occurrence on the web of Ahmad variants
frequency of occurrence on the web of Barakat variants
Expanded OFAC
RANK
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
VARIANT
Hatem Ahmed Barakat
Hatim Ahmed Barakat
Hatem Ahmed Bereket
Hatem Ahmad Barakat
Khadem Ahmed Barakat
Hatem Ahmed Bareket
Hattem Ahmed Barakat
Hatem Ahmed Berekat
Hatem Ahmet Barakat
Hadim Ahmed Barakat
Hatem Ahmed Baraket
Hatam Ahmed Barakat
Hatem Ahmed Barakaat
Hatem Achmed Barakat
Hetem Ahmed Barakat
FREQ1
FREQ2
FREQ3
01580000
00925000
001580000
001580000
000194000
001580000
000180000
001580000
001580000
000114000
001580000
000081300
001580000
001580000
000065600
039000000
039000000
039000000
025600000
039000000
039000000
039000000
039000000
018400000
039000000
039000000
039000000
039000000
000777000
039000000
001180000
001180000
000651000
001180000
001180000
000057200
001180000
000033400
001180000
001180000
000016300
001180000
000014300
001180000
001180000
11
Database of Arabic Business Names
The Database of Arabic Business Names (DABNA) is a large-scale database of Arabic company
and organization names and addresses in the Arabic script, along with their romanized and/or English
equivalents, web frequencies, and other attributes. Company names play an important role in Arabic
natural language processing applications including named entity extraction (NER), machine translation
(MT), and morphological analysis (MA), as well as in a variety of business intelligence applications and
security applications such as watch list querying. To meet this need, DABNA is continuously expanded
and revised to ensure it is up-to-date by a team of editors trained in Arabic name processing and
Arabic phonology.
The sample below shows some Egyptian business names (companies and addresses), along with their
English equivalents.
BUSINESS NAME
DISTRICT
ADDRESS
‫ج الل إ سماع يل مراد‬
Jalal Isma'il Murad
‫ال ق به‬
al-Qubbah
57 ‫ش ال خ ل ي فه ال مامون‬
57 Caliph al-Ma'mun St.
‫أحمد ش كرى م صط فى‬
Ahmad Shakri Mustafa
‫ال ماظه‬
Almaza
34 3‫شارع ال عروب ه ش قه‬
34 al-Orouba Street Flat 3
‫م ك ت به ط ل عت حرب‬
Tal'at Harb Bookshop
‫ال رو ضه‬
Rhoda Island
‫ش ال س يده ن ف ي سه‬
al-Sayyidah Nafisah St.
‫ط ل عت حرب مول‬
Tal'at Harb Mall
‫رم س يس‬
Ramses
30 ‫ش ط ل عت حرب و سط ال ب لد‬
30 Tal'at Harb St. Downtown
‫أ سامة و هان ى‬
Usamah and Hani
‫ال ماظه‬
Almaza
36 ‫ش ال نزهه م صر ال جدي ده‬
36 Nozha St. Heliopolis
‫أب و ش قرة‬
Abu-Shaqrah
‫ال رو ضه‬
Rhoda Island
69 ‫شارع ل ق صر ال ع ي نى ا‬
69 al-Qasr al-'Ayni Street
‫مون ارش ل الث اث و ك ورال دي‬
Moon Arch
‫ال دق ى‬
Dokki
50 ‫ش ن ادى ال ص يد ال دق ى‬
50 Nadi al-Sayd St. Dokki
‫ح س ين ي و سف‬
Husayn Yusuf
‫ال م عادى‬
Maadi
‫جران د مول ال م عادى‬
Grand Mall Maadi
‫ح س ين و ع لى‬
Husayn and 'Ali
‫ال ماظه‬
Almaza
17 ‫ش ب غداد م صر ال جدي ده‬
17 Baghdad St. Heliopolis
‫شري ف محمد ط ل عت ال غ ن يمى‬
‫ب اب ال لوق‬
32 ‫ش ال ف ل كى‬
Sharif Muhammad Tal'at al-Ghanimi
Bab al-Louq
32 al-Falaki St.
12
Database of Foreign Names in Arabic
The Database of Foreign Names in Arabic (DAFNA) is a large-scale database of non-Arab names
written in the Arabic script, along with their romanized variants, web frequencies, and other attributes.
Personal names play an important role in Arabic natural language processing applications including
named entity extraction (NER), machine translation (MT), and morphological analysis (MA), as well as
in a variety of business intelligence applications and security applications such as watch list querying.
To meet this need, DAFNA is continuously being expanded and revised by a team of editors trained in
Arabic name processing and Arabic phonology.
The sample below shows orthographic variants and spelling errors of a common American given name
(John), and a common American surname (Davis). The original American name data was obtained
from the U.S. Census Bureau.
Database of Foreign Names in Arabic
ENGLISH
ARABIC
TYPE
WEB FREQ
WEB FREQ
(English+Arabic) (Arabic only)
John
John
John
John
John
John
John
‫جوون‬
‫جون‬
‫جان‬
‫جوهان‬
‫جوهن‬
‫دجون‬
‫جهون‬
M
M
M
M
M
M
M
0036500
0032700
0031300
0000224
0000173
0000029
0000009
0044500
0947000
2160000
0007090
0001180
0001680
0000328
Davis
Davis
Davis
Davis
Davis
Davis
Davis
‫يس‬
‫يس‬
‫فز‬
‫فس‬
‫فس‬
‫يس‬
‫يز‬
S
S
S
S
S
S
S
0000613
0000249
0000228
0000157
0000040
0000034
0000005
0012300
0001680
0002300
0002020
0000652
0000490
0000098
‫دي ف‬
‫داف‬
‫دي‬
‫دي‬
‫داي‬
‫دف‬
‫دف‬
13
Chinese Lexical Resources
汉语词汇资源
CJKI’s comprehensive Chinese lexical resources currently include over four million entries, covering
general vocabulary, technical terminology, proper nouns, company and organization names, and others,
in both Simplified Chinese (SC) and Traditional Chinese (TC), used in such applications as
machine translation (MT), information retrieval (IR) and input method editors (IME). They includes a
rich set of grammatical, phonological and semantic attributes, including pinyin and zhuyin readings,
part-of-speech codes, frequency of occurrence statistics, and others.
Below is a description of CJKI’s principal Chinese lexical resources.
Principal Resources

Simplified Chinese-English Dictionary

English-Simplified Chinese Dictionary

Chinese-English Database of Proper Nouns

Database of Chinese Name Variants

Chinese Dictionary of Computer Terms

Chinese Lexical Database

Chinese Pinyin Database

Chinese to Chinese Conversion

Hanzi-Pinyin Transcription System

Chinese-English Technical Terms

Chinese Morphological Database

Chinese Lexical Frequency Statistics

English-Traditional Chinese Dictionary

Chinese IME Databases
14
Chinese-English Dictionary
汉 英 词 典 简 体 版
CJKI's Simplified Chinese-English Dictionary (SCED) is the most comprehensive Chinese
dictionary available today. Covering over 700,000 entries of general vocabulary, technical terms,
important proper nouns and example sentences, SCED was compiled in collaboration with
lexicographers from a leading Chinese university on the basis of the most authoritative dictionaries
published in China. This dictionary, which is without peer, has undergone extensive proofreading and
validation by a team of native Chinese editors. It is ideally suited for:

Machine translation dictionaries

Cross-language information retrieval

Handheld electronic dictionaries

Mobile device applications
Simplified Chinese-English Dictionary
CHINESE POS
PINYIN
ENGLISH
guósāng
国产装备
N
N
N
N
N
N
N
N
N
N
guóchǎn zhuāngbèi
national mourning
traditional Chinese music
letter of credence; credentials; letter of commission
copy of credentials
home equipment
domestic products; domestic goods
domestic-made cars
Chinese film
excise duties
Chinese-made equipment
国产品
N
guóchǎnpǐn
home products; national products; domestic products
国优产品
N
guóyōu chǎnpǐn
国债
N
guózhài
national quality product
national debt; government loan; public debt; national
bonds
国丧
国乐
国书
国书付本
国产设备
国产货
国产轿车
国产影片
国产税
guóyuè
guóshū
guóshū fùběn
guóchǎn shèbèi
guóchǎnhuò
guóchǎn jiàochē
guóchǎn yǐngpiàn
guóchǎnshuì
15
English-Chinese Dictionary
英 汉 词 典 简 体 版
CJKI’s English-Simplified Chinese Dictionary (ESCD) covers about 100,000 entries including
general vocabulary and important proper names. Optimized for the convenience of users of electronic
dictionaries, ESCD has just the right amount of detail: enough equivalents to give an in-depth
understanding, yet short enough not to clutter up the screen. ESCD is being used in such well-known
translation tools like Babylon and Quicktionary, as well as on various mobile platforms around the
world including TangoTown in Japan and Australia.
English-Simplified Chinese Dictionary
ENGLISH POS
canoe
cay
cheddar
clad
codger
Comoros
congruence
couch
crew cut
cutthroat
decimal
depersonalize
dial
discrepancy
divide
dramatize
emotional
CHINESE
v. tr. 用独木舟载运
n.
岩礁, 沙洲, 珊瑚礁
n.
切德干酪
v. tr. 电镀
n.
怪人; 有怪癖的人
NP 科摩罗
n.
适合, 相合性, 一致
v. intr. 躺着; 埋伏; 蹲着
n.
平头发式
adj. 残酷的; 杀人的
n.
小数
v. tr. 使失去人性
v. tr. 调; 拨; 收听, 收视; 打电话给
n.
相差; 矛盾; 差异
n.
分歧, 不和; 分水岭
v. intr. 戏剧化; 可改编成剧本; 举止夸张
adj. 情绪的; 情感的
An English-Traditional Chinese Dictionary is also available.
16
Chinese↔English Proper Nouns
汉英专有名词数据库
CJKI's Chinese↔English Database of Proper Nouns (CEP) is very comprehensive, covering
millions of entries in both Simplified and Traditional Chinese. It includes various data fields such as
pinyin, zhuyin, frequency rankings, classification codes, locale codes and English equivalents. Included
are a large variety of both Chinese and non-Chinese name types, such as:

Place names

Personal names (surnames and given names)

Companies and organizations

Facilities and points of interest

Western personal and place names

Miscellaneous, such as periodicals and abbreviations
Chinese Personal Names
SIMPLIFIED
CHINESE
TYPE
TRADITIONAL
CHINESE
PINYIN
G
桂花
桂花
guìhuā
S
鄂
鄂
è
G
尔和
爾和
ěrhé
S
戚
慼
qī
G
联谊
聯誼
liányì
G
亚军
亞軍
yàjūn
G
军营
軍營
jūnyíng
GS
耽
耽
dān
G
嗣
嗣
sì
G
耕耘
耕耘
gēngyún
G
庄稼
莊稼
zhuāngjia
G
和全
和全
héquán
S
侬
儂
nóng
G
之遥
之遙
zhīyáo
S
刁
刁
diāo
S
司马
司馬
sīmā
G
津津
津津
jīnjīn
17
Chinese and non-Chinese Place Names
ENGLISH
SC
L/O
L
TC
PINYIN
ālǔbā
ZHUYIN
Aruba
阿鲁巴
Azerbaijan
阿塞拜疆 L
亞塞拜然 āsāibàijiāng
ㄧㄚˋㄙㄜˋㄅㄞˋㄖㄢˊ
Brasilia
巴西利亚 O
巴西利亞 bāxīlìyà
ㄅㄚㄒㄧㄌㄧˋㄧㄚˋ
Caracas
加拉加斯 L
卡拉卡斯 jiālājiāsī
ㄎㄚˇㄌㄚㄎㄚˇㄙ
Cairo
开罗
O
開羅
kāiluó
ㄎㄞㄌㄨㄛˊ
Chad
乍得
L
查德
zhàdé
ㄔㄚˊㄉㄜˊ
Fukuoka
福冈
O
福岡
dōngyángshì
ㄈㄨˊㄍㄤ
Georgia
乔治亚
O
喬治亞
fúgāng
ㄑㄧㄠˊㄓˋㄧㄚˋ
Guinea
几内亚
O
幾內亞
qiáozhìyà
ㄐㄧˇㄋㄟˋㄧㄚˋ
Haiyan
海盐
O
海鹽
jǐnèiyà
ㄏㄞˇㄧㄢˊ
Hanyang
汉阳
O
漢陽
hǎiyán
ㄏㄢˋㄧㄤˊ
Heshan
鹤山
O
鶴山
hànyáng
ㄏㄜˋㄕㄢ
Huailai
怀来
O
懷來
hèshān
ㄏㄨㄞˊㄌㄞˊ
Ireland
爱尔兰
O
愛爾蘭
huáilái
ㄞˋㄌㄧㄣˊ
阿盧巴
ㄚㄌㄨˊㄅㄚ
L : lexemic mapping O : orthographic mapping
(see Chinese Dictionary of Computer Terms for detail)
18
Chinese Name Variants
汉语人名罗马字异形数据库
The number of Chinese personal names and their variants is very large -- in the millions -- which
makes it difficult to identify them and process them. Named Entity Recognition (NER) technology is a
hot topic in computational linguistics. To enhance NER technology, CJKI maintains databases of
several million CJK and Arabic name variants in all major and most minor romanization systems.
There are several well-established systems for romanizing Chinese, such as Hanyu Pinyin, Wade-Giles,
Yale, and Tongyong Pinyin, as well as various popular ones and many older ones that have fallen out of
use. Chinese has seven major dialect groups, and another four minor ones. The CJKI Database of
Chinese Name Variants (CNV) includes Chinese personal names in all the standard and dialectical
Chinese romanization systems, covering all the major dialects, including Cantonese, Hakka and
Hokkien and including classification codes and frequency of occurrence statistics.
Chinese Name Variants
CHINESE PINYIN
ZHUYIN
ENGLISH TONGYONG
YALE
WADE-GILES VARIANTS
百欣
bǎixīn
ㄞˇㄒㄧㄣ
Baixin
Baisin
Baisyin
Paihsin
Paisin
白
bái
ㄅㄞˊ
Bai
Bai
Bai
Pai
北强
běiqiáng
ㄅㄟˇㄑㄧㄤˊ
Beiqiang
Beiciang
Beichyang Peich'iang
炳章
bǐngzhāng ㄅㄧㄥˇㄓㄤ
Bingzhang
Bingjhang
Bingjang
宝程
bǎochéng ㄅㄠˇㄔㄥˊ
Baocheng
Baocheng
Baucheng Paoch'eng
Paocheng
爱华
àihuá
ㄞˋㄏㄨㄚˊ
Aihua
Aihua
Aihwa
Aihua
Ngaihua
伯芝
bózhī
ㄅㄛˊㄓ
Bozhi
Bojhih
Bwojr
Pochih
长流
chángliú
ㄔㄤˊㄌㄧㄡˊ
Changliu
Changliou
Changlyou Ch'angliu
邦达
bāngdá
ㄅㄤㄉㄚˊ
Bangda
Bangda
Bangda
Pangta
曹
cáo
ㄘㄠˊ
Cao
Cao
Tsau
Ts'ao
冰晓
bīngxiǎo
ㄅㄧㄥㄒㄧㄠˇ
Bingxiao
Bingsiao
Bingsyau
Pinghsiao
Pingsiao
百成
bǎichéng
ㄅㄞˇㄔㄥˊ
Baicheng
Baicheng
Baicheng
Paich'eng
Paicheng
Peits'iêng
Peichiang
Peitsiêng
Pingchang
19
Chinese↔English Computer Terms
英汉计算机术语词典
CJKI's Chinese↔English Dictionary of Computer Terms (ECCT) is an English-Chinese
Chinese-English dictionary containing about 100,000 Simplified Chinese (SC) and 100,000 Traditional
Chinese (TC) entries, including acronyms. This dictionary covers both SC, used in The People's
Republic of China and Singapore, and TC, used in Taiwan, Hong Kong and among overseas Chinese.
It has several features that distinguish it from any other Chinese computer dictionary available today.

Covers about 100,00 entries selected on the basis of frequency statistics.

Constantly updated and expanded to include recent terms.

Contains more than 10,000 acronyms cross-referenced to the expanded forms.

Linguistically accurate TC equivalents (explained below).
The above features make this dictionary an invaluable tool for translators and for use in various IT
applications such as information retrieval, machine translation, and input method editors.
The TC in CECT is not merely a code-conversion of the SC, but has been carefully proofread to
ensure accuracy both on the orthographic and lexemic levels. An example of orthographic conversion,
marked "O" in the table below, is 目录 'directory' converted to 目錄. An example of lexemic
conversion, marked "L" in the table below, is 计算机 in converted but to 電脳 in TC.
Chinese-English Computer Terms
ENGLISH
SIMPLIFIED
TRADITIONAL
TYPE
file
文件
檔案
L
Internet
因特网
網際網路
L
program
程序
程式
L
CD-ROM
光盘
光碟
L
information
信息
資訊
L
computer network
计算机网络
電腦網路
L
modulator/demodulator
调制解调器
調變解調器,數據機
L
modem
调制解调器
調變解調器,數據機
L
computer software
计算机软件
電腦軟體
L
database
数据库
資料庫
L
flowcharting
流程图编制
繪製流程圖
L
expert system
专家系统
專家系統
O
directory
目录
目錄
O
20
Chinese Lexical Database
汉语词汇数据库
The CJKI Chinese Lexical Database (CLD) is a comprehensive monolingual lexical database of
Chinese consisting of the Simplified Chinese Lexical Database (CLD-SC) and the Traditional Chinese
Lexical Database (CLD-TC) modules. Developed by CJKI’s team of experienced Chinese editors and
linguists over many years, the CLD is a significant contribution to the field of Chinese lexicography.
CLD is especially suitable for applications in the fields of information retrieval, morphological analysis,
machine translation and various natural language processing (NLP) applications, and is being used by
various IT companies to enhance their Chinese segmentation technology.
Chinese Lexical Database
POS
TYPE CHINESE
PINYIN
RANK
WEB RANK
NP
G
东霞
dōngxiá
C
000205863
NP
P
东会村
dōnghuìcūn
C
000331481
东海
dōnghǎi
A
000009255
NC
NP
G
东海
dōnghǎi
A
000009255
NP
P
东海
dōnghǎi
A
000009255
NP
P
东海县
dōnghǎixiàn
C
000078031
E
东海扬尘 dōnghǎiyángchén
C
000263750
E
东海捞针 dōnghǎilāozhēn
C
000124028
U
东海舰队 dōnghǎijiànduì
E
东海桑田 dōnghǎisāngtián
C
000090763
000064698
NP
Oe
东海大学 dōnghǎidàxué
C
000069472
NP
P
东外大街 dōngwàidàjiē
C
000166158
东郭
dōngguō
C
000069927
东郭
dōngguō
C
000069927
E
东郭先生 dōngguōxiānshēng C
000101330
NC
东郭履
dōngguōlǚ
C
000267748
C
000234655
NC
NP
NP
S
P
NC
NP
G
东革新里 dōnggéxīnlǐ
东岳
dōngyuè
C
000065236
东岳
dōngyuè
C
000065236
21
Chinese Pinyin Database
汉语拼音数据库
The CJKI Chinese Pinyin Database (CPD) contains several million Simplified Chinese (SC) and
Traditional Chinese (TC) headwords covering general vocabulary, technical terms, and proper nouns.
Each lexeme is accompanied by pinyin readings for SC and both pinyin and zhuyin (not shown here)
for TC. What is especially noteworthy is that the pinyin/zhuyin readings take into account the
differences in pronunciation between Taiwan and the PRC, as shown in the table below. Even highly
educated native Chinese speakers are often surprised to discover that such differences exist.
An important feature of this database is its high accuracy, and explicit indication of the neutral tone,
which is often ignored by conventional dictionaries. The data can be provided in all the major
transcription systems such as Yale, Wade-Giles, and Tongyong Pinyin. An IPA edition, especially useful
for speech technology applications such as TTS, is now under development.
The Diff field below indicates whether pairs of SC-TC equivalents have identical pinyin. "D" indicates
that pinyin is different; "S" indicates that pinyin is the same.
Chinese Pinyin Database
DIFF SC HANZI SC FREQUENCY SC PINYIN TC HANZI TC FREQUENCY TC PINYIN
D
临期
0000029000
línqī
臨期
0000028800
línqí
D
企业
0163000000
qǐyè
企業
0102000000
qìyè
D
倬雄
0000000167
zhuōxióng
倬雄
0000000167
zhuóxióng
S
咖啡豆
0000779000
kāfēidòu
咖啡豆
0000779000
kāfēidòu
D
危险
0022400000
wēixiǎn
危險
0003080000
wéixiǎn
D
埒城
0000000411
lièchéng
埒城
0000000411
lèchéng
D
夕日
0002020000
xīrì
夕日
0002020000
xìrì
D
大期
0000061500
dàqī
大期
0000061500
dàqí
D
帆柱
0000030600
fānzhù
帆柱
0000030600
fánzhù
D
微笑
0018400000
wēixiào
微笑
0018400000
wéixiào
S
无着
0000265000
wúzhuó
無著
0000265000
wúzhuó
D
咖喱粉
0000087400
gālífěn
咖喱粉
0000087400
kālǐfěn
D
昔日
0004880000
xīrì
昔日
0004880000
xírì
D
显微镜
0003390000
xiǎnwēijìng
顯微鏡
0000228000
xiǎnwéijìng
D
期待
0059100000
qīdài
期待
0059100000
qídài
D
咖喱饭
0000122000
gālífàn
咖喱飯
0000122000
kālǐfàn
D
池穴
0000059400
chíxué
池穴
0000059400
chíxuè
D
理发
0002170000
lǐfà
理髮
0000495000
lǐfǎ
22
Chinese-to-Chinese Conversion
中文简繁转换
A common fallacy is that there is a straightforward correspondence between Simplified Chinese (SC)
and Traditional Chinese (TC), and that conversion between the two merely requires mapping from one
character set to another. In fact, code-conversion from SC to TC will often lead to errors both on the
orthographic and lexemic levels. An example of orthographic conversion is 头发 ‘hair’ converted to 頭髮,
in which 頭 and 髮 are the traditional equivalents of 头 and 发 respectively. An example of lexemic
conversion is SC 激光 ’laser’ converted to 雷射 in TC, a distinct word of identical meaning.
CJKI ranks among the world's foremost experts on Simplified to/from Traditional Chinese conversion,
and has in-depth knowledge of Chinese segmentation issues, having collaborated with Chinese
universities such as Beijing Language and Culture University. Our comprehensive SC to/from TC
mapping tables, developed over a period of about 12 years, have several million entries, the largest in
existence.
The table below illustrates lexemic mappings of computer terms between SC and TC.
Chinese to Chinese Conversion Technology
ENGLISH
SIMPLIFIED TRADITIONAL
File
文件
檔案
CD-ROM
光盘
光碟
Data
数据
資料
Compatibility
兼容性
相容性
Information
信息
資訊
Software
软件
軟體
Message
消息
訊息
Camera
摄像机
攝影機
Recording/Burning
刻録
錄製
Drive
驱动器
光碟機
Audio frequency
音频
音訊
Memory
存储
儲存
Video frequency
视频
視訊
Compatible
兼容
相容
Rewritable
可擦写
可重寫
Optical drive
光驱
燒錄機
23
Japanese Lexical Resources
日本語語彙資源
CJKI’s comprehensive Japanese lexical databases and dictionaries currently include nearly seven million
entries, covering general vocabulary, technical terminology, proper nouns, company and organization
names, katakana loanwords, and others. These include a rich set of grammatical, phonological and
semantic attributes, including readings, part-of-speech codes, conjugation and inflection pattern codes,
orthographic variants, various frequency statistics, and others.
Below is a description of some of CJKI’s principal Japanese lexical resources.
Principal Resources

Japanese Lexical Database

Japanese-English Database of Proper Nouns

Japanese Morphological Database

Japanese Orthographical Database

Japanese-English Dictionary of Technical Terms

Database of Japanese Name Variants

CJKI Japanese-English Dictionary

CJKI English-Japanese Dictionary

Japanese Phonetic Database

Japanese-Chinese Dictionary of Technical Terms

Katakana Lexical Database

Japanese Company Names

Japanese Lexical Frequency Statistics

Kanji-English Dictionaries

Japanese IME Databases

Japanese Full Form Dictionary
24
Japanese Lexical Database
日本語語彙データベース
The CJKI Japanese Lexical Database (JLD) is a comprehensive monolingual lexical database that
includes a rich set of grammatical attributes fine-tuned for NLP applications such as machine
translation, information retrieval and morphological analysis. It contains about 400,000 entries covering
general vocabulary, both free forms and bound forms. Developed by CJKI’s team of experienced
Japanese editors and linguists over more than a decade, the JLD is a significant contribution to the field
of Japanese lexicography. It is highly recommended to supplement JLD with our Japanese
Orthographical Database (JOD).
Sample of Japanese Lexical Database
HEPBURN
HEADWORD
READING
POS
SUBPOS
CONJ
TYPE
VALENCY
SCRIPT
掛かる
かかる
V5
-
R
i
0
J
kakaru
仮定
かてい
VN
M
-
-
0
J
katei
がぶ飲み
がぶのみ
VN
t
0
J
gabunomi
がま口
がまぐち
NC
-
-
0
J
gamaguchi
がましげ
がましげ
FS
-
-
1
J
gamashige
がましさ
がましさ
WS
1
J
gamashisa
がらがら
がらがら
D
0
J
garagara
がらがら
がらがら
VN
0
J
garagara
がらくた
がらくた
NC
0
J
garakuta
がらっと
がらっと
D
0
J
garatto
がらっぱち
がらっぱち
AN
0
J
garappachi
がらっぱち
がらっぱち
NC
0
J
garappachi
がわり
がわり
WS
1
J
gawari
がんがん
がんがん
D
0
J
gangan
がんがん
がんがん
VN
0
J
gangan
がんとして
がんとして
D
0
J
gantoshite
下がる
さがる
V5
-
R
i
0
J
sagaru
寒い
さむい
AJ
-
-
-
0
J
samui
何故なら
なぜなら
J
-
-
-
0
J
nazenara
M
i
0
i
25
Japanese↔English Proper Nouns
日英固有名詞データベース
The CJKI Japanese↔English Database of Proper Nouns (JEP) is very comprehensive, covering
millions of entries. It includes various data fields such as hiragana and romanized readings, frequency
rankings, classification codes and locale codes, orthographic variants, English equivalents, and more.
Included are a large variety of both Japanese and non-Japanese name types, such as:

Place names

Personal names (surnames and given names)

Companies and organizations

Western personal and place names

Facilities (stations, roads, hotels) and point of interest.

Detailed geographic data, especially for Japan.
Japanese Personal Names
TYPE
Western Personal Names
NAME
READING
ENGLISH
RANK
S
永福
ながふく
Nagafuku
36072
S
アントニア
あんとにあ
Antonia
S
蟻原
ありばら
Aribara
18269
S
イーザー
いーざー
Iser
S
橋詰
はしつめ
Hashitsume
27721
S
ウッドフォード うっどふぉーど Woodford
S
橋詰
はしづめ
Hashizume
11691
S
ウスペンスキー うすぺんすきー Ouspensky
FM
加名見
かなみ
Kanami
37988
G
シェーラ
しぇーら
Sheila
M
海修
かいしゅう
Kaishu
37988
S
シェフェール
しぇふぇーる
Schaeffer
F
季絵
きえ
Kie
9317
S
シャルコー
しゃるこー
Charcot
M
光喜
こうき
Koki
521
S
シュタードレン しゅたーどれん Stadlen
M
好洋
こうよう
Koyo
37988
S
タラソワ
たらそわ
Tarasova
M
幸喜
さちき
Sachiki
89487
G
ニコライ
にこらい
Nikolai
M
幸喜
こうき
Koki
3085
G
マジョリー
まじょりー
Marjorie
M
幸喜
ゆきよし
Yukiyoshi
82511
G
メルビン
めるびん
Melvin
TYPE
JAPANESE
READING
LATIN
26
Japanese Place Names
NAME
READING
芦別市
ENGLISH
あしべつし
Ashibetsu-shi
あさなべがわ
Asanabegawa
あさひかみまち
Asahikamimachi
あさひがおか
Asahigaoka
めいてつとこなめせん
Meitetsu Tokoname Line
なりたこくさいくうこう
Narita International Airport
ひかわいんたーちぇんじ
Hikawa IC
おうめかいどう
Ome-Kaido
かわさきくやくしょ
Kawasaki-ku Ward Office
きょうとぶらいとんほてる
Kyoto Brighton Hotel
よこはまかんとりーくらぶ
Yokohama C.C.
こまざわおりんぴっくこうえん Komazawa Olympic Park
朝鍋川
旭上町
朝日ヶ丘
名鉄常滑線
成田国際空港
斐川インターチェンジ
青梅街道
川崎区役所
京都ブライトンホテル
横浜カントリークラブ
駒沢オリンピック公園
Western Place Names
JAPANESE
READING
LATIN
東ベルリン
ひがしべるりん
East Berlin
ウィンズローパーク
うぃんずろーぱーく
Winslow Park
エッセン
えっせん
Essen
オークブルック
おーくぶるっく
Oak Brook
オファーレル
おふぁーれる
O'Farrell
サザンプトン島
さざんぷとんとう
Southampton Island
バヌアツ共和国
ばぬあつきょうわこく Republic of Vanuatu
Japanese Companies and Organizations
JAPANESE
READING
ENGLISH
海外旅行開発
かいがいりょこうかいはつ Overseas Tour Promotion, Inc.
宮下機料店
みやしたきりょうてん
Miyashita Kiryoten
大豊建設
だいほうけんせつ
Daiho Corporation
南急モータース
なんきゅうもーたーす
Nankyu Motors Co., Ltd.
富士見産業
ふじみさんぎょう
Fujimi Sangyo Co., Ltd.
緑営バイオ
りょくえいばいお
Ryokuei Bio Co., Ltd.
27
Japanese Morphological Database
日本語連接属性データベース
The Japanese Morphological Database (JMD) contains various morphological attributes such as
derivational attributes, suffixes and prefixes, word elements (bound morphemes) and binding valency.
These are particularly useful for disambiguating and identifying Japanese lexemes in such applications as
segmentors, morphological analyzers, input method editors (IME) and search engine query processing.
JMD is designed to significantly enhance segmentation accuracy and tokenization by making it possible
to reliably identify compound words not in the lexicon. It consists of various components:

A detailed list of verb and adjective stem variants

A detailed list of verb and adjective inflectional endings

A detailed list of auxiliaries attached to verbs and adjectives

A database of affixes with adjacency attributes, essential for identifying lexemes not
in the lexicon (OOV), like 処理済み shorizumi from 処理+済み.
Adjacency Attributes
AFFIX
READING
POS
SUB-POS
VALENCY
RANK
気味
ぎみ
WS
M
1
67900
NC VC
AN
染
ぞめ
WS
1
61089
NC NP
NC
染みる
じみる
WS
1
61089
NC
V1
染める
しめる
WS
1
61089
VC
V1
平
だいら
WS
1
61089
NP
NP
平
ひら
WP
1
61089
NC
NC
平成
へいせい
NE
0
61089
NN
NC
別
べつ
FS
0
331
別
べつ
WP
1
331
片
かた
WP
1
28538
片
へん
WS
1
25149
NC
NC
片
ぺん
WS
1
61089
NN
NC
編
へん
WS
1
8970
NC NP
NC
編み
あみ
WS
1
61089
NX
NC
返す
かえす
WS
1
2476
VC
V5
返る
かえる
WS
1
61089
VC
V5
便
びん
WS
1
3030
NC NN
NC
S
N
BEFORE
AFTER
NC
RESULT
NC
NC VC
NC
NC V
NC
28
Japanese Orthographical Database
日本語異表記データベース
The orthographical complexity of Japanese poses a special challenge to the developers of
computational linguistic tools, especially in the area of intelligent information retrieval. These
difficulties are exacerbated by the lack of a standardized orthography and the highly irregular Japanese
orthography.
The CJKI Japanese Orthographical Database (JOD) plays a critical role in enhancing the accuracy
of information retrieval, machine translation and morphological analysis applications as it helps identify
and disambiguate the numerous Japanese orthographic variants that have identical meanings, such neko
‘cat’ written 猫, ねこ or ネコ and kakiarawasu ‘write out, publish’ written 書き著す, 書
著す, 書き著わす or 書著わす. This database is the most comprehensive of its kind, and is being
used by such companies as Yahoo, Amazon and Baidu to dramatically improve search recall. Also
included are a large variety of katakana orthographic variants for loanwords.
Japanese Orthographic Variants
READING
あっせん
あかとんぼ
あきかん
POS SUB_ID VARIANT
VN
NC
NC
a
斡旋
b
あっせん
c
あっ旋
a
赤とんぼ
b
赤トンボ
c
赤蜻蛉
d
アカトンボ
e
あかとんぼ
a
空き缶
b
空缶
c
明き罐
d
あき缶
e
あき罐
f
空きかん
g
空きカン
h
空き罐
i
空罐
j
空き鑵
k
空鑵
NORMALIZED
あっせん
赤とんぼ
空き缶
29
Japanese↔English Technical Terms
日英専門用語辞書
CJKI maintains a comprehensive Japanese-English English-Japanese dictionary of over 1,000,000
technical terms covering a broad spectrum of fields covering the major domains of science and
technology. The Japanese↔English Dictionary of Technical Terms (JET) is available in
domain-specific standalone modules, or a full edition including all domains. Some of the major
domains covered include computer/IT, mechanical engineering, medicine and pharmaceutics. JET is
currently being used by some of the world's leading IT companies such as Fujitsu, Microsoft, Sharp
and Casio.
This database is being used in a variety of applications and software products, such as:

Machine translation dictionaries.

Information retrieval applications for accurate term recognition and indexing.

NLP tools like morphological analyzers and tokenizers.

Handheld electronic dictionaries, such as in Casio's high end Exword GT series.

Dictionaries on CD-ROM such as in Logovista's Electronic Dictionary Series.

Dictionaries for mobile platforms such as iPhone, Android, and Sharp’s
XMDF-based devices.
CJKI is constantly expanding this database by adding new terms, new domains, and readings.
Japanese-English Technical Terms
DOMAIN
JAPANESE
READING
ENGLISH
化学
亜ヒ酸カルシウム
あひさんかるしうむ
calcium arsenite
生物
環状染色体
かんじょうせんしょくたい
circular chromosome
生物
寒天拡散法
かんてんかくさんほう
cup method
機械
駆動プーリー
くどうぷーりー
driving pulley
医学
結節性裂毛症
けっせつせいれつもうしょう
clastothrix
医学
犬吠せき
けんばいせき
compression cough
電気
コンデンサ始動電動機
こんでんさしどうでんどうき
capacitor-start motor
化学
ジアミノフェノール塩酸塩 じあみのふぇのーる えんさんえん hydrochloride
電気
整流する
せいりゅうする
commutate
医学
チアノーゼ
ちあのーぜ
cyanose
電気
転換する
てんかんする
convert
電気
ブラシ位置変化
ぶらしいちへんか
brush position change
建設
臨界圧力
りんかいあつりょく
critical pressure
diaminophenol
30
Japanese Name Variants
日本語固有名詞異表記辞書
The number of Japanese proper nouns and their variants is very large -- in the millions -- which makes
it difficult to identify them and process them. Named Entity Recognition (NER) technology is a hot
topic in computational linguistics. To enhance NER technology, CJKI maintains databases of several
million CJK and Arabic name variants in all major and most minor romanization systems.
There are several well-established systems for romanizing Japanese, such as the Hepburn and Kunrei
systems, as well as various popular systems and hybrid systems, which leads to millions of romanized
variants. A good example is the first name of Japan's former prime minister Jun'ichirō Koizumi, which
has 169 romanized variants. The CJKI Database of Japanese Name Variants (JNV) covers four
million Japanese names and their romanized variants, and includes gender codes, classification codes,
and frequency rankings. JNV is being used by some major IT companies, especially for business
intelligence software and machine translation.
Japanese Romanization Systems
KANJI
YOMI
ENGLISH HEPBURN KUNREI
NIPPON
VARIANTS
HYBRIDS GERMANIC
佐藤
さとう
Sato
Satō
Satô
Satô
Satoo, Satou, Satoh
青塚
あおづか
Aozuka
Aozuka
Aozuka
Aoduka
Aozuca
Aoduca
生越
いくごし
Ikugoshi
Ikugoshi
Ikugosi
Ikugosi
Icugosi
Icugoshi
大津
おおづ
Ozu
Ōzu
Ôzu
Ôdu
Oozu, Ouzu, Ohzu,
Oodu, Oudu, Ohdu,
Odu
Ōdu
伊大地
いおおじ
Ioji
Iōji
Iôzi
Iôzi
Iōzi, Ioozi, Iouzi,
Iohzi, Iozi, Iooji,
Iouji, Iohji, Iôji
橋本
はしもと
Hashimoto Hashimoto
Hasimoto
Hasimoto
天満屋
てんまんや
Tenman'ya Tenman'ya
Temman'ya, Temmanya,
Tenman'ya Tenman'ya Temman-ya, Tenmanya,
Tenman-ya
Ikugoschi
LATIN
Ikugochi
Haschimoto Hachimoto
Tenman'ja,
Tenmanja,
Tenman-ja
31
Japanese Place Name Variants
Variants of Jun'ichirō
TYPE
VARIANT
ENG
VARIANT
VARIANT
HYBRID
HEPBURN
VARIANT
VARIANT
VARIANT
VARIANT
VARIANT
HYBRID
HYBRID
HYBRID
VARIANT
VARIANT
VARIANT
VARIANT
HYBRID
VARIANT
VARIANT
VARIANT
VARIANT
HYBRID
HYBRID
HYBRID
VARIANT
VARIANT
HYBRID
ROMANIZATION
Junichiro
Jun'ichiro
Jun-ichiro
Junichirô
Juniciro
Jun'ichirō
Jun-ichirō
Junichirou
Jun'ichirou
Jun-ichirou
Junichirō
Jyunichiro
Junitirou
Junitiro
Zyun'itiro
Zyun-itiro
Junichiroo
Junichiroh
Jyunichirou
Jun'ichiroh
Jun-ichiroh
Zyun'itiroo
Zyun-itiroo
Jyun'ichiro
Jyun-ichiro
Jyunichiroh
Jun'ichiroo
Jun-ichiroo
Jyunitirou
RANK
A
FREQ JAPANESE
YOMI
ROMA
TYPE
26600
安城市
あんじょうし
Anjo city
E
A
4270
安城市
あんじょうし
Anjo-shi
E
A
2080
安城市
あんじょうし
Anjou-si
h
A
517
安城市
あんじょうし
Anjou city
h
A
454
安城市
あんじょうし
Anjoshi
E
A
339
安城市
あんじょうし
Anjyo city
h
A
285
安城市
あんじょうし
Anjoushi
h
B
282
安城市
あんじょうし
Anjosi
h
138
安城市
あんじょうし
Anjou-shi
h
B
112
安城市
あんじょうし
Anjo-si
h
106
安城市
あんじょうし
Anjyoushi
h
77
安城市
あんじょうし
Anjyousi
h
65
安城市
あんじょうし
Anjō-shi
H
58
安城市
あんじょうし
Anzyousi
h
C
47
安城市
あんじょうし
Anjyo-shi
h
C
23
安城市
あんじょうし
Anjousi
h
C
21
安城市
あんじょうし
Anjou-chi
h
C
13
安城市
あんじょうし
Anjyoshi
h
C
12
安城市
あんじょうし
Anjoosi
h
C
12
安城市
あんじょうし
Anjyou-shi
h
C
11
安城市
あんじょうし
Anjō city
h
10
安城市
あんじょうし
Anjoh city
h
10
安城市
あんじょうし
Anjo-chi
h
9
安城市
あんじょうし
Anjyou city
h
5
安城市
あんじょうし
Anjyoo-shi
h
3
安城市
あんじょうし
Anjoh-shi
h
B
B
B
B
C
C
C
C
C
3
安城市
あんじょうし
Anjohshi
h
C
3
安城市
あんじょうし
Anjochi
h
C
2
安城市
あんじょうし
Anjô city
h
C
C
32
CJKI Japanese-English Dictionary
CJKI 和英辞典
The CJKI Japanese-English Dictionary (CJED) covers about 110,000 entries of general vocabulary.
It includes part-of-speech codes and readings. This up-to-date dictionary is optimized for the
convenience of users of electronic dictionaries and online translation tools. It has just the right amount
of detail: enough equivalents to give an in-depth understanding, yet short enough not to clutter up the
screen.
Japanese-English Dictionary
POS
JAPANESE
READING
ENGLISH
NC
辞書
じしょ
dictionary; lexicon; glossary; thesaurus
NC
地所
じしょ
land; ground; estate
NC
事象
じしょう
phenomenon
V
自称する
じしょうする
call oneself; profess oneself to be someone; style
oneself; profess oneself
AN
自称の
じしょうの
would-be
NC
辞書学
じしょがく
lexicography
NC
辞職
じしょく
resignation
NC
辞職勧告
じしょくかんこく
advice to resign
V
辞職する
じしょくする
resign; resign from; resign one's office
NC
自信
じしん
self-confidence
NC
自身
じしん
one's self; self; oneself; itself
NC
地震
じしん
earthquake; earth tremor; quake; temblor
D
自身で
じしんで
oneself; in person
AN
自身の
じしんの
one's own
AN
地震の
じしんの
seismic
NC
事情
じじょう
circumstances; conditions; situation; state of affairs;
reasons
NC
自乗
じじょう
square; second power
NC
自浄
じじょう
self-purification
V
自乗する
じじょうする
square
V
二乗する
じじょうする
square; multiply by itself
33
CJKI English-Japanese Dictionary
CJKI 英和辞典
The CJKI English-Japanese Dictionary (CEJD) covers about 82,000 entries covering general
vocabulary and important proper names and includes part-of-speech codes. Optimized for the
convenience of users of electronic dictionaries and online translation tools, it has just the right amount
of detail: enough equivalents to give an in-depth understanding but short enough not to clutter up the
screen. This dictionary is or has been used in such well-known products as Babylon and Quicktionary
various mobile platforms around the world such as TangoTown.
Sample of English-Japanese Dictionary
ENGLISH
POS
JAPANESE
wrinkle
N
しわ; うまい考え; 名案
wrinkle
V
しわが寄る, …にしわを寄せる
wrist
N
手首, そで口
wristlet
N
袖口バンド, 金属バンド, 腕輪
wristwatch
N
腕時計
writ
N
令状
write
V
書く; 著述する, 作曲する; 手紙を書く; 署名する
write down
E
書き留める; 評する; 調子を下げて書く; けなす
write in
E
投書する; 書き込む; 書き入れて投票する
write off
E
帳消しにする; 損失とみなす; すらすらと書く
write one's own ticket
E
将来の方針を立てる
write out
E
全部書く, 清書する; 書く
write-down
N
評価切り下げ; 償却
write-off
N
帳消し, 価格引き下げ
writer
N
作家, 記者; 書く人; 筆者; 作者; 著者
writer's block
E
著述遮断
34
Japanese Phonetic Database
日本語音韻データベース
CJKI’s Japanese Phonetic Database (JPD) was developed in collaboration with The National
Institute for Japanese Language, the well-respected Japanese government organization that conducts
scientific research on the Japanese language. The database covers about 130,000 entries of general
vocabulary and personal names, and to our knowledge is the first database of its kind to provide
phonetic transcriptions in IPA and accent codes that accurately indicate how Japanese names and words
are pronounced in actual speech.
Japanese Phonetic Database
HEADWORD
POS
ACCENT
READING
PHONETIC
鏡
NC
鏡
REMARKS
3
カガミ
kaŋami
voiced velar nasal
NC
3
カガミ
kaɡami
voiced velar stop
鏡
NC
3
カガミ
kaɣami
voiced velar fricative
危ない
AJ
0
アブナイ
abɯnai
voiced bilabial plosive
危ない
AJ
0
アブナイ
aβɯnai
voiced bilabial fricative
飾り
NC
0
カザリ
kazaɾi
voiced alveolopalatal fricative
ザリガニ
NC
0
ザリガニ
dzaɾiɡaɲi
alveolopalatal affricate, palatal nasal
新聞
NC
0
シンブン
ɕimbɯɴ
自分
NC
0
ジブン
dʑibɯɴ
weakening of
(fricativized)
比較
VN
0
ヒカク
ikakɯ
devoicing of [hi]
比較
VN
0
ヒカク
hikakɯ
no devoiced vowel
続く
V5
0
ツヅク
tsɯzɯkɯ
devoicing of [tsɯ]
続く
V5
0
ツヅク
tsɯzɯkɯ
no devoiced vowel
恥
NC
2
ハジ
haʑi
voiced alveolopalatal fricative
恥
NC
2
ハジ
hadʑi
voiced alveolopalatal affricate
蜂
NC
0
ハチ
hatɕi
八
NC
2
ハチ
hatɕi
鵜
NC
1
ウ
ʔɯʔ
glottal stop
voiced
alveolopalatal
plosive
35
Korean Lexical Resources
한국어어휘자원
CJKI is engaged in the development of various Korean dictionaries and lexical databases, covering
general vocabulary, personal names, place names, and geographical data for Japan. These are used in
such applications as machine translation (MT), information retrieval (IR) and input method editors
(IME), and online maps. The Korean lexical database includes a rich set of grammatical, phonological
and semantic attributes for use in natural language applications (NLP).
Below is description of our principal Korean resources:
Principal Resources

Korean Lexical Database

Korean-English Dictionary of Proper Nouns

Korean-Japanese Dictionary of Proper Nouns

Korean IME Databases
36
Korean Lexical Database
한국어어휘데이터베이스
The CJKI Korean Lexical Database (KLD) is a monolingual lexical database of Korean developed
by CJKI’s Korean editors, KLD includes a rich set of grammatical attributes as well as hanja when
applicable. The KLD is especially suitable for applications in the fields of information retrieval,
morphological analysis and machine translation.
Korean Lexical Database
HANGUL
가둥-거리다
가로놓이다
가리산지리산
가볍다
가살-스럽다
가수분해
가시화-되다
가져가다
가파르다
간정되다
갈아대다
감다
감때사납다
개교
개연
POS SUBPOS TYPE PATTERN
MOE
V
i
katungkŏrita
V
i
karonohita
D
V
i
HADA
karisanchirisan
AX
P
kapyŏpta
AX
P
kasal-sŭrŏpta
NC
VP
ti
HADA
kasupunhae
V
i
kasihwatoeta
V
t
kachyŏkata
AX
REU
kap'arŭta
V
i
kanchŏngtoeta
V
t
kalataeta
V
t
kamta
AX
P
kamttaesanapta
NC
V
i
HADA
kaekyo
NC
A
HADA
kaeyŏn
HANJA
加水分解
改敎
蓋然
37
Korean↔English Proper Nouns
고유명사한영사전
The CJKI Korean↔English Dictionary of Proper Nouns (KEP) is a bilingual dictionary of
personal and place names that covers both Korean and non-Korean proper nouns, including Japanese
and Chinese names. It includes various data fields such as romanized readings, frequency rankings,
classification codes, locale codes and English equivalents, and supports multiple romanization systems.
A unique feature of this dictionary is that it also includes hanja (Chinese characters used in Korean).
This data was compiled on the basis of precise transcription and transliteration rules, and verified by
our Korean editors. It supports the Revised Romanization of Korean, the latest standard published by
the Korean government in 2000, properly reflecting the phonological changes resulting from patchim
and liaison. The MOE transliteration refers to a strict transliteration based on the former Ministry of
Education (MOE) romanization system.
Korean Proper Nouns
HANGUL
안
제갈
조
주
한
황
지미
지산
정석
해운
희경
희란
가야1동
가양동
갈말읍
강남구
강원도
HANJA
安
諸葛
趙, 曺
周, 朱
漢, 韓
黃
芝美
智山
正錫
海雲
喜慶
熙欄
伽倻一洞
加陽洞
葛末邑
江南區
江原道
MOE
an
chekal
cho
chu
han
hwang
chimi
chisan
chŏngsŏk
haeun
hŭikyŏng
hŭiran
kaya1tong
kayangtong
kalmalŭp
kangnamku
kangwŏnto
ENGLISH
An
Chegal
Cho
Chu
Han
Hwang
Chimi
Chisan
Cheongseok
Haeun
Huigyeong
Huiran
Kaya 1Village
Kayang Village
Kalmal Town
Kangnam District
Kangwon Province
38
Multilingual Technical Terms
多语言术语词典
Because of the rapidly growing trade relations between China, Japan and the English speaking world,
there is an urgent need for Chinese-Japanese-English technical dictionaries in electronic form. CJKI’s
Multilingual Dictionary of Technical Terms (MDT) is a comprehensive trilingual dictionary of
technical terms in Simplified Chinese, Japanese and English covering about 300,000 entries in all major
fields of science and technology. A sister edition, the Chinese-English Dictionary of Technical Terms,
is under development in collaboration with Chinese institutions and is expected to cover several million
entries.
Multilingual Technical Terms
CHINESE
JAPANESE
加油
給油
加油
注油
加油车
給油車
加亮
ENGLISH
fueling, lubrication
lubrication, oiling
refueling truck
feeder, lubricator, oiler
給油器
lubricator, oil can, oil feeder,
油差し
syringe
lubrication hole, oil hole
油穴
filling station, service station
給油所
フィリングスタンド filling stand
sulfurization, thionation,
加硫
vulcanization, curing
highlighting
強調
jiāyóu
jiāyóu
jiāyóuchē
HIRAGANA
きゅうゆ
ちゅうゆ
きゅうゆしゃ
加油器
jiāyóuqì
きゅうゆき
jiāyóuqì
あぶらさし
加亮
強調表示
highlighting
jiāliàng
加亮
加亮
高輝度表示
ハイライト表示
highlighting, brightening
highlighting
jiāliàng
jiāliàng
加力燃烧室 アフターバーナー
afterburner
jiālìránshā
あふたーばーなー
oshì
加和性
加成性
加气剂
空気連行剤
加聚物
付加重合体
加勒金法
ガレルキン法
additive property, additivity
air entraining agent
addition polymer, addition
resin
Galerkin method
加油器
加油孔
加油站
加油站
加硫
PINYIN
jiāyóukǒng あぶらあな
jiāyóuzhàn きゅうゆしょ
jiāyóuzhàn ふぃりんぐすたんど
jiāliú
かりゅう
jiāliàng
きょうちょう
きょうちょうひょう
じ
こうきどひょうじ
はいらいとひょうじ
jiāhéxìng
かせいせい
jiāqìjì
くうきれんこうざい
jiājùwù
ふかじゅうごうたい
jiālèjīnfǎ
がれるきんほう
39
Multilingual Proper Nouns
多言語固有名詞データベース
CJKI maintains comprehensive databases of CJK and Arabic personal names and place names,
including various kinds of geographical data covering millions of entries. These databases are used by
some of the world's major IT companies for a wide variety of applications such as online multilingual
maps, named entity recognition, machine translation and information retrieval.
Key Features

The database includes various data fields (many not shown here), such as readings in
pinyin and zhuyin, hiragana, romanization in all major and most important minor
romanization systems, semantic classification codes and frequency rankings, locale codes,
and other useful information.

A unique feature is that it's important to note that the TC place names are not merely a
code-conversion equivalent of the SC names, but are accurate on both the orthographic
and the lexemic levels ("O" and "L" in the tables below). For example, New Zealand in
SC is 新西兰 Xīnxīlán but in TC it is 紐西蘭 Niǔxīlán.

Another unique feature is that SC and TC readings are distinguished. Thus the pinyin
for SC 期荣 is qīróng, but for the TC 期榮 it is qírōng.

Databases can be tailored to your specific needs and budgets.
CJK Proper Noun Databases at a Glance
C-E
1,000,000
J-C
1,000,000
J-K
1,000,000
C-K
1,000,000
J-E
1,000,000
K-E
1,000,000
Chinese Place Names
5,000
5,600
3,000
13,000
2,400
5,000
Korean Personal Names
3,300
2,100
13,000
240,000
13,000
13,000
Korean Place Names
Japanese Personal Names —
Given Names
Japanese Personal Names —
Surnames
3,200
2,000
5,900
20,000
5,900
6,900
376,000
281,000
390,000
376,000
390,000
45,000
149,000
91,000
150,000
149,000
150,000
73,000
Japanese Place Names
74,000
74,000
77,000
74,000
77,000
68,000
Western Personal Names
32,000
38,000
10,000
9,800
31,000
7,500
2,500
2,500
1,800
1,800
1,100
1,800
1,645,000
1,496,200
1,650,700
1,883,600
1,670,400
1,220,200
Chinese Personal Names
Western Place Names
Total
40
Chinese-English Chinese Names
Type
S
S
G
G
G
G
G
G
G
G
Chinese
Pinyin
wáng
zhāng
yèzé
yèhuá
yèníng
yètóng
yèquán
yèxún
yèjīng
yèdá
王
张
业则
业华
业宁
业彤
业权
业浔
业经
业达
English
Wang
Zhang
Yeze
Yehua
Yening
Yetong
Yequan
Yexun
Yejing
Yeda
Korean-Chinese Korean Names
Type
S
S
G
G
G
G
G
G
G
G
Korean
가
강
갑중
건종
경수
경숙
경식
경환
경철
구덕
Hanja
賈
姜
甲中
建鍾
敬秀
慶淑
慶植
景桓
景喆
九德
Chinese
贾
姜
甲中
建锺
敬秀
庆淑
庆植
景桓
景喆
九德
English-Chinese Western Names
Type
S
S
S
S
S
S
S
S
S
S
English
Anthony
Waltham
Winchester
Austen
Keyser
Count
Gari
Pierre
Cornelius
Constantin
Chinese
Pinyin
安东尼
沃尔萨姆
温切斯特
奥斯汀
凯泽
康特
加里
皮埃尔
科尔内留斯
康斯坦丁
Āndōngní
wò'ěrsàmǔ
Wēnqiēsītè
Àosītīng
Kǎizé
Kāngtè
Jiālǐ
pí'āi'ěr
kē'ěrnèiliúsī
kāngsītǎndīng
Japanese-English Japanese Given Names
Type
Japanese
Reading
English
Takeshi
M
丈
たけし
Tamao
M
多摩男
たまお
Daijiro
M
大二郎
だいじろう
Tetsuji
M
鉄次
てつじ
Toshiharu
M
敏晴
としはる
Nobuo
M
信雄
のぶお
Yasuhiro
M
安博
やすひろ
Yurio
M
百合男
ゆりお
Kazuki
M
一貴
かずき
M
善一郎
ぜんいちろう Zen'ichiro
Japanese-Chinese Japanese Surnames
Type Japanese
Reading
Chinese
S
岡林
おかばやし 冈林
S
下岡
しもおか
下冈
S
丸本
まるもと
丸本
S
佐藤
さとう
佐藤
S
勝部
かつべ
胜部
S
沼
ぬま
沼
S
上仲
かみなか
上仲
S
西郡
にしごおり 西郡
S
中口
なかくち
中口
S
渡辺
わたなべ
渡边
41
CJKEA Database of Place Names
ENGLISH JAPANESE
Aruba
Brasilia
Caracas
Cairo
Chad
Georgia
Ireland
Seoul
Seoul
Tel Aviv
Yemen
SC
アルーバ
阿鲁巴
ブラジリア
巴西利亚
カラカス
加拉加斯
カイロ
开罗
チャド
乍得
ジョージア
乔治亚
アイルランド 爱尔兰
ソウル
首尔
ソウル
汉城
テルアビブ
特拉维夫
イエメン
也门
LO TC
L
阿盧巴
O
巴西利亞
L
卡拉卡斯
O
開羅
L
查德
O
喬治亞
O
愛爾蘭
O
首爾
O
漢城
O
特拉維夫
L
葉門
KOREAN
아루바섬
브라질리아
카라카스
카이로
차드
조지아
아일랜드
서울
서울
텔아비브
예멘
ARABIC
‫أروب ا‬
‫ب رازي ل يا‬
‫ك راك اس‬
‫ال قاهرة‬
‫ت شاد‬
‫جورج يا‬
‫آي رل ندا‬
‫س يول‬
‫س يول‬
‫ت ل أب يب‬
‫ال يمن‬
Phonemic Transcriptions of CJKE Place Names
ENGLISH JAPANESE
Aruba
Brasilia
Caracas
Cairo
Chad
Georgia
Ireland
Seoul
Seoul
Tel Aviv
Yemen
あるーば
ぶらじりあ
からかす
かいろ
ちゃど
じょーじあ
あいるらんど
そうる
そうる
てるあびぶ
いえめん
SC
ālǔbāā
bāxīlìyà
jiālājiāsī
kāiluó
zhàdé
qiáozhìyà
àiěrlán
shǒuěr
hànchéng
tèlāwéifū
yěmén
TC
ālúbā
bāxīlìyà
kǎlākǎsī
kāiluó
chádé
qiáozhìyà
àiěrlán
shǒuěr
hànchéng
tèlāwéifū
yèmén
KOREAN
ARABIC
arupasŏm
pŭrachilria
k'arak'asŭ
k'airo
ch'atŭ
chochia
ailraentŭ
sŏul
sŏul
t'elapipŭ
yemen
aruba
burazilia
karakasu
al-qahirah
tshad
jurjia
ayirlanda
siwul
siwul
tallu-abib
al-yaman
42
Chinese-Korean Chinese Places
Chinese
Pinyin
Hanja Korean
Dōngguān
东关
東關
둥관
Dōngyíng
东营
東營
둥잉
Dōngyáng
东阳
東陽
둥양
dōng'ē
东阿
東阿
둥어
dōng'ān
东安
東安
둥안
Dōngyuán
东源
東源
둥위안
Dōnghú
东湖
東湖
둥후
Dōnggǎng
东港
東港
둥강
Dōngshān
东山
東山
둥산
Dōngchuān 東川
东川
둥촨
Korean-English Korean Places
Korean
고금면
고남면
고담동
고대면
고덕동
고덕면
고등동
고등동
고랑동
고령군
Hanja
古今面
高南面
高潭洞
高大面
古德洞
古德面
高登洞
高等洞
古浪洞
高靈郡
English
Gogeum-Myeon
Gonam-Myeon
Godam-Dong
Godae-Myeon
Godeok-Dong
Godeok-Myeon
Godeung-Dong
Godeung-Dong
Gorang-Dong
Goryeong-Gun
Chinese-English Western Places
Chinese
土库曼
尼日利亚
哈里斯堡
布宜诺斯艾利斯
布鲁克林
贝塞斯达
柏林
博茨瓦那
马斯喀特
马拉维
Pinyin
English
tǔkùmàn
Turkmenistan
nírìlìyà
Nigeria
hālǐsībǎo
Harrisburg
bùyínuòsī'àilìsī
Buenos Aires
bùlǔkèlín
Brooklyn
bèisāisīdá
Bethesda
bólín
Berlin
bócíwǎnà
Botswana
mǎsīkātè
Muscat
mǎlāwéi
Malawi
CJKE Multilingual Database of Personal Names
ENG
JPN
SC
Abba
アッバ
阿巴
Abbas アッバース
阿巴斯
Alberto アルベルト
阿尔韦托
Qirong 期栄
期荣
Akiko
暁子
晓子
Akiko
顕子
显子
Akiko
昭子
昭子
Akira
明
明
Deng
登
登
Einstein アインスタイン 爱因斯坦
Ernest アーネスト
欧内斯特
Gregg
グレッグ
格雷格
Greg
グレッグ
格雷格
Haiyang 海洋
海洋
Huaiyang 懐陽
怀阳
Jack
ジャック
杰克
Jackie
ジャッキー
杰基
Kennedy ケネディ
肯尼迪
Kaiyang 開陽
开阳
Nakajima 中島
中岛
William ウィリアム
威廉
Zhang 張
张
TC
亞伯
阿巴斯
阿爾韋托
期榮
曉子
顯子
昭子
明
登
愛因斯坦
歐尼斯特
葛瑞格
葛瑞格
海洋
懷陽
傑克
傑基
甘迺迪
開陽
中島
威廉
張
KOR
아바
아바스
알베르토
치룽
아키코
아키코
아키코
아키라
덩
아인슈타인
어니스트
그레그
그레그
하이양
화이양
잭
재키
케네디
카이양
나카지마
빌리암
장
LO HIRAGANA
L あっば
O あっばーす
O あるべると
O きえい
O あきこ
O あきこ
O あきこ
O あきら
O とう
O あいんすたいん
L あーねすと
L ぐれっぐ
L ぐれっぐ
O かいよう
O かいよう
O じゃっく
O じゃっきー
L けねでぃ
O かいよう
O なかじま
O うぃりあむ
O ちょう
SC PIN
TC PIN
ābā
ābāsī
āěrwéituō
qīróng
xiǎozǐ
xiǎnzǐ
zhāozǐ
míng
dēng
àiyīnsītǎn
ōunèisītè
géléigé
géléigé
hǎiyáng
huáiyáng
jiékè
jiéjī
kěnnídí
kāiyáng
zhōngdǎo
wēilián
zhāng
yàbó
ābāsī
āěrwéituō
qíróng
xiǎozǐ
xiǎnzǐ
zhāozǐ
míng
dēng
àiyīnsītǎn
ōunísītè
gěruìgé
gěruìgé
hǎiyáng
huáiyáng
jiékè
jiéjī
gānnǎidí
kāiyáng
zhōngdǎo
wēilián
zhāng
MOE
apa
apasŭ
alperŭt'o
ch'irung
ak'ik'o
ak'ik'o
ak'ik'o
ak'ira
tŏng
ainsyut'ain
ŏnisŭt'ŭ
kŭrekŭ
kŭrekŭ
haiyang
hwaiyang
chaek
chaek'I
k'eneti
k'aiyang
nak'achima
pilriam
chang
Meet Our CEO
Jack Halpern (春遍雀來), CEO of The CJK
Dictionary Institute, is a lexicographer by
profession. For sixteen years was engaged in the
compilation of the New Japanese-English Character
Dictionary, and as a research fellow at Showa
Women's University (Tokyo), he was editor-in-chief
of several kanji dictionaries for learners, which have
become standard reference works.
Jack Halpern, who has lived in Japan over 40 years,
was born in Germany and has lived in six countries
including France, Brazil, Japan and the United States.
An avid polyglot who specializes in Japanese and
Chinese lexicography, he has studied 15 languages
(speaks ten fluently) and has devoted several decades
to the study of linguistics and lexicography.
Jack Halpern has published over twenty books and
dozens of articles and academic papers, mostly on
the Japanese writing system and CJK information
processing, has given over 600 public lectures on Japanese language and culture, and has presented
several dozen papers at international conferences.
On a lighter note, Jack Halpern loves the sport of unicycling. Founder and long-time president of the
International Unicycling Federation, he has promoted the sport worldwide and is a director of the
Japan Unicycling Association. Currently, his passion is playing the quena and improving his Chinese,
Esperanto and Arabic.
Contact Information
The CJK Dictionary Institute, Inc.
Komine Building
34-14, 2-chome, Tohoku, Niiza-shi
Saitama 352-0001
JAPAN
Phone: +81-48-473-3508
Fax: +81-48-486-5032
Email: [email protected]
Web: www.cjk.org
Fly UP