How CJKI`s resources are used - The CJK Dictionary Institute
by user
Comments
Transcript
How CJKI`s resources are used - The CJK Dictionary Institute
Table of Contents Overview Arabic Chinese Japanese Korean Multilingual 概要 عرب ي 汉语 日本語 한국어 多言語 ………… ………… ………… ………… ………… ………… 3 6 13 23 35 38 3 The CJK Dictionary Institute 日中韓辭典研究所 The CJK Dictionary Institute, Inc. (CJKI) specializes in CJK and Arabic computational lexicography. The institute creates and maintains CJK (Chinese, Japanese and Korean) and Arabic lexical databases currently covering approximately 24 million entries. Located in Saitama, Japan, CJKI is headed by Jack Halpern, editor-in-chief of the world-renowned New Japanese-English Character Dictionary and of various other CJK dictionaries. CJKI plays a leading role in helping the IT industry penetrate the lucrative East Asian market by providing software developers with high quality dictionary data. This includes comprehensive databases of general vocabulary, proper nouns and technical terms for CJK languages, including Chinese dialects such as Cantonese and Hakka. CJKI also maintains databases and romanization systems of Arabic proper nouns, a large-scale Spanish-English dictionary, and various multilingual databases of proper nouns and geographic data. CJKI has become one of the world's prime sources for CJK lexical resources. It is contributing to CJK and Arabic information processing technology by providing high-quality lexical resources and professional consulting services to some of the world's leading software developers and IT companies, including Fujitsu, Sharp, Sony, IBM, Google, Microsoft, Yahoo, Amazon and Baidu. 4 How CJKI's resources are used CJKI's team of professional editors and software engineers use advanced computational lexicography methods to compile and maintain comprehensive lexical databases and dictionaries that include a variety of features for a broad gamut of applications, such as: Natural language processing applications such as information retrieval tools, search engine technology and morphological analyzers Anti-money laundering and fraud detection Security applications such as criminal watch lists CJK input method editors (front-end processors) Machine translation and online translation tools Speech technology applications, both text-to-speech and automatic speech recognition Geographical data for multilingual maps, machine translation and tokenization Conversion between Simplified and Traditional Chinese Electronic dictionaries for desktop and mobile platforms Pedagogical, linguistic and computational lexicography research Transcription and transliteration applications Data cleansing Doing business with CJKI CJKI has a flexible business model that is decided on a case-by-case basis to suit the convenience of the customer. We are not "resellers," nor are we "data vendors" -- we are a linguistic institute, and create the data ourselves based on several decades of experience and extensive knowhow of CJK and Arabic lexicography. Our fundamental policy is to customize our databases to the specific requirements of the customer at no extra charge. To achieve this, we study our customers' needs in-depth and prepare a data package that meets the customer's precise needs. We also build custom databases from scratch. We have extensive experience in putting together teams to compile large-scale dictionaries in a short period of time, using our sophisticated tools for automating the compilation process, which significantly reduces costs to the customer. It is important to note that the benefits of working with CJKI go well beyond cost. We are flexible in matters of format, delivery dates and business model, work hard to gain an in-depth understanding of the customer's needs, and provide excellent service that includes a reasonable amount of free technical and linguistic consulting as well as free minor upgrades. Licensing data from CJKI is not merely "buying" data -- it is entering into a close relationship that ensures constant advice, technical/linguistic support, upgrades, and reasonable fees. 5 CJKI's Lexical Resources CJKI's extensive CJKI and Arabic lexical resources currently cover approximately 24 million entries, used by major portals and software developers in a wide variety of applications. Our main resources include: Bilingual dictionaries Multilingual dictionaries Arabic personal names Proper nouns and geographical data Technical terminology Monolingual lexical databases Phonetic and phonological databases Mapping tables for Chinese conversion Morphological databases Lexical databases In addition to the resources described above, CJKI has developed resources containing millions of more entries for the following: Arabic transcription and vocalization systems Arabic, Japanese, and Spanish-English full-form lexicons* Arabic Place names Databases for input method editors Frequency statistics based on web and corpora Database for CJK IMEs * "Full form lexicon" refers to a comprehensive lexical database that includes every single inflected form (verb conjugations, plurals, etc.) and declined forms (case endings) of a language. Each full form lexicon contains millions of entries accompanied by a rich set of grammatical attributes. 6 Arabic Lexical Resources موارد معجم ية ل ل غة ال عرب ية Arabic, one of the six official languages of the United Nations, is spoken by 246 million speakers worldwide -- not only in North Africa and the Middle East, but also in many other countries since it is the language of the Koran. Though Arabic has become a world language of critical importance, lexical resources, especially for proper nouns, are either scarce or exist only on a small scale. The CJK Dictionary Institute has been engaged in an intensive effort to develop comprehensive Arabic lexical databases, with special focus on proper nouns. Below is a description of some of CJKI’s principal Arabic resources. Principal Resources Database of Arab Names Database of Arab Names in Arabic Expanded OFAC Arabic transcription and romanization systems Dictionary of Arabic Place Name Variants Dictionary of Arabic Proper Nouns Arabic Broken Plurals Arabic Transcription and Transliteration Arabic Lexical Database Arabic full form Dictionary 7 Database of Arab Names عدةق ا ب يان ات األ سماء العربية CJKI's comprehensive Database of Arab Names (DAN), which currently covers approximately 6.5 million entries, consists of Arabic personal names and name variants mapped to the original Arabic script with a large variety of supplementary information. DAN is based on authoritative resources and has undergone extensive proofreading and expansion based on about 25 million names derived from a large variety of sources, including websites, corpora, books, dictionaries, phone books, and encyclopedias. Key Features 6.5 million validated Arabic name variants Ideal for security and anti-money laundering, and NLP Based on over 25 million source names from authoritative resources Proofread by native editors trained in Arabic phonology Validated against the web and corpora Fully vocalized with various variants in Arabic script Web-based frequency statistics for each name Various romanization systems, such as the official IC standard Fully supports OFAC names, their official aliases and unofficial variants DAN is playing an important role in helping software developers, especially of security applications and NLP tools, to enhance their technology by enabling named entity recognition and extraction, machine translation, variant normalization, and information retrieval of Arabic names. 8 The table below shows a snippet of the 1,100 variants of ع بدال عزي زalong with their frequency of occurrence on the web. Database of Arab Names ID SUBID VARIANT ARABIC BUCKWALTER FREQUENCY V000010 01140 Abd-Al Azeez ع بدال عزي زEbdAlEzyz 0000002000 V000010 01141 Abd-Al Azez ع بدال عزي زEbdAlEzyz 0000000118 V000010 01142 Abd-Al Aziez ع بدال عزي زEbdAlEzyz 0000000033 V000010 01143 Abd-Al Aziiz ع بدال عزي زEbdAlEzyz 0000000016 V000010 01144 Abd-Al Aziz ع بدال عزي زEbdAlEzyz 0000064000 V000010 01145 Abd-Alazeez ع بدال عزي زEbdAlEzyz 0000000012 V000010 01146 Abd-Alazez ع بدال عزي زEbdAlEzyz 0000000114 V000010 01147 Abd-Alaziz ع بدال عزي زEbdAlEzyz 0000002000 V000010 01148 Abd-El'azeez ع بدال عزي زEbdAlEzyz 0000000052 V000010 01149 Abd-El'azez عزي زع بدالEbdAlEzyz 0000000154 V000010 01150 Abd-El'aziz ع بدال عزي زEbdAlEzyz 0000008000 V000010 01151 Abd-El'eziz ع بدال عزي زEbdAlEzyz 0000000003 V000010 01152 Abd-El-'Azeez ع بدال عزي زEbdAlEzyz 0000002000 V000010 01153 Abd-El-'Azez ع بدال عزي زEbdAlEzyz 0000000014 V000010 01154 Abd-El-'Aziiz ع بدال عزي زEbdAlEzyz 0000000001 V000010 01155 Abd-El-'Aziz ع بدال عزي زEbdAlEzyz 0000024000 9 Database of Arab Names in Arabic The complexity of the Arabic script gives rise to a variety of Arabic spelling variants and spelling errors. The CJKI Database of Arab Names in Arabic (DANA) covers several hundred thousand Arabic script variants and common spelling mistakes, as shown in the table below. A key feature of DANA is that every Arabic name is normalized and vocalized to produce a database of error-free, fully sanitized Arabic canonical forms. The vocalization is performed by a team of editors with the aid of tools and interfaces designed to achieve maximum efficiency. The canonical forms are used both as a basis for creating accurate romanized variants for DAN, as well as Arabic orthographic variants for DANA. Database of Arab Names in Arabic CANONICAL VARIANT FREQUENCY TYPE ع بدال عزي ز ع بدال عزي ز 023102030 Normal ع بدال عزي ز ع بد ال عزي ز 018868920 Variant ع بدال عزي ز ع بد ل عزي ز 000000019 Error ع بدال عزي ز ع بدإل عزي ز 000000010 Error ع بدال عزي ز ع بدل عزي ز 000000003 Error ع بدال عزي ز ع بد أل عزي ز 000000000 Error ع بدال عزي ز ع بد إل عزي ز 000000000 Error ع بدال عزي ز ع بدأل عزي ز 000000000 Error 10 Expanded OFAC The US government's watch lists have come under fire from members of Congress as being "crippled by technical flaws." One of the major factors behind these assertions is the inability to correctly identify and process the numerous variants of names appearing in the Specially Designated Nationals (SDN) list maintained by The Office of Foreign Assets Control (OFAC). To address these shortcomings, CJKI has exploited the linguistic and technical resources to develop a comprehensive Expanded OFAC database (XOFAC) of OFAC full name variants, the vast majority of which are not listed in OFAC. Containing millions of potential and actual variants of the Arab names in OFAC's SDN List, XOFAC is ideal for those agencies and institutions that require maximum recall in their compliance and watch list filtering applications. In contrast to DAN, XOFAC consists of variants of full names only; in other words, names of actual and potential individuals and their variants. For example, the table below lists the top 15 out of about 130,000 actual and potential variants of the OFAC name Hatim Ahmad BARAKAT. The table has the following fields: Rank Variant Freq1 Freq2 Freq3 relative ranking based on component frequencies variants of OFAC name, mostly not appearing in the OFAC list frequency of occurrence on the web of Hatim variants frequency of occurrence on the web of Ahmad variants frequency of occurrence on the web of Barakat variants Expanded OFAC RANK 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 VARIANT Hatem Ahmed Barakat Hatim Ahmed Barakat Hatem Ahmed Bereket Hatem Ahmad Barakat Khadem Ahmed Barakat Hatem Ahmed Bareket Hattem Ahmed Barakat Hatem Ahmed Berekat Hatem Ahmet Barakat Hadim Ahmed Barakat Hatem Ahmed Baraket Hatam Ahmed Barakat Hatem Ahmed Barakaat Hatem Achmed Barakat Hetem Ahmed Barakat FREQ1 FREQ2 FREQ3 01580000 00925000 001580000 001580000 000194000 001580000 000180000 001580000 001580000 000114000 001580000 000081300 001580000 001580000 000065600 039000000 039000000 039000000 025600000 039000000 039000000 039000000 039000000 018400000 039000000 039000000 039000000 039000000 000777000 039000000 001180000 001180000 000651000 001180000 001180000 000057200 001180000 000033400 001180000 001180000 000016300 001180000 000014300 001180000 001180000 11 Database of Arabic Business Names The Database of Arabic Business Names (DABNA) is a large-scale database of Arabic company and organization names and addresses in the Arabic script, along with their romanized and/or English equivalents, web frequencies, and other attributes. Company names play an important role in Arabic natural language processing applications including named entity extraction (NER), machine translation (MT), and morphological analysis (MA), as well as in a variety of business intelligence applications and security applications such as watch list querying. To meet this need, DABNA is continuously expanded and revised to ensure it is up-to-date by a team of editors trained in Arabic name processing and Arabic phonology. The sample below shows some Egyptian business names (companies and addresses), along with their English equivalents. BUSINESS NAME DISTRICT ADDRESS ج الل إ سماع يل مراد Jalal Isma'il Murad ال ق به al-Qubbah 57 ش ال خ ل ي فه ال مامون 57 Caliph al-Ma'mun St. أحمد ش كرى م صط فى Ahmad Shakri Mustafa ال ماظه Almaza 34 3شارع ال عروب ه ش قه 34 al-Orouba Street Flat 3 م ك ت به ط ل عت حرب Tal'at Harb Bookshop ال رو ضه Rhoda Island ش ال س يده ن ف ي سه al-Sayyidah Nafisah St. ط ل عت حرب مول Tal'at Harb Mall رم س يس Ramses 30 ش ط ل عت حرب و سط ال ب لد 30 Tal'at Harb St. Downtown أ سامة و هان ى Usamah and Hani ال ماظه Almaza 36 ش ال نزهه م صر ال جدي ده 36 Nozha St. Heliopolis أب و ش قرة Abu-Shaqrah ال رو ضه Rhoda Island 69 شارع ل ق صر ال ع ي نى ا 69 al-Qasr al-'Ayni Street مون ارش ل الث اث و ك ورال دي Moon Arch ال دق ى Dokki 50 ش ن ادى ال ص يد ال دق ى 50 Nadi al-Sayd St. Dokki ح س ين ي و سف Husayn Yusuf ال م عادى Maadi جران د مول ال م عادى Grand Mall Maadi ح س ين و ع لى Husayn and 'Ali ال ماظه Almaza 17 ش ب غداد م صر ال جدي ده 17 Baghdad St. Heliopolis شري ف محمد ط ل عت ال غ ن يمى ب اب ال لوق 32 ش ال ف ل كى Sharif Muhammad Tal'at al-Ghanimi Bab al-Louq 32 al-Falaki St. 12 Database of Foreign Names in Arabic The Database of Foreign Names in Arabic (DAFNA) is a large-scale database of non-Arab names written in the Arabic script, along with their romanized variants, web frequencies, and other attributes. Personal names play an important role in Arabic natural language processing applications including named entity extraction (NER), machine translation (MT), and morphological analysis (MA), as well as in a variety of business intelligence applications and security applications such as watch list querying. To meet this need, DAFNA is continuously being expanded and revised by a team of editors trained in Arabic name processing and Arabic phonology. The sample below shows orthographic variants and spelling errors of a common American given name (John), and a common American surname (Davis). The original American name data was obtained from the U.S. Census Bureau. Database of Foreign Names in Arabic ENGLISH ARABIC TYPE WEB FREQ WEB FREQ (English+Arabic) (Arabic only) John John John John John John John جوون جون جان جوهان جوهن دجون جهون M M M M M M M 0036500 0032700 0031300 0000224 0000173 0000029 0000009 0044500 0947000 2160000 0007090 0001180 0001680 0000328 Davis Davis Davis Davis Davis Davis Davis يس يس فز فس فس يس يز S S S S S S S 0000613 0000249 0000228 0000157 0000040 0000034 0000005 0012300 0001680 0002300 0002020 0000652 0000490 0000098 دي ف داف دي دي داي دف دف 13 Chinese Lexical Resources 汉语词汇资源 CJKI’s comprehensive Chinese lexical resources currently include over four million entries, covering general vocabulary, technical terminology, proper nouns, company and organization names, and others, in both Simplified Chinese (SC) and Traditional Chinese (TC), used in such applications as machine translation (MT), information retrieval (IR) and input method editors (IME). They includes a rich set of grammatical, phonological and semantic attributes, including pinyin and zhuyin readings, part-of-speech codes, frequency of occurrence statistics, and others. Below is a description of CJKI’s principal Chinese lexical resources. Principal Resources Simplified Chinese-English Dictionary English-Simplified Chinese Dictionary Chinese-English Database of Proper Nouns Database of Chinese Name Variants Chinese Dictionary of Computer Terms Chinese Lexical Database Chinese Pinyin Database Chinese to Chinese Conversion Hanzi-Pinyin Transcription System Chinese-English Technical Terms Chinese Morphological Database Chinese Lexical Frequency Statistics English-Traditional Chinese Dictionary Chinese IME Databases 14 Chinese-English Dictionary 汉 英 词 典 简 体 版 CJKI's Simplified Chinese-English Dictionary (SCED) is the most comprehensive Chinese dictionary available today. Covering over 700,000 entries of general vocabulary, technical terms, important proper nouns and example sentences, SCED was compiled in collaboration with lexicographers from a leading Chinese university on the basis of the most authoritative dictionaries published in China. This dictionary, which is without peer, has undergone extensive proofreading and validation by a team of native Chinese editors. It is ideally suited for: Machine translation dictionaries Cross-language information retrieval Handheld electronic dictionaries Mobile device applications Simplified Chinese-English Dictionary CHINESE POS PINYIN ENGLISH guósāng 国产装备 N N N N N N N N N N guóchǎn zhuāngbèi national mourning traditional Chinese music letter of credence; credentials; letter of commission copy of credentials home equipment domestic products; domestic goods domestic-made cars Chinese film excise duties Chinese-made equipment 国产品 N guóchǎnpǐn home products; national products; domestic products 国优产品 N guóyōu chǎnpǐn 国债 N guózhài national quality product national debt; government loan; public debt; national bonds 国丧 国乐 国书 国书付本 国产设备 国产货 国产轿车 国产影片 国产税 guóyuè guóshū guóshū fùběn guóchǎn shèbèi guóchǎnhuò guóchǎn jiàochē guóchǎn yǐngpiàn guóchǎnshuì 15 English-Chinese Dictionary 英 汉 词 典 简 体 版 CJKI’s English-Simplified Chinese Dictionary (ESCD) covers about 100,000 entries including general vocabulary and important proper names. Optimized for the convenience of users of electronic dictionaries, ESCD has just the right amount of detail: enough equivalents to give an in-depth understanding, yet short enough not to clutter up the screen. ESCD is being used in such well-known translation tools like Babylon and Quicktionary, as well as on various mobile platforms around the world including TangoTown in Japan and Australia. English-Simplified Chinese Dictionary ENGLISH POS canoe cay cheddar clad codger Comoros congruence couch crew cut cutthroat decimal depersonalize dial discrepancy divide dramatize emotional CHINESE v. tr. 用独木舟载运 n. 岩礁, 沙洲, 珊瑚礁 n. 切德干酪 v. tr. 电镀 n. 怪人; 有怪癖的人 NP 科摩罗 n. 适合, 相合性, 一致 v. intr. 躺着; 埋伏; 蹲着 n. 平头发式 adj. 残酷的; 杀人的 n. 小数 v. tr. 使失去人性 v. tr. 调; 拨; 收听, 收视; 打电话给 n. 相差; 矛盾; 差异 n. 分歧, 不和; 分水岭 v. intr. 戏剧化; 可改编成剧本; 举止夸张 adj. 情绪的; 情感的 An English-Traditional Chinese Dictionary is also available. 16 Chinese↔English Proper Nouns 汉英专有名词数据库 CJKI's Chinese↔English Database of Proper Nouns (CEP) is very comprehensive, covering millions of entries in both Simplified and Traditional Chinese. It includes various data fields such as pinyin, zhuyin, frequency rankings, classification codes, locale codes and English equivalents. Included are a large variety of both Chinese and non-Chinese name types, such as: Place names Personal names (surnames and given names) Companies and organizations Facilities and points of interest Western personal and place names Miscellaneous, such as periodicals and abbreviations Chinese Personal Names SIMPLIFIED CHINESE TYPE TRADITIONAL CHINESE PINYIN G 桂花 桂花 guìhuā S 鄂 鄂 è G 尔和 爾和 ěrhé S 戚 慼 qī G 联谊 聯誼 liányì G 亚军 亞軍 yàjūn G 军营 軍營 jūnyíng GS 耽 耽 dān G 嗣 嗣 sì G 耕耘 耕耘 gēngyún G 庄稼 莊稼 zhuāngjia G 和全 和全 héquán S 侬 儂 nóng G 之遥 之遙 zhīyáo S 刁 刁 diāo S 司马 司馬 sīmā G 津津 津津 jīnjīn 17 Chinese and non-Chinese Place Names ENGLISH SC L/O L TC PINYIN ālǔbā ZHUYIN Aruba 阿鲁巴 Azerbaijan 阿塞拜疆 L 亞塞拜然 āsāibàijiāng ㄧㄚˋㄙㄜˋㄅㄞˋㄖㄢˊ Brasilia 巴西利亚 O 巴西利亞 bāxīlìyà ㄅㄚㄒㄧㄌㄧˋㄧㄚˋ Caracas 加拉加斯 L 卡拉卡斯 jiālājiāsī ㄎㄚˇㄌㄚㄎㄚˇㄙ Cairo 开罗 O 開羅 kāiluó ㄎㄞㄌㄨㄛˊ Chad 乍得 L 查德 zhàdé ㄔㄚˊㄉㄜˊ Fukuoka 福冈 O 福岡 dōngyángshì ㄈㄨˊㄍㄤ Georgia 乔治亚 O 喬治亞 fúgāng ㄑㄧㄠˊㄓˋㄧㄚˋ Guinea 几内亚 O 幾內亞 qiáozhìyà ㄐㄧˇㄋㄟˋㄧㄚˋ Haiyan 海盐 O 海鹽 jǐnèiyà ㄏㄞˇㄧㄢˊ Hanyang 汉阳 O 漢陽 hǎiyán ㄏㄢˋㄧㄤˊ Heshan 鹤山 O 鶴山 hànyáng ㄏㄜˋㄕㄢ Huailai 怀来 O 懷來 hèshān ㄏㄨㄞˊㄌㄞˊ Ireland 爱尔兰 O 愛爾蘭 huáilái ㄞˋㄌㄧㄣˊ 阿盧巴 ㄚㄌㄨˊㄅㄚ L : lexemic mapping O : orthographic mapping (see Chinese Dictionary of Computer Terms for detail) 18 Chinese Name Variants 汉语人名罗马字异形数据库 The number of Chinese personal names and their variants is very large -- in the millions -- which makes it difficult to identify them and process them. Named Entity Recognition (NER) technology is a hot topic in computational linguistics. To enhance NER technology, CJKI maintains databases of several million CJK and Arabic name variants in all major and most minor romanization systems. There are several well-established systems for romanizing Chinese, such as Hanyu Pinyin, Wade-Giles, Yale, and Tongyong Pinyin, as well as various popular ones and many older ones that have fallen out of use. Chinese has seven major dialect groups, and another four minor ones. The CJKI Database of Chinese Name Variants (CNV) includes Chinese personal names in all the standard and dialectical Chinese romanization systems, covering all the major dialects, including Cantonese, Hakka and Hokkien and including classification codes and frequency of occurrence statistics. Chinese Name Variants CHINESE PINYIN ZHUYIN ENGLISH TONGYONG YALE WADE-GILES VARIANTS 百欣 bǎixīn ㄞˇㄒㄧㄣ Baixin Baisin Baisyin Paihsin Paisin 白 bái ㄅㄞˊ Bai Bai Bai Pai 北强 běiqiáng ㄅㄟˇㄑㄧㄤˊ Beiqiang Beiciang Beichyang Peich'iang 炳章 bǐngzhāng ㄅㄧㄥˇㄓㄤ Bingzhang Bingjhang Bingjang 宝程 bǎochéng ㄅㄠˇㄔㄥˊ Baocheng Baocheng Baucheng Paoch'eng Paocheng 爱华 àihuá ㄞˋㄏㄨㄚˊ Aihua Aihua Aihwa Aihua Ngaihua 伯芝 bózhī ㄅㄛˊㄓ Bozhi Bojhih Bwojr Pochih 长流 chángliú ㄔㄤˊㄌㄧㄡˊ Changliu Changliou Changlyou Ch'angliu 邦达 bāngdá ㄅㄤㄉㄚˊ Bangda Bangda Bangda Pangta 曹 cáo ㄘㄠˊ Cao Cao Tsau Ts'ao 冰晓 bīngxiǎo ㄅㄧㄥㄒㄧㄠˇ Bingxiao Bingsiao Bingsyau Pinghsiao Pingsiao 百成 bǎichéng ㄅㄞˇㄔㄥˊ Baicheng Baicheng Baicheng Paich'eng Paicheng Peits'iêng Peichiang Peitsiêng Pingchang 19 Chinese↔English Computer Terms 英汉计算机术语词典 CJKI's Chinese↔English Dictionary of Computer Terms (ECCT) is an English-Chinese Chinese-English dictionary containing about 100,000 Simplified Chinese (SC) and 100,000 Traditional Chinese (TC) entries, including acronyms. This dictionary covers both SC, used in The People's Republic of China and Singapore, and TC, used in Taiwan, Hong Kong and among overseas Chinese. It has several features that distinguish it from any other Chinese computer dictionary available today. Covers about 100,00 entries selected on the basis of frequency statistics. Constantly updated and expanded to include recent terms. Contains more than 10,000 acronyms cross-referenced to the expanded forms. Linguistically accurate TC equivalents (explained below). The above features make this dictionary an invaluable tool for translators and for use in various IT applications such as information retrieval, machine translation, and input method editors. The TC in CECT is not merely a code-conversion of the SC, but has been carefully proofread to ensure accuracy both on the orthographic and lexemic levels. An example of orthographic conversion, marked "O" in the table below, is 目录 'directory' converted to 目錄. An example of lexemic conversion, marked "L" in the table below, is 计算机 in converted but to 電脳 in TC. Chinese-English Computer Terms ENGLISH SIMPLIFIED TRADITIONAL TYPE file 文件 檔案 L Internet 因特网 網際網路 L program 程序 程式 L CD-ROM 光盘 光碟 L information 信息 資訊 L computer network 计算机网络 電腦網路 L modulator/demodulator 调制解调器 調變解調器,數據機 L modem 调制解调器 調變解調器,數據機 L computer software 计算机软件 電腦軟體 L database 数据库 資料庫 L flowcharting 流程图编制 繪製流程圖 L expert system 专家系统 專家系統 O directory 目录 目錄 O 20 Chinese Lexical Database 汉语词汇数据库 The CJKI Chinese Lexical Database (CLD) is a comprehensive monolingual lexical database of Chinese consisting of the Simplified Chinese Lexical Database (CLD-SC) and the Traditional Chinese Lexical Database (CLD-TC) modules. Developed by CJKI’s team of experienced Chinese editors and linguists over many years, the CLD is a significant contribution to the field of Chinese lexicography. CLD is especially suitable for applications in the fields of information retrieval, morphological analysis, machine translation and various natural language processing (NLP) applications, and is being used by various IT companies to enhance their Chinese segmentation technology. Chinese Lexical Database POS TYPE CHINESE PINYIN RANK WEB RANK NP G 东霞 dōngxiá C 000205863 NP P 东会村 dōnghuìcūn C 000331481 东海 dōnghǎi A 000009255 NC NP G 东海 dōnghǎi A 000009255 NP P 东海 dōnghǎi A 000009255 NP P 东海县 dōnghǎixiàn C 000078031 E 东海扬尘 dōnghǎiyángchén C 000263750 E 东海捞针 dōnghǎilāozhēn C 000124028 U 东海舰队 dōnghǎijiànduì E 东海桑田 dōnghǎisāngtián C 000090763 000064698 NP Oe 东海大学 dōnghǎidàxué C 000069472 NP P 东外大街 dōngwàidàjiē C 000166158 东郭 dōngguō C 000069927 东郭 dōngguō C 000069927 E 东郭先生 dōngguōxiānshēng C 000101330 NC 东郭履 dōngguōlǚ C 000267748 C 000234655 NC NP NP S P NC NP G 东革新里 dōnggéxīnlǐ 东岳 dōngyuè C 000065236 东岳 dōngyuè C 000065236 21 Chinese Pinyin Database 汉语拼音数据库 The CJKI Chinese Pinyin Database (CPD) contains several million Simplified Chinese (SC) and Traditional Chinese (TC) headwords covering general vocabulary, technical terms, and proper nouns. Each lexeme is accompanied by pinyin readings for SC and both pinyin and zhuyin (not shown here) for TC. What is especially noteworthy is that the pinyin/zhuyin readings take into account the differences in pronunciation between Taiwan and the PRC, as shown in the table below. Even highly educated native Chinese speakers are often surprised to discover that such differences exist. An important feature of this database is its high accuracy, and explicit indication of the neutral tone, which is often ignored by conventional dictionaries. The data can be provided in all the major transcription systems such as Yale, Wade-Giles, and Tongyong Pinyin. An IPA edition, especially useful for speech technology applications such as TTS, is now under development. The Diff field below indicates whether pairs of SC-TC equivalents have identical pinyin. "D" indicates that pinyin is different; "S" indicates that pinyin is the same. Chinese Pinyin Database DIFF SC HANZI SC FREQUENCY SC PINYIN TC HANZI TC FREQUENCY TC PINYIN D 临期 0000029000 línqī 臨期 0000028800 línqí D 企业 0163000000 qǐyè 企業 0102000000 qìyè D 倬雄 0000000167 zhuōxióng 倬雄 0000000167 zhuóxióng S 咖啡豆 0000779000 kāfēidòu 咖啡豆 0000779000 kāfēidòu D 危险 0022400000 wēixiǎn 危險 0003080000 wéixiǎn D 埒城 0000000411 lièchéng 埒城 0000000411 lèchéng D 夕日 0002020000 xīrì 夕日 0002020000 xìrì D 大期 0000061500 dàqī 大期 0000061500 dàqí D 帆柱 0000030600 fānzhù 帆柱 0000030600 fánzhù D 微笑 0018400000 wēixiào 微笑 0018400000 wéixiào S 无着 0000265000 wúzhuó 無著 0000265000 wúzhuó D 咖喱粉 0000087400 gālífěn 咖喱粉 0000087400 kālǐfěn D 昔日 0004880000 xīrì 昔日 0004880000 xírì D 显微镜 0003390000 xiǎnwēijìng 顯微鏡 0000228000 xiǎnwéijìng D 期待 0059100000 qīdài 期待 0059100000 qídài D 咖喱饭 0000122000 gālífàn 咖喱飯 0000122000 kālǐfàn D 池穴 0000059400 chíxué 池穴 0000059400 chíxuè D 理发 0002170000 lǐfà 理髮 0000495000 lǐfǎ 22 Chinese-to-Chinese Conversion 中文简繁转换 A common fallacy is that there is a straightforward correspondence between Simplified Chinese (SC) and Traditional Chinese (TC), and that conversion between the two merely requires mapping from one character set to another. In fact, code-conversion from SC to TC will often lead to errors both on the orthographic and lexemic levels. An example of orthographic conversion is 头发 ‘hair’ converted to 頭髮, in which 頭 and 髮 are the traditional equivalents of 头 and 发 respectively. An example of lexemic conversion is SC 激光 ’laser’ converted to 雷射 in TC, a distinct word of identical meaning. CJKI ranks among the world's foremost experts on Simplified to/from Traditional Chinese conversion, and has in-depth knowledge of Chinese segmentation issues, having collaborated with Chinese universities such as Beijing Language and Culture University. Our comprehensive SC to/from TC mapping tables, developed over a period of about 12 years, have several million entries, the largest in existence. The table below illustrates lexemic mappings of computer terms between SC and TC. Chinese to Chinese Conversion Technology ENGLISH SIMPLIFIED TRADITIONAL File 文件 檔案 CD-ROM 光盘 光碟 Data 数据 資料 Compatibility 兼容性 相容性 Information 信息 資訊 Software 软件 軟體 Message 消息 訊息 Camera 摄像机 攝影機 Recording/Burning 刻録 錄製 Drive 驱动器 光碟機 Audio frequency 音频 音訊 Memory 存储 儲存 Video frequency 视频 視訊 Compatible 兼容 相容 Rewritable 可擦写 可重寫 Optical drive 光驱 燒錄機 23 Japanese Lexical Resources 日本語語彙資源 CJKI’s comprehensive Japanese lexical databases and dictionaries currently include nearly seven million entries, covering general vocabulary, technical terminology, proper nouns, company and organization names, katakana loanwords, and others. These include a rich set of grammatical, phonological and semantic attributes, including readings, part-of-speech codes, conjugation and inflection pattern codes, orthographic variants, various frequency statistics, and others. Below is a description of some of CJKI’s principal Japanese lexical resources. Principal Resources Japanese Lexical Database Japanese-English Database of Proper Nouns Japanese Morphological Database Japanese Orthographical Database Japanese-English Dictionary of Technical Terms Database of Japanese Name Variants CJKI Japanese-English Dictionary CJKI English-Japanese Dictionary Japanese Phonetic Database Japanese-Chinese Dictionary of Technical Terms Katakana Lexical Database Japanese Company Names Japanese Lexical Frequency Statistics Kanji-English Dictionaries Japanese IME Databases Japanese Full Form Dictionary 24 Japanese Lexical Database 日本語語彙データベース The CJKI Japanese Lexical Database (JLD) is a comprehensive monolingual lexical database that includes a rich set of grammatical attributes fine-tuned for NLP applications such as machine translation, information retrieval and morphological analysis. It contains about 400,000 entries covering general vocabulary, both free forms and bound forms. Developed by CJKI’s team of experienced Japanese editors and linguists over more than a decade, the JLD is a significant contribution to the field of Japanese lexicography. It is highly recommended to supplement JLD with our Japanese Orthographical Database (JOD). Sample of Japanese Lexical Database HEPBURN HEADWORD READING POS SUBPOS CONJ TYPE VALENCY SCRIPT 掛かる かかる V5 - R i 0 J kakaru 仮定 かてい VN M - - 0 J katei がぶ飲み がぶのみ VN t 0 J gabunomi がま口 がまぐち NC - - 0 J gamaguchi がましげ がましげ FS - - 1 J gamashige がましさ がましさ WS 1 J gamashisa がらがら がらがら D 0 J garagara がらがら がらがら VN 0 J garagara がらくた がらくた NC 0 J garakuta がらっと がらっと D 0 J garatto がらっぱち がらっぱち AN 0 J garappachi がらっぱち がらっぱち NC 0 J garappachi がわり がわり WS 1 J gawari がんがん がんがん D 0 J gangan がんがん がんがん VN 0 J gangan がんとして がんとして D 0 J gantoshite 下がる さがる V5 - R i 0 J sagaru 寒い さむい AJ - - - 0 J samui 何故なら なぜなら J - - - 0 J nazenara M i 0 i 25 Japanese↔English Proper Nouns 日英固有名詞データベース The CJKI Japanese↔English Database of Proper Nouns (JEP) is very comprehensive, covering millions of entries. It includes various data fields such as hiragana and romanized readings, frequency rankings, classification codes and locale codes, orthographic variants, English equivalents, and more. Included are a large variety of both Japanese and non-Japanese name types, such as: Place names Personal names (surnames and given names) Companies and organizations Western personal and place names Facilities (stations, roads, hotels) and point of interest. Detailed geographic data, especially for Japan. Japanese Personal Names TYPE Western Personal Names NAME READING ENGLISH RANK S 永福 ながふく Nagafuku 36072 S アントニア あんとにあ Antonia S 蟻原 ありばら Aribara 18269 S イーザー いーざー Iser S 橋詰 はしつめ Hashitsume 27721 S ウッドフォード うっどふぉーど Woodford S 橋詰 はしづめ Hashizume 11691 S ウスペンスキー うすぺんすきー Ouspensky FM 加名見 かなみ Kanami 37988 G シェーラ しぇーら Sheila M 海修 かいしゅう Kaishu 37988 S シェフェール しぇふぇーる Schaeffer F 季絵 きえ Kie 9317 S シャルコー しゃるこー Charcot M 光喜 こうき Koki 521 S シュタードレン しゅたーどれん Stadlen M 好洋 こうよう Koyo 37988 S タラソワ たらそわ Tarasova M 幸喜 さちき Sachiki 89487 G ニコライ にこらい Nikolai M 幸喜 こうき Koki 3085 G マジョリー まじょりー Marjorie M 幸喜 ゆきよし Yukiyoshi 82511 G メルビン めるびん Melvin TYPE JAPANESE READING LATIN 26 Japanese Place Names NAME READING 芦別市 ENGLISH あしべつし Ashibetsu-shi あさなべがわ Asanabegawa あさひかみまち Asahikamimachi あさひがおか Asahigaoka めいてつとこなめせん Meitetsu Tokoname Line なりたこくさいくうこう Narita International Airport ひかわいんたーちぇんじ Hikawa IC おうめかいどう Ome-Kaido かわさきくやくしょ Kawasaki-ku Ward Office きょうとぶらいとんほてる Kyoto Brighton Hotel よこはまかんとりーくらぶ Yokohama C.C. こまざわおりんぴっくこうえん Komazawa Olympic Park 朝鍋川 旭上町 朝日ヶ丘 名鉄常滑線 成田国際空港 斐川インターチェンジ 青梅街道 川崎区役所 京都ブライトンホテル 横浜カントリークラブ 駒沢オリンピック公園 Western Place Names JAPANESE READING LATIN 東ベルリン ひがしべるりん East Berlin ウィンズローパーク うぃんずろーぱーく Winslow Park エッセン えっせん Essen オークブルック おーくぶるっく Oak Brook オファーレル おふぁーれる O'Farrell サザンプトン島 さざんぷとんとう Southampton Island バヌアツ共和国 ばぬあつきょうわこく Republic of Vanuatu Japanese Companies and Organizations JAPANESE READING ENGLISH 海外旅行開発 かいがいりょこうかいはつ Overseas Tour Promotion, Inc. 宮下機料店 みやしたきりょうてん Miyashita Kiryoten 大豊建設 だいほうけんせつ Daiho Corporation 南急モータース なんきゅうもーたーす Nankyu Motors Co., Ltd. 富士見産業 ふじみさんぎょう Fujimi Sangyo Co., Ltd. 緑営バイオ りょくえいばいお Ryokuei Bio Co., Ltd. 27 Japanese Morphological Database 日本語連接属性データベース The Japanese Morphological Database (JMD) contains various morphological attributes such as derivational attributes, suffixes and prefixes, word elements (bound morphemes) and binding valency. These are particularly useful for disambiguating and identifying Japanese lexemes in such applications as segmentors, morphological analyzers, input method editors (IME) and search engine query processing. JMD is designed to significantly enhance segmentation accuracy and tokenization by making it possible to reliably identify compound words not in the lexicon. It consists of various components: A detailed list of verb and adjective stem variants A detailed list of verb and adjective inflectional endings A detailed list of auxiliaries attached to verbs and adjectives A database of affixes with adjacency attributes, essential for identifying lexemes not in the lexicon (OOV), like 処理済み shorizumi from 処理+済み. Adjacency Attributes AFFIX READING POS SUB-POS VALENCY RANK 気味 ぎみ WS M 1 67900 NC VC AN 染 ぞめ WS 1 61089 NC NP NC 染みる じみる WS 1 61089 NC V1 染める しめる WS 1 61089 VC V1 平 だいら WS 1 61089 NP NP 平 ひら WP 1 61089 NC NC 平成 へいせい NE 0 61089 NN NC 別 べつ FS 0 331 別 べつ WP 1 331 片 かた WP 1 28538 片 へん WS 1 25149 NC NC 片 ぺん WS 1 61089 NN NC 編 へん WS 1 8970 NC NP NC 編み あみ WS 1 61089 NX NC 返す かえす WS 1 2476 VC V5 返る かえる WS 1 61089 VC V5 便 びん WS 1 3030 NC NN NC S N BEFORE AFTER NC RESULT NC NC VC NC NC V NC 28 Japanese Orthographical Database 日本語異表記データベース The orthographical complexity of Japanese poses a special challenge to the developers of computational linguistic tools, especially in the area of intelligent information retrieval. These difficulties are exacerbated by the lack of a standardized orthography and the highly irregular Japanese orthography. The CJKI Japanese Orthographical Database (JOD) plays a critical role in enhancing the accuracy of information retrieval, machine translation and morphological analysis applications as it helps identify and disambiguate the numerous Japanese orthographic variants that have identical meanings, such neko ‘cat’ written 猫, ねこ or ネコ and kakiarawasu ‘write out, publish’ written 書き著す, 書 著す, 書き著わす or 書著わす. This database is the most comprehensive of its kind, and is being used by such companies as Yahoo, Amazon and Baidu to dramatically improve search recall. Also included are a large variety of katakana orthographic variants for loanwords. Japanese Orthographic Variants READING あっせん あかとんぼ あきかん POS SUB_ID VARIANT VN NC NC a 斡旋 b あっせん c あっ旋 a 赤とんぼ b 赤トンボ c 赤蜻蛉 d アカトンボ e あかとんぼ a 空き缶 b 空缶 c 明き罐 d あき缶 e あき罐 f 空きかん g 空きカン h 空き罐 i 空罐 j 空き鑵 k 空鑵 NORMALIZED あっせん 赤とんぼ 空き缶 29 Japanese↔English Technical Terms 日英専門用語辞書 CJKI maintains a comprehensive Japanese-English English-Japanese dictionary of over 1,000,000 technical terms covering a broad spectrum of fields covering the major domains of science and technology. The Japanese↔English Dictionary of Technical Terms (JET) is available in domain-specific standalone modules, or a full edition including all domains. Some of the major domains covered include computer/IT, mechanical engineering, medicine and pharmaceutics. JET is currently being used by some of the world's leading IT companies such as Fujitsu, Microsoft, Sharp and Casio. This database is being used in a variety of applications and software products, such as: Machine translation dictionaries. Information retrieval applications for accurate term recognition and indexing. NLP tools like morphological analyzers and tokenizers. Handheld electronic dictionaries, such as in Casio's high end Exword GT series. Dictionaries on CD-ROM such as in Logovista's Electronic Dictionary Series. Dictionaries for mobile platforms such as iPhone, Android, and Sharp’s XMDF-based devices. CJKI is constantly expanding this database by adding new terms, new domains, and readings. Japanese-English Technical Terms DOMAIN JAPANESE READING ENGLISH 化学 亜ヒ酸カルシウム あひさんかるしうむ calcium arsenite 生物 環状染色体 かんじょうせんしょくたい circular chromosome 生物 寒天拡散法 かんてんかくさんほう cup method 機械 駆動プーリー くどうぷーりー driving pulley 医学 結節性裂毛症 けっせつせいれつもうしょう clastothrix 医学 犬吠せき けんばいせき compression cough 電気 コンデンサ始動電動機 こんでんさしどうでんどうき capacitor-start motor 化学 ジアミノフェノール塩酸塩 じあみのふぇのーる えんさんえん hydrochloride 電気 整流する せいりゅうする commutate 医学 チアノーゼ ちあのーぜ cyanose 電気 転換する てんかんする convert 電気 ブラシ位置変化 ぶらしいちへんか brush position change 建設 臨界圧力 りんかいあつりょく critical pressure diaminophenol 30 Japanese Name Variants 日本語固有名詞異表記辞書 The number of Japanese proper nouns and their variants is very large -- in the millions -- which makes it difficult to identify them and process them. Named Entity Recognition (NER) technology is a hot topic in computational linguistics. To enhance NER technology, CJKI maintains databases of several million CJK and Arabic name variants in all major and most minor romanization systems. There are several well-established systems for romanizing Japanese, such as the Hepburn and Kunrei systems, as well as various popular systems and hybrid systems, which leads to millions of romanized variants. A good example is the first name of Japan's former prime minister Jun'ichirō Koizumi, which has 169 romanized variants. The CJKI Database of Japanese Name Variants (JNV) covers four million Japanese names and their romanized variants, and includes gender codes, classification codes, and frequency rankings. JNV is being used by some major IT companies, especially for business intelligence software and machine translation. Japanese Romanization Systems KANJI YOMI ENGLISH HEPBURN KUNREI NIPPON VARIANTS HYBRIDS GERMANIC 佐藤 さとう Sato Satō Satô Satô Satoo, Satou, Satoh 青塚 あおづか Aozuka Aozuka Aozuka Aoduka Aozuca Aoduca 生越 いくごし Ikugoshi Ikugoshi Ikugosi Ikugosi Icugosi Icugoshi 大津 おおづ Ozu Ōzu Ôzu Ôdu Oozu, Ouzu, Ohzu, Oodu, Oudu, Ohdu, Odu Ōdu 伊大地 いおおじ Ioji Iōji Iôzi Iôzi Iōzi, Ioozi, Iouzi, Iohzi, Iozi, Iooji, Iouji, Iohji, Iôji 橋本 はしもと Hashimoto Hashimoto Hasimoto Hasimoto 天満屋 てんまんや Tenman'ya Tenman'ya Temman'ya, Temmanya, Tenman'ya Tenman'ya Temman-ya, Tenmanya, Tenman-ya Ikugoschi LATIN Ikugochi Haschimoto Hachimoto Tenman'ja, Tenmanja, Tenman-ja 31 Japanese Place Name Variants Variants of Jun'ichirō TYPE VARIANT ENG VARIANT VARIANT HYBRID HEPBURN VARIANT VARIANT VARIANT VARIANT VARIANT HYBRID HYBRID HYBRID VARIANT VARIANT VARIANT VARIANT HYBRID VARIANT VARIANT VARIANT VARIANT HYBRID HYBRID HYBRID VARIANT VARIANT HYBRID ROMANIZATION Junichiro Jun'ichiro Jun-ichiro Junichirô Juniciro Jun'ichirō Jun-ichirō Junichirou Jun'ichirou Jun-ichirou Junichirō Jyunichiro Junitirou Junitiro Zyun'itiro Zyun-itiro Junichiroo Junichiroh Jyunichirou Jun'ichiroh Jun-ichiroh Zyun'itiroo Zyun-itiroo Jyun'ichiro Jyun-ichiro Jyunichiroh Jun'ichiroo Jun-ichiroo Jyunitirou RANK A FREQ JAPANESE YOMI ROMA TYPE 26600 安城市 あんじょうし Anjo city E A 4270 安城市 あんじょうし Anjo-shi E A 2080 安城市 あんじょうし Anjou-si h A 517 安城市 あんじょうし Anjou city h A 454 安城市 あんじょうし Anjoshi E A 339 安城市 あんじょうし Anjyo city h A 285 安城市 あんじょうし Anjoushi h B 282 安城市 あんじょうし Anjosi h 138 安城市 あんじょうし Anjou-shi h B 112 安城市 あんじょうし Anjo-si h 106 安城市 あんじょうし Anjyoushi h 77 安城市 あんじょうし Anjyousi h 65 安城市 あんじょうし Anjō-shi H 58 安城市 あんじょうし Anzyousi h C 47 安城市 あんじょうし Anjyo-shi h C 23 安城市 あんじょうし Anjousi h C 21 安城市 あんじょうし Anjou-chi h C 13 安城市 あんじょうし Anjyoshi h C 12 安城市 あんじょうし Anjoosi h C 12 安城市 あんじょうし Anjyou-shi h C 11 安城市 あんじょうし Anjō city h 10 安城市 あんじょうし Anjoh city h 10 安城市 あんじょうし Anjo-chi h 9 安城市 あんじょうし Anjyou city h 5 安城市 あんじょうし Anjyoo-shi h 3 安城市 あんじょうし Anjoh-shi h B B B B C C C C C 3 安城市 あんじょうし Anjohshi h C 3 安城市 あんじょうし Anjochi h C 2 安城市 あんじょうし Anjô city h C C 32 CJKI Japanese-English Dictionary CJKI 和英辞典 The CJKI Japanese-English Dictionary (CJED) covers about 110,000 entries of general vocabulary. It includes part-of-speech codes and readings. This up-to-date dictionary is optimized for the convenience of users of electronic dictionaries and online translation tools. It has just the right amount of detail: enough equivalents to give an in-depth understanding, yet short enough not to clutter up the screen. Japanese-English Dictionary POS JAPANESE READING ENGLISH NC 辞書 じしょ dictionary; lexicon; glossary; thesaurus NC 地所 じしょ land; ground; estate NC 事象 じしょう phenomenon V 自称する じしょうする call oneself; profess oneself to be someone; style oneself; profess oneself AN 自称の じしょうの would-be NC 辞書学 じしょがく lexicography NC 辞職 じしょく resignation NC 辞職勧告 じしょくかんこく advice to resign V 辞職する じしょくする resign; resign from; resign one's office NC 自信 じしん self-confidence NC 自身 じしん one's self; self; oneself; itself NC 地震 じしん earthquake; earth tremor; quake; temblor D 自身で じしんで oneself; in person AN 自身の じしんの one's own AN 地震の じしんの seismic NC 事情 じじょう circumstances; conditions; situation; state of affairs; reasons NC 自乗 じじょう square; second power NC 自浄 じじょう self-purification V 自乗する じじょうする square V 二乗する じじょうする square; multiply by itself 33 CJKI English-Japanese Dictionary CJKI 英和辞典 The CJKI English-Japanese Dictionary (CEJD) covers about 82,000 entries covering general vocabulary and important proper names and includes part-of-speech codes. Optimized for the convenience of users of electronic dictionaries and online translation tools, it has just the right amount of detail: enough equivalents to give an in-depth understanding but short enough not to clutter up the screen. This dictionary is or has been used in such well-known products as Babylon and Quicktionary various mobile platforms around the world such as TangoTown. Sample of English-Japanese Dictionary ENGLISH POS JAPANESE wrinkle N しわ; うまい考え; 名案 wrinkle V しわが寄る, …にしわを寄せる wrist N 手首, そで口 wristlet N 袖口バンド, 金属バンド, 腕輪 wristwatch N 腕時計 writ N 令状 write V 書く; 著述する, 作曲する; 手紙を書く; 署名する write down E 書き留める; 評する; 調子を下げて書く; けなす write in E 投書する; 書き込む; 書き入れて投票する write off E 帳消しにする; 損失とみなす; すらすらと書く write one's own ticket E 将来の方針を立てる write out E 全部書く, 清書する; 書く write-down N 評価切り下げ; 償却 write-off N 帳消し, 価格引き下げ writer N 作家, 記者; 書く人; 筆者; 作者; 著者 writer's block E 著述遮断 34 Japanese Phonetic Database 日本語音韻データベース CJKI’s Japanese Phonetic Database (JPD) was developed in collaboration with The National Institute for Japanese Language, the well-respected Japanese government organization that conducts scientific research on the Japanese language. The database covers about 130,000 entries of general vocabulary and personal names, and to our knowledge is the first database of its kind to provide phonetic transcriptions in IPA and accent codes that accurately indicate how Japanese names and words are pronounced in actual speech. Japanese Phonetic Database HEADWORD POS ACCENT READING PHONETIC 鏡 NC 鏡 REMARKS 3 カガミ kaŋami voiced velar nasal NC 3 カガミ kaɡami voiced velar stop 鏡 NC 3 カガミ kaɣami voiced velar fricative 危ない AJ 0 アブナイ abɯnai voiced bilabial plosive 危ない AJ 0 アブナイ aβɯnai voiced bilabial fricative 飾り NC 0 カザリ kazaɾi voiced alveolopalatal fricative ザリガニ NC 0 ザリガニ dzaɾiɡaɲi alveolopalatal affricate, palatal nasal 新聞 NC 0 シンブン ɕimbɯɴ 自分 NC 0 ジブン dʑibɯɴ weakening of (fricativized) 比較 VN 0 ヒカク ikakɯ devoicing of [hi] 比較 VN 0 ヒカク hikakɯ no devoiced vowel 続く V5 0 ツヅク tsɯzɯkɯ devoicing of [tsɯ] 続く V5 0 ツヅク tsɯzɯkɯ no devoiced vowel 恥 NC 2 ハジ haʑi voiced alveolopalatal fricative 恥 NC 2 ハジ hadʑi voiced alveolopalatal affricate 蜂 NC 0 ハチ hatɕi 八 NC 2 ハチ hatɕi 鵜 NC 1 ウ ʔɯʔ glottal stop voiced alveolopalatal plosive 35 Korean Lexical Resources 한국어어휘자원 CJKI is engaged in the development of various Korean dictionaries and lexical databases, covering general vocabulary, personal names, place names, and geographical data for Japan. These are used in such applications as machine translation (MT), information retrieval (IR) and input method editors (IME), and online maps. The Korean lexical database includes a rich set of grammatical, phonological and semantic attributes for use in natural language applications (NLP). Below is description of our principal Korean resources: Principal Resources Korean Lexical Database Korean-English Dictionary of Proper Nouns Korean-Japanese Dictionary of Proper Nouns Korean IME Databases 36 Korean Lexical Database 한국어어휘데이터베이스 The CJKI Korean Lexical Database (KLD) is a monolingual lexical database of Korean developed by CJKI’s Korean editors, KLD includes a rich set of grammatical attributes as well as hanja when applicable. The KLD is especially suitable for applications in the fields of information retrieval, morphological analysis and machine translation. Korean Lexical Database HANGUL 가둥-거리다 가로놓이다 가리산지리산 가볍다 가살-스럽다 가수분해 가시화-되다 가져가다 가파르다 간정되다 갈아대다 감다 감때사납다 개교 개연 POS SUBPOS TYPE PATTERN MOE V i katungkŏrita V i karonohita D V i HADA karisanchirisan AX P kapyŏpta AX P kasal-sŭrŏpta NC VP ti HADA kasupunhae V i kasihwatoeta V t kachyŏkata AX REU kap'arŭta V i kanchŏngtoeta V t kalataeta V t kamta AX P kamttaesanapta NC V i HADA kaekyo NC A HADA kaeyŏn HANJA 加水分解 改敎 蓋然 37 Korean↔English Proper Nouns 고유명사한영사전 The CJKI Korean↔English Dictionary of Proper Nouns (KEP) is a bilingual dictionary of personal and place names that covers both Korean and non-Korean proper nouns, including Japanese and Chinese names. It includes various data fields such as romanized readings, frequency rankings, classification codes, locale codes and English equivalents, and supports multiple romanization systems. A unique feature of this dictionary is that it also includes hanja (Chinese characters used in Korean). This data was compiled on the basis of precise transcription and transliteration rules, and verified by our Korean editors. It supports the Revised Romanization of Korean, the latest standard published by the Korean government in 2000, properly reflecting the phonological changes resulting from patchim and liaison. The MOE transliteration refers to a strict transliteration based on the former Ministry of Education (MOE) romanization system. Korean Proper Nouns HANGUL 안 제갈 조 주 한 황 지미 지산 정석 해운 희경 희란 가야1동 가양동 갈말읍 강남구 강원도 HANJA 安 諸葛 趙, 曺 周, 朱 漢, 韓 黃 芝美 智山 正錫 海雲 喜慶 熙欄 伽倻一洞 加陽洞 葛末邑 江南區 江原道 MOE an chekal cho chu han hwang chimi chisan chŏngsŏk haeun hŭikyŏng hŭiran kaya1tong kayangtong kalmalŭp kangnamku kangwŏnto ENGLISH An Chegal Cho Chu Han Hwang Chimi Chisan Cheongseok Haeun Huigyeong Huiran Kaya 1Village Kayang Village Kalmal Town Kangnam District Kangwon Province 38 Multilingual Technical Terms 多语言术语词典 Because of the rapidly growing trade relations between China, Japan and the English speaking world, there is an urgent need for Chinese-Japanese-English technical dictionaries in electronic form. CJKI’s Multilingual Dictionary of Technical Terms (MDT) is a comprehensive trilingual dictionary of technical terms in Simplified Chinese, Japanese and English covering about 300,000 entries in all major fields of science and technology. A sister edition, the Chinese-English Dictionary of Technical Terms, is under development in collaboration with Chinese institutions and is expected to cover several million entries. Multilingual Technical Terms CHINESE JAPANESE 加油 給油 加油 注油 加油车 給油車 加亮 ENGLISH fueling, lubrication lubrication, oiling refueling truck feeder, lubricator, oiler 給油器 lubricator, oil can, oil feeder, 油差し syringe lubrication hole, oil hole 油穴 filling station, service station 給油所 フィリングスタンド filling stand sulfurization, thionation, 加硫 vulcanization, curing highlighting 強調 jiāyóu jiāyóu jiāyóuchē HIRAGANA きゅうゆ ちゅうゆ きゅうゆしゃ 加油器 jiāyóuqì きゅうゆき jiāyóuqì あぶらさし 加亮 強調表示 highlighting jiāliàng 加亮 加亮 高輝度表示 ハイライト表示 highlighting, brightening highlighting jiāliàng jiāliàng 加力燃烧室 アフターバーナー afterburner jiālìránshā あふたーばーなー oshì 加和性 加成性 加气剂 空気連行剤 加聚物 付加重合体 加勒金法 ガレルキン法 additive property, additivity air entraining agent addition polymer, addition resin Galerkin method 加油器 加油孔 加油站 加油站 加硫 PINYIN jiāyóukǒng あぶらあな jiāyóuzhàn きゅうゆしょ jiāyóuzhàn ふぃりんぐすたんど jiāliú かりゅう jiāliàng きょうちょう きょうちょうひょう じ こうきどひょうじ はいらいとひょうじ jiāhéxìng かせいせい jiāqìjì くうきれんこうざい jiājùwù ふかじゅうごうたい jiālèjīnfǎ がれるきんほう 39 Multilingual Proper Nouns 多言語固有名詞データベース CJKI maintains comprehensive databases of CJK and Arabic personal names and place names, including various kinds of geographical data covering millions of entries. These databases are used by some of the world's major IT companies for a wide variety of applications such as online multilingual maps, named entity recognition, machine translation and information retrieval. Key Features The database includes various data fields (many not shown here), such as readings in pinyin and zhuyin, hiragana, romanization in all major and most important minor romanization systems, semantic classification codes and frequency rankings, locale codes, and other useful information. A unique feature is that it's important to note that the TC place names are not merely a code-conversion equivalent of the SC names, but are accurate on both the orthographic and the lexemic levels ("O" and "L" in the tables below). For example, New Zealand in SC is 新西兰 Xīnxīlán but in TC it is 紐西蘭 Niǔxīlán. Another unique feature is that SC and TC readings are distinguished. Thus the pinyin for SC 期荣 is qīróng, but for the TC 期榮 it is qírōng. Databases can be tailored to your specific needs and budgets. CJK Proper Noun Databases at a Glance C-E 1,000,000 J-C 1,000,000 J-K 1,000,000 C-K 1,000,000 J-E 1,000,000 K-E 1,000,000 Chinese Place Names 5,000 5,600 3,000 13,000 2,400 5,000 Korean Personal Names 3,300 2,100 13,000 240,000 13,000 13,000 Korean Place Names Japanese Personal Names — Given Names Japanese Personal Names — Surnames 3,200 2,000 5,900 20,000 5,900 6,900 376,000 281,000 390,000 376,000 390,000 45,000 149,000 91,000 150,000 149,000 150,000 73,000 Japanese Place Names 74,000 74,000 77,000 74,000 77,000 68,000 Western Personal Names 32,000 38,000 10,000 9,800 31,000 7,500 2,500 2,500 1,800 1,800 1,100 1,800 1,645,000 1,496,200 1,650,700 1,883,600 1,670,400 1,220,200 Chinese Personal Names Western Place Names Total 40 Chinese-English Chinese Names Type S S G G G G G G G G Chinese Pinyin wáng zhāng yèzé yèhuá yèníng yètóng yèquán yèxún yèjīng yèdá 王 张 业则 业华 业宁 业彤 业权 业浔 业经 业达 English Wang Zhang Yeze Yehua Yening Yetong Yequan Yexun Yejing Yeda Korean-Chinese Korean Names Type S S G G G G G G G G Korean 가 강 갑중 건종 경수 경숙 경식 경환 경철 구덕 Hanja 賈 姜 甲中 建鍾 敬秀 慶淑 慶植 景桓 景喆 九德 Chinese 贾 姜 甲中 建锺 敬秀 庆淑 庆植 景桓 景喆 九德 English-Chinese Western Names Type S S S S S S S S S S English Anthony Waltham Winchester Austen Keyser Count Gari Pierre Cornelius Constantin Chinese Pinyin 安东尼 沃尔萨姆 温切斯特 奥斯汀 凯泽 康特 加里 皮埃尔 科尔内留斯 康斯坦丁 Āndōngní wò'ěrsàmǔ Wēnqiēsītè Àosītīng Kǎizé Kāngtè Jiālǐ pí'āi'ěr kē'ěrnèiliúsī kāngsītǎndīng Japanese-English Japanese Given Names Type Japanese Reading English Takeshi M 丈 たけし Tamao M 多摩男 たまお Daijiro M 大二郎 だいじろう Tetsuji M 鉄次 てつじ Toshiharu M 敏晴 としはる Nobuo M 信雄 のぶお Yasuhiro M 安博 やすひろ Yurio M 百合男 ゆりお Kazuki M 一貴 かずき M 善一郎 ぜんいちろう Zen'ichiro Japanese-Chinese Japanese Surnames Type Japanese Reading Chinese S 岡林 おかばやし 冈林 S 下岡 しもおか 下冈 S 丸本 まるもと 丸本 S 佐藤 さとう 佐藤 S 勝部 かつべ 胜部 S 沼 ぬま 沼 S 上仲 かみなか 上仲 S 西郡 にしごおり 西郡 S 中口 なかくち 中口 S 渡辺 わたなべ 渡边 41 CJKEA Database of Place Names ENGLISH JAPANESE Aruba Brasilia Caracas Cairo Chad Georgia Ireland Seoul Seoul Tel Aviv Yemen SC アルーバ 阿鲁巴 ブラジリア 巴西利亚 カラカス 加拉加斯 カイロ 开罗 チャド 乍得 ジョージア 乔治亚 アイルランド 爱尔兰 ソウル 首尔 ソウル 汉城 テルアビブ 特拉维夫 イエメン 也门 LO TC L 阿盧巴 O 巴西利亞 L 卡拉卡斯 O 開羅 L 查德 O 喬治亞 O 愛爾蘭 O 首爾 O 漢城 O 特拉維夫 L 葉門 KOREAN 아루바섬 브라질리아 카라카스 카이로 차드 조지아 아일랜드 서울 서울 텔아비브 예멘 ARABIC أروب ا ب رازي ل يا ك راك اس ال قاهرة ت شاد جورج يا آي رل ندا س يول س يول ت ل أب يب ال يمن Phonemic Transcriptions of CJKE Place Names ENGLISH JAPANESE Aruba Brasilia Caracas Cairo Chad Georgia Ireland Seoul Seoul Tel Aviv Yemen あるーば ぶらじりあ からかす かいろ ちゃど じょーじあ あいるらんど そうる そうる てるあびぶ いえめん SC ālǔbāā bāxīlìyà jiālājiāsī kāiluó zhàdé qiáozhìyà àiěrlán shǒuěr hànchéng tèlāwéifū yěmén TC ālúbā bāxīlìyà kǎlākǎsī kāiluó chádé qiáozhìyà àiěrlán shǒuěr hànchéng tèlāwéifū yèmén KOREAN ARABIC arupasŏm pŭrachilria k'arak'asŭ k'airo ch'atŭ chochia ailraentŭ sŏul sŏul t'elapipŭ yemen aruba burazilia karakasu al-qahirah tshad jurjia ayirlanda siwul siwul tallu-abib al-yaman 42 Chinese-Korean Chinese Places Chinese Pinyin Hanja Korean Dōngguān 东关 東關 둥관 Dōngyíng 东营 東營 둥잉 Dōngyáng 东阳 東陽 둥양 dōng'ē 东阿 東阿 둥어 dōng'ān 东安 東安 둥안 Dōngyuán 东源 東源 둥위안 Dōnghú 东湖 東湖 둥후 Dōnggǎng 东港 東港 둥강 Dōngshān 东山 東山 둥산 Dōngchuān 東川 东川 둥촨 Korean-English Korean Places Korean 고금면 고남면 고담동 고대면 고덕동 고덕면 고등동 고등동 고랑동 고령군 Hanja 古今面 高南面 高潭洞 高大面 古德洞 古德面 高登洞 高等洞 古浪洞 高靈郡 English Gogeum-Myeon Gonam-Myeon Godam-Dong Godae-Myeon Godeok-Dong Godeok-Myeon Godeung-Dong Godeung-Dong Gorang-Dong Goryeong-Gun Chinese-English Western Places Chinese 土库曼 尼日利亚 哈里斯堡 布宜诺斯艾利斯 布鲁克林 贝塞斯达 柏林 博茨瓦那 马斯喀特 马拉维 Pinyin English tǔkùmàn Turkmenistan nírìlìyà Nigeria hālǐsībǎo Harrisburg bùyínuòsī'àilìsī Buenos Aires bùlǔkèlín Brooklyn bèisāisīdá Bethesda bólín Berlin bócíwǎnà Botswana mǎsīkātè Muscat mǎlāwéi Malawi CJKE Multilingual Database of Personal Names ENG JPN SC Abba アッバ 阿巴 Abbas アッバース 阿巴斯 Alberto アルベルト 阿尔韦托 Qirong 期栄 期荣 Akiko 暁子 晓子 Akiko 顕子 显子 Akiko 昭子 昭子 Akira 明 明 Deng 登 登 Einstein アインスタイン 爱因斯坦 Ernest アーネスト 欧内斯特 Gregg グレッグ 格雷格 Greg グレッグ 格雷格 Haiyang 海洋 海洋 Huaiyang 懐陽 怀阳 Jack ジャック 杰克 Jackie ジャッキー 杰基 Kennedy ケネディ 肯尼迪 Kaiyang 開陽 开阳 Nakajima 中島 中岛 William ウィリアム 威廉 Zhang 張 张 TC 亞伯 阿巴斯 阿爾韋托 期榮 曉子 顯子 昭子 明 登 愛因斯坦 歐尼斯特 葛瑞格 葛瑞格 海洋 懷陽 傑克 傑基 甘迺迪 開陽 中島 威廉 張 KOR 아바 아바스 알베르토 치룽 아키코 아키코 아키코 아키라 덩 아인슈타인 어니스트 그레그 그레그 하이양 화이양 잭 재키 케네디 카이양 나카지마 빌리암 장 LO HIRAGANA L あっば O あっばーす O あるべると O きえい O あきこ O あきこ O あきこ O あきら O とう O あいんすたいん L あーねすと L ぐれっぐ L ぐれっぐ O かいよう O かいよう O じゃっく O じゃっきー L けねでぃ O かいよう O なかじま O うぃりあむ O ちょう SC PIN TC PIN ābā ābāsī āěrwéituō qīróng xiǎozǐ xiǎnzǐ zhāozǐ míng dēng àiyīnsītǎn ōunèisītè géléigé géléigé hǎiyáng huáiyáng jiékè jiéjī kěnnídí kāiyáng zhōngdǎo wēilián zhāng yàbó ābāsī āěrwéituō qíróng xiǎozǐ xiǎnzǐ zhāozǐ míng dēng àiyīnsītǎn ōunísītè gěruìgé gěruìgé hǎiyáng huáiyáng jiékè jiéjī gānnǎidí kāiyáng zhōngdǎo wēilián zhāng MOE apa apasŭ alperŭt'o ch'irung ak'ik'o ak'ik'o ak'ik'o ak'ira tŏng ainsyut'ain ŏnisŭt'ŭ kŭrekŭ kŭrekŭ haiyang hwaiyang chaek chaek'I k'eneti k'aiyang nak'achima pilriam chang Meet Our CEO Jack Halpern (春遍雀來), CEO of The CJK Dictionary Institute, is a lexicographer by profession. For sixteen years was engaged in the compilation of the New Japanese-English Character Dictionary, and as a research fellow at Showa Women's University (Tokyo), he was editor-in-chief of several kanji dictionaries for learners, which have become standard reference works. Jack Halpern, who has lived in Japan over 40 years, was born in Germany and has lived in six countries including France, Brazil, Japan and the United States. An avid polyglot who specializes in Japanese and Chinese lexicography, he has studied 15 languages (speaks ten fluently) and has devoted several decades to the study of linguistics and lexicography. Jack Halpern has published over twenty books and dozens of articles and academic papers, mostly on the Japanese writing system and CJK information processing, has given over 600 public lectures on Japanese language and culture, and has presented several dozen papers at international conferences. On a lighter note, Jack Halpern loves the sport of unicycling. Founder and long-time president of the International Unicycling Federation, he has promoted the sport worldwide and is a director of the Japan Unicycling Association. Currently, his passion is playing the quena and improving his Chinese, Esperanto and Arabic. Contact Information The CJK Dictionary Institute, Inc. Komine Building 34-14, 2-chome, Tohoku, Niiza-shi Saitama 352-0001 JAPAN Phone: +81-48-473-3508 Fax: +81-48-486-5032 Email: [email protected] Web: www.cjk.org