Annotating Clause Boundary Labels to Japanese Corpora
by user
Comments
Transcript
Annotating Clause Boundary Labels to Japanese Corpora
2/17/2015 Contents Annotating Clause Boundary Labels to Japanese Corpora Introduction : Multiple clause linkage structure in Japanese Data : Corpora: CSJ, BCCWJ Annotating clause boundaries Takehiko Maruyama (University of Oxford / NINJAL) Result : Distribution of clauses and their combinations Discussion : 17 th February 2015 East Asian Linguistics Seminar Mechanism of multiple clause linkage structure Application to Old Japanese : Multiple clause linkage structure in OCOJ Clause Linkage System in Japanese •Japanese subordinate clauses –Conjugation forms of (auxiliary) verbs [ [ Taro ga warai ] Hanako ga naita ] Taro laughed and Hanako cried. –Conjunctive particles [ [ Taro ga waratta kara ] Hanako ga naita ] Taro laughed, so Hanako cried. •Final boundaries of clauses and sentences can be distinguished morpho-syntactically –Cf. English Taro laughed and Hanako cried. Taro laughed. And Hanako cried. Bad Sentence with Multiple Clause Linkage • 僕たちはバスをおりて、長い階段を上がって、動物園に 向かったんだけど、動物園の門にはライオンの絵がか いてあって、とても大きくてびっくりしたけど、中に入ると 最初はパンダがいて、その先にはリスやコアラがいたり、 おもしろい鳴き声を出す鳥とかがたくさんいて、祐二君 が「動物園ってすごい楽しいね」と言っているうちに、お 昼ごはんを食べる時間になって、キリンの長い首を見 ながらお弁当を食べて、それから大きなゾウを見て、長 い鼻でエサを拾いあげるようにして食べるのを見て、す ごいおもしろかったけれど、途中から雨が強くなって、 早く学校に帰ることになったから、ちょっと残念だった。 Composition by elementary school student Clause Linkage System in Japanese •Subordinate clauses can be linked circularly –A sentence can be infinitely long (cf. embedded sentence) Taro ga warai Hanako ga naita kara Jiro ga okotte … Taro laughed and Hanako laughed, so Jiro got angry and… –A series of clause linkages within a sentence: “Multiple clause linkage structure” • Multiple clause linkage structure tends to be avoided in Prescriptive Grammar –In elementary schools, long sentences are called “pointless / bad sentence” (だらだら文 ・ 悪文) –Nagano (1969) “Disorders and confusions of authors’ thoughts generate pointless sentences” Clause Linkages in Spontaneous Speech • 私が住んでいたところは団地 : の二階でして (F えーと) その前は大きな (F えー) 明治 : 道路が走っていたんで すけれども || 団地と道路の間にはこう団地の庭みたい な感じで (F えーと) || 道路の手前に木がたくさん生えて いたので || (F えーと) || (F ま) 鳥が || (D つつ) 飛び出し たと してもすぐには道路に出ないで その : 木 (D ん) (D き) 木の辺りに引っ掛かってるかな : という || (F えー) 感 じでしたので まず二階からこう || 木を || 木のどの辺にい るかという || のを当たり付けて || 当たりを付けると言う か (F まー) || 探してみて || すぐには見つからなかったの で しょうがない (D ぐ) ので (D すす) すぐに外に飛び出 しまして || (F えーとー)... (Corpus of Spontaneous Japanese:S02M0076) 1 2/17/2015 Clause Linkages in Written Text • 何年度かの初島レースで徹夜で舵を引いて明け方ファースト・フィ ニッシュし、後の片付けはクルーにまかせて家に飛んで 帰り昼近く まで仮眠をとった後迎えにきたサッカー仲間の車で横浜にいき、外 人クラブとの試合で私も一点ゴールを決めて大勝し、帰り道には 当時流行りだしていたバッティングセンターで小一時間ボールを 打って、その後ハーバーから戻っていたクルーたちとマージャンし て馬鹿勝ちし、「いったい石原さんて何なんだ」、とぼやかれて悦に 入っていたこともあったが、その次の年あたりに海で酷い目に会い 生まれて初めて体力の限界を覚らされたものでした。 Then… •How subordinate clauses are linked to compose multiple clause linkage structure in spoken and written Japanese? •Why do speakers / writers produce bad sentences in their speech / text? (Balanced Corpus of Contemporary Written Japanese OB6X 00101) (『老いてこそ人生』 石原慎太郎著、幻冬舎) Research Questions •What type of subordinate clause appears in Japanese spontaneous speech and written text, and how they are combined? –Distribution of clause types and their combinations –Corpus-based study • Surveying large corpora of spoken and written Japanese • Identifying various types of clauses in the corpora •What is the mechanism of generating multiple clause linkage structures? –“Utterance production in real time” cf. Levelt (1989) –“Dynamic Rewriting Rule” by Kondo (2005) CSJ •651 hours, 7.52M words of spontaneous speech –90% for monologue Corpora •CSJ: Corpus of Spontaneous Japanese (2004) 『日本語話し言葉コーパス』 –Mainly monologue –651 hours –7.52 million words •BCCWJ: Balanced Corpus of Contemporary Written Japanese (2011) 『現代日本語書き言葉均衡コーパス』 –Various types of written text –100 million words –172,675 samples 2-way Transcription System Basic Trans. Pronunciation Trans. • APS (Academic Presentation Speech): formal • SPS (Simulated Public Speaking): casual –10% for dialogue •Rich annotations –Transcription, Morphological information, Clause boundary label, Dependency and Discourse structure, Segment label, Intonation label, Speakers’ info.. •Aims –To develop automatic speech recognition system –Linguistics study of spontaneous speechfs 2 2/17/2015 Morphological Information •Morphologically analyzed data XML encoding •All the annotations are encoded in XML files BCCWJ •Balanced corpus for general purpose •100 million words –Sampled randomly from various written text published during 1976 - 2005 (-2009) •Registers BCCWJ •Balanced corpus for general purpose •100 million words –Sampled randomly from various written text published during 1976 - 2005 (-2009) •Registers –Books, Magazines, Newspapers, Web Documents, Whitepapers, Textbooks, Blog, Law, Verse... •Aims –Books, Magazines, Newspapers, Web Documents, Whitepapers, Textbooks, Blog, Law, Verse... •Aims –Vocabulary survey, Grammatical study, Lexicography... –Japanese language education –Natural language processing XML encoding <?xml version="1.0" encoding="UTF-8"?> <sample sampleID="LBe2_00005" version="1.0" type="fixedLength"> <article articleID="LBe2_00005_F001"> <paragraph> <sentence>やがて、後<sampling type="start" />燕は漢人の<ruby rubyText=" ひょう">馮</ruby><ruby rubyText="ばつ">跋</ruby>に乗っ取られてしまいます。 </sentence> <sentence>西暦四〇九年のことですが、この翌年前記の南燕が東晋の<ruby rubyText="りゅう">劉</ruby><ruby rubyText="ゆう">裕</ruby>によって、ほ ろぼされてしまいました。</sentence> </paragraph> <paragraph> <sentence> 四〇九年には、いろいろなことがおこっています。</sentence> <sentence>さしもの拓跋珪も、この年、思わぬことで、あろうことか息子の一人、 <ruby rubyText=“たく”>拓</ruby><ruby rubyText=“ばつ”>跋</ruby><ruby rubyText=“しょう”>紹</ruby>によって殺されました。 </sentence> </paragraph> A sample starts here Figures, old Japanese are –Vocabulary survey, Grammatical omitted study, Lexicography... –Japanese language education A character randomlyprocessing –Natural language chosen in a page Morphological Information •Morphologically analyzed data 18 3 2/17/2015 少納言 Shonagon / 中納言 Chunagon •BCCWJ concordance program What is “Annotation” ? 19 Corpus Annotations •Annotations: adding (non-)linguistic information to linguistic entities in a corpus •Various types of annotations to various levels of linguistic expressions Annotating Clause Boundaries •Adding “Clause Boundaries Labels” Discourse Rhetorical structure, Anaphora Sentence boundaries, Dependency structure, Sentence Predicate-Argument structure, Speech act, Intonation Clause Clause boundaries, Dependency structure Phrase Syntactic feature, Semantic role Lemma, Part-of-Speech, Named entity, WordWord sense, Accent Phoneme Segment kyoo ohanasi sasete itadaku naiyoo na n desu keredomo /keredomo/ (F e:tto:) (F ma) tokuni mezurasii koto de wa nai to <Quote> omou n desu ga /ga/ (F ano:) jibun wa (F ano) karada wa (F e:) moto kara tuyoi hoo dewa nakatta no desu ga /ga/ iwayuru byooki rasii byooki toyu: koto wa sita koto ga naku te /te/ (F sono) (F e) (D i) ikkagetu hodo zutto netakiri toyu:ka <suspend> ie ni ori masi te /te/ (F ano) byooki wo site ori masita mono de <Copula> jibun ni totte wa umare te <te> hajimete no koto datta node <node> (F e:) sono koto ni tui te ohanasi sasete itadaki tai to <Quote> omoi masu [EOS] “CBAP” Data Statistics •CBAP: Clause Boundary Annotation Program (Maruyama et al. 2004) –Detecting and annotating clause boundaries –Using morphologically analyzed data –139 types of clauses Subordinate (102) Conditional (23) Reason (8) Time (21) Manner (12) misc (38) Complementary (10) Complementary (2) Quotation (5) Indirect question (3) Adnominal (15) Coordinate (12) •“Sentences” in (the part of) CSJ and BCCWJ –CSJ: Clause Boundary Labels [EOS] –BCCWJ: the extent of sentence-tags ended by [。!?] Registers # files # sentences # words APS (formal) 70 5,389 191,591 CSJ SPS (casual) 107 4,494 164,096 Book 83 8,780 204,050 BCCWJ Magazine 86 9,342 202,268 Newspaper 340 11,898 308,504 39,903 1,070,509 Total 4 2/17/2015 Distribution of CBLs Target of Clause Boundary Labels •Major clause boundaries, classified by 5 types Clause types Clause Boundary Labels EOS EOS ga, keredomo, keredo, kedomo, Coordination kedo, si Reason kara, node Conditional tara, taraba, to, nara, naraba, reba misc Continuative forms of a verb and copula, te, quotation, toyu: CBL EOS EOS APS SPS Book Magazn Newsp 5,624 5,476 8,606 9,237 7,713 1,027 672 716 552 496 382 800 14 2 1 108 328 0 0 0 8 37 15 26 4 43 584 26 62 10 54 230 108 90 21 78 261 307 185 69 310 735 150 164 40 60 303 184 172 29 546 691 438 365 265 3 9 42 53 11 153 225 450 288 178 ContinueF 556 277 1,908 1,837 2,023 Copula 347 769 448 408 408 te 2,884 3,903 2,122 1,625 1,080 Quote 1,006 1,577 1,130 732 881 toyu: 1,454 1,163 445 267 150 14,645 18,038 17,110 16,065 13,379 ga keredomo kedomo Coordinate keredo kedo si kara Reason node tara(ba) to Conditional nara(ba) reba misc Total Adjusted frequency: 200,000 words in each register Number of CBLs within a Sentence 6,000 APS SPS Book Magazine Newspaper Frequency 5,000 4,000 3,000 2,000 1,000 0 1 2 3 4 5 6 7 8 9 10 ~20 ~40 CBL combinations (CSJ) 32.0% 9.1% 3.8% 3.3% 2.7% 2.2% 1.9% 1.5% 1.5% 1.5% APS EOS te / EOS toyu: / EOS ga / EOS Quote / EOS te / te / EOS Continue / EOS te / toyu: / EOS Copula / EOS te / Quote / EOS 26.2% 5.9% 3.6% 1.8% 1.7% 1.6% 1.4% 1.4% 1.4% 1.2% SPS EOS te / EOS Quote / EOS toyu: / EOS te / Quote / EOS Continue / EOS keredomo / EOS to / EOS te / te / EOS ga / EOS # CBLs within a sentence Adjusted 10K sentences in each register CBL Combinations (BCCWJ) 45.1% 7.8% 6.4% 3.5% 2.4% 1.8% 1.8% 1.6% 1.2% 1.2% Book EOS te / EOS Cont / EOS Quote / EOS ga / EOS to / EOS Copula / EOS reba / EOS toyu: / EOS kara / EOS Magazine 53.7% EOS 7.8% Cont / EOS 6.3% te / EOS 2.9% Quote / EOS 2.4% ga / EOS 2.0% Copula / EOS 1.7% to / EOS 1.3% te / Cont / EOS 1.3% reba / EOS 1.0% toyu: / EOS Newspaper 52.2% EOS 11.5% Cont / EOS 5.6% te / EOS 3.6% Quote / EOS 2.7% ga / EOS 2.6% Copula / EOS 1.4% to / EOS 1.2% Cont / Cont / EOS 1.1% Cont / te / EOS 1.0% te / Cont / EOS Controled, common writing style in published text? Number of CBLs within a Sentence ×6 ×7 ×8 ×9 ×10 ~×20 ~×40 APS 321 186 41 17 37 0 SPS 465 367 185 116 95 78 220 11 Book 95 49 10 10 5 0 0 Magazine 54 16 5 3 2 1 0 35 4 2 2 0 1 0 Newspaper Adjusted 10K sentences in each register Spoken > Written Casual > Formal 5 2/17/2015 Clause Linkages in Spontaneous Speech If he writes as… • 私が住んでいたところは団地 の二階でし EOS その前は大きな 明治 道路が走っていたんで す EOS 団地と道路の間にはこう団地の庭みたいな 感じで 道路の手前に木がたくさん生えていた ので EOS 鳥が 飛び出したと し てもすぐには道路に出ないで その 木 木 の辺りに引っ掛かってるかな という 感じでし たので EOS 二階からこう 木を 木のどの辺にいる かという のを当たり付けて 当たりを付けると言うか 探してみて すぐには見つからなかったので しょうがな い ので すぐに外に 1. 私が住んでいたところは、団地の二階でした。 2. その前は大きな明治道路が走っていました。 3. 団地と道路の間は団地の庭のような感じで、 道路の手前に木がたくさん生えていました。 4. ですから、鳥が飛び出したとしても、すぐには 道路に出ずに、その木の辺りに引っ掛かって るかなと思いました。 5. そこで、まず二階から木のどの辺にいるかの 当たりを付けて、. . . Difference of Speech and Written Text “Dynamic Rewriting Rule” •Process of producing utterance (Levelt 1989) –A speaker has to continue speaking while s/he speaks –Dynamic processing in real time to produce a narrative –Disfluencies (fillers, word fragments, repairs…) •Process of writing texts –A writer can take several time until s/he fixes the result –Editing (copy&paste, delete, rewrite), proof-reading –Typos Constraints of continuity in spontaneous speech generates multiple clause linkage structure “Dynamic Rewriting Rule” 1. [ [hito sigekumo aranedo] + tabi kasanari keri] (EOS) 2. [ [hito sigekumo aranedo] + tabi kasanari kereba] + aruji kikituku] (EOS) 3. [ [ [hito sigekumo aranedo] + tabi kasanari kereba] + aruji kikitukete] + sono kayohiji ni yogoto ni hito wo suetu] (EOS) 4. [ [ [ [hito sigekumo aranedo] + tabi kasanari kereba] + aruji kikitukete] + sono kayohiji ni yogoto ni hito wo suete] + mamorasekeri] (EOS) 5. [ [ [ [ [hito sigekumo aranedo] + tabi kasanari kereba] + aruji kikitukete] + sono kayohiji ni yogoto ni hito wo suete] + mamorasekereba] + ikedomo e awadu] (EOS) 6. [ [ [ [ [ [hito sigekumo aranedo] + tabi kasanari kereba] + aruji kikitukete] + sono kayohiji ni yogoto ni hito wo suete] + mamorasekereba] + ikedomo e awade] + kaerikeri ] (EOS) •Kondo (2005) “MCL in Early Middle Japanese” –A main clause with subordinate clauses is often rewritten into another subordinate clause dynamically 1. [ [hito sigekumo aranedo] + tabi kasanari keri] (EOS) 2. [ [hito sigekumo aranedo] + tabi kasanari kereba] + aruji kikituku] (EOS) 3. [ [ [hito sigekumo aranedo] + tabi kasanari kereba] + aruji kikitukete] + sono kayohiji ni yogoto ni hito wo suetu ] (EOS) Narratives and Multiple Clause Linkage •Kondo (2005) –“Writing styles in the Early Middle Japanese is related to “narratives”, which must reflect spoken language.” –“Speech is produced dynamically, combining phonetic forms and semantic meaning simultaneously.” •Mechanism of Multiple Clause Linkage –MCL is a reflection of the nature of narratives, which a speaker/writer keeps telling a series of episodes • Dynamic production of narratives in spontaneous speech • “Lively” description styles in written text (bad sentences, or effectively used by professional authors) 6 2/17/2015 OCOJ OCOJ The Oxford Corpus of Old Japanese オックスフォード大学上代日本語コーパス The Oxford Corpus of Old Japanese オックスフォード大学上代日本語コーパス •The Oxford Corpus of Old Japanese –A comprehensively annotated corpus of Old Japanese • Original text in Kanji, romanized phonemic transcription, English translation, morphological information, lemmatization, and grammatical and semantic roles of noun Poetic texts phrases Kojiki kayo (古事記歌謡; 712) Nihon shoki kayo (日本書紀歌謡; 720) Fudoki kayo (風土記歌謡; 730s) Bussukoseki-ka (仏足石歌; after 753) Man‘yoshu (万葉集; after 759) Shoku nihongi kayo (続日本紀歌謡; 797) Jogu shotoku hoo teisetsu (上宮聖徳法王帝説) 112 poems 133 poems 20 poems 21 poems 4685 poems 8 poems 4 poems 2527 words 2444 words 271 words 337 words 83706 words 134 words 60 words Non-poetic texts Shoku nihongi Senmyo (続日本紀宣命) Engishiki Norito (延喜式祝詞) OCOJ approx. 14,000 words approx. 6,500 words The Oxford Corpus of Old Japanese オックスフォード大学上代日本語コーパス (tentative) Result of Annotation •Shoku nihongi Senmyo (続日本紀宣命) –A total of 14,306 words –A total of 3,121 clause boundary labels were annotated. te EOS Adnominal Quote ContinueF suru-mo suru-ni Quote-namo reba 727 512 498 381 173 56 46 44 42 te-namo suru-o domo yueni madeni manimani temo nagara made 30 22 15 7 7 5 4 4 4 Annotating CBLs to Senmyo text Comparing Senmyo to CSJ/BCCWJ •Adjusted 20,000 words for each register CBL EOS te Quote keredomo ContinueF reba SPS Book Senmyo 548 861 716 390 212 1,016 158 113 533 80 1 99 28 191 242 23 45 59 7 2/17/2015 CBL combination (OCOJ) CBL combination (OCOJ) /adnom/adnom/adnom/ContF/ContF/Quote •Identifying the clause boundaries -tonamo/Quote/adnom/Quote/EOS/ •Identifying the clause boundaries ametuti no muta nagaku topoku aratamu masiziki tune no nori to tatetamapyeru wosukuni no nori mo katabuku koto naku ugoku koto naku watariyukamu to namo omoposimyesaku to noritamapu opomikoto wo moromoro kikitamapeyo to noritamapu (OCOJ:Senmyo 3) ametuti no muta nagaku topoku aratamu masiziki /adnom/ tune no nori to tatetamapyeru /adnom/ wosukuni no nori mo katabuku /adnom/ koto naku /ContF/ ugoku koto naku /ContF/ watariyukamu to namo /Quote-tonamo/ omoposimyesaku to /Quote/ noritamapu /adnom/ opomikoto wo moromoro kikitamapeyo to /Quote/ noritamapu [EOS] CBL combination (OCOJ) •Adjusted 20,000 words for each register 109 94 41 28 15 14 13 13 11 11 Senmyo EOS te / EOS Quot / EOS te / te / EOS te / Quot / Quot / EOS Quot / Quot / EOS Cont / EOS suruni / EOS te / Quot / EOS te / te / te / EOS 388 67 55 30 21 15 15 14 10 10 Book EOS te / EOS Cont / EOS Quote / EOS ga / EOS to / EOS Copula / EOS reba / EOS toyu: / EOS kara / EOS Prospect •Is it possible to examine clause linkage structures in a series of historical Japanese texts? –Old Japanese ( -794) –Early Middle Japanese (8-12 C) –Late Middle Japanese (12-17 C) –Early Modern Japanese (17-19 C) –Modern Japanese (19-20 C) –Contemporary Japanese (20-21 C) •Corpus-based diachronic studies of Japanese –Development of new historical corpora –Grammatical studies from various aspects CBL combination (OCOJ) •Adjusted 20,000 words for each register 109 94 41 28 15 14 13 13 11 11 Senmyo EOS te / EOS Quot / EOS te / te / EOS te / Quot / Quot / EOS Quot / Quot / EOS Cont / EOS suruni / EOS te / Quot / EOS te / te / te / EOS 144 32 20 20 13 10 9 9 9 8 SPS EOS te / EOS NounQuot / EOS Quot Noun-/ EOS tetoyu: / EOS te / Quot / EOS Copula / EOS Interjection keredomo / EOS Reference 1. Kondo, Yasuhiro (2005) “Heian jidai go no fukushi setsu no setsu rensa koozoo ni tsuite”, Kokugo to kokubungaku, 82(11) 2. Levelt, W. J. M. (1989) Speaking: From Intention to Articulation. MIT Press. 3. Maruyama, Takehiko, Hideki Kashioka, Tadashi Kumano and Hideki Tanaka (2004) “Nihongo setsu kyookai kensyutsu puroguramu CBAP no kaihatsu to hyooka”, Sizen gengo syori 11(3) 4. Maruyama, Takehiko (2014) “Gendai nihongo no tajuutekina setsu rensa koozoo ni tsuite”, Hanashi kotoba to kaki kotoba no setten, Hituji Shobo. 5. Nagano, Masaru (1969) Akubun no jiko shindan to chiryoo no jissai, Shibundo. 8