Master`s Thesis Semantic model and analysis with a stochastic
by user
Comments
Transcript
Master`s Thesis Semantic model and analysis with a stochastic
NAIST-IS-MT9851113 Master's Thesis Semantic model and analysis with a stochastic association Daichi Mochihashi February 14, 2000 Department of Information Processing Graduate School of Information Science Nara Institute of Science and Technology Master's Thesis submitted to Graduate School of Information Science, Nara Institute of Science and Technology in partial fulllment of the requirements for the degree of MASTER of ENGINEERING Daichi Mochihashi Thesis Committee: Yuji Matsumoto, Professor Hiroyuki Seki, Professor Semantic model and analysis with a stochastic association 3 Daichi Mochihashi Abstract This thesis describes the meaning of a word by a stochastic association. The meaning of a word is calculated by the Markov transition as an association based on the association probability statistically deduced from corpora, and the meaning is dened as a state probability distribution over the lexicon. By the association as the state transition, indirect semantic cooccurences can be represented adequately. Based on the quantitative representation of meaning, we also calculated a semantic consistency and semantic informativeness of a sentence. Keywords: semantics, spreading activation, Markov Process, stochastic process 3 Master's Thesis, Department of Information Processing, Graduate School of Information Science, Nara Institute of Science and Technology, NAIST-IS-MT9851113, i February 14, 2000. 確率的連想に基づく意味モデルと意味解析3 持橋 大地 内容梗概 本論文では, 単語の意味を単語間の連想関係を表す確率分布として表現し, その 定式化と連想確率の獲得について述べる. 単語の意味は共起関係から導かれた連想 Markov 過程上の連想によって計算され, その状態確率分布として 意味が定義される. 状態遷移として連想を行うことによって, 直接共起しない語と の意味的な関係が表現できる. 確率分布として定量的に表現された意味に基づいて, 文の意味的な妥当性と意味 的情報量の定量的指標の計算を試み, その際に生じた問題点をまとめた. 確率に基づいて キーワード 意味論, 活性拡散, マルコフ過程, 確率過程 3 奈良先端科学技術大学院大学 MT9851113, 2000 年 2 月 14 情報科学研究科 情報処理学専攻 修士論文, 日. ii NAIST-IS- Contents 1. Introduction 1 2. Meaning and semantic analysis of Language 2.1 Lexical Meaning : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.2 Structural meaning : : : : : : : : : : : : : : : : : : : : : : : : : : 2 2 3 3. Previous work on Lexical Semantics 3.1 Logic based approach : : : : : : : : : : : : : 3.2 Psycholinguistic approach : : : : : : : : : : 3.3 Neural network approach : : : : : : : : : : : 3.4 Related work : : : : : : : : : : : : : : : : : 3.4.1 Semantic similarity : : : : : : : : : : 3.4.2 Semantic disambiguation : : : : : : : 3.4.3 Approaches in Information Retrieval 3.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 4 4 5 5 5 6 7 8 4. Meanings as association 4.1 Lexical Semantic Network : : : : : : : : : : : : : : : : : : : : : : 4.2 Previous works on LSN : : : : : : : : : : : : : : : : : : : : : : : : 4.3 LSN as stochastic network and its formulation in Markov process 4.4 Mathematical formulation of LSN : : : : : : : : : : : : : : : : : : 4.4.1 Probability spaces : : : : : : : : : : : : : : : : : : : : : : : 4.4.2 Composition of Markov process : : : : : : : : : : : : : : : 4.5 Mathematical properties of stochastic LSN : : : : : : : : : : : : : 4.6 Acquisition of association probability : : : : : : : : : : : : : : : : 4.6.1 Mutual information : : : : : : : : : : : : : : : : : : : : : : 4.6.2 Log likelihood ratio : : : : : : : : : : : : : : : : : : : : : : 9 9 10 11 11 12 12 13 13 14 14 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5. Comparison with former formalisms 20 5.1 Advantageous properties : : : : : : : : : : : : : : : : : : : : : : : 20 5.2 Limitations of the formalism : : : : : : : : : : : : : : : : : : : : : 21 iii 6. Semantic sentence analysis 6.1 Semantic consistency of a sentence 6.1.1 Experiment : : : : : : : : : 6.1.2 Evaluation : : : : : : : : : : 6.2 Semantic information measure : : : 6.2.1 Experiment : : : : : : : : : 6.2.2 Evaluation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 22 23 25 29 30 30 7. Conclusion 33 8. Acknowledgements 34 References 35 iv 1. Introduction Language is a medium for communicating meaning. However, such an aspect of language is rather left unthought under the concentration of studies on syntax since the advent of generative grammar by Chomsky. On the other hand, it appears that the studies of grammar itself have revealed the indispensability of semantics when we analyze natural languages. Furthermore, from the practical point of view, according to the rise of hyper text space exemplied by Web, proper treatment of meaning is strongly required in Information Retrieval (IR). This thesis is devoted to capture such realistic meanings of words automatically, if inevitable limitations, and to model a system in which meanings are produced. Overview This thesis focuses on lexical meaning and denes it as a probability distribution over a stochastic unlabeled network consisting of words as its simple nodes. A node in the network exists only by the relationships to its neighbors and has no internal structures within itself. First, we describe the two main subdivisions in semantics, i.e., lexical meaning and structural meaning, and describe why we focus on lexical meaning. Second, we overview the previous works done on the lexical meaning. They are divided into logical approach, psycholinguistic theory, and related technical persuits in natural language processing. We look through them, and locate where this thesis has its own place. 1 2. Meaning and semantic analysis of Language Sentences in natural language consist of words and their arrangements. While there are subconstituents of words such as morphemes and phonemes, and they have substancial contribution to the semantics of words [24], we dene here a word as an atomic constituent of language because of simplicity and because these subconstituents might be treated within the framework this thesis proposes. When a word is regarded as a semantic atom of language, corresponding to the atoms and their arrangements, semantics of natural language can be classied into two subdivisions: the lexical meaning and the structural meaning. In general, lexical meaning is what Saussure referred as `paradigmatic' meaning, and structural meaning corresponds to `syntagmatic' meaning[9]. Below, we will investigate these two meanings more closely and describe why we focus on the lexical meaning. 2.1 Lexical Meaning Lexical meaning is what a certain word represents for. As we have seen above, a word can be regarded as an atom of semantics. Meaning of a sentence is based on the meanings of the words contained in the sentence; we cannot dispense with the lexical meaning when we think about the meaning of sentence1. Lexical meaning has nothing to do with its arrangements, i.e., grammatical constructions, but some structural meanings can be expressed in a static lexical way. For example, `The situation where some liquid is heated to be a gas form' can be described simply as `evaporation,' and in Latin, a sentence `He will sing' can be described by a single word `cantabit.' Sapir comments about the example of compaction of much more complicated situation into a single word found in some native American languages [21]. However, obviously lexical meaning has limitations where such a `frozen' expression does not exist so far. In Japanese, one can describe a certain quiet atmosphere of an old city landscape as `たたずまい', but in English, one cannot express such an atmosphere in a word and cannot resort to lexical meaning; in1 Even in the case of an idiom, it is more or less based on the meaning of its constituent words; we cannot replace a word in an idiom with any other words. 2 stead they must use syntactic constructions as used in this sentence. As a whole, complicated meaning is dicult to describe by lexical meaning, and its is obvious that there are some meanings which can only be expressed with the help of syntactic description. Although lexical meaning has thus its own limitations, it is the key to semantics and no one tells about semantics without lexical meaning. Also in the view of present methods in Information Retrieval, which accepts words as its query, exploring lexical meaning is the rst inquiry we have to make. 2.2 Structural meaning Structural meaning is the meaning which cannot be deduced from the mere set of its constituent word; it lies on the abstract construction in the person who hears the expression. For example, meaning of the sentence `Every boy has a lover.' can be described via logical form \8x9y:boy(x) lover(x; y)." Montague semantics[17] traditionally advocates this kind of treatment, and it has been studied as the main stream of semantics. These treatments such as quantication and identication are useful and cannot be replaced by lexical meanings. However, it says nothing about the logical predicate, say \boy()" means; it is a critical limitation when we make an accurate denition of meaning. Only when these predicates are properly dened, it will become possible to talk about the abstract constructions on the basis of these denitions of meaning of predicates. In other words, syntagme exists on the base of paradigm. Therefore, we restrict ourselves here not to treat these structural meaning until the lexical meaning is dened, and focus on the lexical meaning as an atom of semantics. 3 3. 3.1 Previous work on Lexical Semantics Logic based approach Most part of previous studies on lexical meaning concentrate on a logic based approach. For example, 9-term[1] denes the meaning of a word by labeled structures and their corresponding values. There are hierarchical structures over the labels, and using the hierarchy, unications are dened between 9-terms to construct a semantic hierarchy of words over the lexicon. Similar descriptions are adopted in Generative Lexicon[19], where structures are divided into four structures; argument structure, event structure, qualia structure, and inheritance structure. While argument structure treats structural meaning discussed above, event structure and qualia structure deal with semantic interpretations. Using event structure and substructures of qualia structure, Generative Lexicon tries to capture selectional preferences and metaphors. In general, these approaches use typed hierarchical structure and attempt to capture the semantic interpretation of a word by logical operations, especially unications. However, they have problems on the acquisition of labels or structures they use, and determinacy of logical deductions which does not always t to our intuition. The former is especially problematic when treating new words or a huge set of words commonly found in the realistic environment. 3.2 Psycholinguistic approach Psycholinguistic or cognitive approach was rst introduced by Quillian[20]. His idea was to describe the meaning of a word by a labeled semantic network. Following the links, one can get the meaning of a word by the properties of the node and types of the links it follows. This theory is extended by the spreading activation theory of Collins & Loftus[6] to match the results of psychological experiment, but there is no mathematical formulation of the process, which the original Quillian's theory aimed at. Another psycholinguistic approach is exemplied by the Knowledge Representation Network. While spreading activation theory itself is regarded as a Knowl4 edge Representation Network, KL-ONE[3], Conceptual Graph[23] and other representations are characterized by the logical operations dened over its nodes. Using logical operations, they share the same advantages and disadvantages of logic based approaches. Due to the diculty of dening such a knowledge network, there exists no semantic processing system of this sort in practical use. 3.3 Neural network approach Since the meaning of language is dened and processed in a human brain, it seems natural to represent it by a neural network. Elman[13] uses a simple recurrent neural network and let it learn to predict a successive word from a corpus as its input. By clustering the hidden unit activations, he could get a semantically adequate hierarchy of words automatically. Takahashi[27][28] also uses a layered neural network to make a semantic representation of words, but its dimension is quite low (4) and vocabulary in the approach is very small (11). Assuming limited semantic case beforehand, it does not go beyond an articial set of meaning. In general, neural network is a promising and natural approach in its relationships to cognitive system, but it is not intuitive to interpret its internal representations and is dicult to apply to a semantic information processing. When it becomes clear and easy to manipulate the internal representation of these networks, it will be a promising method which matches our cognitive process. 3.4 Related work As to the computational treatment of lexical relationships, several works have been perfomed in natural language processing. 3.4.1 Semantic similarity Semantic similarity attracts our attention with relation to the clustering of words which enables us to construct computationally the semantic categories we recognize naturally. From the point of information-theoretical view, Church et al. proposed a 5 P (x; y) named association ratio measure of self mutual information I (x; y ) = P (x)P (y ) which expresses a relevancy of occurence of two words [5]. He proposes this to nd lexico-syntactic regularity and semantic classication of subcategorization in lexicography, while it has admitted its limitations concerning with the use of mutual information when P (x; y) is too small, and thus omits the pairs with the joint occurence frequency f (x; y)5 in investigation. Hindle also uses self mutual information as a measure of semantic relevancy, and from the predicate-argument structure of transitive verbs he extracted word similarities of nouns they subcategorize[14]. Extraction of predicate-argument structure of a verb is perfomed by a simple deterministic parser Fidditch originated by Marcus[16]. As opposed to these, Pereira et al. uses a Kullback-Leibler information distance to measure a semantic distance of probability distributions which two words have in respect to cooccurence [18]. They also focuses on predicate-argument structure using the Fidditch parser Hindle used, and performs a noun classication in which each word has a membership probability to clusters. 3.4.2 Semantic disambiguation Polysemy of words is a semantic problem natural language processing must deal with. Dagan worked a pseudo-word disambiguation experiment proposed of Schutze[22] by a similarity estimation between transitive verbs and its object [8]. He denes the similarity of w2 given w1 as a weighted sum of conditional probability of cooccurence between w10 and w2, where w10 is a relevant word of w1 (possibly all of lexicon) which has a similarity to original w1 as its weight. He compares the result of similarity of probability distribution using KL divergence, Information radius, L1 norm, and confusion probability to conclude that the denition using Information radius made the greatest performance. Additionally, he notes that the comparison of cooccurence will suer a noticable performance degredation when very low frequency events are omitted; this is typically important in the use of mutual information, where such low frequency events must be excluded against the overestimation of information. 6 Yarowsky stands at a dierent point to disambiguate a polysemous word by making a decision list of disambiguation whose entry has the likelihood of determinance [26]. He rst gives the characteristic collocations of a word by hand, and by nding the most likely collocation word of these collocations of the initial word and continuing this process, lexicon is classied to a set of clusters. A decision list of disambiguation is made by these clusters by the likelihood membership function to each cluster. However, it rst requires a characteristic disambiguation key by hand: it is not always clear to an unfamiliar word which has unknown polysemy. And although this approach is eective in disambiguation, having a huge decision list about all words in lexicon to disambiguate is not realistic. Having the concern in mind that these decision lists have diculties to reect a more indirect context eect, the decision list approach has a limitation in applying to a realistic environment. 3.4.3 Approaches in Information Retrieval In the eld of Information Retrieval, which treats a set of words as a query, word weighting and measurement of word relevancy have attracted attention. Schutze constructs a semantic vector which has cooccurence frequencies to other words as its element [22]. It is dened in a very high dimensional space where each word expresses a dimension, and semantic similarity of two words is measured by the resemblance of directionality these words have. In general, since in Information Retrieval the concept of term-document matrix is of common interest, extension from term-document matrix to term-term matrix is considered quite natural. Because of the huge dimensionality in this kind of treatment, [22] and LSI[10] use a SVD algorithm or perform a Principal Component Analysis to reduce the semantic dimensionality of representation to calculate in a moderate resource use. However, this leaves a critical problem with low-frequency words: while the importance of singletons (events occurred only once) has been pointed out by Dagan[8], these low frequency words have a very limited and specic cooccurences and cannot be easily mapped to principal dimensions. If done, mapping result may be quite ambiguous to lose the keen informativeness that the original words have. 7 Information Retrieval is a promising eld of technique executed in practice. However, in view of these characteristic events, its dimension reduction technique commonly assumed to perform leaves much to be desired semantically. A method is required which uses a moderate computational resource, while the fertile diversity of realistic meaning of words is preserved. 3.5 Summary Many works have been done in semantic treatment of lexicon. Traditionally there are logic based approaches which concern the structural meanings and in the eld of psychology many proposed a cognitively adequate system of semantic processing, while many of which are not formulated mathematically. In the persuits of natural language processing, semantic similarity and semantic disambiguation mainly attracted attention. They propose informationtheoretical measures between words, and saw a considerable success in each eld. However, they are not suciently general in representing the meaning of words, therefore some approach cannot be easily applied to other objectives, and they share an inadequacy in their application in a changing realistic environment. Strategies taken in Information Retrieval is promising in its practicality and automation, but a compromise of moderate computational resource use and preservation of semantic diversity of realistic environment is a dicult problem to be challenged. 8 4. 4.1 Meanings as association Lexical Semantic Network As overviewed above, previous methods of dening lexical meaning have both advantages and drawbacks; especially in applying to a realistic environment, some of these proved to be of limited use. Back to the basics, what information do we use when we learn the meaning of a specic word? Here, we will return to the very fundamental situation; the situation where a child learns the meaning of a new word @. What information will he use? If someone says `@!' with no attendant circumstances in the dark, he cannot grasp its meaning at all. However, if someone says with an indication `This is @', or `@ is ', he will surely be able to know what the word @ means by associating the substance he was indicated or alternatively the word . This is the same when he is said in context `1@2', when he can associate 1 and 2 as the relevancies of @. Of course and 1; 2 are words and have their associative neighbors; then, when we assume that any image in the cognitive map has its equivalent set of words representing the image, we can dene the meaning of a word by the set of words associated directly or indirectly from the word, because from the assumption any association, if words or images, can be reduced to a certain set of words in the long run. Then, the meaning of a word is expressed by the set of words associated from it; but they need not be a simple set and may have some features. As a feature, we adopt a weight to each associated word corresponding to the strength of association because there are no explicit types of association in the real corpus, and therefore assuming a type of links is not valid when we want to extract semantic relationships from corpora. When words are interconnected one another by association, they can be regarded as a network: We call it as a Lexical Semantic Network (LSN) after [15]. By the traversal of node in the network, the meaning of a word is dened as a weighted set of words which is associated with the word. 9 4.2 Previous works on LSN As we have seen in section 3.2, the denition of meaning by LSN is advocated originally by Quillian[20], Collins & Loftus[6], Sowa[23], and else. But their works are all assumed a label or a type over the link which is dened a priori. As we claimed in section 4.1 arbitrariness of labels are dicult to dene and maintain in a realistic environment. As for the study of unlabeled LSN with weighted links, there are several works such as Waltz & Pollack[25], Kozima[30] and Hiro[34]. Waltz & Pollack[25] uses a real-valued quantitative links to perform a spreading activation, but the weights are xed to constants beforehand as to excitatory and inhibitory links, and it proposes no strategy to acquire these weights from a real data. This study showed that by utilizing quantitative links, semantic polysemy can be treated appropriately to select the semantically correct parse tree under the framework of CFG. Kozima[30] calculates association probabilities from the denition sentences of the words in LDOCE2 and calculates an activation vector using a heuristic spreading activation. Entries in LDOCE are dened using a limited vocabulary LDV3, therefore it can be regarded to dene the meaning of a word with an eect of principal component dimensions. Although this strategy is very eective, it is based on an existing dictionary and therefore it is not sucient in an enormous size of lexicon newly created or updated continuously. When we also notice that the meaning of a certain word diers partly from person to person according to his interests or experience, acquisition of semantic relevancies from real corpora is strongly desirable. Hiro[34] follows [30], and it constructs a semantic network from a corpus. It uses semantic similarity between words as a value of the link between them, which is calculated by a modication of Dagan[7]'s csim () based on mutual information. Because it is essentially based on mutual information, there are problems with words of low frequency. He avoids this by dealing only with high frequency links where frequency40, but this compromise has made itself semantically unuseful bacause semantically important information is generally conveyed by these low 2 Longman 3 Longman Dictionary of Contemporary English Dening Vocabulary 10 frequency words that are omitted. 4.3 LSN as stochastic network and its formulation in Markov process Spreading activation theory descended from [20] tells about the process of successive activation starting from the initially activated concept. [25] and [30] use this kind of association using a heuristic formula. If we assign each word in LSN a state transition probability according to the associativity it has to a certain word , aggregation of these relationships can be regarded as a Markov Process. Even if there are no direct associations to a word, it may have a high degree of state probability through state transitions if it is commonly relevant to the other associatives of the central word. In view of this, if we regard a state transition of the Markov process as an association and state probability as an activation, we can dene the meaning of a word w by the state probability distribution fP (!)g over ! in Lexicon, starting from the state in which fP (!) = 1g. As the transition proceeds more, its associations spread more; if we `contemplate deeply', we can broaden its extentional subjects by increasing the transition steps t. We have to note here that the initial distibution need not be P (!) = 1 for central word and P (!) = 0 for other words; it can be any arbitrary distribution. For example, if we want to know the meaning of `blue sky', the initial distribution can be set where P (blue) = 0:4 and P (sky) = 0:6 to obtain the mixture of these meanings as an associative result. Here we have referred to the meaning of a word rather informally. In the next subsection, we formalize them mathematically. 4.4 Mathematical formulation of LSN As we have seen in 4.1, semantic associations are acquired through linguistic cooccurences or direct cooccurences with cognitive images. Therefore, we will construct a semantic association space from the relation of cooccurence space and formalize them as the two probability spaces. 11 4.4.1 Probability spaces def. (collocation space) For a natural number i; j 2 f1::ng, we dene a unitary event ! = cij as a cooccurence of ith word and j th word. For the sigma eld 4C = }( C ) where C = fcij ji; j 2 f1::ngg and function N : ! ! R which maps an event to the frequecy of its occurence, X P (! ) N (c ) PC (!) = P ij ; PC (Z ) = C i;j N (cij ) !2Z denes a probability space f C ; 4C ; PC g called a collocation space. def. (semantic space) semantic space is a probability space f S ; 4S ; PS g such that for a relation ! = sij of ith word and j th word in collocation space and its probability PS (!), S = fsij g and 4S = }( S ) are dened by PS (!) = aij ; PS (Z ) = where aij is properly dened afterwards. X P (! ) ! 2Z S 4.4.2 Composition of Markov process def. (association probability) A set wi = fsij jj 2 f1::ngg in a semantic space is called a word in the semantic space, and set L = fwi g (i 2 f1::ng) is called a Lexicon. For a stochastic process ct (!) (! 2 L) which takes its state in L, association probability a(wj jwi ) is dened by a P (w \ w ) a(wj jwi) = S i j = P ij : PS (wi ) j aij Using a(yjx), ct (! ) denes a simple Markov process which takes t as its time step. Then, def. (meanings) meaning is a state probability distribution = fP (ct(! ) = w)g (w 2 L) especially, meaning of w, t (w) is a state probability distribution after t transitions, t (w) = fP (ct (!) = wi )jP (c0(! = w)) = 1g (i 2 f1::ng): 12 4.5 Mathematical properties of stochastic LSN When we dene LSN stochastically as above, meanings can be calculated by assigning each association probability a(wj jwi ) to an element Aij of the Markov transition matrix A of LSN. Meanings of a certain word is thought to be decomposed into direct associations 2-level indirect associations .. . t-level indirect associations which is originally addressed as `nd an intersection' in [6], therefore we can calculate the meaning t(wk ) of a word wk by t 1 1X t (wk ) = (Aw~k + A2 w~k + 1 1 1 + At w~k ) = Ai w~k (1) t t i=1 where w~k is an initial probability vector corresponding to wk , which is normally k w~k = t (0 | : : :{z1 : : : 0)} : n (1) is a formalization of spreading activation used in [25] and [30], however, it has a clear mathematical interpretation and a characteristic property. In fact, when 3 is an eigenvector corresponding to the eigenvalue = 1 of n ~ = 3 for any w in Lexicon due to the known stochastic matrix A, nlim !1 A w property of Markov transition matrix[4]. Therefore, tlim t (w) = 3 for every !1 w in L, and every meaning of a word converges to 3 in the suciently long association interval of t, whenever corresponding Markov process is ergodic. Since 3 is a semantic vector which does not vary in association, 3 can be thought as a semantic tautology in a corresponding semantic space. 4.6 Acquisition of association probability We did not dene in 4.4 as to how to calculate the probability PS (! ) = aij of associativity in semantic space. This subsection shows the strategies other studies used and describes about log likelihood ratio which this study takes as its measure. 13 4.6.1 Mutual information Since aij expresses associative overlap between two word wi and wj and interchangeable between i and j , it is thought to measure it by the mutual information I (wi; wj ) = log P (wi; wj ) P (wi )P (wj ) (2) as Church[5] claimed as an associative measure for lexicography. Though it works fairly well, it has problems with very low frequency words. P (wj jwi ) Because I (wi; wj ) = log , mutual information measures eectively conP (wj ) ditioned cooccurence probability P (wj jwi ) around wi by weighting in inverse propotion to the occurence probability P (wj ) of its neighbor. However, we do not always think a low frequency word as important: the reverse is often true for our cognitive intuition. As [34] and many studies which are based on mutual information, compromise by eliminating low frequency words is possible, but that leads to semantically unuseful results because most of information is conveyed by such very low frequency words that are eliminated, which comprise a substancial part of lexicon. Then, this study uses another measure as association, log likelihood ratio, which Dunning[11] showed its validity in Information Retrieval. 4.6.2 Log likelihood ratio Information measure such as mutual information and 2 -statistics emphasizes the importance of very low frequency words. To solve this problem, [11] proposed a measure using log likelihood ratio test. If the occurences of words distribute not with a normal distribution around the maximal likelihood estimate but with a binary distribution of its occurence probability, the signicance of dependency between the occurences of two words can be calculated using a likelihood ratio. If we assume a null hypothesis H0 : P (wj jwi) = p ^ P (wj j:wi ) = p (3) which says that words wi and wj are independent of the occurence of the other, 14 and the alternative hypothesis H1 : P (wj jwi ) = p1 6= p2 = P (wj j:wi ) (4) which says that the occurence of wj is dependent on the occurence of wi where p, p1, p2 are parameters, the signicance of dependence can be evaluated using a likelihood ratio L(H0) ij = (5) L(H1) and its logarithm L(H0) : (6) log ij = log L(H1) L(H1) Since 0 log ij = log is a relative likelihood of dependency over indeL(H0) pendency and 02 log is asymptotically 2 -distibuted [33], 0 log can be used as a more accurate approximation than 2 - test. c c c 0c When p, p1 , p2 is estimated as p = j , pi = ij , pj = j ij by maximum ci N 0 ci P N likelihood estimates where N = k;l ckl , cij = N (wij ), ci = N (wi ), cj = N (wj ), L(H0) and L(H1) is calculated as L(H0) = Bi (p; ci ; cij )Bi (p; N 0 ci ; ci 0 cij ) L(H1) = Bi (pi ; ci ; cij )Bi (pj ; N 0 ci; ci 0 cij ) (7) (8) where Bi is a binomial distribution Bi (p; n; k ) = nCk pk (1 0 p)n0k . Therefore, 0 log ij can be evaluated as Bi (p; N 0 ci ; ci 0 cij ) 0 log ij = log BiBi((pp;; cci;; ccij))Bi (pj ; N 0 ci ; ci 0 cij ) i i ij = log Bi (p; ci; cij ) + log Bi (p; N 0 ci ; ci 0 cij ) 0 log Bi (pi; ci; cij ) 0 log Bi(pj ; N 0 ci; ci 0 cij ): (9) (10) In this thesis, we use this likelihood ratio as a measure of associative relevancy P to see PS (! ) = aij = 1log ij , where is an normalizing constant = ( i;j ij )01 . Using this measure, association probability a(wj jwi ) is calculated as a(wj jwi ) = ij Plog j log ij 15 (11) Hereafter, we use the association network acquired by parsing EDR Japanese corpus[12] with 50; 000 sentences with the collocational window of 5 words, where LSN has 37; 714 nodes of words and 1; 863; 773 links between them. Table 1 and 4.6.2 show the association probabilities around some words in the LSN and their associative result, where association step t = 4. 16 Table 1. association probability 言語: 見る: neighbor プログラム 型 世代 記述 アセンブリ 文脈 第 表現 定義 処理 : 示す 中心 装置 同 いま probability 0.066702 0.033042 0.029995 0.027043 0.021971 0.017466 0.017035 0.016671 0.014594 0.013171 : 0.000002 0.000002 0.000002 0.000001 0.000001 (freq.) neighbor (49) (32) (20) (18) (9) (10) (21) (16) (12) (17) 方 られる 面倒 限り いる 目 姿 強い ない 向き (1) (1) (1) (1) (1) 度 業者 そこ 相手 戦争 : サントノレ: フォーブル アトリエ ブティック 構える 街 probability 0.249103 0.225979 0.212098 0.179264 0.133557 0.080225 0.070657 0.011259 0.010553 0.010482 0.008112 0.007964 0.007256 0.007046 0.006757 : 0.000001 0.000001 0.000001 0.000001 0.000001 1 1 neighbor probability (freq.) (1) (1) (1) (1) (1) 1 17 (freq.) (99) (140) (10) (13) (138) (23) (16) (19) (93) (8) (2) (1) (1) (1) (1) Table 2. calculated meanings 言語: 見る: association プログラム 言語 型 世代 記述 第 アセンブリ 表現 において 処理 文脈 定義 理論 形式 原始 論理 いる れる プログラミング probability association 0.021963 0.018031 0.013351 0.009194 0.008461 0.007039 0.006270 0.006092 0.005672 0.005434 0.005422 0.004933 0.004410 0.003786 0.003777 0.003685 0.003416 0.003312 0.003239 方 られる 見る いる ない 面倒 目 限り 姿 れる 強い みる か ながら 考える よう こと なる 回る 1 フォーブル アトリエ ブティック 街 構える サントノレ パリ 貸与 郊外 足 起きる ハンブルク レンタル 店 ディスプレー 飲食 入る 商店 連れる パーティー 0.02264 0.021779 0.013857 0.005960 0.005230 0.003139 0.002975 0.002965 0.002865 0.002782 0.002677 0.002610 0.002516 0.002398 0.002326 0.002205 0.002077 0.002043 0.002037 1 サントノレ: association probability probability 0.091468 0.088251 0.08147 0.064814 0.063651 0.043291 0.026355 0.011235 0.009156 0.006649 0.006041 0.006025 0.005846 0.005592 0.005549 0.00532 0.004804 0.004770 0.004443 0.004095 1 18 ap (w) w の は や で を に が など から と も として による だ する た へ この また その によって 0.000700 0.000786 0.000880 0.000892 0.000899 0.000942 0.000993 0.001114 0.001206 0.001243 0.001273 0.001355 0.001496 0.001564 0.001653 0.001658 0.001699 0.001727 0.001763 0.001802 0.001908 まで て について 氏 さん しかし ら 者 に対する 的 より ため ば という さらに 中 用 および に対して ある 0.001990 0.002008 0.002067 0.002097 0.002117 0.002119 0.002152 0.002191 0.002220 0.002240 0.002267 0.002313 0.002332 0.002351 0.002357 0.002423 0.002429 0.002431 0.002482 0.002482 Table 3. Functional words excluded. To note here, although this measure works fairly well for low frequency words, yet it is not always applicable suciently to rather frequent words where functional words are associated highly while suppressing other content words. To avoid this problem, in this study functional words are excluded in association with the threshold of ap(x) statistics proposed in [29], which reects more semantic informativeness than frequency using cooccurence probability distribution. In the corresponding LSN, we used the threshold ap (x) < 0:0025. Table 4.6.2 shows these excluded words. Note that these exclusions are mostly functional words which bears less semantic importance; therefore this compromise does not lead to be problematic while exclusion of low frequency words are extremely inappropriate. 19 5. Comparison with former formalisms As we have overviewed in section 3, most of studies on lexical meaning so far uses a discrete representation of lexical knowledge based on the rst order predicate logic. Contrary to this, the attempt to represent the meaning as a continuous vector quantity has both an advantage that previous formalisms cannot aord and a limitation of representation that cannot be described within the formalism. Hereafter, we overview and reect these advantages and drawbacks. 5.1 Advantageous properties Quantitative treatment of meaning seems to have the following advantages: One can dene the lexical meaning, which depends on the world knowledge that changes continuously and varies from person to person, automatically on the basis of not the arbitrary and limited view of theorists but on the basis of statistical and mathematical view of information. Meaning can be dened completely automatically from corpora to enable itself to be updated and maintained easily. One can give a quantitative criterion of meaning and related ambiguities, which leads to a natural model of cognition which is not always deterministic. Especially, computational denition of `meaning' is an important element: because a natural language entails a production of new meanings on words, and the semantic space is not common to all but dierent from person to person by the linguistic contacts or spontaneous acquisition of the subspace of language. With this concern in mind, semantic description which assumes a referent meaning common to all is essentially insucient and the dynamic description is required. However, the denition of `meaning' deduced statistically from collocation has an inevitable limitation because there is a unobservable discrepancy between word and its corresponding internal representation in mind that linguistics cannot touch. 20 5.2 Limitations of the formalism First, associative denition of meaning cannot deal with logical constructions. In addition to the at description of its extensional images, a sentence may have a structural meaning typically depicted by logical formulae. As we have seen in section 2.2, it cannot be deduced from lexical meaning, namely semantic extensions. When we consider the history of semantics mainly dealt with this kind of meaning, an interface between structural meaning is an important assignment which must be considered next. Second, associative type of links we are apt to see naturally is excluded in the current formalization. Although the arguments so far deliberately excluded it to enable a statistical measure of meaning by treating an association as a unlabeled fundamental phenomenon of cognition, in fact when we are exposed to an associative relation, say `Kyoto' and `Kamogawa', we are apt to see there a specic kind of relationship like `locational'. While we note here that such a relationship itself is expressed by a word, and therefore such a label could be retrieved with an adequate treatment of unlabeled association, the phenomenon itself of the label deduction cannot be deduced from the naive theory alone; we must expect more abstruct constructions over it. 21 6. 6.1 Semantic sentence analysis Semantic consistency of a sentence We dened the meaning of a word by a probability distribution of association over lexicon in section 4.4.2. When meaning of a word is represented thus quantitatively, we can give a quantitative measure of semantic consistency of sentences which has been overlooked from a perspective of syntax. As a rst imaginable measure, semantic consistency of a sentence s is calculated by the degree of clustering of the set of semantic vectors f (wi)jwi 2 sg. [30] calculates along this way to dene the semantic consistency as a similarity of the set of semantic vectors to the set itself. Although he got a moderate result with experimental sentences, it proves to be insucient when we ponder on it. While this measure gives a semantic consistency of a sentence which has only words that have been already associated sucently, it cannot touch a machinery where previously unassociated pairs of words are considered associative to acquire a new association. For example, within a sentence \Under the clear sky of autumn in Tennessee, he found a girl with a straw hat." there is no direct relevance between `sky' and `Tennessee', and `girl' and `straw' is semantically irrelevant in any way. The reason we can understand the sentence in spite of these irrelevancies is because of its local consistency of dependent pairs such as `sky' and `clear', `girl' and `hat'. These locally consistent pairs are merged to constitute a consistent meaning within us. To model such a perspective, we will extract a dependency structure of a given sentence by a statistical dependency parser[31]. Given a dependency structure, wa can calculate a semantic consistency of a sentence by the geometric average of semantic distance between dependent pairs. For a sentence s = w1w2 1 1 1 wn and the set of dependence relations D(s) = fdij g where dij expresses a dependence wi ! wj (i 2 f1::ng) because the dependency is assumed to be unique for wi , semantic consistency C (s) of s is dened by 22 s Y C (s ) = n * ) dij 2D(s) log C (s) = sim(dij )w ij 1 X w 1 sim(dij ) n d 2D(s) ij (12) (13) ij where sim(dij ) = sim( (wi ); (wj )). wij is a weighting coecient; we usually permits a discrepancy between uninformative pair of words while feeling semantic severance between informative pair of words. Therefore, here wij is dened wij = u(wi) 1 u(wj ) (14) log N n(w) : (n = where u(w) is a normalized information measure u(w) = 1 log n Pw N (w)) Because wij approaches to 0 when wi and wj are uninformative, following (12) it is lightly weighted because wlim sim(dij )w = 1. !0 ij ij 6.1.1 Experiment Here we calculated the consistency of each of 45 reciprocally irrelevant sentences that begin with `彼 (he)' from EDR Japanese corpus, which are included in the calculation of associative matrix. Additionally, we also calculated some sentences which (i) seems rather dicult to understand and (ii) seems rather easy, extracted from children's tail. These sentences are shown in table 4. 23 EDR Japanese corpus No. Sentence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Table 4. Analyzed sentences 彼がぼくたちの話に乗ってくれるかどうかは疑問だ。 彼が使っている現トラクター、元戦車はソ連で1960年代中期に造られたT54だ。 彼が働いた鉱山には、同じような白人仲間が64人いた。 彼とは、引っ越して以来、すっかりつき合いが絶えてしまっている。 彼にとって戦争とは政治的な駆け引きではなく、拡大された暴力と格闘だった。 彼に代表される黒人による政治的圧力は大きくなった。 彼のやり残した仕事は、後世の人々によって必ず成し遂げられるだろう。 彼のユニークな意見が採り上げられ、重役会議に諮られることになった。 彼の会社はもちろん、その週のうちに大半の商社は北京駐在員を帰任させ始めた。 彼の言うことは、いつもあいまいもことしていてまるで雲をつかむようだ。 彼の作品をどう見るか、というより、自然の環境に目を向けさせ、現代へのアンチテーゼとしての 激しい芸術、さらに芸術の社会性を追求した。 彼の質問は、この問題の本質に触れるものでした。 彼の真に迫った演技は、観客に深い感動を与えた。 彼の足の肉がちぎれている。 彼の発明は日本綿業の発展に貢献したばかりでなく,31年には,当時一流のイギリスの プラット社がこの特許権を譲り受け,世界的に注目された。 彼の無口な性格は、時に人々の誤解を招いた。 彼の話は、言い回しがうまいので、わかりやすいうえにおもしろい。 彼は、この世に自分より劣った人間は一人もいないかのように思い込んでいる。 彼は、つくりの大きな顔に笑みを浮かべて言った。 彼は、クラスの中で際立った存在で、みんなから注目されていた。 彼は、王立医学協会のメンバーで、以前は原子力調査員でもあった。 彼は、近ごろにはめずらしく折り目正しい青年です。 彼は、口数の少ない、どちらかというと思索的なタイプの人です。 彼は、自他共に許す世界的な名指揮者です。 彼は、手短にゲームのやり方を説明したあと、実際にやってみせてくれた。 彼は、世の荒波にもまれながら、だんだんとたくましい人間に成長していった。 彼は、大きな可能性を持つ待望の新人として、プロ野球界に迎えられた。 彼は、日本の車のクラクション音は小さい、という。 彼は、豊富な語いを駆使して、すばらしい作品を書き上げた。 彼は、力士としての年齢的な限界を感じ、今場所限りで引退する決心だ。 彼はおもむろに立ち上がり、静かに部屋を出ていった。 彼はすぐには数え切れない死体の山を見た。 彼はばく大な資金を投入して、自然保護団体の設立に努めました。 彼はシートベルトをしていたのだが、頭をどこかに強く打ちつけ、病院にかつぎこまれた。 彼は意を決して、彼女に自分の思いを打ち明けた。 彼は奇行のない人で、エピソードらしいものも思い出せない。 彼は血だらけになりながらも、われわれと一緒に逃げた。 彼は困難の中でも学問を続け、立派に学者として大成した。 彼は自分の高校時代のことを、そう言う。 彼は初めから十点満点をマークし、他の選手の追随を許さなかった。 彼は人づてに買った中古の日本製カメラの部品がないため、写すことができず、その店に来ていたのだ。 彼は大学で、都市計画の分野を専攻している。 彼は東チモール侵攻の総指揮をとった人物である。 彼は憤然といった。 彼は理不尽な理由をつけて、訳もなく反対をする。 24 reference (i) 4 1. 現代のはじまりのときに、言語に挑戦をいどんできたのは、ほんとうは物そのもので はなく、物たちの世界をつくるマシニックなものの領域の力だったのだろう、とぼく は考える。 2. ここでは複数で、偶然的で、個性的なさまざまな断片をひとつの統一体につなぐ、「文 脈を統御するメカニズム」の働きから解き放たれたテキストの運動が、新しい自由を 生きるようになる。 reference (ii) 1. 今ではいつのころだったか覚えてはいませんが、秋だったのでしょう。5 2. ふと見ると、川の中に人がいて、なにかやっています。6 For these sentences, some words are not included in the training data: in which case similarity to the word is xed to = 0:0001. 6.1.2 Evaluation Result is shown in table 5. 4 中沢新一, 「知天使のぶどう酒」, 河出書房新社, 5 有島武郎, 「一房の葡萄」. 岩波書店, 1988. 6 新美南吉, 「ごんぎつね」. 校訂新美南吉全集, 1992. 1980. 25 Table 5. Consistency C (s) 0.22382 EDR Japanese corpus No. 5 0.304015 40 0.329755 29 0.330606 18 0.331364 9 0.347335 30 0.350964 13 0.352166 4 0.36101 15 0.370594 33 0.373669 24 0.376915 35 0.379523 42 0.392221 26 0.39402 27 0.408127 14 0.418808 34 0.420147 16 0.42216 25 0.424438 38 0.433286 8 0.436043 31 0.438841 2 0.443812 6 0.445323 7 0.454066 1 0.470849 32 0.471595 20 0.482224 22 0.491864 12 0.499521 43 0.503757 41 0.505961 11 0.510784 36 0.525287 17 0.526912 19 0.532498 23 0.536592 37 0.557533 10 0.562632 3 0.627387 28 0.637572 39 0.645452 44 0.65568 21 0.656699 45 Sentence s 彼にとって戦争とは政治的な駆け引きではなく、拡大された暴力.. 彼は初めから十点満点をマークし、他の選手の追随を許さなかっ.. 彼は、豊富な語いを駆使して、すばらしい作品を書き上げた。 彼は、この世に自分より劣った人間は一人もいないかのように思.. 彼の会社はもちろん、その週のうちに大半の商社は北京駐在員を.. 彼は、力士としての年齢的な限界を感じ、今場所限りで引退する.. 彼の真に迫った演技は、観客に深い感動を与えた。 彼とは、引っ越して以来、すっかりつき合いが絶えてしまってい.. 彼の発明は日本綿業の発展に貢献したばかりでなく,31年には.. 彼はばく大な資金を投入して、自然保護団体の設立に努めました。 彼は、自他共に許す世界的な名指揮者です。 彼は意を決して、彼女に自分の思いを打ち明けた。 彼は大学で、都市計画の分野を専攻している。 彼は、世の荒波にもまれながら、だんだんとたくましい人間に成.. 彼は、大きな可能性を持つ待望の新人として、プロ野球界に迎え.. 彼の足の肉がちぎれている。 彼はシートベルトをしていたのだが、頭をどこかに強く打ちつけ.. 彼の無口な性格は、時に人々の誤解を招いた。 彼は、手短にゲームのやり方を説明したあと、実際にやってみせ.. 彼は困難の中でも学問を続け、立派に学者として大成した。 彼のユニークな意見が採り上げられ、重役会議に諮られることに.. 彼はおもむろに立ち上がり、静かに部屋を出ていった。 彼が使っている現トラクター、元戦車はソ連で1960年代中期.. 彼に代表される黒人による政治的圧力は大きくなった。 彼のやり残した仕事は、後世の人々によって必ず成し遂げられる.. 彼がぼくたちの話に乗ってくれるかどうかは疑問だ。 彼はすぐには数え切れない死体の山を見た。 彼は、クラスの中で際立った存在で、みんなから注目されていた。 彼は、近ごろにはめずらしく折り目正しい青年です。 彼の質問は、この問題の本質に触れるものでした。 彼は東チモール侵攻の総指揮をとった人物である。 彼は人づてに買った中古の日本製カメラの部品がないため、写す.. 彼の作品をどう見るか、というより、自然の環境に目を向けさせ.. 彼は奇行のない人で、エピソードらしいものも思い出せない。 彼の話は、言い回しがうまいので、わかりやすいうえにおもしろい。 彼は、つくりの大きな顔に笑みを浮かべて言った。 彼は、口数の少ない、どちらかというと思索的なタイプの人です。 彼は血だらけになりながらも、われわれと一緒に逃げた。 彼の言うことは、いつもあいまいもことしていてまるで雲をつか.. 彼が働いた鉱山には、同じような白人仲間が64人いた。 彼は、日本の車のクラクション音は小さい、という。 彼は自分の高校時代のことを、そう言う。 彼は憤然といった。 彼は、王立医学協会のメンバーで、以前は原子力調査員でもあった。 彼は理不尽な理由をつけて、訳もなく反対をする。 26 reference (i) 0.16165 1. 現代のはじまりのときに、言語に挑戦をいどんできたのは、ほんとうは物.. 0.162771 2. ここでは複数で、偶然的で、個性的なさまざまな断片をひとつの統一体に.. reference (ii) 0.763045 1. 今ではいつのころだったか覚えてはいませんが、秋だったのでしょう。 0.593312 2. ふと見ると、川の中に人がいて、なにかやっています。 45 sentences included in EDR corpus are relatively semantically consistent, and there are few prominent value to distribute smoothly. However, sentence 5, which has a metaphorical phrase `拡大された暴力 (extended violence)' has a low consistency and sentence 45, which has only similar words and their dependencies, has a relatively high consistency. This is especially prominent in referent sentences: in referent (i), consistencies are 0:1616 and 0:1627, which are by far lower than those included in EDR sampled sentences. This is partly because the two sentences have some words ([マシニック], [統御]) which are not found in lexicon; but additionally, pairs have acute discrepancies in meaning to lower the overall consistency. On the other hand, (ii) has easier words and shows a high similarity in each pair. However, there leaves a problem as shown in the comparatively similar sentences from EDR corpus. Below reasons are considered for this problem: Problems with the denition of dependency. For example, corresponding dependency structure of the phrase `彼とはつき 合いが絶える' from sentence 4 becomes as Figure 6.1.2. As shown there, the natural dependency `彼 (he)'!`つき合い (association)' is not calculated but a syntactic dependency `彼'!`絶える (cease to)' is calculated; this leads to a lower consistency than we naturally expect. In Figure 6.1.2 which depicts a sectional dependency structure of sentence 13, `演技 (act)' also relates not to `感動 (impression)' and `観客 (audience)', but to `与える (give)'. Although assuming associative relation of neighbors may be considered as a remedy of this problem, that cannot fully capture the semantic relationships included in a sentence, as we noted in the beginning of this subsection. Since language is a one-dimensional stream of words, we must consider contextual 27 彼 付き合い 絶える Figure 1. unused semantic dependency. 彼の ┗真に ┗迫った ┗演技は、 ┃観客に ┃┃深い ┃┃┗感動を ┗┻━┻与えた。 Figure 2. Dependency structure of sentence 13. 28 relationships of the occurence of signs as well as assuming a dependency structure of distant pairs. Problems with the selection of semantic head In the experiment, we calculated similarities of the pairs from not every words contained but the semantic heads of each bunsetsu to capture the semantic dependency, let the syntactic consistency alone. However, for example, in sentence 21, `王立医学 協会 (Royal society of medicine)' and `原子力 調査 員 (investigator of nuclear power) only underlined word is considered semantic head to exclude the other informative words like `王立' and `医学'. Since separation of bunsetsu is dened by the regular expression using part of speech information [31], as a remedy we can expect a method which does not perform a bunsetsu separation while calculating on a at dependency structure by assigning low weights on functional words; but this cannot treat distant relationships adequately as we argued above. Dierence between syntactic dependency and semantic dependency is also a problem here. 6.2 Semantic information measure In the previous section we argued a quantitative indicator of semantic consistency of a sentence. However, no matter how consistent that the sentence is, if the meaning it expresses is trivial then it proves to be less meaningful as a communication, which is the central objective of language we argued at the rst of this thesis. To transfer semantically rich information, we must make an utterance which has a large variance of meaning associated one another; however, on the other hand, a sentence is not accepted understandable unless it has a moderate consistency as we saw in the previous section. Therefore, it is thought that we produce a sentence in balancing the motivation to make a large diversity of meaning and the motivation to observe a local consistency of dependent pairs to make a sentence semantically consistent. In other words, informativenss of a sentence is augmented in not the situation that a trivial meanings is strongly associated but in the situation where diverse meanings are grasped associatively connected under the condition of pairwise local consistency of meanings. Therefore, more formally, semantic informativeness of 29 a sentence can be measured as a distance from the probability distribution of associative activation from the sentence to the trivial meaning 3 noted in section 4.5 as a semantic tautology. a sentence s = w1w2w31 1 1wn. From s, a set of probability distributions P(sLet ) = f(wi )ji 2 f1::ngg associated as its meanings, we can calculate a distribution (s) to be a total meaning of s. P Though we can imagine various denitions of (s) from (s) taking their order and context eect in consideration, we put it simply here by an average (s ) = f Xn (wi) g i=1 n (15) and measure a Kullback-Leibler information distance to 3, I (s) = D((s)jj 3) (16) X (s) (17) = (s) log 3 to give a semantic informativeness of s, I (s). X P (x ) In general, divergence D(P jjQ) = P (x) log has a problem in case of Q(x) x Q(x) = 0 for certain x that D(P jjQ) ! 1; however, it can be avoided in this case because the each element of trivial distribution 3 of meaning has theoretically a nonzero probability as far as the corresponding word has a nonzero occurence probability, which must be the case for every situation. 6.2.1 Experiment On each of the sentence used in section 6.1.1, we calculated the KL divergence to 3 . 6.2.2 Evaluation Table 6 shows the result. 30 Table 6. Informativeness I (s) EDR Japanese corpus No. 1.1821 14 1.12114 31 1.12004 16 1.10724 22 1.10162 44 1.07741 36 1.06932 34 1.06765 19 0.998295 29 0.967434 4 0.929454 26 0.884969 8 0.851634 17 0.831292 30 0.802944 10 0.769649 43 0.765636 18 0.759355 25 0.751526 35 0.751315 32 0.748489 23 0.739663 38 0.738743 7 0.729799 24 0.713438 5 0.694481 13 0.677928 42 0.655219 1 0.654186 9 0.612354 40 0.60596 45 0.600273 15 0.595084 27 0.573879 37 0.572526 33 0.570249 2 0.544073 20 0.54234 12 0.493979 3 0.483002 41 0.479121 39 0.45185 21 0.43127 6 0.419433 11 0.417165 28 Sentence s 彼の足の肉がちぎれている。 彼はおもむろに立ち上がり、静かに部屋を出ていった。 彼の無口な性格は、時に人々の誤解を招いた。 彼は、近ごろにはめずらしく折り目正しい青年です。 彼は憤然といった。 彼は奇行のない人で、エピソードらしいものも思い出せない。 彼はシートベルトをしていたのだが、頭をどこかに強く打ちつけ.. 彼は、つくりの大きな顔に笑みを浮かべて言った。 彼は、豊富な語いを駆使して、すばらしい作品を書き上げた。 彼とは、引っ越して以来、すっかりつき合いが絶えてしまってい.. 彼は、世の荒波にもまれながら、だんだんとたくましい人間に成.. 彼のユニークな意見が採り上げられ、重役会議に諮られることに.. 彼の話は、言い回しがうまいので、わかりやすいうえにおもしろい。 彼は、力士としての年齢的な限界を感じ、今場所限りで引退する.. 彼の言うことは、いつもあいまいもことしていてまるで雲をつか.. 彼は東チモール侵攻の総指揮をとった人物である。 彼は、この世に自分より劣った人間は一人もいないかのように思.. 彼は、手短にゲームのやり方を説明したあと、実際にやってみせ.. 彼は意を決して、彼女に自分の思いを打ち明けた。 彼はすぐには数え切れない死体の山を見た。 彼は、口数の少ない、どちらかというと思索的なタイプの人です。 彼は困難の中でも学問を続け、立派に学者として大成した。 彼のやり残した仕事は、後世の人々によって必ず成し遂げられる.. 彼は、自他共に許す世界的な名指揮者です。 彼にとって戦争とは政治的な駆け引きではなく、拡大された暴力.. 彼の真に迫った演技は、観客に深い感動を与えた。 彼は大学で、都市計画の分野を専攻している。 彼がぼくたちの話に乗ってくれるかどうかは疑問だ。 彼の会社はもちろん、その週のうちに大半の商社は北京駐在員を.. 彼は初めから十点満点をマークし、他の選手の追随を許さなかっ.. 彼は理不尽な理由をつけて、訳もなく反対をする。 彼の発明は日本綿業の発展に貢献したばかりでなく,31年には.. 彼は、大きな可能性を持つ待望の新人として、プロ野球界に迎え.. 彼は血だらけになりながらも、われわれと一緒に逃げた。 彼はばく大な資金を投入して、自然保護団体の設立に努めました。 彼が使っている現トラクター、元戦車はソ連で1960年代中期.. 彼は、クラスの中で際立った存在で、みんなから注目されていた。 彼の質問は、この問題の本質に触れるものでした。 彼が働いた鉱山には、同じような白人仲間が64人いた。 彼は人づてに買った中古の日本製カメラの部品がないため、写す.. 彼は自分の高校時代のことを、そう言う。 彼は、王立医学協会のメンバーで、以前は原子力調査員でもあった。 彼に代表される黒人による政治的圧力は大きくなった。 彼の作品をどう見るか、というより、自然の環境に目を向けさせ.. 彼は、日本の車のクラクション音は小さい、という。 31 reference (i) 0.477072 1. 現代のはじまりのときに、言語に挑戦をいどんできたのは、ほんとうは物.. 0.413139 2. ここでは複数で、偶然的で、個性的なさまざまな断片をひとつの統一体に.. reference (ii) 0.551357 1. 今ではいつのころだったか覚えてはいませんが、秋だったのでしょう。 0.674302 2. ふと見ると、川の中に人がいて、なにかやっています。 As we noted in section 6.1, semantic informativeness is assumed to be reversely propotional to semantic consistency; Spearman's rank correlation coecient of the result to the order of experiment in section 6.1.1 is rs = 00:22029, which shows a weak negative correlation as expected. However, when we look closely into each sentence, this result does not always reect an intuitive informativeness. For example, though sentence 5 is the most unconsistent sentence which has a metaphoric dependency relation, divergence shows a moderate result among other sentences. Moreover, in spite of quite a low consistency of referent sentences (i), its divergence to 3 is medium, and referent (ii) is medium also. Since we compare the average meaning of word set contained in a sentence to 3, the divergence tends to be small with a sentence that has frequent words which often becomes a subject. In fact, we seem to make a judgement on the informativeness by a variance of meanings rather than an average of meanings. However, self-information of (s) as a measure of variance is almost equal to every experimental sentences therefore it is not valid. This is partly because of the denition of (s): denition by linear combination (average) cannot express sufciently the overlap of each probability distribution to fail to describe a semantic variance of meanings each words has in a sentence. 32 7. Conclusion This thesis showed that the stochastic treatment of meaning based on cooccurence statistics can model the meaning which diers from person to person and changes dynamically in the environment. Contrary to the traditional deterministic treatment of meaning which entails an arbitrariness, statistical acquisition of meaning is dened completely objectively. Also, by performing an association as the state transition of Markov process, the spreading activation formerly advocated in psycholinguistics is formulated mathematically to enable the indirect semantic relationships to be handled adequately. However, stochastic formulation of meaning still has much to be desired. Although the denition of associativity by the log likelihood ratio works fairly well in practice, exclusion of functional words cannot be avoided because functional words are not always assigned suciently low associativities. Moreover, in the associational process of Markov transition, the association spreads out with all the dimensions as the naive use of stochastic matrix: that does not t our intuitions completely. These problems come from the lack of an adequate criteria on the association over the Markov process: we must make explicit what criteria the adequate association holds and what quantity to be maximized or minimized in the association. When the meaning is dened as a probability distribution over the lexicon, the set of the meanings forms a Hilbert space H by the perfection with a norm between probability vectors[32]. By assuming a linguistic symbol from information source as a realization of probabilistic variable over all the lexicon and with a geometrical methods in H [2], our intuition of semantic space can be modeled mathematically more accurately. 33 8. Acknowledgements First, I am quite grateful to the Professor Yuji Matsumoto. Besides his supervision and valuable comments on the manuscript, I would not be able to think over my original idea eectively without the acceptance to the Computational Linguistics Laboratory at NAIST. Second, I thank to the second supervisor Professor Hiroyuki Seki to show an interest on this study, which relieved and encouraged me much. And I am also especially indebted to the Associate Professor Shin Ishii, for he showed an interest at the quite premature stage of this study and gave me a number of insightful comments, especially the suggestion of the probabilistic formulation. Professor Yutaka Takahashi and Associate Professor Satoshi Nakamura made a hearty reception at my visit and gave me valuable comments. Additionally, I was inspired to the members of the laboratory of the neural modeling at the brain science institute of RIKEN. Mr. Cateau, Ph.D. gave me a suggestion of the similarity to the function of a neuron. In the last place, I would like to thank the stas of Matsumoto laboratory and all the laboratory members. In spite of my tendency of self-righteous attitude to studies, they showed me a hearty reception and criticism. I was also supported technically in various aspects and it helped much to my improvement of programming. And I note here that all of these are owed to the national foundation and scholarship; without that I would never been able to contemplate the nature of language so satisfactorily. 34 References [1] H. At-Kaci and R. Nasr. LogIn: A logic programming language with builtin inheritance. Journal of Logic Programming, Vol. 3, No. 3, pp. 187{215, 1986. [2] S. Amari. Dierential-Geometrical Methods in Statistics. Springer-Verlag, New York, 1985. [3] R.J. Brachman and J.G. Schmolze. An overview of the kl-one knowledge representation system. Cognitive Science, Vol. 9, No. 2, pp. 171{216, 1985. [4] Pierre Bremaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues, Vol. 31 of Texts in Applied Mathematics. Springer, 1999. [5] Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography. In Proc. of COLING 27, pp. 76{83, 26-29 Jun 1989. [6] A.M. Collins and E. F. Loftus. A spreading activation theory of semantic processing. Psychological Review, No. 82, pp. 407{428, 1975. [7] I. Dagan, S. Marcus, and S. Markovitch. Contextual word similarity and estimation from sparse data. Computer Speech and Language, Vol. 9, pp. 123{152, 1995. [8] Ido Dagan, L.Lee, and F.Pereira. Similarity-based methods for word sense disambiguation. In Proc. of ACL-EACL '97, pp. 56{63, 1997. [9] Ferdinand de Saussure. 一般言語学講義. 岩波書店, 1972. [10] S. Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, Vol. 41, No. 6, pp. 391{407, 1990. [11] Ted Dunning. Accurate methods for the statistics of suprise and coincidence. Computational Linguistics, Vol. 19, No. 1, pp. 61{74, 1993. [12] EDR 電子化辞書研究所. EDR 電子化辞書仕様説明書, 1995. [13] Jerey L. Elman. Finding structure in time. Cognitive Science, No. 14, pp. 179{211, 1990. [14] Donald Hindle. Noun classication from predicate-argument structures. In 28th Proc. of COLING, pp. 268{275, 1990. [15] Will Lowe. Semantic representation and priming in a self-organizing 35 [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] lexicon. In Proc. of 4th Neural Computation and Psychology Workshop, pp. 227{239. Splinger Verlag, 1997. Mitchell P. Marcus. A Theory of Syntactic Recognition for Natural Language. MIT Press, 1980. R. Montague. Formal Philosophy: Selected Papers of Richard Montague. Yale University Press, 1974. Fernando Pereira, Naftali Tishby, and Lillian Lee. Distributional clustering of english words. In Proc. of 31th ACL, pp. 183{190, 1993. James Pustejovsky. The Generative Lexicon. The MIT Press, 1995. M. R. Quillian. Semantic Information Processing, pp. 216{270. MIT Press: Cambridge, MA., 1968. Edward Sapir. 言語. 岩波書店, 1998. Hinrich Schutze. Dimentions of meaning. In Proceedings of Supercomputing'92, pp. 787{796, 1992. J. F. Sowa. Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley, 1984. M. Taft. Reading and the mental lexicon. Lawrence Erlbaum Associates Limited, 1991. David L. Waltz and Jordan B. Pollack. Massively parallel parsing: A strongly interactive model of natural language interpretation. Cognitive Science, No. 9, pp. 51{74, 1985. David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proc. of ACL'95, pp. 189{196, 1995. 高橋直人. 実数ベクトルによる語句の表現の試み. 情報処理学会研究報告 自然 言語処理研究会, pp. 95{102, Nov 1995. 高橋直人. ニューラルネットを用いた意味表現形式の自動獲得. 電気情報通信 学会技術研究報告 NLC 98-28, Vol. 98, No. 338, pp. 17{24, 1998. 持橋大地, 松本裕治. 連想としての意味. 情報処理学会研究報告 99-NL-134, pp. 155{162, 1999. 小嶋秀樹, 古郡延治. 単語の意味的な類似度の計算. 電子情報通信学会技術研究 報告 AI92-100, pp. 81{88, 1993. 藤尾正和, 松本裕治. 語の共起確率に基づく係り受け解析とその評価. 情報処理 学会論文誌, Vol. 40, No. 12, pp. 4201{4212, December 1999. 36 [32] [33] 片山徹. 新版 応用カルマンフィルタ. 朝倉書店, Jan 2000. 東京大学教養学部統計学教室編. 自然科学の統計学. 基礎統計学 III. 東京大学 出版会, 1992. [34] 廣恵太, 伊藤毅志, 古郡延治. コーパスから抽出した単語間類似度に基づく意味 ネットワーク. 情報処理学会第 51 回論文集, Vol. 3, pp. 13{14, 1995. 37