...

Master`s Thesis Semantic model and analysis with a stochastic

by user

on
Category: Documents
87

views

Report

Comments

Transcript

Master`s Thesis Semantic model and analysis with a stochastic
NAIST-IS-MT9851113
Master's Thesis
Semantic model and analysis with a stochastic
association
Daichi Mochihashi
February 14, 2000
Department of Information Processing
Graduate School of Information Science
Nara Institute of Science and Technology
Master's Thesis
submitted to Graduate School of Information Science,
Nara Institute of Science and Technology
in partial fulllment of the requirements for the degree of
MASTER of ENGINEERING
Daichi Mochihashi
Thesis Committee: Yuji Matsumoto, Professor
Hiroyuki Seki, Professor
Semantic model and analysis with a stochastic
association
3
Daichi Mochihashi
Abstract
This thesis describes the meaning of a word by a stochastic association. The
meaning of a word is calculated by the Markov transition as an association based
on the association probability statistically deduced from corpora, and the meaning is dened as a state probability distribution over the lexicon. By the association as the state transition, indirect semantic cooccurences can be represented
adequately.
Based on the quantitative representation of meaning, we also calculated a
semantic consistency and semantic informativeness of a sentence.
Keywords:
semantics, spreading activation, Markov Process, stochastic process
3 Master's
Thesis, Department of Information Processing, Graduate School of Information
Science, Nara Institute of Science and Technology, NAIST-IS-MT9851113,
i
February 14, 2000.
確率的連想に基づく意味モデルと意味解析3
持橋 大地
内容梗概
本論文では, 単語の意味を単語間の連想関係を表す確率分布として表現し, その
定式化と連想確率の獲得について述べる. 単語の意味は共起関係から導かれた連想
Markov 過程上の連想によって計算され, その状態確率分布として
意味が定義される. 状態遷移として連想を行うことによって, 直接共起しない語と
の意味的な関係が表現できる.
確率分布として定量的に表現された意味に基づいて, 文の意味的な妥当性と意味
的情報量の定量的指標の計算を試み, その際に生じた問題点をまとめた.
確率に基づいて
キーワード
意味論, 活性拡散, マルコフ過程, 確率過程
3 奈良先端科学技術大学院大学
MT9851113, 2000
年
2
月
14
情報科学研究科 情報処理学専攻 修士論文,
日.
ii
NAIST-IS-
Contents
1. Introduction
1
2. Meaning and semantic analysis of Language
2.1 Lexical Meaning : : : : : : : : : : : : : : : : : : : : : : : : : : : :
2.2 Structural meaning : : : : : : : : : : : : : : : : : : : : : : : : : :
2
2
3
3. Previous work on Lexical Semantics
3.1 Logic based approach : : : : : : : : : : : : :
3.2 Psycholinguistic approach : : : : : : : : : :
3.3 Neural network approach : : : : : : : : : : :
3.4 Related work : : : : : : : : : : : : : : : : :
3.4.1 Semantic similarity : : : : : : : : : :
3.4.2 Semantic disambiguation : : : : : : :
3.4.3 Approaches in Information Retrieval
3.5 Summary : : : : : : : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
4
4
4
5
5
5
6
7
8
4. Meanings as association
4.1 Lexical Semantic Network : : : : : : : : : : : : : : : : : : : : : :
4.2 Previous works on LSN : : : : : : : : : : : : : : : : : : : : : : : :
4.3 LSN as stochastic network and its formulation in Markov process
4.4 Mathematical formulation of LSN : : : : : : : : : : : : : : : : : :
4.4.1 Probability spaces : : : : : : : : : : : : : : : : : : : : : : :
4.4.2 Composition of Markov process : : : : : : : : : : : : : : :
4.5 Mathematical properties of stochastic LSN : : : : : : : : : : : : :
4.6 Acquisition of association probability : : : : : : : : : : : : : : : :
4.6.1 Mutual information : : : : : : : : : : : : : : : : : : : : : :
4.6.2 Log likelihood ratio : : : : : : : : : : : : : : : : : : : : : :
9
9
10
11
11
12
12
13
13
14
14
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
5. Comparison with former formalisms
20
5.1 Advantageous properties : : : : : : : : : : : : : : : : : : : : : : : 20
5.2 Limitations of the formalism : : : : : : : : : : : : : : : : : : : : : 21
iii
6. Semantic sentence analysis
6.1 Semantic consistency of a sentence
6.1.1 Experiment : : : : : : : : :
6.1.2 Evaluation : : : : : : : : : :
6.2 Semantic information measure : : :
6.2.1 Experiment : : : : : : : : :
6.2.2 Evaluation : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
22
22
23
25
29
30
30
7. Conclusion
33
8. Acknowledgements
34
References
35
iv
1.
Introduction
Language is a medium for communicating meaning. However, such an aspect of
language is rather left unthought under the concentration of studies on syntax
since the advent of generative grammar by Chomsky. On the other hand, it
appears that the studies of grammar itself have revealed the indispensability of
semantics when we analyze natural languages.
Furthermore, from the practical point of view, according to the rise of hyper
text space exemplied by Web, proper treatment of meaning is strongly required
in Information Retrieval (IR). This thesis is devoted to capture such realistic
meanings of words automatically, if inevitable limitations, and to model a system
in which meanings are produced.
Overview
This thesis focuses on lexical meaning and denes it as a probability distribution
over a stochastic unlabeled network consisting of words as its simple nodes. A
node in the network exists only by the relationships to its neighbors and has no
internal structures within itself.
First, we describe the two main subdivisions in semantics, i.e., lexical meaning
and structural meaning, and describe why we focus on lexical meaning.
Second, we overview the previous works done on the lexical meaning. They
are divided into logical approach, psycholinguistic theory, and related technical
persuits in natural language processing. We look through them, and locate where
this thesis has its own place.
1
2.
Meaning and semantic analysis of Language
Sentences in natural language consist of words and their arrangements. While
there are subconstituents of words such as morphemes and phonemes, and they
have substancial contribution to the semantics of words [24], we dene here a
word as an atomic constituent of language because of simplicity and because these
subconstituents might be treated within the framework this thesis proposes.
When a word is regarded as a semantic atom of language, corresponding to
the atoms and their arrangements, semantics of natural language can be classied
into two subdivisions: the lexical meaning and the structural meaning. In general, lexical meaning is what Saussure referred as `paradigmatic' meaning, and
structural meaning corresponds to `syntagmatic' meaning[9].
Below, we will investigate these two meanings more closely and describe why
we focus on the lexical meaning.
2.1
Lexical Meaning
Lexical meaning is what a certain word represents for. As we have seen above, a
word can be regarded as an atom of semantics. Meaning of a sentence is based
on the meanings of the words contained in the sentence; we cannot dispense with
the lexical meaning when we think about the meaning of sentence1.
Lexical meaning has nothing to do with its arrangements, i.e., grammatical
constructions, but some structural meanings can be expressed in a static lexical
way. For example, `The situation where some liquid is heated to be a gas form'
can be described simply as `evaporation,' and in Latin, a sentence `He will sing'
can be described by a single word `cantabit.' Sapir comments about the example
of compaction of much more complicated situation into a single word found in
some native American languages [21].
However, obviously lexical meaning has limitations where such a `frozen' expression does not exist so far. In Japanese, one can describe a certain quiet
atmosphere of an old city landscape as `たたずまい', but in English, one cannot
express such an atmosphere in a word and cannot resort to lexical meaning; in1 Even
in the case of an idiom, it is more or less based on the meaning of its constituent
words; we cannot replace a word in an idiom with any other words.
2
stead they must use syntactic constructions as used in this sentence. As a whole,
complicated meaning is dicult to describe by lexical meaning, and its is obvious that there are some meanings which can only be expressed with the help of
syntactic description.
Although lexical meaning has thus its own limitations, it is the key to semantics and no one tells about semantics without lexical meaning. Also in the view
of present methods in Information Retrieval, which accepts words as its query,
exploring lexical meaning is the rst inquiry we have to make.
2.2
Structural meaning
Structural meaning is the meaning which cannot be deduced from the mere set
of its constituent word; it lies on the abstract construction in the person who
hears the expression. For example, meaning of the sentence `Every boy has a
lover.' can be described via logical form \8x9y:boy(x) lover(x; y)." Montague
semantics[17] traditionally advocates this kind of treatment, and it has been
studied as the main stream of semantics. These treatments such as quantication
and identication are useful and cannot be replaced by lexical meanings.
However, it says nothing about the logical predicate, say \boy()" means; it is
a critical limitation when we make an accurate denition of meaning. Only when
these predicates are properly dened, it will become possible to talk about the
abstract constructions on the basis of these denitions of meaning of predicates.
In other words, syntagme exists on the base of paradigm.
Therefore, we restrict ourselves here not to treat these structural meaning
until the lexical meaning is dened, and focus on the lexical meaning as an atom
of semantics.
3
3.
3.1
Previous work on Lexical Semantics
Logic based approach
Most part of previous studies on lexical meaning concentrate on a logic based
approach. For example, 9-term[1] denes the meaning of a word by labeled
structures and their corresponding values. There are hierarchical structures over
the labels, and using the hierarchy, unications are dened between 9-terms to
construct a semantic hierarchy of words over the lexicon. Similar descriptions
are adopted in Generative Lexicon[19], where structures are divided into four
structures; argument structure, event structure, qualia structure, and inheritance
structure. While argument structure treats structural meaning discussed above,
event structure and qualia structure deal with semantic interpretations. Using
event structure and substructures of qualia structure, Generative Lexicon tries to
capture selectional preferences and metaphors.
In general, these approaches use typed hierarchical structure and attempt to
capture the semantic interpretation of a word by logical operations, especially
unications. However, they have problems on the acquisition of labels or structures they use, and determinacy of logical deductions which does not always t
to our intuition. The former is especially problematic when treating new words
or a huge set of words commonly found in the realistic environment.
3.2
Psycholinguistic approach
Psycholinguistic or cognitive approach was rst introduced by Quillian[20]. His
idea was to describe the meaning of a word by a labeled semantic network. Following the links, one can get the meaning of a word by the properties of the node
and types of the links it follows.
This theory is extended by the spreading activation theory of Collins &
Loftus[6] to match the results of psychological experiment, but there is no mathematical formulation of the process, which the original Quillian's theory aimed
at.
Another psycholinguistic approach is exemplied by the Knowledge Representation Network. While spreading activation theory itself is regarded as a Knowl4
edge Representation Network, KL-ONE[3], Conceptual Graph[23] and other representations are characterized by the logical operations dened over its nodes.
Using logical operations, they share the same advantages and disadvantages of
logic based approaches. Due to the diculty of dening such a knowledge network, there exists no semantic processing system of this sort in practical use.
3.3
Neural network approach
Since the meaning of language is dened and processed in a human brain, it seems
natural to represent it by a neural network. Elman[13] uses a simple recurrent
neural network and let it learn to predict a successive word from a corpus as
its input. By clustering the hidden unit activations, he could get a semantically
adequate hierarchy of words automatically.
Takahashi[27][28] also uses a layered neural network to make a semantic representation of words, but its dimension is quite low (4) and vocabulary in the
approach is very small (11). Assuming limited semantic case beforehand, it does
not go beyond an articial set of meaning.
In general, neural network is a promising and natural approach in its relationships to cognitive system, but it is not intuitive to interpret its internal representations and is dicult to apply to a semantic information processing. When
it becomes clear and easy to manipulate the internal representation of these networks, it will be a promising method which matches our cognitive process.
3.4
Related work
As to the computational treatment of lexical relationships, several works have
been perfomed in natural language processing.
3.4.1 Semantic similarity
Semantic similarity attracts our attention with relation to the clustering of words
which enables us to construct computationally the semantic categories we recognize naturally.
From the point of information-theoretical view, Church et al. proposed a
5
P (x; y)
named association ratio
measure of self mutual information I (x; y ) =
P (x)P (y )
which expresses a relevancy of occurence of two words [5]. He proposes this to
nd lexico-syntactic regularity and semantic classication of subcategorization
in lexicography, while it has admitted its limitations concerning with the use of
mutual information when P (x; y) is too small, and thus omits the pairs with the
joint occurence frequency f (x; y)5 in investigation.
Hindle also uses self mutual information as a measure of semantic relevancy,
and from the predicate-argument structure of transitive verbs he extracted word
similarities of nouns they subcategorize[14]. Extraction of predicate-argument
structure of a verb is perfomed by a simple deterministic parser Fidditch originated by Marcus[16].
As opposed to these, Pereira et al. uses a Kullback-Leibler information distance to measure a semantic distance of probability distributions which two words
have in respect to cooccurence [18]. They also focuses on predicate-argument
structure using the Fidditch parser Hindle used, and performs a noun classication in which each word has a membership probability to clusters.
3.4.2 Semantic disambiguation
Polysemy of words is a semantic problem natural language processing must deal
with.
Dagan worked a pseudo-word disambiguation experiment proposed of Schutze[22]
by a similarity estimation between transitive verbs and its object [8]. He denes
the similarity of w2 given w1 as a weighted sum of conditional probability of
cooccurence between w10 and w2, where w10 is a relevant word of w1 (possibly all
of lexicon) which has a similarity to original w1 as its weight. He compares the
result of similarity of probability distribution using KL divergence, Information
radius, L1 norm, and confusion probability to conclude that the denition using
Information radius made the greatest performance. Additionally, he notes that
the comparison of cooccurence will suer a noticable performance degredation
when very low frequency events are omitted; this is typically important in the
use of mutual information, where such low frequency events must be excluded
against the overestimation of information.
6
Yarowsky stands at a dierent point to disambiguate a polysemous word by
making a decision list of disambiguation whose entry has the likelihood of determinance [26]. He rst gives the characteristic collocations of a word by hand, and
by nding the most likely collocation word of these collocations of the initial word
and continuing this process, lexicon is classied to a set of clusters. A decision
list of disambiguation is made by these clusters by the likelihood membership
function to each cluster.
However, it rst requires a characteristic disambiguation key by hand: it is not
always clear to an unfamiliar word which has unknown polysemy. And although
this approach is eective in disambiguation, having a huge decision list about all
words in lexicon to disambiguate is not realistic. Having the concern in mind
that these decision lists have diculties to reect a more indirect context eect,
the decision list approach has a limitation in applying to a realistic environment.
3.4.3 Approaches in Information Retrieval
In the eld of Information Retrieval, which treats a set of words as a query, word
weighting and measurement of word relevancy have attracted attention. Schutze
constructs a semantic vector which has cooccurence frequencies to other words as
its element [22]. It is dened in a very high dimensional space where each word
expresses a dimension, and semantic similarity of two words is measured by the
resemblance of directionality these words have.
In general, since in Information Retrieval the concept of term-document matrix is of common interest, extension from term-document matrix to term-term
matrix is considered quite natural.
Because of the huge dimensionality in this kind of treatment, [22] and LSI[10]
use a SVD algorithm or perform a Principal Component Analysis to reduce the
semantic dimensionality of representation to calculate in a moderate resource
use. However, this leaves a critical problem with low-frequency words: while
the importance of singletons (events occurred only once) has been pointed out by
Dagan[8], these low frequency words have a very limited and specic cooccurences
and cannot be easily mapped to principal dimensions. If done, mapping result
may be quite ambiguous to lose the keen informativeness that the original words
have.
7
Information Retrieval is a promising eld of technique executed in practice.
However, in view of these characteristic events, its dimension reduction technique
commonly assumed to perform leaves much to be desired semantically. A method
is required which uses a moderate computational resource, while the fertile diversity of realistic meaning of words is preserved.
3.5
Summary
Many works have been done in semantic treatment of lexicon. Traditionally
there are logic based approaches which concern the structural meanings and in
the eld of psychology many proposed a cognitively adequate system of semantic
processing, while many of which are not formulated mathematically.
In the persuits of natural language processing, semantic similarity and semantic disambiguation mainly attracted attention. They propose informationtheoretical measures between words, and saw a considerable success in each eld.
However, they are not suciently general in representing the meaning of words,
therefore some approach cannot be easily applied to other objectives, and they
share an inadequacy in their application in a changing realistic environment.
Strategies taken in Information Retrieval is promising in its practicality and
automation, but a compromise of moderate computational resource use and
preservation of semantic diversity of realistic environment is a dicult problem
to be challenged.
8
4.
4.1
Meanings as association
Lexical Semantic Network
As overviewed above, previous methods of dening lexical meaning have both
advantages and drawbacks; especially in applying to a realistic environment, some
of these proved to be of limited use.
Back to the basics, what information do we use when we learn the meaning
of a specic word? Here, we will return to the very fundamental situation; the
situation where a child learns the meaning of a new word @. What information
will he use? If someone says `@!' with no attendant circumstances in the dark,
he cannot grasp its meaning at all. However, if someone says with an indication
`This is @', or `@ is ', he will surely be able to know what the word @ means by
associating the substance he was indicated or alternatively the word . This is
the same when he is said in context `1@2', when he can associate 1 and 2 as
the relevancies of @. Of course and 1; 2 are words and have their associative
neighbors; then, when we assume that any image in the cognitive map has its
equivalent set of words representing the image, we can dene the meaning of a
word by the set of words associated directly or indirectly from the word, because
from the assumption any association, if words or images, can be reduced to a
certain set of words in the long run.
Then, the meaning of a word is expressed by the set of words associated from
it; but they need not be a simple set and may have some features. As a feature,
we adopt a weight to each associated word corresponding to the strength of
association because there are no explicit types of association in the real corpus,
and therefore assuming a type of links is not valid when we want to extract
semantic relationships from corpora.
When words are interconnected one another by association, they can be regarded as a network: We call it as a Lexical Semantic Network (LSN) after [15].
By the traversal of node in the network, the meaning of a word is dened as a
weighted set of words which is associated with the word.
9
4.2
Previous works on LSN
As we have seen in section 3.2, the denition of meaning by LSN is advocated
originally by Quillian[20], Collins & Loftus[6], Sowa[23], and else. But their works
are all assumed a label or a type over the link which is dened a priori. As we
claimed in section 4.1 arbitrariness of labels are dicult to dene and maintain
in a realistic environment.
As for the study of unlabeled LSN with weighted links, there are several works
such as Waltz & Pollack[25], Kozima[30] and Hiro[34].
Waltz & Pollack[25] uses a real-valued quantitative links to perform a spreading activation, but the weights are xed to constants beforehand as to excitatory
and inhibitory links, and it proposes no strategy to acquire these weights from a
real data. This study showed that by utilizing quantitative links, semantic polysemy can be treated appropriately to select the semantically correct parse tree
under the framework of CFG.
Kozima[30] calculates association probabilities from the denition sentences
of the words in LDOCE2 and calculates an activation vector using a heuristic
spreading activation. Entries in LDOCE are dened using a limited vocabulary
LDV3, therefore it can be regarded to dene the meaning of a word with an eect
of principal component dimensions. Although this strategy is very eective, it is
based on an existing dictionary and therefore it is not sucient in an enormous
size of lexicon newly created or updated continuously. When we also notice that
the meaning of a certain word diers partly from person to person according to
his interests or experience, acquisition of semantic relevancies from real corpora
is strongly desirable.
Hiro[34] follows [30], and it constructs a semantic network from a corpus. It
uses semantic similarity between words as a value of the link between them, which
is calculated by a modication of Dagan[7]'s csim () based on mutual information.
Because it is essentially based on mutual information, there are problems with
words of low frequency. He avoids this by dealing only with high frequency links
where frequency40, but this compromise has made itself semantically unuseful
bacause semantically important information is generally conveyed by these low
2 Longman
3 Longman
Dictionary of Contemporary English
Dening Vocabulary
10
frequency words that are omitted.
4.3
LSN as stochastic network and its formulation in Markov
process
Spreading activation theory descended from [20] tells about the process of successive activation starting from the initially activated concept. [25] and [30] use
this kind of association using a heuristic formula.
If we assign each word in LSN a state transition probability according to the
associativity it has to a certain word , aggregation of these relationships can be
regarded as a Markov Process.
Even if there are no direct associations to a word, it may have a high degree
of state probability through state transitions if it is commonly relevant to the
other associatives of the central word.
In view of this, if we regard a state transition of the Markov process as an
association and state probability as an activation, we can dene the meaning of a
word w by the state probability distribution fP (!)g over ! in Lexicon, starting
from the state in which fP (!) = 1g. As the transition proceeds more, its associations spread more; if we `contemplate deeply', we can broaden its extentional
subjects by increasing the transition steps t.
We have to note here that the initial distibution need not be P (!) = 1 for
central word and P (!) = 0 for other words; it can be any arbitrary distribution.
For example, if we want to know the meaning of `blue sky', the initial distribution
can be set where P (blue) = 0:4 and P (sky) = 0:6 to obtain the mixture of these
meanings as an associative result.
Here we have referred to the meaning of a word rather informally. In the next
subsection, we formalize them mathematically.
4.4
Mathematical formulation of LSN
As we have seen in 4.1, semantic associations are acquired through linguistic
cooccurences or direct cooccurences with cognitive images. Therefore, we will
construct a semantic association space from the relation of cooccurence space
and formalize them as the two probability spaces.
11
4.4.1 Probability spaces
def. (collocation space) For a natural number i; j 2 f1::ng, we dene a unitary event ! = cij as a cooccurence of ith word and j th word. For the sigma eld
4C = }(
C ) where C = fcij ji; j 2 f1::ngg and function N : ! ! R which maps
an event to the frequecy of its occurence,
X P (! )
N (c )
PC (!) = P ij ; PC (Z ) =
C
i;j N (cij )
!2Z
denes a probability space f
C ; 4C ; PC g called a collocation space.
def. (semantic space) semantic space is a probability space f
S ; 4S ; PS g such
that for a relation ! = sij of ith word and j th word in collocation space and its
probability PS (!),
S = fsij g and 4S = }(
S ) are dened by
PS (!) = aij ; PS (Z ) =
where aij is properly dened afterwards.
X P (! )
! 2Z
S
4.4.2 Composition of Markov process
def. (association probability) A set wi = fsij jj 2 f1::ngg in a semantic
space is called a word in the semantic space, and set L = fwi g (i 2 f1::ng) is
called a Lexicon.
For a stochastic process ct (!) (! 2 L) which takes its state in L, association
probability a(wj jwi ) is dened by
a
P (w \ w )
a(wj jwi) = S i j = P ij :
PS (wi )
j aij
Using a(yjx), ct (! ) denes a simple Markov process which takes t as its time
step. Then,
def. (meanings) meaning is a state probability distribution
= fP (ct(! ) = w)g (w 2 L)
especially, meaning of w, t (w) is a state probability distribution after t transitions,
t (w) = fP (ct (!) = wi )jP (c0(! = w)) = 1g (i 2 f1::ng):
12
4.5
Mathematical properties of stochastic LSN
When we dene LSN stochastically as above, meanings can be calculated by
assigning each association probability a(wj jwi ) to an element Aij of the Markov
transition matrix A of LSN.
Meanings of a certain word is thought to be decomposed into
direct associations
2-level indirect associations
..
.
t-level indirect associations
which is originally addressed as `nd an intersection' in [6], therefore we can
calculate the meaning t(wk ) of a word wk by
t
1
1X
t (wk ) = (Aw~k + A2 w~k + 1 1 1 + At w~k ) =
Ai w~k
(1)
t
t i=1
where w~k is an initial probability vector corresponding to wk , which is normally
k
w~k = t (0
| : : :{z1 : : : 0)} :
n
(1) is a formalization of spreading activation used in [25] and [30], however,
it has a clear mathematical interpretation and a characteristic property.
In fact, when 3 is an eigenvector corresponding to the eigenvalue = 1 of
n ~ = 3 for any w in Lexicon due to the known
stochastic matrix A, nlim
!1 A w
property of Markov transition matrix[4]. Therefore, tlim
t (w) = 3 for every
!1
w in L, and every meaning of a word converges to 3 in the suciently long
association interval of t, whenever corresponding Markov process is ergodic.
Since 3 is a semantic vector which does not vary in association, 3 can be
thought as a semantic tautology in a corresponding semantic space.
4.6
Acquisition of association probability
We did not dene in 4.4 as to how to calculate the probability PS (! ) = aij
of associativity in semantic space. This subsection shows the strategies other
studies used and describes about log likelihood ratio which this study takes as its
measure.
13
4.6.1 Mutual information
Since aij expresses associative overlap between two word wi and wj and interchangeable between i and j , it is thought to measure it by the mutual information
I (wi; wj ) = log
P (wi; wj )
P (wi )P (wj )
(2)
as Church[5] claimed as an associative measure for lexicography.
Though it works fairly well, it has problems with very low frequency words.
P (wj jwi )
Because I (wi; wj ) = log
, mutual information measures eectively conP (wj )
ditioned cooccurence probability P (wj jwi ) around wi by weighting in inverse
propotion to the occurence probability P (wj ) of its neighbor. However, we do
not always think a low frequency word as important: the reverse is often true for
our cognitive intuition.
As [34] and many studies which are based on mutual information, compromise
by eliminating low frequency words is possible, but that leads to semantically unuseful results because most of information is conveyed by such very low frequency
words that are eliminated, which comprise a substancial part of lexicon.
Then, this study uses another measure as association, log likelihood ratio,
which Dunning[11] showed its validity in Information Retrieval.
4.6.2 Log likelihood ratio
Information measure such as mutual information and 2 -statistics emphasizes the
importance of very low frequency words. To solve this problem, [11] proposed a
measure using log likelihood ratio test.
If the occurences of words distribute not with a normal distribution around
the maximal likelihood estimate but with a binary distribution of its occurence
probability, the signicance of dependency between the occurences of two words
can be calculated using a likelihood ratio.
If we assume a null hypothesis
H0 : P (wj jwi) = p ^ P (wj j:wi ) = p
(3)
which says that words wi and wj are independent of the occurence of the other,
14
and the alternative hypothesis
H1 : P (wj jwi ) = p1 6= p2 = P (wj j:wi )
(4)
which says that the occurence of wj is dependent on the occurence of wi where
p, p1, p2 are parameters, the signicance of dependence can be evaluated using a
likelihood ratio
L(H0)
ij =
(5)
L(H1)
and its logarithm
L(H0)
:
(6)
log ij = log
L(H1)
L(H1)
Since 0 log ij = log
is a relative likelihood of dependency over indeL(H0)
pendency and 02 log is asymptotically 2 -distibuted [33], 0 log can be used
as a more accurate approximation than 2 - test.
c
c
c 0c
When p, p1 , p2 is estimated as p = j , pi = ij , pj = j ij by maximum
ci
N 0 ci
P N
likelihood estimates where N = k;l ckl , cij = N (wij ), ci = N (wi ), cj = N (wj ),
L(H0) and L(H1) is calculated as
L(H0) = Bi (p; ci ; cij )Bi (p; N 0 ci ; ci 0 cij )
L(H1) = Bi (pi ; ci ; cij )Bi (pj ; N 0 ci; ci 0 cij )
(7)
(8)
where Bi is a binomial distribution Bi (p; n; k ) = nCk pk (1 0 p)n0k .
Therefore, 0 log ij can be evaluated as
Bi (p; N 0 ci ; ci 0 cij )
0 log ij = log BiBi((pp;; cci;; ccij))Bi
(pj ; N 0 ci ; ci 0 cij )
i i ij
= log Bi (p; ci; cij ) + log Bi (p; N 0 ci ; ci 0 cij )
0 log Bi (pi; ci; cij ) 0 log Bi(pj ; N 0 ci; ci 0 cij ):
(9)
(10)
In this thesis, we use this likelihood ratio as a measure of associative relevancy
P
to see PS (! ) = aij = 1log ij , where is an normalizing constant = ( i;j ij )01 .
Using this measure, association probability a(wj jwi ) is calculated as
a(wj jwi ) =
ij
Plog
j log ij
15
(11)
Hereafter, we use the association network acquired by parsing EDR Japanese
corpus[12] with 50; 000 sentences with the collocational window of 5 words, where
LSN has 37; 714 nodes of words and 1; 863; 773 links between them.
Table 1 and 4.6.2 show the association probabilities around some words in the
LSN and their associative result, where association step t = 4.
16
Table 1. association probability
言語:
見る:
neighbor
プログラム
型
世代
記述
アセンブリ
文脈
第
表現
定義
処理
:
示す
中心
装置
同
いま
probability
0.066702
0.033042
0.029995
0.027043
0.021971
0.017466
0.017035
0.016671
0.014594
0.013171
:
0.000002
0.000002
0.000002
0.000001
0.000001
(freq.)
neighbor
(49)
(32)
(20)
(18)
(9)
(10)
(21)
(16)
(12)
(17)
方
られる
面倒
限り
いる
目
姿
強い
ない
向き
(1)
(1)
(1)
(1)
(1)
度
業者
そこ
相手
戦争
:
サントノレ:
フォーブル
アトリエ
ブティック
構える
街
probability
0.249103
0.225979
0.212098
0.179264
0.133557
0.080225
0.070657
0.011259
0.010553
0.010482
0.008112
0.007964
0.007256
0.007046
0.006757
:
0.000001
0.000001
0.000001
0.000001
0.000001
1
1
neighbor
probability
(freq.)
(1)
(1)
(1)
(1)
(1)
1
17
(freq.)
(99)
(140)
(10)
(13)
(138)
(23)
(16)
(19)
(93)
(8)
(2)
(1)
(1)
(1)
(1)
Table 2. calculated meanings
言語:
見る:
association
プログラム
言語
型
世代
記述
第
アセンブリ
表現
において
処理
文脈
定義
理論
形式
原始
論理
いる
れる
プログラミング
probability
association
0.021963
0.018031
0.013351
0.009194
0.008461
0.007039
0.006270
0.006092
0.005672
0.005434
0.005422
0.004933
0.004410
0.003786
0.003777
0.003685
0.003416
0.003312
0.003239
方
られる
見る
いる
ない
面倒
目
限り
姿
れる
強い
みる
か
ながら
考える
よう
こと
なる
回る
1
フォーブル
アトリエ
ブティック
街
構える
サントノレ
パリ
貸与
郊外
足
起きる
ハンブルク
レンタル
店
ディスプレー
飲食
入る
商店
連れる
パーティー
0.02264
0.021779
0.013857
0.005960
0.005230
0.003139
0.002975
0.002965
0.002865
0.002782
0.002677
0.002610
0.002516
0.002398
0.002326
0.002205
0.002077
0.002043
0.002037
1
サントノレ:
association
probability
probability
0.091468
0.088251
0.08147
0.064814
0.063651
0.043291
0.026355
0.011235
0.009156
0.006649
0.006041
0.006025
0.005846
0.005592
0.005549
0.00532
0.004804
0.004770
0.004443
0.004095
1
18
ap (w)
w
の
は
や
で
を
に
が
など
から
と
も
として
による
だ
する
た
へ
この
また
その
によって
0.000700
0.000786
0.000880
0.000892
0.000899
0.000942
0.000993
0.001114
0.001206
0.001243
0.001273
0.001355
0.001496
0.001564
0.001653
0.001658
0.001699
0.001727
0.001763
0.001802
0.001908
まで
て
について
氏
さん
しかし
ら
者
に対する
的
より
ため
ば
という
さらに
中
用
および
に対して
ある
0.001990
0.002008
0.002067
0.002097
0.002117
0.002119
0.002152
0.002191
0.002220
0.002240
0.002267
0.002313
0.002332
0.002351
0.002357
0.002423
0.002429
0.002431
0.002482
0.002482
Table 3. Functional words excluded.
To note here, although this measure works fairly well for low frequency words,
yet it is not always applicable suciently to rather frequent words where functional words are associated highly while suppressing other content words.
To avoid this problem, in this study functional words are excluded in association with the threshold of ap(x) statistics proposed in [29], which reects
more semantic informativeness than frequency using cooccurence probability distribution. In the corresponding LSN, we used the threshold ap (x) < 0:0025.
Table 4.6.2 shows these excluded words. Note that these exclusions are mostly
functional words which bears less semantic importance; therefore this compromise does not lead to be problematic while exclusion of low frequency words are
extremely inappropriate.
19
5.
Comparison with former formalisms
As we have overviewed in section 3, most of studies on lexical meaning so far uses
a discrete representation of lexical knowledge based on the rst order predicate
logic. Contrary to this, the attempt to represent the meaning as a continuous
vector quantity has both an advantage that previous formalisms cannot aord
and a limitation of representation that cannot be described within the formalism.
Hereafter, we overview and reect these advantages and drawbacks.
5.1
Advantageous properties
Quantitative treatment of meaning seems to have the following advantages:
One can dene the lexical meaning, which depends on the world knowledge
that changes continuously and varies from person to person, automatically
on the basis of not the arbitrary and limited view of theorists but on the
basis of statistical and mathematical view of information.
Meaning can be dened completely automatically from corpora to enable
itself to be updated and maintained easily.
One can give a quantitative criterion of meaning and related ambiguities,
which leads to a natural model of cognition which is not always deterministic.
Especially, computational denition of `meaning' is an important element:
because a natural language entails a production of new meanings on words, and
the semantic space is not common to all but dierent from person to person by
the linguistic contacts or spontaneous acquisition of the subspace of language.
With this concern in mind, semantic description which assumes a referent
meaning common to all is essentially insucient and the dynamic description is
required.
However, the denition of `meaning' deduced statistically from collocation
has an inevitable limitation because there is a unobservable discrepancy between
word and its corresponding internal representation in mind that linguistics cannot
touch.
20
5.2
Limitations of the formalism
First, associative denition of meaning cannot deal with logical constructions. In
addition to the at description of its extensional images, a sentence may have
a structural meaning typically depicted by logical formulae. As we have seen in
section 2.2, it cannot be deduced from lexical meaning, namely semantic extensions. When we consider the history of semantics mainly dealt with this kind
of meaning, an interface between structural meaning is an important assignment
which must be considered next.
Second, associative type of links we are apt to see naturally is excluded in
the current formalization. Although the arguments so far deliberately excluded
it to enable a statistical measure of meaning by treating an association as a
unlabeled fundamental phenomenon of cognition, in fact when we are exposed
to an associative relation, say `Kyoto' and `Kamogawa', we are apt to see there
a specic kind of relationship like `locational'. While we note here that such a
relationship itself is expressed by a word, and therefore such a label could be
retrieved with an adequate treatment of unlabeled association, the phenomenon
itself of the label deduction cannot be deduced from the naive theory alone; we
must expect more abstruct constructions over it.
21
6.
6.1
Semantic sentence analysis
Semantic consistency of a sentence
We dened the meaning of a word by a probability distribution of association
over lexicon in section 4.4.2.
When meaning of a word is represented thus quantitatively, we can give a
quantitative measure of semantic consistency of sentences which has been overlooked from a perspective of syntax.
As a rst imaginable measure, semantic consistency of a sentence s is calculated by the degree of clustering of the set of semantic vectors f (wi)jwi 2 sg.
[30] calculates along this way to dene the semantic consistency as a similarity
of the set of semantic vectors to the set itself. Although he got a moderate result
with experimental sentences, it proves to be insucient when we ponder on it.
While this measure gives a semantic consistency of a sentence which has only
words that have been already associated sucently, it cannot touch a machinery
where previously unassociated pairs of words are considered associative to acquire
a new association.
For example, within a sentence
\Under the clear sky of autumn in Tennessee, he found a girl with a
straw hat."
there is no direct relevance between `sky' and `Tennessee', and `girl' and `straw'
is semantically irrelevant in any way. The reason we can understand the sentence
in spite of these irrelevancies is because of its local consistency of dependent pairs
such as `sky' and `clear', `girl' and `hat'. These locally consistent pairs are merged
to constitute a consistent meaning within us.
To model such a perspective, we will extract a dependency structure of a given
sentence by a statistical dependency parser[31]. Given a dependency structure,
wa can calculate a semantic consistency of a sentence by the geometric average
of semantic distance between dependent pairs.
For a sentence s = w1w2 1 1 1 wn and the set of dependence relations D(s) =
fdij g where dij expresses a dependence wi ! wj (i 2 f1::ng) because the dependency is assumed to be unique for wi , semantic consistency C (s) of s is dened
by
22
s Y
C (s ) =
n
*
)
dij 2D(s)
log C (s) =
sim(dij )w
ij
1 X
w 1 sim(dij )
n d 2D(s) ij
(12)
(13)
ij
where sim(dij ) = sim( (wi ); (wj )).
wij is a weighting coecient; we usually permits a discrepancy between uninformative pair of words while feeling semantic severance between informative
pair of words.
Therefore, here wij is dened
wij = u(wi) 1 u(wj )
(14)
log N n(w)
:
(n =
where u(w) is a normalized information measure u(w) =
1
log
n
Pw N (w))
Because wij approaches to 0 when wi and wj are uninformative, following (12)
it is lightly weighted because wlim
sim(dij )w = 1.
!0
ij
ij
6.1.1 Experiment
Here we calculated the consistency of each of 45 reciprocally irrelevant sentences
that begin with `彼 (he)' from EDR Japanese corpus, which are included in the
calculation of associative matrix. Additionally, we also calculated some sentences
which (i) seems rather dicult to understand and (ii) seems rather easy, extracted
from children's tail. These sentences are shown in table 4.
23
EDR Japanese corpus
No.
Sentence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Table 4. Analyzed sentences
彼がぼくたちの話に乗ってくれるかどうかは疑問だ。
彼が使っている現トラクター、元戦車はソ連で1960年代中期に造られたT54だ。
彼が働いた鉱山には、同じような白人仲間が64人いた。
彼とは、引っ越して以来、すっかりつき合いが絶えてしまっている。
彼にとって戦争とは政治的な駆け引きではなく、拡大された暴力と格闘だった。
彼に代表される黒人による政治的圧力は大きくなった。
彼のやり残した仕事は、後世の人々によって必ず成し遂げられるだろう。
彼のユニークな意見が採り上げられ、重役会議に諮られることになった。
彼の会社はもちろん、その週のうちに大半の商社は北京駐在員を帰任させ始めた。
彼の言うことは、いつもあいまいもことしていてまるで雲をつかむようだ。
彼の作品をどう見るか、というより、自然の環境に目を向けさせ、現代へのアンチテーゼとしての
激しい芸術、さらに芸術の社会性を追求した。
彼の質問は、この問題の本質に触れるものでした。
彼の真に迫った演技は、観客に深い感動を与えた。
彼の足の肉がちぎれている。
彼の発明は日本綿業の発展に貢献したばかりでなく,31年には,当時一流のイギリスの
プラット社がこの特許権を譲り受け,世界的に注目された。
彼の無口な性格は、時に人々の誤解を招いた。
彼の話は、言い回しがうまいので、わかりやすいうえにおもしろい。
彼は、この世に自分より劣った人間は一人もいないかのように思い込んでいる。
彼は、つくりの大きな顔に笑みを浮かべて言った。
彼は、クラスの中で際立った存在で、みんなから注目されていた。
彼は、王立医学協会のメンバーで、以前は原子力調査員でもあった。
彼は、近ごろにはめずらしく折り目正しい青年です。
彼は、口数の少ない、どちらかというと思索的なタイプの人です。
彼は、自他共に許す世界的な名指揮者です。
彼は、手短にゲームのやり方を説明したあと、実際にやってみせてくれた。
彼は、世の荒波にもまれながら、だんだんとたくましい人間に成長していった。
彼は、大きな可能性を持つ待望の新人として、プロ野球界に迎えられた。
彼は、日本の車のクラクション音は小さい、という。
彼は、豊富な語いを駆使して、すばらしい作品を書き上げた。
彼は、力士としての年齢的な限界を感じ、今場所限りで引退する決心だ。
彼はおもむろに立ち上がり、静かに部屋を出ていった。
彼はすぐには数え切れない死体の山を見た。
彼はばく大な資金を投入して、自然保護団体の設立に努めました。
彼はシートベルトをしていたのだが、頭をどこかに強く打ちつけ、病院にかつぎこまれた。
彼は意を決して、彼女に自分の思いを打ち明けた。
彼は奇行のない人で、エピソードらしいものも思い出せない。
彼は血だらけになりながらも、われわれと一緒に逃げた。
彼は困難の中でも学問を続け、立派に学者として大成した。
彼は自分の高校時代のことを、そう言う。
彼は初めから十点満点をマークし、他の選手の追随を許さなかった。
彼は人づてに買った中古の日本製カメラの部品がないため、写すことができず、その店に来ていたのだ。
彼は大学で、都市計画の分野を専攻している。
彼は東チモール侵攻の総指揮をとった人物である。
彼は憤然といった。
彼は理不尽な理由をつけて、訳もなく反対をする。
24
reference (i)
4
1. 現代のはじまりのときに、言語に挑戦をいどんできたのは、ほんとうは物そのもので
はなく、物たちの世界をつくるマシニックなものの領域の力だったのだろう、とぼく
は考える。
2. ここでは複数で、偶然的で、個性的なさまざまな断片をひとつの統一体につなぐ、「文
脈を統御するメカニズム」の働きから解き放たれたテキストの運動が、新しい自由を
生きるようになる。
reference (ii)
1. 今ではいつのころだったか覚えてはいませんが、秋だったのでしょう。5
2. ふと見ると、川の中に人がいて、なにかやっています。6
For these sentences, some words are not included in the training data: in
which case similarity to the word is xed to = 0:0001.
6.1.2 Evaluation
Result is shown in table 5.
4 中沢新一, 「知天使のぶどう酒」, 河出書房新社,
5 有島武郎, 「一房の葡萄」. 岩波書店, 1988.
6 新美南吉, 「ごんぎつね」. 校訂新美南吉全集,
1992.
1980.
25
Table 5. Consistency
C (s)
0.22382
EDR Japanese corpus
No.
5
0.304015
40
0.329755
29
0.330606
18
0.331364
9
0.347335
30
0.350964
13
0.352166
4
0.36101
15
0.370594
33
0.373669
24
0.376915
35
0.379523
42
0.392221
26
0.39402
27
0.408127
14
0.418808
34
0.420147
16
0.42216
25
0.424438
38
0.433286
8
0.436043
31
0.438841
2
0.443812
6
0.445323
7
0.454066
1
0.470849
32
0.471595
20
0.482224
22
0.491864
12
0.499521
43
0.503757
41
0.505961
11
0.510784
36
0.525287
17
0.526912
19
0.532498
23
0.536592
37
0.557533
10
0.562632
3
0.627387
28
0.637572
39
0.645452
44
0.65568
21
0.656699
45
Sentence
s
彼にとって戦争とは政治的な駆け引きではなく、拡大された暴力..
彼は初めから十点満点をマークし、他の選手の追随を許さなかっ..
彼は、豊富な語いを駆使して、すばらしい作品を書き上げた。
彼は、この世に自分より劣った人間は一人もいないかのように思..
彼の会社はもちろん、その週のうちに大半の商社は北京駐在員を..
彼は、力士としての年齢的な限界を感じ、今場所限りで引退する..
彼の真に迫った演技は、観客に深い感動を与えた。
彼とは、引っ越して以来、すっかりつき合いが絶えてしまってい..
彼の発明は日本綿業の発展に貢献したばかりでなく,31年には..
彼はばく大な資金を投入して、自然保護団体の設立に努めました。
彼は、自他共に許す世界的な名指揮者です。
彼は意を決して、彼女に自分の思いを打ち明けた。
彼は大学で、都市計画の分野を専攻している。
彼は、世の荒波にもまれながら、だんだんとたくましい人間に成..
彼は、大きな可能性を持つ待望の新人として、プロ野球界に迎え..
彼の足の肉がちぎれている。
彼はシートベルトをしていたのだが、頭をどこかに強く打ちつけ..
彼の無口な性格は、時に人々の誤解を招いた。
彼は、手短にゲームのやり方を説明したあと、実際にやってみせ..
彼は困難の中でも学問を続け、立派に学者として大成した。
彼のユニークな意見が採り上げられ、重役会議に諮られることに..
彼はおもむろに立ち上がり、静かに部屋を出ていった。
彼が使っている現トラクター、元戦車はソ連で1960年代中期..
彼に代表される黒人による政治的圧力は大きくなった。
彼のやり残した仕事は、後世の人々によって必ず成し遂げられる..
彼がぼくたちの話に乗ってくれるかどうかは疑問だ。
彼はすぐには数え切れない死体の山を見た。
彼は、クラスの中で際立った存在で、みんなから注目されていた。
彼は、近ごろにはめずらしく折り目正しい青年です。
彼の質問は、この問題の本質に触れるものでした。
彼は東チモール侵攻の総指揮をとった人物である。
彼は人づてに買った中古の日本製カメラの部品がないため、写す..
彼の作品をどう見るか、というより、自然の環境に目を向けさせ..
彼は奇行のない人で、エピソードらしいものも思い出せない。
彼の話は、言い回しがうまいので、わかりやすいうえにおもしろい。
彼は、つくりの大きな顔に笑みを浮かべて言った。
彼は、口数の少ない、どちらかというと思索的なタイプの人です。
彼は血だらけになりながらも、われわれと一緒に逃げた。
彼の言うことは、いつもあいまいもことしていてまるで雲をつか..
彼が働いた鉱山には、同じような白人仲間が64人いた。
彼は、日本の車のクラクション音は小さい、という。
彼は自分の高校時代のことを、そう言う。
彼は憤然といった。
彼は、王立医学協会のメンバーで、以前は原子力調査員でもあった。
彼は理不尽な理由をつけて、訳もなく反対をする。
26
reference (i)
0.16165
1.
現代のはじまりのときに、言語に挑戦をいどんできたのは、ほんとうは物..
0.162771
2.
ここでは複数で、偶然的で、個性的なさまざまな断片をひとつの統一体に..
reference (ii)
0.763045
1.
今ではいつのころだったか覚えてはいませんが、秋だったのでしょう。
0.593312
2.
ふと見ると、川の中に人がいて、なにかやっています。
45 sentences included in EDR corpus are relatively semantically consistent,
and there are few prominent value to distribute smoothly. However, sentence 5,
which has a metaphorical phrase `拡大された暴力 (extended violence)' has a low
consistency and sentence 45, which has only similar words and their dependencies, has a relatively high consistency. This is especially prominent in referent
sentences: in referent (i), consistencies are 0:1616 and 0:1627, which are by far
lower than those included in EDR sampled sentences.
This is partly because the two sentences have some words ([マシニック], [統御])
which are not found in lexicon; but additionally, pairs have acute discrepancies
in meaning to lower the overall consistency. On the other hand, (ii) has easier
words and shows a high similarity in each pair.
However, there leaves a problem as shown in the comparatively similar sentences from EDR corpus. Below reasons are considered for this problem:
Problems with the denition of dependency.
For example, corresponding dependency structure of the phrase `彼とはつき
合いが絶える' from sentence 4 becomes as Figure 6.1.2. As shown there, the
natural dependency `彼 (he)'!`つき合い (association)' is not calculated but
a syntactic dependency `彼'!`絶える (cease to)' is calculated; this leads to
a lower consistency than we naturally expect. In Figure 6.1.2 which depicts
a sectional dependency structure of sentence 13, `演技 (act)' also relates not
to `感動 (impression)' and `観客 (audience)', but to `与える (give)'.
Although assuming associative relation of neighbors may be considered as a
remedy of this problem, that cannot fully capture the semantic relationships
included in a sentence, as we noted in the beginning of this subsection. Since
language is a one-dimensional stream of words, we must consider contextual
27
彼
付き合い
絶える
Figure 1. unused semantic dependency.
彼の
┗真に
┗迫った
┗演技は、
┃観客に
┃┃深い
┃┃┗感動を
┗┻━┻与えた。
Figure 2. Dependency structure of sentence 13.
28
relationships of the occurence of signs as well as assuming a dependency
structure of distant pairs.
Problems with the selection of semantic head
In the experiment, we calculated similarities of the pairs from not every
words contained but the semantic heads of each bunsetsu to capture the
semantic dependency, let the syntactic consistency alone.
However, for example, in sentence 21, `王立医学 協会 (Royal society of
medicine)' and `原子力 調査 員 (investigator of nuclear power) only underlined word is considered semantic head to exclude the other informative
words like `王立' and `医学'.
Since separation of bunsetsu is dened by the regular expression using part
of speech information [31], as a remedy we can expect a method which does
not perform a bunsetsu separation while calculating on a at dependency
structure by assigning low weights on functional words; but this cannot treat
distant relationships adequately as we argued above. Dierence between
syntactic dependency and semantic dependency is also a problem here.
6.2
Semantic information measure
In the previous section we argued a quantitative indicator of semantic consistency of a sentence. However, no matter how consistent that the sentence is, if
the meaning it expresses is trivial then it proves to be less meaningful as a communication, which is the central objective of language we argued at the rst of
this thesis. To transfer semantically rich information, we must make an utterance
which has a large variance of meaning associated one another; however, on the
other hand, a sentence is not accepted understandable unless it has a moderate
consistency as we saw in the previous section. Therefore, it is thought that we
produce a sentence in balancing the motivation to make a large diversity of meaning and the motivation to observe a local consistency of dependent pairs to make
a sentence semantically consistent.
In other words, informativenss of a sentence is augmented in not the situation
that a trivial meanings is strongly associated but in the situation where diverse
meanings are grasped associatively connected under the condition of pairwise local
consistency of meanings. Therefore, more formally, semantic informativeness of
29
a sentence can be measured as a distance from the probability distribution of
associative activation from the sentence to the trivial meaning 3 noted in section
4.5 as a semantic tautology.
a sentence s = w1w2w31 1 1wn. From s, a set of probability distributions
P(sLet
) = f(wi )ji 2 f1::ngg associated as its meanings, we can calculate a distribution (s) to be a total meaning of s.
P
Though we can imagine various denitions of (s) from (s) taking their
order and context eect in consideration, we put it simply here by an average
(s ) = f
Xn (wi) g
i=1
n
(15)
and measure a Kullback-Leibler information distance to 3,
I (s) = D((s)jj 3)
(16)
X
(s)
(17)
=
(s) log 3
to give a semantic informativeness of s, I (s).
X
P (x )
In general, divergence D(P jjQ) = P (x) log
has a problem in case of
Q(x)
x
Q(x) = 0 for certain x that D(P jjQ) ! 1; however, it can be avoided in this case
because the each element of trivial distribution 3 of meaning has theoretically
a nonzero probability as far as the corresponding word has a nonzero occurence
probability, which must be the case for every situation.
6.2.1 Experiment
On each of the sentence used in section 6.1.1, we calculated the KL divergence
to 3 .
6.2.2 Evaluation
Table 6 shows the result.
30
Table 6. Informativeness
I (s)
EDR Japanese corpus
No.
1.1821
14
1.12114
31
1.12004
16
1.10724
22
1.10162
44
1.07741
36
1.06932
34
1.06765
19
0.998295
29
0.967434
4
0.929454
26
0.884969
8
0.851634
17
0.831292
30
0.802944
10
0.769649
43
0.765636
18
0.759355
25
0.751526
35
0.751315
32
0.748489
23
0.739663
38
0.738743
7
0.729799
24
0.713438
5
0.694481
13
0.677928
42
0.655219
1
0.654186
9
0.612354
40
0.60596
45
0.600273
15
0.595084
27
0.573879
37
0.572526
33
0.570249
2
0.544073
20
0.54234
12
0.493979
3
0.483002
41
0.479121
39
0.45185
21
0.43127
6
0.419433
11
0.417165
28
Sentence
s
彼の足の肉がちぎれている。
彼はおもむろに立ち上がり、静かに部屋を出ていった。
彼の無口な性格は、時に人々の誤解を招いた。
彼は、近ごろにはめずらしく折り目正しい青年です。
彼は憤然といった。
彼は奇行のない人で、エピソードらしいものも思い出せない。
彼はシートベルトをしていたのだが、頭をどこかに強く打ちつけ..
彼は、つくりの大きな顔に笑みを浮かべて言った。
彼は、豊富な語いを駆使して、すばらしい作品を書き上げた。
彼とは、引っ越して以来、すっかりつき合いが絶えてしまってい..
彼は、世の荒波にもまれながら、だんだんとたくましい人間に成..
彼のユニークな意見が採り上げられ、重役会議に諮られることに..
彼の話は、言い回しがうまいので、わかりやすいうえにおもしろい。
彼は、力士としての年齢的な限界を感じ、今場所限りで引退する..
彼の言うことは、いつもあいまいもことしていてまるで雲をつか..
彼は東チモール侵攻の総指揮をとった人物である。
彼は、この世に自分より劣った人間は一人もいないかのように思..
彼は、手短にゲームのやり方を説明したあと、実際にやってみせ..
彼は意を決して、彼女に自分の思いを打ち明けた。
彼はすぐには数え切れない死体の山を見た。
彼は、口数の少ない、どちらかというと思索的なタイプの人です。
彼は困難の中でも学問を続け、立派に学者として大成した。
彼のやり残した仕事は、後世の人々によって必ず成し遂げられる..
彼は、自他共に許す世界的な名指揮者です。
彼にとって戦争とは政治的な駆け引きではなく、拡大された暴力..
彼の真に迫った演技は、観客に深い感動を与えた。
彼は大学で、都市計画の分野を専攻している。
彼がぼくたちの話に乗ってくれるかどうかは疑問だ。
彼の会社はもちろん、その週のうちに大半の商社は北京駐在員を..
彼は初めから十点満点をマークし、他の選手の追随を許さなかっ..
彼は理不尽な理由をつけて、訳もなく反対をする。
彼の発明は日本綿業の発展に貢献したばかりでなく,31年には..
彼は、大きな可能性を持つ待望の新人として、プロ野球界に迎え..
彼は血だらけになりながらも、われわれと一緒に逃げた。
彼はばく大な資金を投入して、自然保護団体の設立に努めました。
彼が使っている現トラクター、元戦車はソ連で1960年代中期..
彼は、クラスの中で際立った存在で、みんなから注目されていた。
彼の質問は、この問題の本質に触れるものでした。
彼が働いた鉱山には、同じような白人仲間が64人いた。
彼は人づてに買った中古の日本製カメラの部品がないため、写す..
彼は自分の高校時代のことを、そう言う。
彼は、王立医学協会のメンバーで、以前は原子力調査員でもあった。
彼に代表される黒人による政治的圧力は大きくなった。
彼の作品をどう見るか、というより、自然の環境に目を向けさせ..
彼は、日本の車のクラクション音は小さい、という。
31
reference (i)
0.477072
1.
現代のはじまりのときに、言語に挑戦をいどんできたのは、ほんとうは物..
0.413139
2.
ここでは複数で、偶然的で、個性的なさまざまな断片をひとつの統一体に..
reference (ii)
0.551357
1.
今ではいつのころだったか覚えてはいませんが、秋だったのでしょう。
0.674302
2.
ふと見ると、川の中に人がいて、なにかやっています。
As we noted in section 6.1, semantic informativeness is assumed to be reversely
propotional to semantic consistency; Spearman's rank correlation coecient of
the result to the order of experiment in section 6.1.1 is rs = 00:22029, which
shows a weak negative correlation as expected.
However, when we look closely into each sentence, this result does not always
reect an intuitive informativeness. For example, though sentence 5 is the most
unconsistent sentence which has a metaphoric dependency relation, divergence
shows a moderate result among other sentences. Moreover, in spite of quite a low
consistency of referent sentences (i), its divergence to 3 is medium, and referent
(ii) is medium also.
Since we compare the average meaning of word set contained in a sentence
to 3, the divergence tends to be small with a sentence that has frequent words
which often becomes a subject. In fact, we seem to make a judgement on the
informativeness by a variance of meanings rather than an average of meanings.
However, self-information of (s) as a measure of variance is almost equal to every experimental sentences therefore it is not valid. This is partly because of the
denition of (s): denition by linear combination (average) cannot express sufciently the overlap of each probability distribution to fail to describe a semantic
variance of meanings each words has in a sentence.
32
7.
Conclusion
This thesis showed that the stochastic treatment of meaning based on cooccurence
statistics can model the meaning which diers from person to person and changes
dynamically in the environment.
Contrary to the traditional deterministic treatment of meaning which entails
an arbitrariness, statistical acquisition of meaning is dened completely objectively. Also, by performing an association as the state transition of Markov process, the spreading activation formerly advocated in psycholinguistics is formulated mathematically to enable the indirect semantic relationships to be handled
adequately.
However, stochastic formulation of meaning still has much to be desired. Although the denition of associativity by the log likelihood ratio works fairly well
in practice, exclusion of functional words cannot be avoided because functional
words are not always assigned suciently low associativities. Moreover, in the associational process of Markov transition, the association spreads out with all the
dimensions as the naive use of stochastic matrix: that does not t our intuitions
completely.
These problems come from the lack of an adequate criteria on the association
over the Markov process: we must make explicit what criteria the adequate association holds and what quantity to be maximized or minimized in the association.
When the meaning is dened as a probability distribution over the lexicon,
the set of the meanings forms a Hilbert space H by the perfection with a norm
between probability vectors[32]. By assuming a linguistic symbol from information source as a realization of probabilistic variable over all the lexicon and with
a geometrical methods in H [2], our intuition of semantic space can be modeled
mathematically more accurately.
33
8.
Acknowledgements
First, I am quite grateful to the Professor Yuji Matsumoto. Besides his supervision and valuable comments on the manuscript, I would not be able to think
over my original idea eectively without the acceptance to the Computational
Linguistics Laboratory at NAIST.
Second, I thank to the second supervisor Professor Hiroyuki Seki to show an
interest on this study, which relieved and encouraged me much.
And I am also especially indebted to the Associate Professor Shin Ishii, for
he showed an interest at the quite premature stage of this study and gave me
a number of insightful comments, especially the suggestion of the probabilistic
formulation.
Professor Yutaka Takahashi and Associate Professor Satoshi Nakamura made
a hearty reception at my visit and gave me valuable comments.
Additionally, I was inspired to the members of the laboratory of the neural
modeling at the brain science institute of RIKEN. Mr. Cateau, Ph.D. gave me a
suggestion of the similarity to the function of a neuron.
In the last place, I would like to thank the stas of Matsumoto laboratory
and all the laboratory members. In spite of my tendency of self-righteous attitude to studies, they showed me a hearty reception and criticism. I was also
supported technically in various aspects and it helped much to my improvement
of programming.
And I note here that all of these are owed to the national foundation and
scholarship; without that I would never been able to contemplate the nature of
language so satisfactorily.
34
References
[1] H. At-Kaci and R. Nasr. LogIn: A logic programming language with builtin inheritance. Journal of Logic Programming, Vol. 3, No. 3, pp. 187{215,
1986.
[2] S. Amari. Dierential-Geometrical Methods in Statistics. Springer-Verlag,
New York, 1985.
[3] R.J. Brachman and J.G. Schmolze. An overview of the kl-one knowledge
representation system. Cognitive Science, Vol. 9, No. 2, pp. 171{216, 1985.
[4] Pierre Bremaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation,
and Queues, Vol. 31 of Texts in Applied Mathematics. Springer, 1999.
[5] Kenneth Ward Church and Patrick Hanks. Word association norms, mutual
information, and lexicography. In Proc. of COLING 27, pp. 76{83, 26-29
Jun 1989.
[6] A.M. Collins and E. F. Loftus. A spreading activation theory of semantic
processing. Psychological Review, No. 82, pp. 407{428, 1975.
[7] I. Dagan, S. Marcus, and S. Markovitch. Contextual word similarity and
estimation from sparse data. Computer Speech and Language, Vol. 9, pp.
123{152, 1995.
[8] Ido Dagan, L.Lee, and F.Pereira. Similarity-based methods for word sense
disambiguation. In Proc. of ACL-EACL '97, pp. 56{63, 1997.
[9] Ferdinand de Saussure. 一般言語学講義. 岩波書店, 1972.
[10] S. Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer,
and Richard Harshman. Indexing by latent semantic analysis. Journal of
the American Society of Information Science, Vol. 41, No. 6, pp. 391{407,
1990.
[11] Ted Dunning. Accurate methods for the statistics of suprise and
coincidence. Computational Linguistics, Vol. 19, No. 1, pp. 61{74, 1993.
[12] EDR 電子化辞書研究所. EDR 電子化辞書仕様説明書, 1995.
[13] Jerey L. Elman. Finding structure in time. Cognitive Science, No. 14, pp.
179{211, 1990.
[14] Donald Hindle. Noun classication from predicate-argument structures. In
28th Proc. of COLING, pp. 268{275, 1990.
[15] Will Lowe. Semantic representation and priming in a self-organizing
35
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
lexicon. In Proc. of 4th Neural Computation and Psychology Workshop, pp.
227{239. Splinger Verlag, 1997.
Mitchell P. Marcus. A Theory of Syntactic Recognition for Natural Language.
MIT Press, 1980.
R. Montague. Formal Philosophy: Selected Papers of Richard Montague.
Yale University Press, 1974.
Fernando Pereira, Naftali Tishby, and Lillian Lee. Distributional clustering
of english words. In Proc. of 31th ACL, pp. 183{190, 1993.
James Pustejovsky. The Generative Lexicon. The MIT Press, 1995.
M. R. Quillian. Semantic Information Processing, pp. 216{270. MIT Press:
Cambridge, MA., 1968.
Edward Sapir. 言語. 岩波書店, 1998.
Hinrich Schutze. Dimentions of meaning. In Proceedings of Supercomputing'92, pp. 787{796, 1992.
J. F. Sowa. Conceptual Structures: Information Processing in Mind and
Machine. Addison-Wesley, 1984.
M. Taft. Reading and the mental lexicon. Lawrence Erlbaum Associates
Limited, 1991.
David L. Waltz and Jordan B. Pollack. Massively parallel parsing: A strongly
interactive model of natural language interpretation. Cognitive Science,
No. 9, pp. 51{74, 1985.
David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proc. of ACL'95, pp. 189{196, 1995.
高橋直人. 実数ベクトルによる語句の表現の試み. 情報処理学会研究報告 自然
言語処理研究会, pp. 95{102, Nov 1995.
高橋直人. ニューラルネットを用いた意味表現形式の自動獲得. 電気情報通信
学会技術研究報告 NLC 98-28, Vol. 98, No. 338, pp. 17{24, 1998.
持橋大地, 松本裕治. 連想としての意味. 情報処理学会研究報告 99-NL-134, pp.
155{162, 1999.
小嶋秀樹, 古郡延治. 単語の意味的な類似度の計算. 電子情報通信学会技術研究
報告 AI92-100, pp. 81{88, 1993.
藤尾正和, 松本裕治. 語の共起確率に基づく係り受け解析とその評価. 情報処理
学会論文誌, Vol. 40, No. 12, pp. 4201{4212, December 1999.
36
[32]
[33]
片山徹. 新版 応用カルマンフィルタ. 朝倉書店,
Jan 2000.
東京大学教養学部統計学教室編. 自然科学の統計学. 基礎統計学 III. 東京大学
出版会, 1992.
[34] 廣恵太, 伊藤毅志, 古郡延治. コーパスから抽出した単語間類似度に基づく意味
ネットワーク. 情報処理学会第 51 回論文集, Vol. 3, pp. 13{14, 1995.
37
Fly UP