...

Applying Deep Linguistic Analysis to the Web using Windows Azure

by user

on
Category: Documents
34

views

Report

Comments

Transcript

Applying Deep Linguistic Analysis to the Web using Windows Azure
Deep Natural Language
Processing for Improving a
Search Engine Infrastructure
using Windows Azure
Daisuke Kawahara and Sadao Kurohashi
Graduate School of Informatics,
Kyoto University
Cloud Futures Workshop 2011 (June 3rd, 2011)
Overview
Web services of
search and analysis
based on deep NLP
Web Services
Information
on the Web
Search Engine Infra
TSUBAKI
HPC or Cloud,
Large Web Page Pool
Deep NLP-based
indexing and search
• Linux clusters/grids
 1,000〜2,000 CPUs
• Windows Azure
 〜10,000〜 CPUs
Search Engine Infrastructure
TSUBAKI
based on Deep NLP
Case frame
Knowledge
Acquisition and
泳ぐ swim
Deep
NLP
{人 person, 子 child,…}が
Mary
{クロール crawl, 平泳ぎ,…}で
sea,a大海,…}を
ate the salad{海
with
fork
見る see
person,
者,…}が
Mary ate the salad{人
with
mushrooms
{望遠鏡 telescope, 双眼鏡 ,…}で
{姿 figure, 人 person,…}を
クロールで泳いでいる女の子を見た
crawl
swim
girl
saw
望遠鏡で泳いでいる女の子を見た
telescope
swim
girl
saw
[Kawahara and Kurohashi, LREC06, HLT-NAACL06]
Web
Predicate-argument
structures
2weeks
1.6G sentences
(100M pages)
Clustering
3days
Parsing and
Filtering
89.0% for all
98.3% for 20.7% PAs
Case frames for
40K predicates
89.0% → 89.7%
Case frame examples
yaku (1)
(bake)
yaku (2)
(have
difficulty)
CS
ga
wo
de
ga
wo
ni
yaku (3)
ga
(copy; burn wo
CDR)
ni
…
examples
I:18, person:15, craftsman:10, …
bread:2484, meat:1521, cake:1283, …
oven:1630, frying pan:1311, …
teacher:3, government:3, person:3, …
hand:2950
attack:18, action:15, son:15, …
maker:1, distributor:1, …
data:178, file:107, copy:9, …
R:1583, CD:664, CDR:3, …
Ellipsis (Zero Anaphora) Resolution [Sasano et al., COLING08]
Input sentences
Entities
Toyota-wa
Case frames
hatsubai (launch)(1)
1997nen
hybrid-car Prius-wo
hatsubai.
Toyota-wa
ga
company, SONY, firm, …
nominative
1997-nen
wo
[NE:ORGANIZATION] 0.15, …
product, CD, model, car, …
[CT:ARTIFACT] 0.40, …
2000nen-karawa
kaigai-demo hanbai-shiteiru.
hybrid car
accusative
{Toyota}
de
Prius-wo
{hybrid car,
Prius}
locative
area, shop, world, Japan, …
[CT:FACILITY] 0.13, …
hatsubai.
(launch)
ga
2000-nen-karawa
{kaigai}
hanbai (sell)(1)
nominative
locative
kaigai-demo
wo
accusative
(overseas)
hanbai-shiteiru.
(sell)
:direct case assignment
:indirect case assignment (zero anaphora)
ni
dative
de
locative
company, Microsoft, …
[NE:ORGANIZATION] 0.16, …
goods, product, ticket, …
[CT:ARTIFACT] 0.40, …
customer, company, user, …
[CT:PERSON] 0.28, …
shop, bookstore, site, …
[CT:FACILITY] 0.40, …
Probabilistic estimates of the
correspondences
Toyota launched the hybrid car Prius in 1997. Φ1 started selling Φ2 overseas in 2000.
Effect of Corpus Size
[Sasano et al., NAACL-HLT09]
1
Accuracy of syntactic analysis
0.9
0.8
0.7
Accuracy of case
structure analysis
0.6
Coverage of case frames
0.5
0.4
F-measure of zero
anaphora resolution
0.3
0.2
0.1
0
1.6M
6.3M
25M
100M
400M
Corpus size (# of sentences)
1.6G
Indexing
Terms: Words, synonyms/hypernyms of words and
dependency relations including zero anaphora
Modifier
Modifiee
情報
(information)
技術の
(technology)
Synonyms
発達は
(growth)
進歩
(progress)
IT
目覚ましいものが
(striking)
あります
(is)
(The growth of information technology is striking.)
Extracted terms
Indexing
IT
情報→技術
情報 (information) 進歩 (progress)
技術→発達
技術 (technology)
発達→目覚ましい
発達 (growth)
目覚ましい (striking)
10
Search Example: Use of Dependency Relations
Facebookが直面する問題
(Problems that Facebook confronts)
… Spam may be the most subtle problem that
Facebook confronts. ...
… This is the third large-scale sudden privacy
affairs that Facebook confronts. …
… Facebook finally faces a knot that cannot be
left. …
Search Example: Hypernym/Hyponym Relations
風邪の予防に効果的な野菜
(Effective vegetable to the prevention of the cold)
Hyponym of
vegetable
Chinese cabbage: One of the brightly colored vegetables that
plenty contains vitamin A. … as an effective vegetable that
contribute to prevent the cold.
Hyponym of
vegetable
Welsh onion is perfect for preventing the cold because it has
effects for warming your body, facilitating the circulation,
enhancing appetite and improving immunity.Hyponym of
vegetable
The recovery effect can be obtained when eating pumpkins
that include vitamin A. Vitamin A works to prevent skin
roughness and the cold.
Web Services on TSUBAKI
Information Analysis with WISDOM
Analyzing web pages based on various criteria
Input a topic to
be analyzed.
Someone makes
conflicting statements!
We can see major
information sender classes!
The ratio of positive/negative opinions
is different for each sender class!
Distribution of Senders
Distribution of Opinions
"Are electric cars good for the environment?"
Major/Contradictory Statements
"good for the "not good for the
environment"
environment"
Major Keywords
"CO2", "fuel
consumption", "exhaust
gas", ...
Major Senders/Opinions
We can grasp at a glance important issues and the
distributions of information senders and opinions!
"Japan Automobile
Manufacturers Association"
We can find experts
on the topic!
TSUBAKI based on Deeper NLP
Enabled by Windows Azure
21
Flow of TSUBAKI
Execute deep NLP on Azure
Offline processing
Deep NLP
Text data
(Web pages)
Indexing
Analyzed data
1. Morphological
analyzer: JUMAN (C)
Online
processing
2. Dependency/Case/Ellipsis analyzer: KNP (C)
Search module
3. Semantic analyzer: SynGraph
(Perl)
Search
query
Analyze
search query
Doc. retrieval
and ranking
Index
Looking up
Search
result
Our Model of Deployment to Azure
• Task is completely
independent
イ
Worker1
ン
タ
ー
ネ
ッ
ト
– Each task processes a
small set of Web pages
(100 pages)
• Master/Worker model
Master
Worker2
– Master = Web Role
– Worker = Worker Role
Worker3
...
Create a
manager with
Azure APIs
Our Side
Manage 29 services
Porting Linux by hand
software to
Strawberry
Windows
Perl
Issues and Solutions
The network between
C: Cygwin
(32bit) Azure (Asia) and
SINET is narrow?
hours/3.5GB)
Large(20
databases
of
automatically acquired
knowledge
Create a virtual hard
drive (VHD) on
Azure storage (Blob)
Azure Side
The use of 10,000
CPU cores is
challenging!
Divide to
29x350
Try 1x350,
2x350, 8x350
Highly concurrent
access to VHD on Blob
Execute a dummy run
before the real run
Finally, we succeeded in
using 10,000 CPU cores!
• TODO: real run
29 services
– Upload all the input
data (120M pages;
260GB)
– Execute the analysis
– Download the output
data (3TB)
Search Example: Use of Case/Ellipsis Analysis
クラウドが解決すべき課題
(Problems that should be solved by Cloud)
There are problems. Extensibility and elasticity are the
characteristics of Cloud Computing, but these have not
reached the degree that companies require at the first
priority. If solved, the barrier of cost is lowered and ...
解決 (solve)
ga system, solution, company, expert, …
wo issue, problem, incident, trouble,
conundrum, defect, …
ni
early, actually, in turn,…
:
Conclusion
Web Services
Search Engine Infra
TSUBAKI
HPC or Cloud,
Large Web Page Pool
Web services of
search and analysis
based on deep NLP
Deep NLP-based
indexing and search
• Windows Azure
 〜10,000〜 CPUs
Future Work: Execute the Whole
Process on Azure!
Offline processing
Deep NLP
Text data
(Web pages)
Indexing
Analyzed data
Online processing
Index
Looking up
Search module
Search
query
Analyze
search query
Doc. retrieval
and ranking
Search
result
Many thanks to
•
•
•
•
Dr. Dennis Gannon (MSR)
Ms. Hitomi Iida (MS Support)
Prof. Masaru Kitsuregawa (Univ. of Tokyo)
Prof. Jun Adachi (NII)
Fly UP