Applying Deep Linguistic Analysis to the Web using Windows Azure
by user
Comments
Transcript
Applying Deep Linguistic Analysis to the Web using Windows Azure
Deep Natural Language Processing for Improving a Search Engine Infrastructure using Windows Azure Daisuke Kawahara and Sadao Kurohashi Graduate School of Informatics, Kyoto University Cloud Futures Workshop 2011 (June 3rd, 2011) Overview Web services of search and analysis based on deep NLP Web Services Information on the Web Search Engine Infra TSUBAKI HPC or Cloud, Large Web Page Pool Deep NLP-based indexing and search • Linux clusters/grids 1,000〜2,000 CPUs • Windows Azure 〜10,000〜 CPUs Search Engine Infrastructure TSUBAKI based on Deep NLP Case frame Knowledge Acquisition and 泳ぐ swim Deep NLP {人 person, 子 child,…}が Mary {クロール crawl, 平泳ぎ,…}で sea,a大海,…}を ate the salad{海 with fork 見る see person, 者,…}が Mary ate the salad{人 with mushrooms {望遠鏡 telescope, 双眼鏡 ,…}で {姿 figure, 人 person,…}を クロールで泳いでいる女の子を見た crawl swim girl saw 望遠鏡で泳いでいる女の子を見た telescope swim girl saw [Kawahara and Kurohashi, LREC06, HLT-NAACL06] Web Predicate-argument structures 2weeks 1.6G sentences (100M pages) Clustering 3days Parsing and Filtering 89.0% for all 98.3% for 20.7% PAs Case frames for 40K predicates 89.0% → 89.7% Case frame examples yaku (1) (bake) yaku (2) (have difficulty) CS ga wo de ga wo ni yaku (3) ga (copy; burn wo CDR) ni … examples I:18, person:15, craftsman:10, … bread:2484, meat:1521, cake:1283, … oven:1630, frying pan:1311, … teacher:3, government:3, person:3, … hand:2950 attack:18, action:15, son:15, … maker:1, distributor:1, … data:178, file:107, copy:9, … R:1583, CD:664, CDR:3, … Ellipsis (Zero Anaphora) Resolution [Sasano et al., COLING08] Input sentences Entities Toyota-wa Case frames hatsubai (launch)(1) 1997nen hybrid-car Prius-wo hatsubai. Toyota-wa ga company, SONY, firm, … nominative 1997-nen wo [NE:ORGANIZATION] 0.15, … product, CD, model, car, … [CT:ARTIFACT] 0.40, … 2000nen-karawa kaigai-demo hanbai-shiteiru. hybrid car accusative {Toyota} de Prius-wo {hybrid car, Prius} locative area, shop, world, Japan, … [CT:FACILITY] 0.13, … hatsubai. (launch) ga 2000-nen-karawa {kaigai} hanbai (sell)(1) nominative locative kaigai-demo wo accusative (overseas) hanbai-shiteiru. (sell) :direct case assignment :indirect case assignment (zero anaphora) ni dative de locative company, Microsoft, … [NE:ORGANIZATION] 0.16, … goods, product, ticket, … [CT:ARTIFACT] 0.40, … customer, company, user, … [CT:PERSON] 0.28, … shop, bookstore, site, … [CT:FACILITY] 0.40, … Probabilistic estimates of the correspondences Toyota launched the hybrid car Prius in 1997. Φ1 started selling Φ2 overseas in 2000. Effect of Corpus Size [Sasano et al., NAACL-HLT09] 1 Accuracy of syntactic analysis 0.9 0.8 0.7 Accuracy of case structure analysis 0.6 Coverage of case frames 0.5 0.4 F-measure of zero anaphora resolution 0.3 0.2 0.1 0 1.6M 6.3M 25M 100M 400M Corpus size (# of sentences) 1.6G Indexing Terms: Words, synonyms/hypernyms of words and dependency relations including zero anaphora Modifier Modifiee 情報 (information) 技術の (technology) Synonyms 発達は (growth) 進歩 (progress) IT 目覚ましいものが (striking) あります (is) (The growth of information technology is striking.) Extracted terms Indexing IT 情報→技術 情報 (information) 進歩 (progress) 技術→発達 技術 (technology) 発達→目覚ましい 発達 (growth) 目覚ましい (striking) 10 Search Example: Use of Dependency Relations Facebookが直面する問題 (Problems that Facebook confronts) … Spam may be the most subtle problem that Facebook confronts. ... … This is the third large-scale sudden privacy affairs that Facebook confronts. … … Facebook finally faces a knot that cannot be left. … Search Example: Hypernym/Hyponym Relations 風邪の予防に効果的な野菜 (Effective vegetable to the prevention of the cold) Hyponym of vegetable Chinese cabbage: One of the brightly colored vegetables that plenty contains vitamin A. … as an effective vegetable that contribute to prevent the cold. Hyponym of vegetable Welsh onion is perfect for preventing the cold because it has effects for warming your body, facilitating the circulation, enhancing appetite and improving immunity.Hyponym of vegetable The recovery effect can be obtained when eating pumpkins that include vitamin A. Vitamin A works to prevent skin roughness and the cold. Web Services on TSUBAKI Information Analysis with WISDOM Analyzing web pages based on various criteria Input a topic to be analyzed. Someone makes conflicting statements! We can see major information sender classes! The ratio of positive/negative opinions is different for each sender class! Distribution of Senders Distribution of Opinions "Are electric cars good for the environment?" Major/Contradictory Statements "good for the "not good for the environment" environment" Major Keywords "CO2", "fuel consumption", "exhaust gas", ... Major Senders/Opinions We can grasp at a glance important issues and the distributions of information senders and opinions! "Japan Automobile Manufacturers Association" We can find experts on the topic! TSUBAKI based on Deeper NLP Enabled by Windows Azure 21 Flow of TSUBAKI Execute deep NLP on Azure Offline processing Deep NLP Text data (Web pages) Indexing Analyzed data 1. Morphological analyzer: JUMAN (C) Online processing 2. Dependency/Case/Ellipsis analyzer: KNP (C) Search module 3. Semantic analyzer: SynGraph (Perl) Search query Analyze search query Doc. retrieval and ranking Index Looking up Search result Our Model of Deployment to Azure • Task is completely independent イ Worker1 ン タ ー ネ ッ ト – Each task processes a small set of Web pages (100 pages) • Master/Worker model Master Worker2 – Master = Web Role – Worker = Worker Role Worker3 ... Create a manager with Azure APIs Our Side Manage 29 services Porting Linux by hand software to Strawberry Windows Perl Issues and Solutions The network between C: Cygwin (32bit) Azure (Asia) and SINET is narrow? hours/3.5GB) Large(20 databases of automatically acquired knowledge Create a virtual hard drive (VHD) on Azure storage (Blob) Azure Side The use of 10,000 CPU cores is challenging! Divide to 29x350 Try 1x350, 2x350, 8x350 Highly concurrent access to VHD on Blob Execute a dummy run before the real run Finally, we succeeded in using 10,000 CPU cores! • TODO: real run 29 services – Upload all the input data (120M pages; 260GB) – Execute the analysis – Download the output data (3TB) Search Example: Use of Case/Ellipsis Analysis クラウドが解決すべき課題 (Problems that should be solved by Cloud) There are problems. Extensibility and elasticity are the characteristics of Cloud Computing, but these have not reached the degree that companies require at the first priority. If solved, the barrier of cost is lowered and ... 解決 (solve) ga system, solution, company, expert, … wo issue, problem, incident, trouble, conundrum, defect, … ni early, actually, in turn,… : Conclusion Web Services Search Engine Infra TSUBAKI HPC or Cloud, Large Web Page Pool Web services of search and analysis based on deep NLP Deep NLP-based indexing and search • Windows Azure 〜10,000〜 CPUs Future Work: Execute the Whole Process on Azure! Offline processing Deep NLP Text data (Web pages) Indexing Analyzed data Online processing Index Looking up Search module Search query Analyze search query Doc. retrieval and ranking Search result Many thanks to • • • • Dr. Dennis Gannon (MSR) Ms. Hitomi Iida (MS Support) Prof. Masaru Kitsuregawa (Univ. of Tokyo) Prof. Jun Adachi (NII)