Comments
Description
Transcript
Meeting title
Term Distillation in Patent Retrieval Ricoh Company, Ltd. Hideo Itoh Hiroko Mano Yasushi Ogawa Cross-DB Retrieval (1) The query domain differs from the retrieval target one. Query domain Target domain news articles patents Retrieval System query article 2003/7/14 ranking of patents NTCIR-3 Workshop Meeting 2 Cross-DB Retrieval (2) • Problem – Incorrect query term weighting caused by the difference of term occurrence distribution between query and target domain. – Example • The query term “社長 (president)” would be given a large weight, because the df in patents is very low. • However, “社長” is not a good term for patent retrieval. 2003/7/14 NTCIR-3 Workshop Meeting 3 Term distillation • A general framework for query term selection for cross-DB retrieval Query document Term extraction (morphological analysis + stopword list) Candidates of Query Term Term selection using TDV Query Terms 2003/7/14 (Term Distillation Value) NTCIR-3 Workshop Meeting 4 Term Distillation Value • TDV represents “goodness” of the query term • Generic model TDV = QV ・ TV where QV : conventional term selection value TV : newly introduced selection value for cross-DB retrieval • Two probabilities used for estimation of TV p = Prob (term | target domain ) q = Prob (term | query domain ) 2003/7/14 NTCIR-3 Workshop Meeting 5 Instances of TV Distillation Model • Zero • Swets • Naïve Bayes • Bayesian classification • Binary independence • Target domain • Query domain • Binary • Joint probability • Decision theoretic 2003/7/14 Estimation of TV constant = 1 p-q p/q p / (p + α・q + ε) log p (1 – q) / q (1- p ) p 1-q 1 (p > 0) or 0 (p = 0) p (1–q) log (p / q) NTCIR-3 Workshop Meeting 6 Instances of QV Conventional Model • Zero • Approximated 2-Poisson • Term frequency • IDF • Probabilistic tf * idf • tf * idf 2003/7/14 Estimation of QV constant = 1 tf / ( tf + β) tf log ( N / df + 1) tf / ( tf + β) ・ log ( N / df + 1) tf ・ log ( N / df + 1) NTCIR-3 Workshop Meeting 7 The Cross-DB Retrieval System IREX news articles Query DB NTCIR-3 patents Target DB p = df T / N T q = df Q / N Q query document Cross-DB Retrieval System Target documents ranking of documents 2003/7/14 NTCIR-3 Workshop Meeting 8 Experimental Results Topic = article-only Number of query terms = 8 Automatic retrieval with pseudo-relevance feedback QV TV p q AveP tf log(p(1-q)/q(1-p)) TITLE IREX 0.1953 tf log(p/q) TITLE IREX 0.1948 tf * idf p / (p + αq + ε) TITLE IREX 0.1844 tf p / (p + αq + ε) TITLE IREX 0.1843 1 p / (p + αq + ε) TITLE IREX 0.1816 tf 1-q TITLE IREX 0.1730 tf p/q TITLE IREX 0.1701 tf p / (p + αq + ε) ABST WHOLE 0.1694 tf 1 2003/7/14 ー NTCIR-3 Workshop Meeting ー 0.1645 9 Samples of query terms • Topic 0001 – 装置。サブミクロン。液体。工業。特殊。特殊機。分離。粒子。 – 装置。乳化。撹拌。液体。粒子。撹拌機。微粒子。サブミクロン。 • Topic 0002 – 種子。植物。福岡。農法。単行本。引用。漫画。SEED。 – 種子。植物。粘土団子。団子。粘土。農法。農業。編集。 • Topic 0003 – 機器。湯場。日本。制御。下請け。発注。仕事。時代。 – 制御。5相。モータ。ステツピングモータ。電子機器。機器。電子。 • Topic 0004 – エポック。エポック社。バンダイ。製造。訴訟。地裁。東京。万円。 – 製造。玩具。ゲイム機。カードゲーム。小型。ゲーム。技術的。指摘。 upper : conventional mehod (tf) lower : with term distillation (tf ・ log(p(1-q)/(q(1-p)) 2003/7/14 NTCIR-3 Workshop Meeting 10 NTCIR-3 mandatory runs Topic = article+supplement Number of query terms = 8 Automatic retrieval with pseudo-relevance feedback Run QV f021 tf f020 1 f022 tf f019 tf*idf TV p / (p + αq + ε) p / (p + αq + ε) p / (p + αq + ε) p / (p + αq + ε) p TITLE TITLE ABST TITLE q IREX IREX WHOLE IREX AveP 0.2794 0.2701 0.2688 0.2637 • The retrieval performance is remarkable in comparison with other submitted runs. • Under the influence of supplemental data, the effect of term distillation is unclear. 2003/7/14 NTCIR-3 Workshop Meeting 11 NTCIR-3 optional runs • Automatic retrieval with pseudo-relevance feedback Run f018 fields t,d,c t,d,c,n d t,d t,d,n d,n s t,d AveP 0.3262 0.3056 0.3039 0.2801 0.2753 0.2750 0.2712 0.1283 P@10 0.4323 0.4258 0.4032 0.3581 0.4000 0.4323 0.3806 0.1968 Rret 1197 1182 1133 1100 1140 1145 991 893 (t=title, d=desc, c=concept, n=narrative, s=supplement) 2003/7/14 NTCIR-3 Workshop Meeting 12 Conclusions • We proposed “term distillation” which is a general framework for cross-DB retrieval. • Our experiments using NTCIR-3 patent retrieval test collection demonstrate that term distillation is effective in the cross-DB retrieval. 2003/7/14 NTCIR-3 Workshop Meeting 13