Comments
Description
Transcript
ケモインフォマティクス入門
ケモインフォマティクス入門 九州工業大学 西郷浩人 私、西郷について • 1997-‐2001 – 学士(電気電子工学)、上智大学 • 2001-‐2006 – 博士(情報学)、京都大学 • 2006-‐2008 – 研究員、独Max Planck Biological Cyberne;cs研究所 • 2008-‐2010 – 研究員、独Max Planck Informa;cs研究所 • 2010-‐ – 准教授、九州工業大学 九州工業大学情報工学部 九工大からの眺め ケモインフォマティクス入門 九州工業大学 西郷浩人 化合物の例 Caffeine Aspirin Oseltamivir Sildenafil Serotonin 増え続ける化合物の数と コンピュータ時代の化学 Flood of Information ar 30 000 000 # o f str u c tu r e s ds / year • 1800万 (2000年) 25 000 000 20 000 000 15 000 000 10 000 000 5 000 000 0 1965 1970 1975 1980 1985 1990 1995 2000 Year ©Alexandre Varek ad 4.000 publications / day ? 化学におけるコンピューターの役割 • データベース – 化合物データの蓄積と検索 • スクリーニング – 数式や人工知能等を用いて、欲しい性質をもつ 化合物の絞り込み コンピューターの中での化合物 の表現法1:MDLフォーマット 4 アスピリンのMDLファイル An Introduction to Chemoinformatics number of atoms number of bonds -ISIS- 13 13 0 -3.4639 -3.4651 -2.7503 -2.0338 -2.0367 -2.7521 -2.7545 -2.0413 -3.4702 -1.3238 -0.6125 -0.6167 0.1000 1 2 2 6 7 1 3 4 2 7 8 1 7 9 2 4 5 1 5 10 1 2 3 1 10 11 1 5 6 2 11 12 2 6 1 1 11 13 1 M END the first atom is a carbon 09270222202D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1.5375 -2.3648 -2.7777 -2.3644 -1.5338 -1.1247 -0.2997 0.1149 0.1107 -1.1186 -1.5292 -2.3542 -1.1125 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0999 V2000 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 O 0 0 0.0000 O 0 0 0.0000 O 0 0 0.0000 C 0 0 0.0000 O 0 0 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 the first three numbers are the x, y and z coordinates of the atom the first bond is between atoms 1 and 2 and has order 2 9 8 OH O 7 6 1 2 10 O 5 4 11 13 O 12 3 ©Leach and Gillet, Springer 2007 Figure 1-3. The connection table for aspirin in the MDL format (hydrogen-suppressed form). The numbering of the atoms is as shown in the chemical diagram. 4 MDLファイル:基本情報 An Introduction to Chemoinformatics number of atoms number of bonds -ISIS- 13 13 0 -3.4639 -3.4651 -2.7503 -2.0338 -2.0367 -2.7521 -2.7545 -2.0413 -3.4702 -1.3238 -0.6125 -0.6167 0.1000 1 2 2 6 7 1 3 4 2 7 8 1 7 9 2 4 5 1 5 10 1 2 3 1 10 11 1 5 6 2 11 12 2 6 1 1 11 13 1 M END the first atom is a carbon 09270222202D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1.5375 -2.3648 -2.7777 -2.3644 -1.5338 -1.1247 -0.2997 0.1149 0.1107 -1.1186 -1.5292 -2.3542 -1.1125 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0999 V2000 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 O 0 0 0.0000 O 0 0 0.0000 O 0 0 0.0000 C 0 0 0.0000 O 0 0 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 the first three numbers are the x, y and z coordinates of the atom the first bond is between atoms 1 and 2 and has order 2 9 8 OH O 7 6 1 2 10 O 5 4 11 13 O 12 3 ©Leach and Gillet, Springer 2007 Figure 1-3. The connection table for aspirin in the MDL format (hydrogen-suppressed form). The numbering of the atoms is as shown in the chemical diagram. 4 MDLファイル:原子情報 An Introduction to Chemoinformatics number of atoms number of bonds -ISIS- 13 13 0 -3.4639 -3.4651 -2.7503 -2.0338 -2.0367 -2.7521 -2.7545 -2.0413 -3.4702 -1.3238 -0.6125 -0.6167 0.1000 1 2 2 6 7 1 3 4 2 7 8 1 7 9 2 4 5 1 5 10 1 2 3 1 10 11 1 5 6 2 11 12 2 6 1 1 11 13 1 M END the first atom is a carbon 09270222202D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1.5375 -2.3648 -2.7777 -2.3644 -1.5338 -1.1247 -0.2997 0.1149 0.1107 -1.1186 -1.5292 -2.3542 -1.1125 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0999 V2000 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 O 0 0 0.0000 O 0 0 0.0000 O 0 0 0.0000 C 0 0 0.0000 O 0 0 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 the first three numbers are the x, y and z coordinates of the atom the first bond is between atoms 1 and 2 and has order 2 9 8 OH O 7 6 1 2 10 O 5 4 11 13 O 12 3 ©Leach and Gillet, Springer 2007 Figure 1-3. The connection table for aspirin in the MDL format (hydrogen-suppressed form). The numbering of the atoms is as shown in the chemical diagram. 4 MDLファイル:結合情報 An Introduction to Chemoinformatics number of atoms number of bonds -ISIS- 13 13 0 -3.4639 -3.4651 -2.7503 -2.0338 -2.0367 -2.7521 -2.7545 -2.0413 -3.4702 -1.3238 -0.6125 -0.6167 0.1000 1 2 2 6 7 1 3 4 2 7 8 1 7 9 2 4 5 1 5 10 1 2 3 1 10 11 1 5 6 2 11 12 2 6 1 1 11 13 1 M END the first atom is a carbon 09270222202D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1.5375 -2.3648 -2.7777 -2.3644 -1.5338 -1.1247 -0.2997 0.1149 0.1107 -1.1186 -1.5292 -2.3542 -1.1125 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0999 V2000 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 O 0 0 0.0000 O 0 0 0.0000 O 0 0 0.0000 C 0 0 0.0000 O 0 0 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 the first three numbers are the x, y and z coordinates of the atom the first bond is between atoms 1 and 2 and has order 2 9 8 OH O 7 6 1 2 10 O 5 4 11 13 O 12 3 ©Leach and Gillet, Springer 2007 Figure 1-3. The connection table for aspirin in the MDL format (hydrogen-suppressed form). The numbering of the atoms is as shown in the chemical diagram. 10 コンピューターの中での化合物の 表現法2:Fingerprint An Introduction to Chemoinformatics H N O HO HO 0 1 1 1 0 0 1 1 1 1 1 1 0 1 0 OH NH2 O NH O B N= NH2 Query A N N H Figure 1-7. The bitstring representation of a query substructure is illustrated together with the corresponding bitstrings of two database molecules. Molecule A passes the screening stage since ©Leach nd illet, Springer 2007 all the bits set to “1” in the query are also setain itsGbitstring. Molecule B, however, does not pass the screening stage since the bit representing the presence of O is set to “0” in its bitstring. Fingerprintの例 Name Daylight Pubchem MACCS Size 1028 888 960 109 >= 1 Ho PubChem Substructure Fingerprint V1.3 248 >= 1 any ring size 10 http://pubchem.ncbi.nlm.nih.gov 110 >= 1 Er http://pubchem.ncbi.nlm.nih.gov >= 1 saturated or aromatic carbon-only ring size 10 111 >= 1 Tm 250 >= 1 saturated or aromatic nitrogen-containing ring size 10 PubChem Substructure Fingerprint Description 112 >= 1 Yb 251 >= or aromatic heteroatom-containing ring size 10 113 >= 11 saturated Tc PubChem Substructure Fingerprint (cont.) 252 >= non-aromatic carbon-only size Section 1: Hierarchic Element Counts -Description These bits ring test for10the 114 >= 11 unsaturated U PubChem Substructure Fingerprint 253 >= 1 unsaturated non-aromatic nitrogen-containing ring size 10 presence or count of individual chemical atoms represented V1.3 http://pubchem.ncbi.nlm.nih.gov 254 >= 1SMARTS unsaturated non-aromatic heteroatom-containing ring size 10 Section 6: Simple patterns (cont.) V1.3249 Pubchem fingerprint by their atomic symbol. PubChem Substructure Fingerprint PubChem Substructure Fingerprint 255 >= 1 Substructure aromatic ring Bit Position Bitin V1.3 http://pubchem.ncbi.nlm.nih.gov Section 2: Rings a canonic Extended Smallest Set of Smallest Rings Bit Position Bit Substructure 256 >= 1 hetero-aromatic ring V1.3 http://pubchem.ncbi.nlm.nih.gov PubChem Substructure Fingerprint Description (cont.) 689 O-C-C-C-C-C-C (ESSSR) set rings - These bits test for the presence or 0 >=2ring 4aromatic H 257 >= 690 O-C-C-C-C-C-O 1 >= 8 H 258 >= 2 hetero-aromatic rings count of the described chemical ring system. An ESSSR ring Section 3:Substructure Simple atom pairs (cont.) 691 O-C-C-C-C-C-N PubChem Fingerprint Description (cont.) (cont.) 2 >=ring 16 H which 259 >= 3 aromatic rings PubChem Substructure Fingerprint Description is any does not share three consecutive atoms 692 O=C-C-C-C-C-C 3 >= 32 H 3Substructure hetero-aromatic rings Bit260 Position with >= Bit 693 O=C-C-C-C-C-O any other ring in the chemical structure. For Section 4: Simple (cont.) 4 >=4atom 1aromatic Li nearest 261 >= rings neighbors 321 Si-Si 694 O=C-C-C-C-C=O Section 5: example, Detailed atom neighborhoods (cont.) naphthalene has three ESSSR rings (two phenyl 5 >=42hetero-aromatic Li 262 >= rings 322 Si-Cl 695 O=C-C-C-C-C-N and the 10-membered envelope), while biphenyl Bit 6Positionfragments Bit >= 1Substructure B P-H 696 C-C-C-C-C-C-C-C Bit 323 Position Bit Substructure 411 P(~O)(~O) 7 >= 2 B will yield a count of only two ESSSR rings. 324 P-P 697 C-C-C-C-C-C(C)-C 449 C(-N)(=C) 412 S(~C)(~C) 8 >= 4 B 325 As-H 698 O-C-C-C-C-C-C-C 450 C(-N)(=N) 413 S(~C)(~H) 9 >= atom 2 C Bit451 Position Bit Substructure Section 3: Simple pairs – These bits test for the presence of 326 As-As 699 O-C-C-C-C-C(C)-C C(-N)(=O) 414 S(~C)(~O) 10 >= 4 C ring size 3 115 >= 1 any patterns of bonded atom pairs, regardless of bond order or 700 O-C-C-C-C-C-O-C 452 C(-O)(=O) 415 Si(~C)(~C) 11 >= 18 saturated C 116 >= or aromatic carbon-only ring size 3 701 O-C-C-C-C-C(O)-C 453 N(-C)(=C) count.>= 16 C 12 117 >= 1 saturated or aromatic nitrogen-containing ring size 3 702 O-C-C-C-C-C-N-C 454 N(-C)(=O) 13 >= 132 C 118 >= saturated or aromatic heteroatom-containing ringfor sizethe 3 Section 4: Simple atom nearest neighbors – These bits test 703 O-C-C-C-C-C(N)-C N(-O)(=O) Bit455 Position Bit 14 >= 1Substructure 1 unsaturated N 119 >= non-aromatic carbon-only ring size 3 456 P(-O)(=O) 704 O=C-C-C-C-C-C-C atomneighborhoods nearest neighbor patterns, regardless 263 Li-H 15 >= 12of Natom Section 5:presence Detailed These bits test for size the of 120 >= unsaturated non-aromatic–nitrogen-containing ring 3 457 S(-C)(=O) 705 O=C-C-C-C-C(O)-C 264 Li-Li bond order byatom "~")neighborhood or heteroatom-containing count, but where bond 16 >= 14 unsaturated N (denoted 121 >= non-aromatic ring size 3 presence of detailed patterns, regardless 458 S(-O)(=O) 706 O=C-C-C-C-C(=O)-C 265 Li-B 17 >= 28 any N (denoted aromaticity by ":") is significant. 122 >= ring size bond 3 of count, but where orders are specific, bond 459 S(=O)(=O) 707 O=C-C-C-C-C(N)-C 266 Li-C 18 >= 21 saturated O 123 >= or aromatic carbon-only ring size 3 708 C-C(C)-C-C aromaticity both single and double bonds, and where 267 Li-O 19 >= 2Substructure 2 saturated O matches Bit Bit 124 Position >= or aromatic nitrogen-containing ring size 3 709 C-C(C)-C-C-C 268 Li-F "-", "=", and "#" matches a single bond, double bond, 20 >= 24 saturated O 327 C(~Br)(~C) 125 >= or aromatic heteroatom-containing ring size and 3 710 C-C-C(C)-C-C 269 Li-P 21 >= 8 O triple bond order, respectively. 328 C(~Br)(~C)(~C) 126 >= 2 unsaturated non-aromatic carbon-only ring size 3 711 C-C(C)(C)-C-C Section 6: Simple patterns – These bits test for the presence 270 Li-S 22 >= 2SMARTS 16 O 329 C(~Br)(~H) 127 >= unsaturated non-aromatic nitrogen-containing ring size 3 712 C-C(C)-C(C)-C of simple SMARTS patterns, regardless of count, but where 23 >= 21Substructure F C(~Br)(:C) Bit330 Position Bit 128 >= unsaturated non-aromatic heteroatom-containing ring size 3 bond orders 24 >= 2 F are specific and bond aromaticity matches both 331 C(~Br)(:N) 416 C=C 25 >=and F double 332 C(~C)(~C) 417 C#C 129 >= 14 any ring size 4 single bonds. Page 7 of 21 C(~C)(~C)(~C) 5/1/2009 7:21:06 AM 333 26 >= 11 saturated Na 418 C=N 130 >= or aromatic carbon-only ring size 4 Section 7: Complex SMARTS patterns – These bits test C(~C)(~C)(~C)(~C) 27 >= 1Substructure 2 saturated Na 419 C#N 131 >= or aromatic nitrogen-containingfor ringthe sizepresence 4 Bit 334 Position Bit of complex patterns, regardless of count, 335 C(~C)(~C)(~C)(~H) 28 >= 11 saturated SiSMARTS or 420 C=O 132 >= aromatic heteroatom-containing ringbut sizewhere 4 460 C-C-C#C 336 C(~C)(~C)(~C)(~N) 29 >= 2 Si 421 C=S 133 >= 1 unsaturated non-aromatic carbon-only ring size 4 bond orders and bond aromaticity are specific. 461 O-C-C=N 337 C(~C)(~C)(~C)(~O) 30 >= 1 P 422 N=N 462 O-C-C=O 338 C(~C)(~C)(~H)(~N) 31 >= 2 P 423 N=O Bit463 Position Bit Substructure N:C-S-[#1] 339 C(~C)(~C)(~H)(~O) 424 N=P 32 >= 4 P 713 Cc1ccc(C)cc1 464 N-C-C=C Page 4 of 21 C(~C)(~C)(~N) 5/1/2009 7:21:06 AM 340 425 P=O 33 >= 1 S 714 Cc1ccc(O)cc1 465 O=S-C-C 341 C(~C)(~C)(~O) 426 P=P 34 >= 2 S 715 Cc1ccc(S)cc1 466 N#C-C=C 342 C(~C)(~Cl) 427 C(#C)(-C) 35 >= 4 S 716 Cc1ccc(N)cc1 467 C=N-N-C 343 C(~C)(~Cl)(~H) 428 C(#C)(-H) 36 >= 8 S 717 Cc1ccc(Cl)cc1 468 O=S-C-N 344 C(~C)(~H) 429 C(#N)(-C) 37 >= 1 Cl 718 Cc1ccc(Br)cc1 469 S-S-C:C 345 C(~C)(~H)(~N) 430 C(-C)(-C)(=C) 38 >= 2 Cl 470 C:C-C=C 719 Oc1ccc(O)cc1 346 C(~C)(~H)(~O) 431 C(-C)(-C)(=N) 39 >= 4 Cl 471 S:C:C:C 720 Oc1ccc(S)cc1 347 C(~C)(~H)(~O)(~O) 432 C(-C)(-C)(=O) 40 >= 8 Cl 472 C:N:C-C 721 Oc1ccc(N)cc1 348 C(~C)(~H)(~P) 433 C(-C)(-Cl)(=O) 41 >= 1 K 473 S-C:N:C 722 Oc1ccc(Cl)cc1 hTps://pubchem.ncbi.nlm.nih.gov/help.html MACCS keys 1:('?',0), # ISOTOPE #2:('[#103,#104,#105,#106,#107,#106,#109,#110,#111,#112]',0), # ISOTOPE Not complete 2:('[#103,#104]',0), # ISOTOPE Not complete 3:('[Ge,As,Se,Sn,Sb,Te,Tl,Pb,Bi]',0), # Group IVa,Va,VIa Periods 4-‐6 (Ge...) *NOTE* spec wrong 4:('[Ac,Th,Pa,U,Np,Pu,Am,Cm,Bk,Cf,Es,Fm,Md,No,Lr]',0), # ac;nide 5:('[Sc,Ti,Y,Zr,Hf]',0), # Group IIIB,IVB (Sc...) *NOTE* spec wrong 6:('[La,Ce,Pr,Nd,Pm,Sm,Eu,Gd,Tb,Dy,Ho,Er,Tm,Yb,Lu]',0), # Lanthanide 7:('[V,Cr,Mn,Nb,Mo,Tc,Ta,W,Re]',0), # Group VB,VIB,VIIB (V...) *NOTE* spec wrong 8:('[!#6;!#1]1~*~*~*~1',0), # QAAA@1 9:('[Fe,Co,Ni,Ru,Rh,Pd,Os,Ir,Pt]',0), # Group VIII (Fe...) 10:('[Be,Mg,Ca,Sr,Ba,Ra]',0), # Group IIa (Alkaline earth) 11:('*1~*~*~*~1',0), # 4M Ring 12:('[Cu,Zn,Ag,Cd,Au,Hg]',0), # Group IB,IIB (Cu..) 13:('[#8]~[#7](~[#6])~[#6]',0), # ON(C)C 14:('[#16]-‐[#16]',0), # S-‐S 15:('[#8]~[#6](~[#8])~[#8]',0), # OC(O)O The simplest SMILES is probably that for methane: C. Note that all four attached hydrogens are implied. Ethane is CC, propane is CCC and 2-methyl propane is CC(C)C (note the branch point). Cyclohexane illustrates the use of ring closure integers; the SMILES is C1CCCCC1. Benzene is c1ccccc1 (note the use of lower case to indicate aromatic atoms). Acetic acid is CC(=O)O. The SMILES for a selection of more complex molecules are provided in Figure 1-4. コンピューターの中での化合物の 表現法3:SMILESフォーマット NH2 HO COOH N H COOH succinicacid: OC(=O)CCC(=O)O cubane: C1(C2C3C14)C5C2C3C45 serotonin: NCCc1c[nH]c2ccc(O)cc12 NH2 O O N H2N O N O trimethoprim: COc1cc(Cc2cnc(N)nc2N)cc(OC)c1OC O progesterone: CC(=O)C1CCC2C3CCC4=CC(=O)CCC4(C)C3CCC12C Figure 1-4. Some examples of the SMILES strings for a variety of molecules. ©Leach and Gillet, Springer 2007 SMILESやMDLフォーマットの問題点 • アスピリンのSMILES表現 – OC(=0)c1cccc1OC(=O)C – C1cccc(OC(=O)C)c1C(=O)O • 1つの化合物に対する記法が1つに定まらな いため、2つの化合物が同一かどうか調べる ことが出来ない。 • MDLフォーマットも、どの原子からテーブルを 始めるかによって異なる。 化合物に対する単一の名前付け(1) • Morgan Indexによる方法 – 原子に数字(例えば原子番号)を与える。 – 隣の原子の数字を繰り返したしていく。 Representation and Manipulation of 2D Molecular Structures – 最も大きい数字の原子からテーブルを始める。 O1 O1 O3 O3 3 2 2 3 O5 O5 14 5 O2 3 O1 3 2 O1 4 2 n=3 O6 8 5 4 3 7 5 4 17 12 O3 9 9 21 20 n=11 12 11 O4 4 73 27 26 19 O27 O27 O14 45 O11 n=8 n=6 O14 7 O31 12 39 19 28 O12 66 46 92 49 O58 104 59 55 19 O19 n=11 ©Leach and Gillet, Springer 2007 化合物に対する単一の名前付け(2) • IUPAC(Interna;onal Union of Pure and Applied Chemistry)による定義 – アスピリンの例 2-‐acetoxybenzoic acid データベースの検索 部分構造探索 Representation and Manipulation of 2D Molecular Structures OH N O HO O OH O H N O NH Olmidine Adrenaline Cl N HO NH2 O O HO query 9 Mefeclorazine HO NH HO N N HO Cl N Fenoldopam HO OH O Apomorphine OH Morphine Figure 1-6. An illustration of the range of hits that can be obtained using a substructure search. ©Leach and Gillet, Springer 007 the World Drug Index In this case the dopamine-derived query at the top left was used2to search (WDI) [WDI]. Note that in the case of fenoldopam the query can match the molecule in more than one way. popular, particularly when dealing with large numbers of molecules. However, methods that are based on 2D structure will tend to identify molecules with common substructures, whereas the aim is often to identify structurally different molecules. As we have already indicated in Chapter 2, it is well known that molecular 化合物の構造機能類似性(1) recognition depends on the 3D structure and properties (e.g. electrostatics and shape) of a molecule rather than the underlying substructure(s). An illustration of this is provided in Figure 5-5, which shows the three opioid ligands from • 構造が類似している化合部は、機能も類似し ていることが多い。 N Morphine O HO OH O O O Codeine 0.99 similar N N N OH O O Heroin 0.95 similar O O O Methadone 0.20 similar ©Leach calculated and Gillet, Springer 2007 Figure 5-5. Similarities to morphine using Daylight fingerprints and the Tanimoto coefficient. 化合物の構造機能類似性(2) 構造が類似している化合部は、同じターゲッ トのタンパク質やDNAと結合しやすい。 Testosterone Estrogen Dioxine ©Leach and Gillet, Springer 2007 Fingerprintでの 部分構造検索 • 化合物Aはqueryの1のビットが全て1なので 10 An Introduction to Chemoinformatics ヒットする。化合物BはO原子に対応するビット が0なのでヒットしない。HN O HO HO 0 1 1 1 0 0 1 1 1 1 1 1 0 1 0 OH NH2 O NH O B N= NH2 Query A N N H ©Leach and Gillet, Springer 2007 Figure 1-7. The bitstring representation of a query substructure is illustrated together with the there are no bits in common between the two molecules). Figure 5-2 provides a simple hypothetical example of the calculation of the Tanimoto similarity coefficient. Fingerprintの類似度 A 1 0 1 1 1 0 1 1 0 0 1 1 a=8 c=5 B 0 0 1 1 0 0 1 0 1 0 1 1 b=6 化合物Aで1の数をa化合物Bで1の数をb化合物Aと化合物Bで 5 共通の1の数をcとすると SAB = = 0.56 8+6−5 数式 Figure 5-2. Calculating similarity using binary vector representations and the Tanimoto coefficient. ユークリッド距離 a + b − 2c ハミング距離 コサイン距離 タニモト距離 a + b − 2c c ab c a+b−c identical molecules) and a value of zero indicates that there is no similarity (i.e. there are no bits in common between the two molecules). Figure 5-2 provides a simple hypothetical example of the calculation of the Tanimoto similarity coefficient. 類似度計算の例 A 1 0 1 1 1 0 1 1 0 0 1 1 a=8 c=5 B 0 0 1 1 0 0 SAB = 1 0 5 8+6−5 1 0 = 0.56 1 1 b=6 計算例 ユークリッド距離 a + b − 2c = 8 + 6 −10 = 2 ハミング距離 a + b − 2c = 8 + 6 −10 = 4 Figure 5-2. Calculating similarity using binary vector representations and the Tanimoto coefficient. コサイン距離 タニモト距離 c 5 = ~ 0.72 ab 48 c 5 5 = = ~ 0.55 a + b−c 8+ 6 − 5 9 一般の部分構造探索 • Fingerprint法で全ての可能な部分構造を保 持するには無限の長さのビット列が必要。 • MDLフォーマットやSMILESフォーマットから部 分構造のあるなしを判断する問題は難しい – 計算機化学ではNP困難問題として知られる 最大共通部分グラフ(MCS) 106 • 複数の化合物が与えられた時、その最大共 通部分グラフ(Maximum Common Subgraph) を探す問題も、NP困難であることが知られて いる。 An Introduction to Chemoinformatics • MCSは解釈し易いため計算化学ではよく使わ OH れる。 OH H2N OH A B MCSAB ©Leach and Gillet, Springer 2007 その他の化合物の特徴表現 • Log P:水・オクタノール分配係数(の対数) – 化合物が水に溶けるかどうかの指標。 • 計算法としてはGasteiger-‐Marsili法が有名。 • Polar Surface Area – 分子の表面積 リピンスキーの5の法則 • 経口投与できる薬(飲み薬)は次のルールを1 つ以上違反しない – 水素供与体の数(NHとOHの数)は5を超えない。 – 水素結合受容体(N原子とO原子の数)は10を超 えない。 – 分子量は500ダルトン以下である。 – Log P が5を超えない。 • 全てのルールは5の倍数 まとめ:化合物の特徴を表す “記述子”の種類 • バイナリー(0/1) – Fingerprint – SMILES • 実数 – LogP, PSA, etc.. 記述子からの活性予測 C C C C Cl C C C C C C C C C C C C Log p 効果 C O 1 1 1 薬 0 1 0 薬 0 1 0 薬 0 1 0 毒 1 1 0 毒 C f ( ) = α1 C C Cl + α2 C C C O C C + α3 C C C C C C C C C + ... 最小二乗回帰 78 An Introduction to Chemoinformatics • 横軸が化合物の記述子で、縦軸が活性。 データ点の間を通る線を引くことが目的。 y x 量的構造活性相関解析 • QSAR(Quan;ta;ve Structure Ac;vity Rela;onship) • 構造情報に基づく方法 (Structure-‐based) vs 構造情報に基づかない方法 (Ligand-‐based) 構造に基づく方法:ドッキング • Dock, Gold, Disco,… 166 An Introduction to Chemoinformatics O S O N H O NH N O O S N H O NH N Figure 8-3. Operation of the DOCK algorithm. A set of overlapping spheres is used to create a “negative image” of the active site. Ligand atoms are matched to the sphere centres, thereby enabling the ligand conformation to be oriented within the active site for subsequent scoring. ©Leach and Gillet, Springer 2007 More recent algorithms that take the ligand conformational degrees of freedom スコア・エネルギー関数の最小化 170 An Introduction to Chemoinformatics Figure 8-5. Two simple scoring functions used in docking. On the left is the basic scoring scheme used by the DOCK program [Desjarlais et al. 1988]. On the right is the piecewise linear potential with the parameters shown being those used to calculate steric interactions [Gelhaar et al. 1995]. (Adapted from Leach 2001.) クラスタリング • 似た性質をもつ化合物をグループ分けする方 法 F F Cl F F F Cl F N O H2NS O N S O H3CS H3O C N O O CH3 O S O N H2NS N OH N O O O O N H2NS H2NS O N NH2 O O N Cl H N N O O N H2NS N O N H2NS N O O O O N H2NS N N O O N H2NS O N F O OH CH3 H 3C Cl H3CN F S CH 3 CH3 O Cl H3CNH Cl Cl Cl Cl H 3C O O O N H2NS N N H2NS N H2NS N CH3 O O O O O N F N O N H2NS F N O FF FF O N H2NS N FF O F H2NS O N H2NS F N O FF Br CH3 F N O H2NS F N O FF O N O F N O F F N O F F O N H2NS F F F N F F H2 N O O O S F F F N O N F N O N O S N H2NS FF N O F F FF N H2NS S O H2 N F O O S F N O FF NH2 F Cl F F F O O O O S CH3 CH3 O N H2NS CH3 Cl F N O NH2 NH2 NH2 FF N N O O O F N F N O F S S N H2NS O FF N O O O S F N F N H2NS FF N N N O N H2NS O FF F F N S O N H2NS O FF F N Cl N F N N H2NS F Cl S N O H2NS O N O FF N CH3 Cl Cl O N O N H2NS F N O FF O N H2NS F N O FF O N H2NS F N O FF N H2NS F N O FF O H2NS O N F N O FF F F H2NS O N F N F F H3C F CH3 CH3 OON+ F F CH3 CH3 OH O S O NH2 CH3 O O O N H2NS O F N FF O N H2NS O F N FF O N H2NS O F N FF O N H2NS O F N O N H2NS F N O FF O O N H2NS F N O FF F N CH3 O O CH O N H2NS F N O FF CH3 H 3C O O N H2NS O FF CH3 O CH3 OH OH H3CN NH F N CH3 CH3 O N H2NS O FF FF H2NS O N F N FF Cl 3 CH3 H 3C N Cl O O N H2NS F N O N H2NS O F F F N O H2NS O N O FF F N F F O N H2NS O F N O N H2NS O FF F N O N H2NS O FF F N F N O FF O O N H2NS N H2NS O FF F N FF Cl N H2NS O F F N O N H2NS N O F CH3 F Cl Cl N N OH N Cl Cl O N H2NS N O O OH O Cl O O H2NS N H2NS N O CH3 O N O N Cl Cl Cl Cl Cl O H2NS Cl O H2NS Cl O N N NH2 O N H2NS N O F O N H2NS N CH3 O N H2NS O N O O Cl Cl CH3 Br O N H2NS O O O H2NS N N N O Cl F CH3 O N+ O N O O N H2NS N H2NS N O O OS O O N N H2NS N NH2 O O F F N O N O N H2NS N H2NS O O CH3 F O N H2NS F F N O F CH3 CH3 H2NS O N N H2NS F F O F N F O N H2NS F F OH OCH3 O N O H2NS F F N O N F F F F F N O CH3 F N N F F O N F N F Cl F O O O S H 2N O H3 C S H2 N O CH3 O O S N F F O N NS H O F N O N H2NS FF N H2NS N O F F O N O F F O N H2NS F F N O NH2 CH3 O N H2NS F F N O CH3 OH O O O N H2NS F F N O O O N H2NS F F N O N H2NS F F N O F CH3 CH3 CH3 O Cl H3C O Cl O F CH3 F O F CH3 O N F F N O O N H2NS F F N O O N H2NS F F N O O N H2NS F F N O O S N O F F FF O O O N H2NS NH2 H3C N F F O O N NS OCH3 O S N H F N FF N F N F F CH3 CH3 Cl F N O O N H2NS N H2NS N O O O H2NS F F N O H 3C F Cl CH3 O O F Cl F F ON+ O O N F N FF F N F FF F Cl H2NS N O N O Cl N O N H2NS N N S F N FF Cl N O F N FF H 3C N F N FF Cl N O F N FF H3C O N F N FF H3CS O O N F N FF H2NS O N O N CH3 O O H2NS O N N CH3 F F Cl F F F Cl F N O H2NS O N S O H3CS H3O C N O O CH3 O S O N H2NS N OH N O O O O N H2NS H2NS O N NH2 O O N Cl H N N O O N H2NS N O N H2NS N O O O O N H2NS N N O O N H2NS O N F O OH CH3 H 3C Cl H3CN F S CH 3 CH3 O Cl H3CNH Cl Cl Cl Cl H 3C O O O N H2NS N N H2NS N H2NS N CH3 O O O O O N F N O N H2NS F N O FF FF O N H2NS N FF O F H2NS O N H2NS F N O FF Br CH3 F N O H2NS F N O FF O N O F N O F F N O F F O N H2NS F F F N F F H2 N O O O S F F F N O N F N O N O S N H2NS FF N O F F FF N H2NS S O H2 N F O O S F N O FF NH2 F Cl F F F O O O O S CH3 CH3 O N H2NS CH3 Cl F N O NH2 NH2 NH2 FF N N O O O F N F N O F S S N H2NS O FF N O O O S F N F N H2NS FF N N N O N H2NS O FF F F N S O N H2NS O FF F N Cl N F N N H2NS F Cl S N O H2NS O N O FF N CH3 Cl Cl O N O N H2NS F N O FF O N H2NS F N O FF O N H2NS F N O FF N H2NS F N O FF O H2NS O N F N O FF F F H2NS O N F N F F H3C F CH3 CH3 OON+ F F CH3 CH3 OH O S O NH2 CH3 O O O N H2NS O F N FF O N H2NS O F N FF O N H2NS O F N FF O N H2NS O F N O N H2NS F N O FF O O N H2NS F N O FF F N CH3 O O CH O N H2NS F N O FF CH3 H 3C O O N H2NS O FF CH3 O CH3 OH OH H3CN NH F N CH3 CH3 O N H2NS O FF FF H2NS O N F N FF Cl 3 CH3 H 3C N Cl O O N H2NS F N O N H2NS O F F F N O H2NS O N O FF F N F F O N H2NS O F N O N H2NS O FF F N O N H2NS O FF F N F N O FF O O N H2NS N H2NS O FF F N FF Cl N H2NS O F F N O N H2NS N O F CH3 F Cl Cl N N OH N Cl Cl O N H2NS N O O OH O Cl O O H2NS N H2NS N O CH3 O N O N Cl Cl Cl Cl Cl O H2NS Cl O H2NS Cl O N N NH2 O N H2NS N O F O N H2NS N CH3 O N H2NS O N O O Cl Cl CH3 Br O N H2NS O O O H2NS N N N O Cl F CH3 O N+ O N O O N H2NS N H2NS N O O OS O O N N H2NS N NH2 O O F F N O N O N H2NS N H2NS O O CH3 F O N H2NS F F N O F CH3 CH3 H2NS O N N H2NS F F O F N F O N H2NS F F OH OCH3 O N O H2NS F F N O N F F F F F N O CH3 F N N F F O N F N F Cl F O O O S H 2N O H3 C S H2 N O CH3 O O S N F F O N NS H O F N O N H2NS FF N H2NS N O F F O N O F F O N H2NS F F N O NH2 CH3 O N H2NS F F N O CH3 OH O O O N H2NS F F N O O O N H2NS F F N O N H2NS F F N O F CH3 CH3 CH3 O Cl H3C O Cl O F CH3 F O F CH3 O N F F N O O N H2NS F F N O O N H2NS F F N O O N H2NS F F N O O S N O F F FF O O O N H2NS NH2 H3C N F F O O N NS OCH3 O S N H F N FF N F N F F CH3 CH3 Cl F N O O N H2NS N H2NS N O O O H2NS F F N O H 3C F Cl CH3 O O F Cl F F ON+ O O N F N FF F N F FF F Cl H2NS N O N O Cl N O N H2NS N N S F N FF Cl N O F N FF H 3C N F N FF Cl N O F N FF H3C O N F N FF H3CS O O N F N FF H2NS O N O N CH3 O O H2NS O N N CH3 階層的クラスタリング • 7つの化合物をグループ化する手順 122 An Introduction to Chemoinformatics Figure 6-2. A dendrogram representing a hierarchical clustering of seven compounds. その他の解析手法: 主成分分析 PCA • Principal Component Analysis (PCA) – 主にデータの可視化に使われる。 ソフトウェア • フリー – Openbabel (hTp://openbabel.org) • MOL、SDF、SMILESなどのフォーマット間での変換 • 2D座標計算、png(画像)ファイル生成 – Chemsketch (hTp://www.acdlabs.com/resources/ freeware/chemsketch/) • 化合物描画 • 商用 – MOE (hTp://www.rsi.co.jp/kagaku/cs/ccg/) • 統合パッケージ – Chemdraw (hTps://www.hulinks.co.jp/souware/ chembiodraw/index.html) • 化合物の描画 References • Andrew R. Leach and Valerie J. Gillet “An Introduc;on to Chemoinforma;cs” Springer • J. Gasteiger and T. Engel “Chemoinforma;cs” Wiley Contents • Representa;on of chemical compounds in computer – – – – • • Database Souware – – – – – • Openbabel Chemsketch Chemdraw Pubchem Dock, Gold, Similarity search – – – • 2d 3d MOL/SDF format Smile, Fingerprint Compound retrieval Maximum common subgraph Tanimoto index Virtual screening (modeling) – – – Lipinski’s QSAR/QSPR ADMET