Speeding up chemical database searches using a proximity filter based on the logical exclusive or.
In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence in the molecular graphs of particular functional groups or combinatorial features, such as labeled paths or labeled trees. To speed up database searches, we propose to store with each fingerprint a small header vector containing primarily the result of applying the logical exclusive OR (XOR) operator to the fingerprint vector after modulo wrapping to a smaller number of bits, such as 128 bits. From the XOR headers of two molecules, tight bounds on the intersection and union of their fingerprint vectors can be rapidly obtained, yielding tight bounds on derived similarity measures, such as the Tanimoto measure. During a database search, every time these bounds are unfavorable, the corresponding molecule can be rapidly discarded with no need for further inspection. We derive probabilistic models that allow us to estimate precisely the behavior of the XOR headers and the level of pruning under different conditions in terms of similarity threshold and fingerprint density. These theoretical results are corroborated by experimental results on a large set of molecules. For a Tanimoto threshold of 0.5 (respectively 0.9), this approach requires searching less than 50% (respectively 10%) of the database, leading to typical search speedups of 2 to 3 times over the previous state-of-the-art.
Version: za2963e q8zab q8zb7 q8zc6 q8zdf q8ze1 q8zfa q8zg8