Advanced search×

Speeding up chemical database searches using a proximity filter based on the logical exclusive or.

J Chem Inf Model 48(7):1367-78 (2008) PMID 18593143

In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence in the molecular graphs of particular functional groups or combinatorial features, such as labeled paths or labeled trees. To speed up database searches, we propose to store with each fingerprint a small header vector containing primarily the result of applying the logical exclusive OR (XOR) operator to the fingerprint vector after modulo wrapping to a smaller number of bits, such as 128 bits. From the XOR headers of two molecules, tight bounds on the intersection and union of their fingerprint vectors can be rapidly obtained, yielding tight bounds on derived similarity measures, such as the Tanimoto measure. During a database search, every time these bounds are unfavorable, the corresponding molecule can be rapidly discarded with no need for further inspection. We derive probabilistic models that allow us to estimate precisely the behavior of the XOR headers and the level of pruning under different conditions in terms of similarity threshold and fingerprint density. These theoretical results are corroborated by experimental results on a large set of molecules. For a Tanimoto threshold of 0.5 (respectively 0.9), this approach requires searching less than 50% (respectively 10%) of the database, leading to typical search speedups of 2 to 3 times over the previous state-of-the-art.

DOI: 10.1021/ci800076s
Version: za2963e q8zab q8zb7 q8zc6 q8zdf q8ze1 q8zfa q8zg8

Similar articles you may find interesting…

  1. Recommendations for reporting outcome results in abdominal wall repair : Results of a Consensus meeting in Palermo, Italy, 28-30 June 2012.

    Hernia (2013) PMID 23673408

    A list of recommendations was formulated including more general issues on the scientific methodology and statistical approach. Standards and statements are available, each depending on the type of study that is being reported: the CONSORT statement for the Randomised Controlled Trials, the TREND sta...
  2. Beyond the ridge pattern: multi-informative analysis of latent fingermarks by MALDI mass spectrometry.

    Analyst (2013) PMID 23658933

    We have demonstrated that Matrix Assisted Laser Desorption Ionisation Mass Spectrometry and Mass Spectrometry Imaging (MALDI MSI) can provide multiple images of the same fingermark in one analysis simultaneous with additional intelligence. Here, a review on the pioneering use and development of MALD...
  3. Compliance with Pharmacotherapy and Direct Healthcare Costs in Patients with Parkinson's Disease: A Retrospective Claims Database Analysis.

    Appl Health Econ Health Policy (2013) PMID 23649891

    A total of 15,846 patients were included, of whom 46 % were considered to be non-compliant with their prescribed medication (MPR <2 years since the initial PD...