The Data Distillery: A Graph Framework for Semantic Integration and Querying of Biomedical Data

Authors

Taha Mohseni Ahooyi, Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia PA USA.
Benjamin Stear, Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia PA USA.
J Alan Simmons, Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh PA USA.
Vincent T. Metzger, Department of Internal Medicine, Division of Translational Informatics, University of New Mexico Health Sciences Center, University of New Mexico NM USA.
Praveen Kumar, Department of Internal Medicine, Division of Translational Informatics, University of New Mexico Health Sciences Center, University of New Mexico NM USA.
John Erol Evangelista, Department of Pharmacological Sciences; Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York NY USA.
Daniel J. Clarke, Department of Pharmacological Sciences; Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York NY USA.
Zhuorui Xie, Department of Pharmacological Sciences; Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York NY USA.
Heesu Kim, Department of Pharmacological Sciences; Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York NY USA.
Sherry L. Jenkins, Department of Pharmacological Sciences; Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York NY USA.
Mano R. Maurya, Department of Bioengineering, University of California San Diego, San Diego CA USA.
Srinivasan Ramachandran, Department of Bioengineering, University of California San Diego, San Diego CA USA.
Eoin Fahy, Department of Bioengineering, University of California San Diego, San Diego CA USA.
Thomas H. Gillespie, Department of Neuroscience, School of Medicine, University of California San Diego, San Diego CA USA.
Fahim T. Imam, Department of Neuroscience, School of Medicine, University of California San Diego, San Diego CA USA.
Natallia Kokash, Institute of Informatics, University of Amsterdam, the Netherlands.
Matthew E. Roth, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX USA.
Robert Fullem, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX USA.
Dubravka Jevtic, Persida Bio, Brooklyn NY USA.
Aleks Mihajlovic, Persida Bio, Brooklyn NY USA.
Michael Tiemeyer, Complex Carbohydrate Research Center, University of Georgia, Athens, Georgia, USA.
Clara Bakker, Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Andrew J. Schroeder, Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Julia Markowski, Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Jared Nedzel, Broad Institute of MIT and Harvard, Cambridge MA USA.
Dave D. Hill, Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia PA USA.
James Terry, Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia PA USA.
Christopher Nemarich, Center for Data Driven Discovery, The Children's Hospital of Philadelphia, Philadelphia PA USA.
Jyl Boline, Informed Minds Inc. Walnut Creek, CA USA.
Peter J. Park, Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Kristin G. Ardlie, Broad Institute of MIT and Harvard, Cambridge MA USA.
Jeet Vora, Department of Biochemistry and Molecular Medicine, George Washington University, Washington DC USA.

Document Type

Journal Article

Publication Date

10-16-2025

Journal

bioRxiv : the preprint server for biology

DOI

10.1101/2025.08.11.666099

Abstract

The Data Distillery Knowledge Graph (DDKG) is a framework for semantic integration and querying of biomedical data across domains. Built for the NIH Common Fund Data Ecosystem, it supports translational research by linking clinical and experimental datasets in a unified graph model. Clinical standards such as ICD-10, SNOMED, and DrugBank are integrated through UMLS, while genomics and basic science data are structured using ontologies and standards such as HPO, GENCODE, Ensembl, STRING, and ClinVar. The DDKG uses a property graph architecture based on the UBKG infrastructure and supports ontology-based ingestion, identifier normalization, and graph-native querying. The system is modular and can be extended with new datasets or schema modules. We demonstrate its utility for informatics queries across eight use cases, including regulatory variant analysis, tissue-specific expression, biomarker discovery, and cross-species variant prioritization. The DDKG is accessible via a public interface, a programmatic API, and downloadable builds for local use.

Department

Biochemistry and Molecular Medicine

Share

COinS