Toward Natural Language Query Processing for Bioinformatics Polystores: BIDS Data Science Lecture

Lecture | July 19 | 3-4 p.m. | 190 Doe Library

 Kurt Stockinger, Zurich University of Applied Sciences (ZHAW)

 Berkeley Institute for Data Science

Abstract: When medical doctors, biologists or pharmaceutical researchers want to analyze gene data, they typically need to query complex bioinformatics data stores. These data stores are mostly either based on relational database technologies that are commonly used in industry or on so-called semantic web technologies that have recently gained attraction due to the wide use of linked open data by numerous practitioners in life sciences, medicine and health care around the globe. In order to efficiently query these heterogeneous data stores (so-called polystores), end users need to know the specific query languages SQL or SPARQL. However, both query languages require significant technical know-how, as well as deep insights into the structures of the underlying data stores. These technical hurdles make it practically impossible for non-technical end-users to query data efficiently.

In this talk, we present a novel solution and hence a new query interface called Bio-SODA which enables end-users to query polystores using keyword queries. In particular, we demonstrate how to intuitively query one of the world’s most commonly used bioinformatics knowledge bases called UniProt using keyword queries. Moreover, we discuss how to automatically translate keywords into the technical query languages SQL and SPARQL. Our proposed approach is not specific to bioinformatics databases, but is generic and can therefore be applied in different settings where research institutions, enterprises or governments need to get aggregate insights for queries that span multiple, heterogeneous data stores in an intuitive way.

 BidsAdmin@berkeley.edu, 510-664-4506