Exploiting shared correlations in probabilistic databases

TitleExploiting shared correlations in probabilistic databases
Publication TypeJournal Articles
Year of Publication2008
AuthorsSen P, Deshpande A, Getoor L
JournalProceedings of the VLDB Endowment
Pagination809 - 820
Date Published2008/08//
ISBN Number2150-8097

There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources --- from sensor data, experimental data, data from uncurated sources, and many others. There is a growing need for database management systems that can efficiently represent and query such data. In this work, we show how data characteristics can be leveraged to make the query evaluation process more efficient. In particular, we exploit what we refer to as shared correlations where the same uncertainties and correlations occur repeatedly in the data. Shared correlations occur mainly due to two reasons: (1) Uncertainty and correlations usually come from general statistics and rarely vary on a tuple-to-tuple basis; (2) The query evaluation procedure itself tends to re-introduce the same correlations. Prior work has shown that the query evaluation problem on probabilistic databases is equivalent to a probabilistic inference problem on an appropriately constructed probabilistic graphical model (PGM). We leverage this by introducing a new data structure, called the random variable elimination graph (rv-elim graph) that can be built from the PGM obtained from query evaluation. We develop techniques based on bisimulation that can be used to compress the rv-elim graph exploiting the presence of shared correlations in the PGM, the compressed rv-elim graph can then be used to run inference. We validate our methods by evaluating them empirically and show that even with a few shared correlations significant speed-ups are possible.