filmov
tv
[PLDI24] GenSQL: A Probabilistic Programming System for Querying Generative Models of Database(…)

Показать описание
GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables (Video, PLDI 2024)
Mathieu Huot, Matin Ghavami, Alexander K. Lew, Ulrich Schaechtle, Cameron E. Freer, Zane Shelby, Martin C. Rinard, Feras A. Saad, and Vikash K. Mansinghka
(Massachusetts Institute of Technology, USA; Massachusetts Institute of Technology, USA; Massachusetts Institute of Technology, USA; Digital Garage, Japan; Massachusetts Institute of Technology, USA; Digital Garage, Japan; Massachusetts Institute of Technology, USA; Carnegie Mellon University, USA; Massachusetts Institute of Technology, USA)
Abstract: This article presents GenSQL, a probabilistic programming system for querying probabilistic generative models of database tables. By augmenting SQL with only a few key primitives for querying probabilistic models, GenSQL enables complex Bayesian inference workflows to be concisely implemented. GenSQL’s query planner rests on a unified programmatic interface for interacting with probabilistic models of tabular data, which makes it possible to use models written in a variety of probabilistic programming languages that are tailored to specific workflows. Probabilistic models may be automatically learned via probabilistic program synthesis, hand-designed, or a combination of both. GenSQL is formalized using a novel type system and denotational semantics, which together enable us to establish proofs that precisely characterize its soundness guarantees. We evaluate our system on two case real-world studies—an anomaly detection in clinical trials and conditional synthetic data generation for a virtual wet lab—and show that GenSQL more accurately captures the complexity of the data as compared to common baselines. We also show that the declarative syntax in GenSQL is more concise and less error-prone as compared to several alternatives. Finally, GenSQL delivers a 1.7-6.8x speedup compared to its closest competitor on a representative benchmark set and runs in comparable time to hand-written code, in part due to its reusable optimizations and code specialization.
Video Tags: generative modeling, Bayesian data analysis, AutoML, query language, probabilistic programming, semantics and correctness, pldi24main-p182-p, doi:10.1145/3656409, doi:10.5281/zenodo.10949799, orcid:0000-0002-5294-9088, orcid:0000-0003-3052-7412, orcid:0000-0002-9262-4392, orcid:0009-0005-8897-6394, orcid:0000-0003-1791-6843, orcid:0009-0003-2976-4581, orcid:0000-0001-8095-8523, orcid:0000-0002-0505-795X, orcid:0000-0003-2507-0833, Artifacts Available, Artifacts Evaluated — Reusable
Sponsored by ACM SIGPLAN,
Mathieu Huot, Matin Ghavami, Alexander K. Lew, Ulrich Schaechtle, Cameron E. Freer, Zane Shelby, Martin C. Rinard, Feras A. Saad, and Vikash K. Mansinghka
(Massachusetts Institute of Technology, USA; Massachusetts Institute of Technology, USA; Massachusetts Institute of Technology, USA; Digital Garage, Japan; Massachusetts Institute of Technology, USA; Digital Garage, Japan; Massachusetts Institute of Technology, USA; Carnegie Mellon University, USA; Massachusetts Institute of Technology, USA)
Abstract: This article presents GenSQL, a probabilistic programming system for querying probabilistic generative models of database tables. By augmenting SQL with only a few key primitives for querying probabilistic models, GenSQL enables complex Bayesian inference workflows to be concisely implemented. GenSQL’s query planner rests on a unified programmatic interface for interacting with probabilistic models of tabular data, which makes it possible to use models written in a variety of probabilistic programming languages that are tailored to specific workflows. Probabilistic models may be automatically learned via probabilistic program synthesis, hand-designed, or a combination of both. GenSQL is formalized using a novel type system and denotational semantics, which together enable us to establish proofs that precisely characterize its soundness guarantees. We evaluate our system on two case real-world studies—an anomaly detection in clinical trials and conditional synthetic data generation for a virtual wet lab—and show that GenSQL more accurately captures the complexity of the data as compared to common baselines. We also show that the declarative syntax in GenSQL is more concise and less error-prone as compared to several alternatives. Finally, GenSQL delivers a 1.7-6.8x speedup compared to its closest competitor on a representative benchmark set and runs in comparable time to hand-written code, in part due to its reusable optimizations and code specialization.
Video Tags: generative modeling, Bayesian data analysis, AutoML, query language, probabilistic programming, semantics and correctness, pldi24main-p182-p, doi:10.1145/3656409, doi:10.5281/zenodo.10949799, orcid:0000-0002-5294-9088, orcid:0000-0003-3052-7412, orcid:0000-0002-9262-4392, orcid:0009-0005-8897-6394, orcid:0000-0003-1791-6843, orcid:0009-0003-2976-4581, orcid:0000-0001-8095-8523, orcid:0000-0002-0505-795X, orcid:0000-0003-2507-0833, Artifacts Available, Artifacts Evaluated — Reusable
Sponsored by ACM SIGPLAN,