This project is offered together with the Programming Group at the University of St. Gallen (HSG).

Background:

Datalog is a declarative logic language with applications in databases, program analysis, and networking. It is widely used in systems that require strong correctness guarantees, including static analyzers, program verifiers, and policy enforcement engines. The correctness of Datalog engines is therefore critical: bugs in evaluation can compromise the soundness of the systems that rely on them. For instance, an imprecise alias analysis implemented in Datalog may cause a static analyzer to incorrectly validate unsafe code. In Infrastructure-as-Code (IaC), Datalog is used in policy languages like REGO to define rules that help manage security and compliance. REGO is the language behind the Open Policy Agent (OPA), a tool commonly used in cloud environments to check whether configurations follow these rules.

In software testing, a key challenge is designing a reliable oracle, that is, a procedure to determine whether the output of a test case is correct. This is particularly difficult for systems like Datalog engines, where defining expected outputs for non-trivial inputs can be infeasible. Metamorphic testing addresses this by defining relations between inputs and outputs rather than relying on known outputs directly. If two semantically equivalent programs produce different outputs under the same engine, this inconsistency serves as an automated oracle that indicates a potential bug.

Recent work has shown that real-world Datalog engines, including mature implementations like Soufflé and μZ, contain subtle correctness issues [1-2]. These bugs often arise from interactions between Datalog’s declarative semantics and the low-level implementation details of the engine. For example, different evaluation strategies can lead to inconsistencies when handling negation, recursion, or aggregation. Since Datalog engines often include custom optimizations for performance, these can introduce bugs that are hard to detect and reproduce. While the metamorphic testing technique presented in previous work has been effective in uncovering such bugs, it relies on hand-written, randomized transformation rules implemented in Python. This limits control over transformation semantics and restricts the ability to explore equivalence classes systematically.

Objective:

This thesis proposes a principled alternative based on e-graphs. An e-graph is a data structure for representing equivalence classes under a set of rewrite rules. Originally developed for compiler optimizations, e-graphs are also promising for software testing. They provide a compact and efficient way to represent many equivalent programs simultaneously. Unlike random rewriting, e-graphs apply rewrites exhaustively, allowing systematic exploration of semantic equivalence classes. This improves coverage, enables reproducibility, and ensures that transformations preserve semantics by construction. An example is egg [3], a high-performance and extensible e-graph implementation in Rust.

The initial step of the thesis will be to implement an e-graph engine in Rust and re-implement the transformation rules used in the previous work using this infrastructure.

[1] Muhammad Numair Mansur, Maria Christakis, and Valentin Wüstholz. Metamorphic testing of datalog engines. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 639–650, 2021.

[2] Muhammad Numair Mansur, Valentin Wüstholz, and Maria Christakis Dependency-aware metamorphic testing of datalog engines. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 236–247, 2023.

[3] Max Willsey, Chandrakana Nandi, Yihan Wang, Ryan Ringer, and Zachary Tatlock. Egg: Fast and extensible equality saturation. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 181–197, 2021.