Expecto

Extracting Formal Specifications from Natural Language Description for Trustworthy Oracles

Dongjae Lee Kihong Heo

KAIST ProSys Lab

Abstract

Natural-language specifications are easy to write, but they are ambiguous and not directly machine-checkable. Expecto bridges this gap by extracting trustworthy formal specifications from natural-language intent. Its key idea is a top-down, modular specification synthesis algorithm: an LLM first decomposes a complex requirement into typed informal definitions, then Expecto refines each definition into formal DSL components. At every step, the system validates partial specifications with syntax/type checking and SMT-based consistency checking, treating unresolved informal definitions as uninterpreted functions. Across competitive-programming and real-world bug benchmarks, Expecto generates substantially more sound-and-complete specifications than monolithic LLM baselines.

Overview

Expecto overview diagram: user intent flows to specification refinement, validation, worklist management, and a final formal specification.

Key components

SpecRefinement

  • Selects one typed informal definition from the worklist instead of generating the full specification at once.
  • Asks the LLM to produce a focused DSL definition for the selected function or predicate.
  • Introduces helper predicates/functions as new informal definitions when the current target needs decomposition.

SpecValidation

  • Runs syntax and type checks to catch malformed DSL definitions early.
  • Uses SMT solving to check logical consistency and agreement with available input-output examples.
  • Treats unresolved helper definitions as uninterpreted functions, enabling validation even for partial specifications.
  • Sends concrete feedback back to SpecRefinement when a candidate fails validation.

Running Example

Problem description

Depot is a rectangular checkered field of n × m size. Each cell can be empty (".") or occupied by a wall ("*"). You have one bomb. If you lay the bomb at cell (x, y), it wipes out all walls in row x and column y. Determine whether all walls can be wiped out by placing exactly one bomb. The bomb can be laid either in an empty cell or in a wall cell.

Input-output example

possible = True, x = 2, y = 5

Formal Specification

Colored phrases in the natural-language intent correspond to highlighted function names.

def Spec (n, m, depot, possible, x, y) =
possible → ValidPos(x, y, n, m) ∧ WipesAll(x, y, n, m, depot) ∧
¬possible → ∀r, c. ValidPos(r, c, n, m) → ¬WipesAll(r, c, n, m, depot)
def ValidPos (r, c, n, m) =
(1 ≤ r ≤ n) ∧ (1 ≤ c ≤ m)
def WipesAll (r, c, n, m, depot) =
∀i, j. IsWall(i, j, n, m, depot) → ((i + 1 = r) ∨ (j + 1 = c))
def IsWall (r, c, n, m, depot) =
(0 ≤ r < n) ∧ (0 ≤ c < m) ∧ depot[r][c] = "*"

Refinement Target: Spec

Current informal definition

Refinement result

New auxiliary informal definitions

Worklist

Experimental Results

The experiments measure whether generated specifications accept intended behavior, reject unintended behavior, and help detect real bugs.

Benchmarks

  • HumanEval+: 164 Python programming tasks with enhanced test cases for basic algorithms and data structures.
  • APPS: 127 competitive-programming tasks selected from problems with at least 100 test cases.
  • Defects4J: 501 Java methods from 336 real-world bugs, using Javadoc comments as natural-language specifications.

Models & Setup

  • Large language model: GPT-4.1-mini generates and repairs DSL definitions.
  • Symbolic backend: Z3 checks SMT constraints produced from partial specifications.
  • Compared methods: Select an RQ below to see the methods used in the currently displayed result view.

Metrics: Soundness & Completeness

Completeness

Accept every correct input-output pairs

A complete specification must accept all the correct input-output pairs. For instance, in the running example, (True, 2, 5) should be accepted by the specification.

possible = True,
x = 2, y = 5
Accepted
Soundness

Reject more than X% of wrong input-output pairs

A sound specification must reject more than X% of the wrong input-output pairs. We set X = 50 for the experiments. For instance, (True, 1, 1) should be rejected by the specification.

possible = True,
x = 1, y = 1
Rejected

S&C means both sound and complete; S means sound only; C means complete only; W means neither.

Takeaway

Expecto improves the number of sound-and-complete specifications by decomposing the generation problem and validating partial candidates throughout the synthesis process.