MultiTab — Synthetic Dataset Generator

Generate multitask tabular data with controllable task correlations, polynomial complexity, and noise.

Max 200,000 in-browser.

How it works

The generator creates a multitask regression dataset where task weight vectors share a controlled pairwise cosine similarity ρ.

  1. Orthogonal unit vectors U are constructed via Gram–Schmidt.
  2. A Gram matrix P is built with off-diagonal entries = ρ.
  3. Eigendecomposition of P yields the weight matrix W = V √Λ U.
  4. Input features X are sampled from N(0, 1).
  5. Labels for task t: yt = Xwt + Σk=2…d (Xwt)k + ε, where ε ~ N(0, σ²).

The output CSV contains columns x_0, x_1, …, x_{D-1}, task_0, task_1, …, task_{T-1}, split where split is train or test.

Need more than 200K samples? Use the Google Colab notebook (coming soon).