Multi-Agent · Fully Autonomous · End-to-End

From a research idea
to a polished paper.

MAARS is a multi-agent system that takes a vague idea — or a Kaggle competition URL — and autonomously runs the full research loop: literature survey, task decomposition, sandboxed experiments, paper drafting, and peer review.

/ the three-stage loop
01
Refine
Explorer ↔ Critic
02
Research
Decompose · Execute · Evaluate
03
Write
Writer ↔ Reviewer → Polish
live run — Lorenz attractor
/ the gap

Ideas are just the start.
The paper is the hard part.

Between a one-sentence idea and a finished research artifact lies a long tail of dirty work: scoping, literature surveys, experiment design, reproducible code, error recovery, writing, revising. MAARS puts an agent-orchestrated loop around that tail — and runs it end to end.

3
autonomous stages
0
human loops to close
iterations until clean
1
polished paper out
/ pipeline

Three stages.
One coherent loop.

Each stage has a stable I/O boundary: a runtime orchestrates control flow and persistence, while agents do the open-ended work.

stage 01
Refine

Explorer drafts a proposal from the user's raw idea. Critic reviews it within a declared scope, surfacing gaps and ambiguities. They iterate until zero issues remain.

in  →  idea.md
out →  refined_idea.md
stage 02
Research

The core engine. A strategy decomposes the proposal into atomic tasks, which execute in Docker sandboxes with parallel scheduling. Outputs are verified, then evaluated — loops with strategy updates when gaps exist.

in  →  refined_idea.md
out →  artifacts/ · tasks/ · evaluations/
stage 03
Write

Writer reads all research outputs and drafts a complete paper. Reviewer critiques and drives revisions — same IterationState pattern as Refine. A final Polish sub-step runs a single-pass LLM refinement and appends a deterministic metadata appendix (tokens, timings, scores).

in  →  artifacts/ · refined_idea.md
out →  paper.mdpaper_polished.md
Runtime vs Agent

if · for · while, scheduling, retries, termination → runtime. Search, coding, reasoning → agent.

IterationState pattern

Primary drafts → Reviewer returns {issues, resolved, pass} → Primary revises. Loops until pass=true or the iteration cap.

DB as single source of truth

State lives on disk — JSON + Markdown under results/{session}/. SSE is notification only; the UI reads canonical data from the session DB.

/ system

System design.
Clear boundaries, concrete state.

The important split is simple: Python owns control flow and persistence; agents do the open-ended work inside bounded stages.

L5
Entry layer
Frontend (Vanilla JS, SSE streaming, marked.js) · FastAPI server
HTTP · SSE
L4
Orchestration layer
Sequences the three stages; manages lifecycle, session recovery, termination
orchestrator.py
L3
Stage layer
Stage base → ResearchStage · TeamStage → Refine / Write (Polish sub-step)
stable I/O
L2
Execution layer
Multi-agent — all three stages; Agno framework · Gemini with native Google Search
Agno · Gemini
L1
Tools & State layer
ArXiv · Wikipedia · DDGS · Kaggle API · PyPDF · Docker sandbox · file-based session DB
persistent

Every run is a directory.

State is just files. No hidden database, no vector store. You can cd into any session and read exactly what each agent said — drafts, critiques, plan trees, execution logs, polished paper. Reproducible by construction.

results/{session}/
├── idea.md                     # user input
├── proposals/ critiques/       # Refine: rounds
├── refined_idea.md             # Refine: final
├── calibration.md              # Research: calibration
├── strategy/ plan_tree.json    # Research: strategy · plan
├── plan_list.json              # Research: flat task list
├── tasks/ artifacts/           # code · figures · data
├── evaluations/                # Research: eval rounds
├── results_summary.{json,md}   # Research: summary
├── drafts/ reviews/            # Write: rounds
├── paper.md                    # Write: draft
├── paper_polished.md           # Write: polished (sub-step)
├── meta.json                   # metadata (tokens, score)
├── log.jsonl                   # SSE stream log
├── execution_log.jsonl         # Docker exec log
└── reproduce/                  # Dockerfile · run.sh
/ implementation

Implementation choices.
Simple runtime, bounded execution.

The stack is intentionally plain: keep orchestration legible, keep execution isolated, and keep state inspectable on disk.

backend
Python · FastAPI

Runtime, API server, SSE broadcast, session DB.

agents
Agno · Gemini

Multi-agent framework; per-stage model override; native Google Search.

execution
Docker sandbox

Isolated Python exec with CPU/RAM/GPU quotas, network toggle, timeouts.

frontend
Vanilla JS · SSE

Streaming log viewer + right-panel state dashboard reading from DB.

tools
ArXiv · Wikipedia · DDGS · Kaggle · PyPDF

Literature survey, web search, competition data, paper parsing — wired into agents.

storage
File-based DB

JSON & Markdown on disk. No cloud, no migrations. grep-friendly.

reproducibility
Dockerfile + run.sh

Every session emits a reproduce bundle — rerun any experiment from scratch.

/ showcase

Real runs.
Real papers.

Two end-to-end runs from the repo's showcase/ folder — each one producing code, figures, a polished paper, and a reproduce bundle.

case · scientific visualization

Lorenz attractor — 4 figures that tell the chaos story

A one-line prompt: solve the Lorenz system with RK45 and produce 4 chaos figures. Out comes a full paper — derivations, code, and the plots below.

time
11 min
tokens
347k
paper
open
Lorenz 3D trajectory
fig.1  3D phase-space trajectory (σ=10, ρ=28, β=8/3)
Bifurcation diagram
fig.2  bifurcation over ρ ∈ [0, 50]
Divergence plot
fig.3  trajectory divergence → max Lyapunov exponent
Lyapunov heatmap
fig.4  (σ,ρ) stability heatmap, 30×30 grid
case · deep learning empirical study

Does legitimate transfer learning weaken backdoor watermarks?

An empirical question: does legitimate fine-tuning erase a BadNets-style watermark? MAARS poisons a CIFAR-10 source, fine-tunes to CIFAR-100 under Head-only vs Full FT, and delivers the comparative figures that land in the paper.

time
3 h
tokens
2.47M
paper
open
Watermark decay curve
fig.1  watermark persistence across fine-tuning
Final comparison
fig.2  Head-only vs Full FT — accuracy vs watermark
Anchor class metrics
fig.3  anchor-class metrics on CIFAR-100
Prediction distribution
fig.4  trigger-input prediction distribution
/ interface

A window into the loop.

The left panel streams what every agent is thinking. The right panel is a live dashboard that reads canonical state from the session DB — proposals, plan trees, drafts, reviews.

MAARS startup TUI
start.sh — one command boot
MAARS web UI
web UI — streaming left, state right
/ quick start

Run it locally.
One command to boot.

Requirements: Python 3.10+, Docker running, a Gemini API key.

# clone & run
git clone https://github.com/dozybot001/MAARS.git
cd MAARS
bash start.sh

On first boot start.sh creates a venv, installs deps, scaffolds .env from the example, builds the sandbox image, and serves on localhost:8000. GPU is auto-detected.

Drop in an idea

Paste a research idea, a path to a UTF-8 text/markdown file, or a Kaggle competition URL. Kaggle mode auto-extracts the competition ID, downloads data, and skips Refine.

Tune per stage

Set MAARS_REFINE_MODEL, MAARS_RESEARCH_MODEL, etc. to override the default Gemini model per stage.

GPU passthrough

Install the NVIDIA Container Toolkit, set MAARS_DOCKER_SANDBOX_GPU=true, and the sandbox gets --gpus all for PyTorch training.

/ docs

Read the design docs.

Bring an idea.
Walk away with a paper.

MAARS is MIT-licensed, self-hostable, and reproducible by construction. Fork it, run it, break it.