Architecture
spark-tui follows a modular architecture with clear separation between configuration, data fetching, analysis, and rendering.
Module Map
src/
├── main.rs
├── config/
│ └── mod.rs CLI args, env vars, ~/.databrickscfg parsing
├── fetch/
│ ├── client.rs SparkHttpClient + FetchError
│ ├── spark.rs Endpoint methods (get_jobs, get_stages, etc.)
│ ├── types.rs Spark API response types (serde)
│ ├── databricks.rs DatabricksClient (cluster info, DBFS, sparkui, history server)
│ ├── orchestrator.rs poll_once, assemble_data_payload, compute_health_summary
│ ├── poller.rs run_poller + historical fallback chain
│ └── eventlog/ Event log parsing (DBFS download, gzip, SparkEvent serde)
├── analyze/
│ ├── types.rs Suspect, Severity, SuspectCategory, BottleneckPattern
│ ├── skew/ Data skew detection (CV + max/median)
│ ├── suspects/ SuspectContext, 10 detectors, bottleneck classification
│ └── sql_linker/ Job ↔ SQL ↔ Stage mapping
├── tui/
│ ├── app/ App state, event loop, key handling, rendering dispatch
│ │ ├── state.rs
│ │ ├── input.rs
│ │ └── render.rs
│ ├── theme.rs Color/style functions
│ ├── highlight.rs SQL/plan syntax highlighting
│ ├── tabs/
│ │ ├── jobs_list.rs Jobs table
│ │ ├── job_detail.rs Stage breakdown for a job
│ │ ├── sql_detail.rs SQL execution plan view
│ │ ├── stage_detail.rs Detailed stage metrics
│ │ └── suspects.rs Suspects table view
│ └── widgets/
│ ├── help.rs Help overlay
│ ├── status_line.rs Status bar
│ └── summary_bar.rs Health summary bar
└── util/
├── format/ format_duration_ms, format_bytes, truncate, clean_stage_name
└── time/ Spark timestamp parsing, duration_between
Data Flow
┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐
│ Config │────▶│ SparkHttp │────▶│ Poller │────▶│ Analysis │
│ resolve │ │ Client │ │ (poll_once) │ │ Engine │
└──────────┘ └──────────────┘ └──────┬───────┘ └─────┬─────┘
│ │
DataPayload Suspects
+ stage_sql_hints (via SuspectContext)
+ critical_stages
│ │
▼ ▼
┌───────────────────────────────┐
│ App (TUI) │
│ event loop ← mpsc channel │
└───────────────────────────────┘
Step by step:
-
Config resolution (
config/mod.rs) — parses CLI args, env vars, and~/.databrickscfgto produce aConfigstruct with host, token, cluster_id, and poll_interval -
HTTP client (
fetch/client.rs) —SparkHttpClientwrapsreqwest::Clientwith the base URL and token.FetchErrormaps HTTP status codes to user-friendly messages -
Endpoint methods (
fetch/spark.rs) —discover_app_id,get_jobs,get_stages,get_sql_executions,get_task_list,get_executors— each calls the Spark REST API and deserializes the response -
Background poller (
fetch/poller.rs) —run_pollerruns in a tokio task. When the cluster becomes unreachable (503 or terminated), the poller automatically falls back to historical data via a 4-strategy chain: Spark UI REST API (with warm-up retry), Spark History Server proxy, DBFS event logs, and default DBFS path scanning.poll_oncelives infetch/orchestrator.rs(separate from the poller loop):- Fetches jobs, stages, SQL executions, and executors concurrently via 4-way
tokio::join! - Aggregates active executors into
ClusterResources(total memory, cores, executor count) - Builds cross-reference maps (job↔SQL, stage↔job)
- Creates a
SuspectContextwith cross-reference maps - Runs 10 stage-level detectors via function pointer table, plus skew detection on task data
- Fetches task lists for up to ~15 stages (selected by multiple heuristics)
- Computes
stage_sql_hints(SQL plan hints per stage) andcritical_stages(longest wall-clock stage per job) - Computes
HealthSummaryfor the summary bar - Sends a
DataPayload(includingcluster_resources,stage_sql_hints,critical_stages) through an mpsc channel
- Fetches jobs, stages, SQL executions, and executors concurrently via 4-way
-
Analysis (
analyze/) — 10 stage-level detectors are dispatched via a function pointer table (&[DetectorFn]):detect_slow_stages,detect_spill,detect_cpu_efficiency,detect_record_explosion,detect_task_failures,detect_memory_pressure,detect_partition_count,detect_broadcast_join,detect_python_udf,detect_cache_opportunity. Each takes(&[SparkStage], &SuspectContext)and returnsVec<Suspect>.detect_skewruns separately on task data.aggregate_suspectssorts by severity thenestimated_savings_ms -
App event loop (
tui/app/) —App::runreceivesActionvariants from the mpsc channel:Action::DataUpdate(payload)— stores the new dataAction::FetchError(err)— stores the error messageAction::Key(event)— processes keybindingsAction::Resize(w, h)— triggers re-render
-
Rendering (
tui/tabs/,tui/widgets/) — renders the current view mode (List, JobDetail, StageDetail, SqlDetail) using ratatui widgets. The summary bar widget displays health metrics in List view
Async Model
spark-tui uses the tokio runtime with three concurrent tasks:
| Task | Channel | Description |
|---|---|---|
| Poller | tx → rx | Fetches data and sends Action::DataUpdate / Action::FetchError |
| Event reader | tx → rx | Reads terminal events via crossterm::event::read (blocking, wrapped in spawn_blocking) |
| App loop | rx | Receives all actions and processes them sequentially |
All tasks communicate through a single mpsc::UnboundedSender<Action> channel. The app loop owns the receiver and processes actions one at a time, ensuring thread-safe state updates without locks.
Design Decisions
- Bounded task fetching: Task lists (per-task metrics) are fetched for up to ~15 stages selected by multiple heuristics (top-by-runtime, top-by-shuffle, high-parallelism). On-demand task fetching is triggered when entering StageDetail for stages not already analyzed
- Concurrent fetches: Jobs, stages, SQL executions, and executors are fetched in parallel with 4-way
tokio::join!to minimize latency - Function pointer dispatch: Stage-level detectors are stored in a
&[DetectorFn]array and dispatched viaflat_map, making it easy to add new detectors - SuspectContext: Replaces ad-hoc parameter passing — all cross-reference maps are bundled in a single struct with helper methods (
job_id,resolve_sql,resolve_plan_hint_for,enrich) - tui-scrollview: Used for smooth scrolling in StageDetail and SqlDetail views, replacing manual
u16scroll offsets withScrollViewState - Log file: Logs go to
/tmp/spark-tui.loginstead of stderr to avoid corrupting the TUI - Panic hook: A custom panic hook restores the terminal before printing the panic message, preventing terminal corruption
- Edition 2024: Uses the latest Rust edition for modern language features
Dependencies
| Crate | Purpose |
|---|---|
clap | CLI argument parsing with env var fallback |
tokio | Async runtime (macros, rt-multi-thread, time, sync features) |
reqwest | HTTP client (with rustls-tls) |
serde / serde_json | JSON deserialization |
thiserror | Error type derivation |
ratatui | Terminal UI framework (with unstable-rendered-line-info feature) |
crossterm | Terminal backend |
tracing / tracing-subscriber | Structured logging |
chrono | Timestamp parsing |
syntect / syntect-tui | SQL syntax highlighting |
tui-scrollview | Smooth scrollable views for detail panels |
CI/CD Workflows
| Workflow | Trigger | Description |
|---|---|---|
ci.yml | Push / PR | Runs cargo fmt --check, cargo clippy, cargo test |
docs.yml | Push / PR | Builds and deploys mdbook documentation to GitHub Pages |
auto-tag.yml | Push to master (Cargo.toml changed) | Creates a vX.Y.Z tag when the version in Cargo.toml changes |
release.yml | Tag v* | Cross-platform release builds (Linux x86_64, macOS x86_64 + aarch64, Windows x86_64) with GitHub Release artifacts |