Introduction

spark-tui is a terminal-based performance analysis tool for Apache Spark applications running on Databricks. It connects to the Spark REST API through the Databricks driver proxy and presents live job metrics, stage breakdowns, and automated suspect detection in an interactive TUI.

Why spark-tui?

Debugging Spark performance problems typically involves clicking through the Spark UI in a browser, manually comparing stage durations, and guessing which stages have data skew or excessive spill. This process is slow and error-prone.

spark-tui automates this analysis:

Automatic suspect detection — identifies slow stages, data skew, and disk spill without manual inspection
Bottleneck classification — categorizes root causes as Large Scan, Wide Shuffle, or Data Explosion
Actionable recommendations — each finding includes a concrete tuning suggestion
SQL correlation — links stages back to the originating SQL query and shows plan hints
Live updates — polls the Spark API on a configurable interval and refreshes the display

What You See

The interface has two main tabs:

Jobs — all Spark jobs ranked by duration (slowest first), with drill-down to stage details, duration bar charts, and SQL execution plans
Suspects — automatically detected performance issues, sorted by severity (critical first), with category labels, I/O summaries, and recommendations

How It Works

spark-tui connects to the Spark History Server API exposed through Databricks’ driver proxy endpoint:

https://{host}/driver-proxy-api/o/0/{cluster_id}/40001/api/v1

A background poller fetches jobs, stages, SQL executions, and task lists at regular intervals. The analysis engine processes this data to detect anomalies, then the TUI renders the results in real time.

Next Steps

Quick Start — install, configure, and run spark-tui
Navigation — learn the keybindings
Understanding Analysis — interpret the suspect findings

Quick Start

Prerequisites

Rust toolchain — version 1.85 or later (spark-tui uses edition 2024)
Databricks workspace — with a running cluster and an active Spark application
Personal access token — generated from Databricks Settings > Developer > Access Tokens

Installation

git clone https://github.com/tadeasf/spark-tui.git
cd spark-tui
cargo install --path .

Or run directly without installing:

cargo run -- --host adb-123.azuredatabricks.net --token dapi... --cluster-id 0123-...

Configuration

You need three pieces of information:

Field	Example
Workspace host	`adb-1234567890.azuredatabricks.net`
Personal access token	`dapi0123456789abcdef...`
Cluster ID	`0123-456789-abcdef`

Provide them via any of these methods (highest priority first):

Option 1: CLI flags

spark-tui \
  --host adb-123.azuredatabricks.net \
  --token dapi0123456789abcdef \
  --cluster-id 0123-456789-abcdef

Option 2: Environment variables

export DATABRICKS_HOST=adb-123.azuredatabricks.net
export DATABRICKS_TOKEN=dapi0123456789abcdef
export DATABRICKS_CLUSTER_ID=0123-456789-abcdef
spark-tui

Option 3: `~/.databrickscfg` file

Create or edit ~/.databrickscfg:

[my-workspace]
host = adb-123.azuredatabricks.net
token = dapi0123456789abcdef
cluster_id = 0123-456789-abcdef

Then run with a specific profile:

spark-tui --profile my-workspace

Or let spark-tui auto-detect the first complete profile:

spark-tui

First Run

When spark-tui starts, it will:

Resolve configuration (CLI > env > databrickscfg)
Connect to the Spark REST API via the driver proxy
Discover the active Spark application
Fetch jobs, stages, and SQL executions
Display the Jobs tab with results ranked by duration

You should see a table of Spark jobs. Use j/k or arrow keys to navigate, Enter to drill into a job, and Tab to switch to the Suspects tab.

If something goes wrong, check the Troubleshooting guide.

Next Steps

Configuration — full reference for all options
Navigation — keybindings and view modes
Understanding Analysis — interpreting suspects

Configuration

spark-tui requires three credentials to connect to a Databricks cluster: host, token, and cluster ID. These can be provided through CLI flags, environment variables, or a ~/.databrickscfg file.

Priority Resolution

Configuration is resolved in this order (highest priority first):

CLI flags — --host, --token, --cluster-id
Environment variables — DATABRICKS_HOST, DATABRICKS_TOKEN, DATABRICKS_CLUSTER_ID
~/.databrickscfg — INI-format file with profile sections

CLI flags and environment variables are handled by clap with the env feature — each flag falls back to its corresponding env var automatically.

If all three required fields are not satisfied by CLI/env, spark-tui reads ~/.databrickscfg to fill the gaps. You can mix sources: for example, set host and token via env vars but cluster_id via the config file.

CLI Reference

Flag	Short	Env Var	Default	Description
`--host`		`DATABRICKS_HOST`	—	Workspace hostname
`--token`		`DATABRICKS_TOKEN`	—	Personal access token
`--cluster-id`		`DATABRICKS_CLUSTER_ID`	—	Cluster ID
`--profile`	`-p`	`DATABRICKS_CONFIG_PROFILE`	auto-detect	Profile name from `~/.databrickscfg`
`--poll-interval`		`SPARK_TUI_POLL_INTERVAL`	`10`	Poll interval in seconds
`--event-log-path`		`SPARK_TUI_EVENT_LOG_PATH`	—	DBFS path to a Spark event log file
`--sparkui-cookie`		`SPARK_TUI_SPARKUI_COOKIE`	—	`DATAPLANE_DOMAIN_DBAUTH` cookie value

`~/.databrickscfg` Format

The file uses INI format with named profile sections:

[DEFAULT]
host = adb-123.azuredatabricks.net
token = dapi0123456789abcdef

[production]
host = adb-999.azuredatabricks.net
token = dapi_prod_token
cluster_id = 0123-456789-prod

[development]
host = adb-123.azuredatabricks.net
token = dapi_dev_token
cluster_id = 0456-789012-dev

Profile selection

Explicit: spark-tui --profile production uses the [production] section
Auto-detect: without --profile, spark-tui scans all profiles and uses the first one that has all three required fields (host, token, cluster_id)

If the named profile doesn’t exist, spark-tui lists available profiles in the error message.

Base URL Construction

spark-tui constructs the Spark REST API base URL as:

https://{host}/driver-proxy-api/o/0/{cluster_id}/40001/api/v1

The host field is normalized: any https:// prefix and trailing slashes are stripped before URL construction.

Poll Interval

The --poll-interval flag controls how often spark-tui refreshes data from the Spark API (default: 10 seconds). Lower values give more responsive updates but increase API load.

# Refresh every 5 seconds
spark-tui --poll-interval 5

Historical Mode

When spark-tui detects a terminated cluster (HTTP 503 or INVALID_STATE response), it automatically attempts to load historical Spark data using a 4-strategy fallback chain:

Priority	Strategy	Description
0	Spark UI REST API	Probes `https://{host}/sparkui/{cluster}/{driver}/api/v1/`. Requires `spark_context_id` from the cluster info. Also tries the dataplane domain variant (`adb-dp-` prefix).
1	Spark History Server	Probes known Databricks history server proxy URLs (multiple path patterns).
2	DBFS event logs	Reads event logs from the cluster’s `cluster_log_conf` delivery path, or from `--event-log-path` if specified.
3	Default DBFS paths	Scans well-known DBFS directories (`dbfs:/cluster-logs/`, `dbfs:/databricks/spark/eventLogs/`, etc.) for event log files.

The first strategy that succeeds provides the historical data. The status line shows a HISTORICAL badge.

Spark UI warm-up

The Historical Spark UI needs to download and parse event logs from DBFS before serving JSON data. During this warm-up phase, it returns an HTML loading page instead of JSON. spark-tui detects this and retries with backoff (3s, 5s, 10s, 15s, 20s — ~53s total), showing progress messages like “Spark UI loading… retrying (2/5)”.

On authenticated Databricks workspaces, the Spark UI endpoint requires a cookie instead of a Bearer token:

Open the Databricks workspace in your browser
Navigate to your cluster’s Spark UI tab (this warms up the endpoint)
Open browser DevTools (F12) → Application → Cookies
Find the adb-dp-* domain (e.g., adb-dp-1234567890.azuredatabricks.net)
Copy the value of the DATAPLANE_DOMAIN_DBAUTH cookie

Then pass it to spark-tui:

spark-tui --sparkui-cookie "eyJ0eXAiOiJKV1Q..." --cluster-id 0123-456789-abcdef
# or
export SPARK_TUI_SPARKUI_COOKIE="eyJ0eXAiOiJKV1Q..."
spark-tui --cluster-id 0123-456789-abcdef

Using `--event-log-path`

If you know the exact DBFS path to a Spark event log file:

spark-tui --event-log-path "dbfs:/cluster-logs/0123-456789-abcdef/eventlog/events.log.gz"

This is useful when the automatic DBFS scanning doesn’t find your logs (e.g., custom log delivery paths).

Logging

spark-tui writes logs to /tmp/spark-tui.log (logs cannot go to stderr as it would corrupt the TUI). Control the log level with the RUST_LOG environment variable:

RUST_LOG=info spark-tui    # Info and above
RUST_LOG=debug spark-tui   # Debug messages
RUST_LOG=trace spark-tui   # Everything

Default log level is warn.

spark-tui uses vim-style keybindings for navigation. The interface has three view modes arranged in a drill-down hierarchy.

View Modes

List ──Enter──▶ JobDetail ──Enter──▶ StageDetail
                    │                     │
                    └──s──▶ SqlDetail     │
  ◀──Esc──        ◀──Esc──        ◀──Esc──

List — the top-level view showing either the Jobs or Suspects tab
JobDetail — stage breakdown and duration bar chart for a selected job
StageDetail — detailed metrics for a selected stage (I/O, CPU %, peak RAM, task histograms, per-executor breakdown, skew metrics)
SqlDetail — scrollable SQL execution plan for the selected job’s query

Keybindings

Global

Key	Action
`q`	Quit the application
`Esc`	Go back one level (SqlDetail → JobDetail → List → Quit)
`h`	Toggle help overlay (keybinding reference in most views; PySpark recommendations in SqlDetail)

List Mode (Jobs / Suspects tabs)

Key	Action
`Tab`	Switch to next tab
`Shift+Tab`	Switch to previous tab
`j` / `↓`	Move selection down
`k` / `↑`	Move selection up
`g` / `Home`	Jump to first row
`G` / `End`	Jump to last row
`Enter`	Drill into the selected job’s detail view

Status bar hint: q:quit Tab:switch j/k:nav Enter:detail h:help

JobDetail Mode

Key	Action
`j` / `↓`	Move selection down in the stage list
`k` / `↑`	Move selection up in the stage list
`g` / `Home`	Jump to first stage
`G` / `End`	Jump to last stage
`Enter`	Drill into the selected stage’s detail view
`s`	Open SQL plan view (if the job has a linked SQL execution)
`Esc`	Return to List mode

Status bar hint: Esc:back j/k:nav Enter:stage s:sql h:help

Note: When entering JobDetail from the Suspects tab, pressing Esc returns to the Suspects tab (not Jobs). This is tracked via the return_tab field.

StageDetail Mode

Key	Action
`j` / `↓`	Scroll down
`k` / `↑`	Scroll up
`g` / `Home`	Scroll to top
`G` / `End`	Scroll to bottom
`Esc`	Return to JobDetail mode

Status bar hint: Esc:back j/k:scroll g/G:top/bot h:help

SqlDetail Mode

Key	Action
`j` / `↓`	Scroll down
`k` / `↑`	Scroll up
`g` / `Home`	Scroll to top
`G` / `End`	Scroll to bottom
`h`	Show PySpark recommendations for suspects related to this SQL execution
`Esc`	Return to JobDetail mode

Status bar hint: Esc:back j/k:scroll g/G:top/bot h:hints

Help Overlay

Pressing h toggles a help overlay:

In List, JobDetail, and StageDetail modes: Shows a general keybinding reference card listing all available shortcuts for the current view.
In SqlDetail mode: Shows PySpark-specific recommendations based on suspect findings related to the current SQL execution.

Press h again or Esc to dismiss the overlay.

Tabs

Jobs Tab

Displays all Spark jobs in a table, ranked by duration (slowest first). Running jobs (with no completion time) appear at the top. Columns include:

Job ID
Status (with color coding)
Duration
Task counts
SQL description (if linked)
Submission time

Suspects Tab

Displays automatically detected performance issues, sorted by severity (Critical first), then by estimated savings descending as a tiebreaker. Each row shows:

Severity indicator (color-coded)
Category (Slow Stage / Data Skew / Data Size Skew / Record Count Skew / Disk Spill / CPU Bottleneck / I/O Bottleneck / Record Explosion / Task Failures / Memory Pressure / Executor Hotspot / Too Many Partitions / Too Few Partitions / Broadcast Join Opportunity / Python UDF / Cache Opportunity)
Stage ID and job ID
Title with key metrics
Detail summary
Recommendation

Color Coding

Color	Meaning
Red	Critical severity, failed status
Yellow	Warning severity, running status
Green	Healthy / succeeded status
Gray	Muted / secondary information
Cyan	Selected row highlight
Magenta	CP — Critical Path stage (longest-running stage per job)

CPU Utilization (Stage Detail)

Color	Range	Meaning
Red	≥ 95% or < 30%	CPU saturated or severe I/O bound
Green	50%–94%	Healthy utilization
Yellow	30%–49%	Underutilized

Peak Memory (Stage Detail)

Color-coded relative to total cluster memory (ratio-based), with absolute fallback when executor data is unavailable. See Understanding Analysis for threshold details.

Summary Bar

The health summary bar in List view uses colored foreground text (not colored background) to display job/IO counts and top issues: red for critical, yellow for warning, green for healthy.

Understanding Analysis

spark-tui automatically detects performance issues in your Spark application and presents them as suspects. This guide explains how each detector works, what the thresholds mean, and how to act on the findings.

Suspect Categories

Slow Stage

Detects stages whose executor_run_time is statistically anomalous compared to all completed stages.

How it works:

Computes the mean and standard deviation of executor_run_time across all completed stages
Flags stages that exceed the threshold

Severity	Threshold
Warning	`executor_run_time > mean + 2 * stddev`
Critical	`executor_run_time > mean + 4 * stddev`

The suspect detail shows how many times slower the stage is compared to the average (e.g., “3.5x slower than average”).

Data Skew

Detects uneven task duration distribution within a stage, indicating skewed partitions.

How it works:

Collects all task durations for the stage
Computes the coefficient of variation (CV = stddev / mean) and the max/median ratio
Flags if either metric exceeds threshold

Severity	Threshold
Warning	CV > 1.0 or max > 3x median
Critical	CV > 2.0 or max > 10x median

The suspect detail identifies the slowest task, its duration vs. the median, and how much data it processed.

Note: Task-level analysis is performed for up to ~15 stages selected by multiple heuristics (top-by-runtime, top-by-shuffle, high-parallelism). On-demand task fetching is triggered when entering StageDetail for stages not already analyzed.

Data Size Skew

Detects uneven data size distribution across tasks within a stage.

How it works:

Computes the total bytes processed per task (input_bytes + shuffle_read_bytes)
Applies the same CV and max/median ratio thresholds as duration skew

Severity	Threshold
Warning	CV > 1.0 or max > 3x median
Critical	CV > 2.0 or max > 10x median

The suspect detail identifies the task processing the most data, its byte count vs. the median.

Record Count Skew

Detects uneven record count distribution across tasks within a stage.

How it works:

Computes the total records processed per task (input_records + shuffle_read_records)
Applies CV and max/median ratio thresholds (only when max records > 1000)

Severity	Threshold
Warning	CV > 1.0 or max > 3x median (and max > 1000)
Critical	CV > 2.0 or max > 10x median

Indicates hot keys in joins or group-bys.

Disk Spill

Detects stages where data was spilled from memory to disk, indicating insufficient executor memory.

How it works:

Checks disk_bytes_spilled for each stage
Any spill > 0 is flagged

Severity	Threshold
Warning	`disk_bytes_spilled > 0`
Critical	`disk_bytes_spilled > 1 GB`

The suspect detail shows both memory spill and disk spill amounts.

CPU Bottleneck

Detects stages where the CPU is fully saturated for a sustained period.

How it works:

Computes cpu_ratio = (executor_cpu_time / 1_000_000) / executor_run_time
Flags stages with high CPU ratio and significant runtime

Severity	Threshold
Warning	`cpu_ratio > 0.9` and `runtime > 30s`

The suspect detail shows CPU time vs. runtime and utilization percentage.

I/O Bottleneck

Detects stages that are I/O or GC bound (low CPU utilization despite significant runtime).

How it works:

Uses the same CPU ratio as CPU Bottleneck detection
Flags stages with low CPU ratio

Severity	Threshold
Warning	`cpu_ratio < 0.3` and `runtime > 10s`

Consider increasing memory, improving data locality, using faster storage, or checking GC pauses.

Record Explosion

Detects stages where output records vastly exceed input records, indicating explode(), cross joins, or generate() operations.

How it works:

Checks if output_records > 10x input_records (only when input_records > 1000)

Severity	Threshold
Warning	`output_records > 10x input_records`
Critical	`output_records > 100x input_records`

Task Failures

Detects stages with failed or killed tasks.

How it works:

Checks if num_failed_tasks > 0 or num_killed_tasks > 0

Severity	Threshold
Warning	Any failed or killed tasks
Critical	Failure rate > 10% or total problematic > 10

Common causes include OOM, data corruption, and fetch failures.

Memory Pressure

Detects stages where memory spill is occurring but hasn’t yet reached disk — a proactive warning before disk spill happens.

How it works:

Checks if memory_bytes_spilled > 50 MB and disk_bytes_spilled == 0

Severity	Threshold
Warning	`memory_bytes_spilled > 50 MB` with no disk spill

Recommendation: increase spark.executor.memory or spark.executor.memoryOverhead, reduce partition size.

Executor Hotspot

Detects stages where a single executor handles a disproportionate share of data.

How it works:

Sums input_bytes + shuffle_read_bytes per executor
Flags executors processing > 50% of total data

Severity	Threshold
Warning	One executor handles > 50% of data

Check data locality and partition assignment. This may indicate skewed partition-to-executor mapping.

Too Many Partitions

Detects stages with excessive small partitions, causing high scheduling overhead.

How it works:

Computes avg_bytes_per_task = (input_bytes + shuffle_read_bytes) / num_tasks
Flags stages with too many tiny partitions

Severity	Threshold
Warning	`num_tasks > 10,000` and `avg_bytes_per_task < 1 MB`

The recommendation suggests a target partition count to achieve ~128 MB/partition: df.coalesce(N).

Too Few Partitions

Detects stages with too few large partitions, causing stragglers and underutilized executors.

How it works:

Computes avg_bytes_per_task = (input_bytes + shuffle_read_bytes) / num_tasks
Flags stages with too few large partitions

Severity	Threshold
Warning	`num_tasks ≤ 8` and `avg_bytes_per_task > 1 GB`

The recommendation suggests a target partition count: df.repartition(N).

Broadcast Join Opportunity

Detects shuffle joins where one side is small enough to broadcast, eliminating the shuffle entirely.

How it works:

Filters stages with shuffle_write_bytes < 100 MB and executor_run_time > 5s
Checks if the SQL plan contains join indicators (SortMerge, ShuffledHash, Join)

Severity	Threshold
Warning	`shuffle_write < 100 MB` and join detected in SQL plan

Recommendation: from pyspark.sql.functions import broadcast; df.join(broadcast(small_df), on='key').

Python UDF

Detects Python UDF usage in SQL plans, which causes row-by-row serialization overhead.

How it works:

Searches the SQL plan hint for markers: ArrowEvalPython, BatchEvalPython, PythonUDF, PythonRunner
If the stage is also CPU-bound (ratio > 0.9, runtime > 30s), severity escalates to Critical

Severity	Threshold
Warning	Python UDF marker found in plan, `runtime > 5s`
Critical	Python UDF + CPU ratio > 0.9 + `runtime > 30s`

Recommendation: Replace @udf with @pandas_udf for vectorized execution, or use native F.when()/F.expr() functions.

Cache Opportunity

Detects repeated computations (stages with the same name) that could benefit from caching.

How it works:

Groups completed stages by cleaned name
Flags groups where ≥ 2 stages share a name and total runtime > 30s

Severity	Threshold
Warning	≥ 2 stages with same name, total `runtime > 30s`

Recommendation: df.cache() or df.persist(StorageLevel.MEMORY_AND_DISK) before the first action. Call df.unpersist() when no longer needed.

Bottleneck Classification

When a slow stage or spill suspect is detected, spark-tui classifies the root cause based on I/O patterns:

Pattern	Condition	Meaning
Data Explosion	`input > 100 MB` and `output > 5x input`	Stage produces far more data than it reads (e.g., `explode`, cross join)
Large Scan	`input > 1 GB` and `input > 10x (output + shuffle_write)`	Stage reads a lot but produces little (missing pushdown filters)
Wide Shuffle	`shuffle_write > 500 MB` or `shuffle_read > input`	Stage shuffles more data than it reads directly (broad join, groupBy on high-cardinality key)
Record Explosion	`output_records > 10x input_records`	Attached to record explosion suspects (see above)

If none of these patterns match, no bottleneck tag is shown.

Recommendations

All recommendations use PySpark-specific syntax for immediate applicability. Each suspect includes a recommendation based on its category and bottleneck pattern:

Category + Bottleneck	Recommendation
Data Skew	Repartition or salt skewed keys
Data Size Skew	Repartition by a more uniform key or use salting
Record Count Skew	Check for hot keys in joins or group-bys
Disk Spill	`spark.conf.set('spark.executor.memory', '8g')` or `df.repartition(200)`
CPU Bottleneck	Replace `@udf` with `@pandas_udf` or native `F.when()`/`F.expr()`. Cache with `df.cache()` and increase parallelism
I/O Bottleneck	`spark.conf.set('spark.executor.memory', '8g')`, cache hot DataFrames with `df.cache()`, or use `df.repartition()` for better locality
Record Explosion	Filter before explode: `df.filter(...).select(explode('col'))`. Check for unintentional cross joins
Task Failures	Check executor logs for OOM/fetch failures. `spark.conf.set('spark.task.maxFailures', '4')`
Memory Pressure	`spark.conf.set('spark.executor.memory', '8g')` and `spark.conf.set('spark.executor.memoryOverhead', '2g')`
Executor Hotspot	Check data locality and partition assignment
Too Many Partitions	`df.coalesce(N)` to target ~128 MB/partition
Too Few Partitions	`df.repartition(N)` to target ~128 MB/partition
Broadcast Join Opportunity	`from pyspark.sql.functions import broadcast; df.join(broadcast(small_df), on='key')`
Python UDF	Replace `@udf` with `@pandas_udf` for vectorized execution, or use native `F.when()`/`F.expr()`
Cache Opportunity	`df.cache()` or `df.persist(StorageLevel.MEMORY_AND_DISK)` before the first action
Slow Stage + Large Scan	`df.filter(F.col('date') >= '2024-01-01')` and select only needed columns. Use partition pruning
Slow Stage + Wide Shuffle	`from pyspark.sql.functions import broadcast; df.join(broadcast(small_df), ...)`. Pre-aggregate with `groupBy` before joins
Slow Stage + Data Explosion	Filter before explode: `df.filter(...).withColumn('x', explode('arr'))`
Slow Stage (no pattern)	`df.explain(True)` to see the query plan. Large shuffle may indicate missing filters or broad joins

Estimated Savings

Each suspect includes an estimated_savings_ms field — a rough estimate of how much time could be saved by addressing the issue. This is used as a secondary sort key (after severity) so that higher-impact issues appear first within the same severity level.

How savings are computed per category

Category	Estimation Method
Slow Stage	`executor_run_time - mean_runtime` (time above average)
Disk Spill	~30% of `executor_run_time` (spill overhead)
CPU Bottleneck	~20% of `executor_run_time`
I/O Bottleneck	~20% of `executor_run_time`
Record Explosion	~50% of `executor_run_time`
Task Failures	`executor_run_time × failure_rate` (retry overhead)
Memory Pressure	~10% of `executor_run_time` (GC pause overhead)
Too Many Partitions	~40% of `executor_run_time` (scheduling overhead)
Too Few Partitions	~50% of `executor_run_time` (straggler overhead)
Broadcast Join Opportunity	~60% of `executor_run_time` (shuffle elimination)
Python UDF	~50% of `executor_run_time` (serialization overhead)
Cache Opportunity	`total_runtime - min_single_runtime` (repeated computation)

These are heuristic estimates intended for prioritization, not precise predictions.

SQL Correlation

Each suspect is linked to its originating SQL execution when possible. The suspect shows:

SQL ID — the Spark SQL execution identifier
SQL Description — the query text or description
SQL Plan Hint — the top operations from the physical plan (e.g., “HashAggregate -> Exchange -> Scan parquet”)

This helps trace the suspect back to the specific query that caused it.

I/O Summary

Slow stage and spill suspects include an I/O summary showing:

Input bytes / records
Output bytes / records
Shuffle read bytes / records
Shuffle write bytes / records
Memory and disk spill amounts

Use this to understand the data flow through the flagged stage.

Severity Sorting

Suspects are sorted by severity (Critical first, then Warning), with estimated_savings_ms descending as a tiebreaker within the same severity level. The Suspects tab title reflects this: "Suspects (severity → savings)".

Color-Coding in Stage Detail

The stage detail view uses color-coded metrics to help identify issues at a glance.

CPU Utilization

The CPU % value in the stage header is color-coded based on the CPU ratio (executor_cpu_time / executor_run_time):

Color	Range	Meaning
Red	≥ 95%	CPU saturated
Green	50%–94%	Healthy utilization
Yellow	30%–49%	Underutilized (possible I/O bound)
Red	< 30%	Severe I/O bound

Peak Memory

Peak execution memory is color-coded relative to total cluster memory when executor data is available:

Color	Ratio to cluster memory	Meaning
Red	≥ 80%	Near memory limit
Yellow	50%–79%	Moderate usage
Green	10%–49%	Comfortable
Default	< 10%	Low usage

When executor data is unavailable, absolute thresholds are used as fallback:

Color	Threshold	Meaning
Red	≥ 10 GB	High memory usage
Yellow	≥ 1 GB	Moderate
Green	≥ 100 MB	Normal
Default	< 100 MB	Low

Architecture

spark-tui follows a modular architecture with clear separation between configuration, data fetching, analysis, and rendering.

Module Map

src/
├── main.rs
├── config/
│   └── mod.rs             CLI args, env vars, ~/.databrickscfg parsing
├── fetch/
│   ├── client.rs          SparkHttpClient + FetchError
│   ├── spark.rs           Endpoint methods (get_jobs, get_stages, etc.)
│   ├── types.rs           Spark API response types (serde)
│   ├── databricks.rs      DatabricksClient (cluster info, DBFS, sparkui, history server)
│   ├── orchestrator.rs    poll_once, assemble_data_payload, compute_health_summary
│   ├── poller.rs          run_poller + historical fallback chain
│   └── eventlog/          Event log parsing (DBFS download, gzip, SparkEvent serde)
├── analyze/
│   ├── types.rs           Suspect, Severity, SuspectCategory, BottleneckPattern
│   ├── skew/              Data skew detection (CV + max/median)
│   ├── suspects/          SuspectContext, 10 detectors, bottleneck classification
│   └── sql_linker/        Job ↔ SQL ↔ Stage mapping
├── tui/
│   ├── app/               App state, event loop, key handling, rendering dispatch
│   │   ├── state.rs
│   │   ├── input.rs
│   │   └── render.rs
│   ├── theme.rs           Color/style functions
│   ├── highlight.rs       SQL/plan syntax highlighting
│   ├── tabs/
│   │   ├── jobs_list.rs   Jobs table
│   │   ├── job_detail.rs  Stage breakdown for a job
│   │   ├── sql_detail.rs  SQL execution plan view
│   │   ├── stage_detail.rs Detailed stage metrics
│   │   └── suspects.rs    Suspects table view
│   └── widgets/
│       ├── help.rs        Help overlay
│       ├── status_line.rs Status bar
│       └── summary_bar.rs Health summary bar
└── util/
    ├── format/            format_duration_ms, format_bytes, truncate, clean_stage_name
    └── time/              Spark timestamp parsing, duration_between

Data Flow

┌──────────┐     ┌──────────────┐     ┌──────────────┐     ┌───────────┐
│  Config  │────▶│ SparkHttp    │────▶│   Poller     │────▶│  Analysis │
│ resolve  │     │  Client      │     │ (poll_once)  │     │  Engine   │
└──────────┘     └──────────────┘     └──────┬───────┘     └─────┬─────┘
                                             │                   │
                                    DataPayload              Suspects
                                    + stage_sql_hints        (via SuspectContext)
                                    + critical_stages
                                             │                   │
                                             ▼                   ▼
                                     ┌───────────────────────────────┐
                                     │          App (TUI)            │
                                     │  event loop ← mpsc channel   │
                                     └───────────────────────────────┘

Step by step:

Config resolution (config/mod.rs) — parses CLI args, env vars, and ~/.databrickscfg to produce a Config struct with host, token, cluster_id, and poll_interval
HTTP client (fetch/client.rs) — SparkHttpClient wraps reqwest::Client with the base URL and token. FetchError maps HTTP status codes to user-friendly messages
Endpoint methods (fetch/spark.rs) — discover_app_id, get_jobs, get_stages, get_sql_executions, get_task_list, get_executors — each calls the Spark REST API and deserializes the response
Background poller (fetch/poller.rs) — run_poller runs in a tokio task. When the cluster becomes unreachable (503 or terminated), the poller automatically falls back to historical data via a 4-strategy chain: Spark UI REST API (with warm-up retry), Spark History Server proxy, DBFS event logs, and default DBFS path scanning. poll_once lives in fetch/orchestrator.rs (separate from the poller loop):
- Fetches jobs, stages, SQL executions, and executors concurrently via 4-way tokio::join!
- Aggregates active executors into ClusterResources (total memory, cores, executor count)
- Builds cross-reference maps (job↔SQL, stage↔job)
- Creates a SuspectContext with cross-reference maps
- Runs 10 stage-level detectors via function pointer table, plus skew detection on task data
- Fetches task lists for up to ~15 stages (selected by multiple heuristics)
- Computes stage_sql_hints (SQL plan hints per stage) and critical_stages (longest wall-clock stage per job)
- Computes HealthSummary for the summary bar
- Sends a DataPayload (including cluster_resources, stage_sql_hints, critical_stages) through an mpsc channel
Analysis (analyze/) — 10 stage-level detectors are dispatched via a function pointer table (&[DetectorFn]): detect_slow_stages, detect_spill, detect_cpu_efficiency, detect_record_explosion, detect_task_failures, detect_memory_pressure, detect_partition_count, detect_broadcast_join, detect_python_udf, detect_cache_opportunity. Each takes (&[SparkStage], &SuspectContext) and returns Vec<Suspect>. detect_skew runs separately on task data. aggregate_suspects sorts by severity then estimated_savings_ms
App event loop (tui/app/) — App::run receives Action variants from the mpsc channel:
- Action::DataUpdate(payload) — stores the new data
- Action::FetchError(err) — stores the error message
- Action::Key(event) — processes keybindings
- Action::Resize(w, h) — triggers re-render
Rendering (tui/tabs/, tui/widgets/) — renders the current view mode (List, JobDetail, StageDetail, SqlDetail) using ratatui widgets. The summary bar widget displays health metrics in List view

Async Model

spark-tui uses the tokio runtime with three concurrent tasks:

Task	Channel	Description
Poller	`tx → rx`	Fetches data and sends `Action::DataUpdate` / `Action::FetchError`
Event reader	`tx → rx`	Reads terminal events via `crossterm::event::read` (blocking, wrapped in `spawn_blocking`)
App loop	`rx`	Receives all actions and processes them sequentially

All tasks communicate through a single mpsc::UnboundedSender<Action> channel. The app loop owns the receiver and processes actions one at a time, ensuring thread-safe state updates without locks.

Design Decisions

Bounded task fetching: Task lists (per-task metrics) are fetched for up to ~15 stages selected by multiple heuristics (top-by-runtime, top-by-shuffle, high-parallelism). On-demand task fetching is triggered when entering StageDetail for stages not already analyzed
Concurrent fetches: Jobs, stages, SQL executions, and executors are fetched in parallel with 4-way tokio::join! to minimize latency
Function pointer dispatch: Stage-level detectors are stored in a &[DetectorFn] array and dispatched via flat_map, making it easy to add new detectors
SuspectContext: Replaces ad-hoc parameter passing — all cross-reference maps are bundled in a single struct with helper methods (job_id, resolve_sql, resolve_plan_hint_for, enrich)
tui-scrollview: Used for smooth scrolling in StageDetail and SqlDetail views, replacing manual u16 scroll offsets with ScrollViewState
Log file: Logs go to /tmp/spark-tui.log instead of stderr to avoid corrupting the TUI
Panic hook: A custom panic hook restores the terminal before printing the panic message, preventing terminal corruption
Edition 2024: Uses the latest Rust edition for modern language features

Dependencies

Crate	Purpose
`clap`	CLI argument parsing with env var fallback
`tokio`	Async runtime (`macros`, `rt-multi-thread`, `time`, `sync` features)
`reqwest`	HTTP client (with `rustls-tls`)
`serde` / `serde_json`	JSON deserialization
`thiserror`	Error type derivation
`ratatui`	Terminal UI framework (with `unstable-rendered-line-info` feature)
`crossterm`	Terminal backend
`tracing` / `tracing-subscriber`	Structured logging
`chrono`	Timestamp parsing
`syntect` / `syntect-tui`	SQL syntax highlighting
`tui-scrollview`	Smooth scrollable views for detail panels

CI/CD Workflows

Workflow	Trigger	Description
`ci.yml`	Push / PR	Runs `cargo fmt --check`, `cargo clippy`, `cargo test`
`docs.yml`	Push / PR	Builds and deploys mdbook documentation to GitHub Pages
`auto-tag.yml`	Push to `master` (Cargo.toml changed)	Creates a `vX.Y.Z` tag when the version in `Cargo.toml` changes
`release.yml`	Tag `v*`	Cross-platform release builds (Linux x86_64, macOS x86_64 + aarch64, Windows x86_64) with GitHub Release artifacts

Module Reference

This section provides a reference for each module in the spark-tui codebase.

Modules

Module	Path	Description
Config	`src/config/`	CLI argument parsing, config resolution, `~/.databrickscfg` support
Fetch	`src/fetch/`	HTTP client, Spark API types, endpoint methods, background poller
Analyze	`src/analyze/`	Suspect detection (slow stages, skew, spill), bottleneck classification, SQL linking
TUI	`src/tui/`	App state machine, tab rendering, widgets, theme
Utilities	`src/util/`	Formatting (duration, bytes) and time parsing helpers

Config Module

Path: src/config.rs

Handles CLI argument parsing, environment variable fallback, and ~/.databrickscfg file parsing to produce a resolved Config struct.

Structs

`CliArgs`

Clap-derived struct for command-line arguments. Each field uses #[arg(env = "...")] for automatic env var fallback.

#![allow(unused)]
fn main() {
pub struct CliArgs {
    pub host: Option<String>,          // --host / DATABRICKS_HOST
    pub token: Option<String>,         // --token / DATABRICKS_TOKEN
    pub cluster_id: Option<String>,    // --cluster-id / DATABRICKS_CLUSTER_ID
    pub profile: Option<String>,       // --profile, -p / DATABRICKS_CONFIG_PROFILE
    pub poll_interval: u64,            // --poll-interval / SPARK_TUI_POLL_INTERVAL (default: 10)
}
}

`Config`

Resolved configuration with all required fields guaranteed present.

#![allow(unused)]
fn main() {
pub struct Config {
    pub host: String,
    pub token: String,
    pub cluster_id: String,
    pub poll_interval: u64,
}
}

Methods:

Method	Signature	Description
`base_url`	`&self -> String`	Constructs the Spark REST API base URL. Strips `https://` prefix and trailing slashes from host

Functions

Function	Signature	Description
`resolve_config`	`() -> Result<Config, String>`	Main entry point. Resolves config from CLI > env > databrickscfg
`databrickscfg_path`	`() -> Option<PathBuf>`	Returns `~/.databrickscfg` path if the file exists
`parse_databrickscfg`	`(path) -> Result<HashMap<String, Profile>>`	Parses INI-format config file into profile sections
`find_complete_profile`	`(profiles) -> Option<&Profile>`	Finds the first profile with host, token, and cluster_id

Resolution Logic

Parse CLI args (clap handles env var fallback)
If all three fields are present → return Config
Otherwise, read ~/.databrickscfg
If --profile is set → use that section (error if not found)
Otherwise → auto-detect the first complete profile
Merge CLI/env values with profile values (CLI/env takes priority)

Tests

test_parse_databrickscfg — verifies INI parsing
test_find_complete_profile — verifies auto-detection of complete profiles
test_base_url_strips_scheme_and_trailing_slash — verifies URL normalization

Fetch Module

Path: src/fetch/

Handles HTTP communication with the Spark REST API via the Databricks driver proxy.

Files

File	Purpose
`client.rs`	`SparkHttpClient` and `FetchError`
`types.rs`	Spark API response types (serde)
`spark.rs`	Endpoint methods on `SparkHttpClient`
`databricks.rs`	`DatabricksClient` for Databricks REST API, DBFS, Spark UI, and History Server
`orchestrator.rs`	`poll_once`, `assemble_data_payload`, `compute_health_summary`
`poller.rs`	Background polling loop and historical fallback chain
`eventlog/`	Event log parsing: DBFS download, gzip decompression, `SparkEvent` serde

`client.rs` — SparkHttpClient

`SparkHttpClient`

#![allow(unused)]
fn main() {
pub struct SparkHttpClient {
    client: reqwest::Client,
    base_url: String,
    token: String,
}
}

Method	Signature	Description
`new`	`(base_url, token) -> Self`	Creates a new client
`base_url`	`&self -> &str`	Returns the base URL
`get`	`&self, path: &str -> Result<T, FetchError>`	Generic GET request with Bearer auth and JSON deserialization

`FetchError`

Error enum with user-friendly messages for common HTTP errors:

Variant	Status	Message
`Unauthorized`	401	Token expired or invalid
`Forbidden`	403	Insufficient permissions
`NotFound`	404	Spark UI not available / app may have ended
`ServiceUnavailable`	503	Cluster not reachable
`HttpError`	other	Generic HTTP error with status and body
`Deserialize`	—	JSON deserialization failure
`Request`	—	Network-level request failure
`NoApplications`	—	No Spark applications found

`types.rs` — API Types

Application & Job Types

Type	Key Fields
`SparkApplication`	`id`, `name`
`SparkJob`	`job_id`, `name`, `status`, `submission_time`, `completion_time`, `stage_ids`, task counts
`JobStatus`	`Succeeded`, `Running`, `Failed`, `Unknown`

Stage Types

Type	Key Fields
`SparkStage`	`stage_id`, `attempt_id`, `status`, `num_tasks`, `executor_run_time`, I/O bytes, spill bytes
`StageStatus`	`Active`, `Complete`, `Pending`, `Failed`, `Skipped`

SQL Types

Type	Key Fields
`SparkSqlExecution`	`id`, `status`, `description`, `plan_description`, `duration`, job ID lists

Task Types

Type	Key Fields
`SparkTask`	`task_id`, `stage_id`, `executor_id`, `host`, `status`, `duration`, I/O bytes, spill bytes, `peak_execution_memory`
`RawSparkTask`	Raw API format with nested `task_metrics` (flattened into `SparkTask` via custom deserializer)

Executor Types

Type	Key Fields
`SparkExecutor`	`id`, `total_cores`, `max_memory`, `is_active`
`ClusterResources`	`total_executor_memory`, `total_executor_cores`, `num_executors`

`spark.rs` — Endpoint Methods

Methods on SparkHttpClient:

Method	Path	Returns
`discover_app_id`	`/applications`	`String` (first application ID)
`get_jobs`	`/applications/{id}/jobs`	`Vec<SparkJob>`
`get_stages`	`/applications/{id}/stages`	`Vec<SparkStage>`
`get_sql_executions`	`/applications/{id}/sql`	`Vec<SparkSqlExecution>`
`get_task_list`	`/applications/{id}/stages/{sid}/{attempt}/taskList`	`Vec<SparkTask>`
`get_executors`	`/applications/{id}/executors`	`Vec<SparkExecutor>`

`databricks.rs` — DatabricksClient

Thin client for Databricks REST API /api/2.0/* endpoints and workspace-level requests.

`DatabricksClient`

#![allow(unused)]
fn main() {
pub struct DatabricksClient {
    client: reqwest::Client,
    base_url: String,        // e.g. https://{host}/api/2.0
    workspace_root: String,  // e.g. https://{host}
    token: String,
    sparkui_cookie: Option<String>,
}
}

`SparkuiProbeResult`

Tri-state result from probing a Spark UI endpoint:

Variant	Fields	Meaning
`Ready`	`base_url`, `app_id`	Spark UI is serving JSON data
`Loading`	`base_url`	Spark UI is authenticated but still downloading/parsing event logs
`NotFound`	—	No accessible Spark UI endpoint found

Key Methods

Method	Description
`get_cluster_info`	Fetch cluster state, log config, and `spark_context_id`
`try_sparkui_endpoint`	Probe Historical Spark UI REST API (tries Bearer + cookie auth, workspace + dataplane domains)
`probe_sparkui_url`	Re-probe a single URL for retry after `Loading` state
`fetch_sparkui_data`	Fetch all data (jobs, stages, SQL, tasks) from a ready Spark UI endpoint
`discover_history_server`	Probe known Spark History Server proxy URL patterns
`fetch_history_data`	Fetch all data from History Server
`dbfs_list` / `dbfs_read_full`	DBFS file operations
`find_default_event_logs`	Scan well-known DBFS paths for event log files

Warm-up Detection

The is_loading_page() helper detects HTML loading pages returned during Spark UI warm-up by checking for patterns like <title>Loading</title>, loading spark ui, please wait, or generic HTML responses.

`eventlog/` — Event Log Parsing

File	Purpose
`events.rs`	`SparkEvent` serde types for Spark event log JSON lines
`parser.rs`	`EventLogParser` — converts raw events into jobs, stages, SQL, tasks
`loader.rs`	DBFS download + gzip decompression pipeline

The load_event_log() function orchestrates: discover log path → download from DBFS → decompress gzip → parse JSON lines → return structured data.

`orchestrator.rs` — Data Polling & Assembly

`poll_once`

Fetches all data and runs analysis in a single poll cycle:

Fetch jobs, stages, SQL executions, and executors concurrently (4-way tokio::join!)
Aggregate active executors into ClusterResources (total memory, cores, executor count)
Build cross-reference maps (job↔SQL, stage↔job)
Build ranked jobs (sorted by duration, running first)
Create a SuspectContext from the cross-reference maps
Run 10 stage-level detectors via function pointer table (detect_slow_stages, detect_spill, detect_cpu_efficiency, detect_record_explosion, detect_task_failures, detect_memory_pressure, detect_partition_count, detect_broadcast_join, detect_python_udf, detect_cache_opportunity)
Fetch task lists for top ~15 stages (selected by multiple heuristics)
Run skew detection on fetched tasks (duration, data-size, record-count, executor hotspot)
Aggregate and sort suspects (severity first, then estimated_savings_ms descending)
Build stage_sql_hints — maps stage_id to top SQL plan operations
Compute critical_stages — the longest wall-clock stage per job (critical path)
Compute HealthSummary (job/IO counts, critical/warning counts, top issues)
Return DataPayload

`compute_health_summary`

#![allow(unused)]
fn main() {
fn compute_health_summary(
    jobs: &[RankedJob],
    stages: &[SparkStage],
    suspects: &[Suspect],
) -> HealthSummary
}

Aggregates job counts, total I/O bytes, and suspect severity counts into a HealthSummary for the summary bar widget.

`poller.rs` — Background Poller

`run_poller`

#![allow(unused)]
fn main() {
pub async fn run_poller(
    client: Arc<SparkHttpClient>,
    databricks: Arc<DatabricksClient>,
    config: Arc<Config>,
    tx: mpsc::UnboundedSender<Action>,
    poll_interval: Duration,
)
}

Discovers the application ID (or detects cluster unreachable → historical fallback)
Enters a live poll loop: poll_once (from orchestrator.rs) → send result via channel → sleep
On connection loss during polling, falls back to historical data via try_load_historical

Historical Fallback Chain

When the cluster is unreachable, try_load_historical attempts these strategies in order:

Strategy	Function	Condition
0 — Spark UI	`try_sparkui`	Requires `spark_context_id`; retries with backoff if UI is loading
1 — History Server	`try_history_server`	Probes known proxy URL patterns
2 — DBFS event logs	`try_dbfs_event_logs`	Uses `cluster_log_conf` or `--event-log-path`
3 — Default paths	`find_default_event_logs`	Scans well-known DBFS directories

The Spark UI strategy includes warm-up retry: if the endpoint returns an HTML loading page, it retries with backoff delays defined in SPARKUI_RETRY_DELAYS (3, 5, 10, 15, 20 seconds — ~53s total).

`DataPayload`

#![allow(unused)]
fn main() {
pub struct DataPayload {
    pub app_id: String,
    pub jobs: Vec<RankedJob>,
    pub stages: Vec<SparkStage>,
    pub sql_executions: Vec<SparkSqlExecution>,
    pub suspects: Vec<Suspect>,
    pub stage_tasks: Arc<HashMap<i64, Vec<SparkTask>>>,
    pub summary: HealthSummary,
    pub cluster_resources: ClusterResources,
    pub stage_sql_hints: Arc<HashMap<i64, String>>,
    pub critical_stages: Arc<HashSet<i64>>,
    pub last_updated: String,
    pub data_source: DataSourceMode,
    pub data_source_detail: Option<String>,
}
}

Note: DataPayload is defined in src/tui/mod.rs and contains all data needed to render the TUI.

Key fields:

stage_sql_hints — maps stage_id → String with top SQL plan operations (e.g., “HashAggregate → Exchange → Scan parquet”), pre-computed for display in stage detail headers
critical_stages — set of stage IDs that represent the critical path (longest wall-clock stage per job), used for “CP” annotations in the job detail view
data_source — DataSourceMode::Live or DataSourceMode::Historical, displayed in the status line
data_source_detail — optional label like "Spark UI", "history server", or "event logs", shown alongside the HISTORICAL badge

Analyze Module

Path: src/analyze/

Contains all performance analysis logic: suspect detection, bottleneck classification, and SQL correlation.

Files

File	Purpose
`types.rs`	Core types: `Suspect`, `Severity`, `SuspectCategory`, `BottleneckPattern`, `RankedJob`, `SqlJobLink`
`skew/`	Data skew detection using task-level metrics
`suspects/`	`SuspectContext`, 10 stage-level detectors, bottleneck classification, aggregation
`sql_linker/`	Cross-reference maps between jobs, stages, and SQL executions

`types.rs` — Core Types

`Severity`

#![allow(unused)]
fn main() {
pub enum Severity {
    Warning,
    Critical,
}
}

Implements Ord for sorting (Critical > Warning).

`SuspectCategory`

#![allow(unused)]
fn main() {
pub enum SuspectCategory {
    SlowStage,
    DataSkew,
    DataSizeSkew,
    RecordCountSkew,
    DiskSpill,
    CpuBottleneck,
    IoBottleneck,
    RecordExplosion,
    TaskFailures,
    MemoryPressure,
    ExecutorHotspot,
    TooManyPartitions,
    TooFewPartitions,
    BroadcastJoinOpportunity,
    PythonUdf,
    CacheOpportunity,
}
}

`BottleneckPattern`

#![allow(unused)]
fn main() {
pub enum BottleneckPattern {
    LargeScan,
    WideShuffle,
    DataExplosion,
    RecordExplosion,
}
}

`Suspect`

#![allow(unused)]
fn main() {
pub struct Suspect {
    pub severity: Severity,
    pub category: SuspectCategory,
    pub stage_id: i64,
    pub job_id: Option<i64>,
    pub title: String,
    pub detail: String,
    pub stage_name: Option<String>,
    pub sql_id: Option<i64>,
    pub sql_description: Option<String>,
    pub io_summary: Option<String>,
    pub recommendation: Option<String>,
    pub bottleneck: Option<BottleneckPattern>,
    pub sql_plan_hint: Option<String>,
    pub estimated_savings_ms: i64,
}
}

The estimated_savings_ms field contains a heuristic estimate of time savings from fixing the issue. It is used as a secondary sort key (after severity) in aggregate_suspects.

`RankedJob`

Processed job data for display, sorted by duration (running first, then slowest first).

#![allow(unused)]
fn main() {
pub struct RankedJob {
    pub job_id: i64,
    pub name: String,
    pub status: String,
    pub duration_ms: Option<i64>,
    pub num_tasks: i32,
    pub num_failed_tasks: i32,
    pub sql_id: Option<i64>,
    pub sql_description: Option<String>,
    pub stage_ids: Vec<i64>,
    pub submission_time: Option<String>,
    pub sql_plan: Option<String>,
}
}

`HealthSummary`

#![allow(unused)]
fn main() {
pub struct HealthSummary {
    pub total_jobs: usize,
    pub running_jobs: usize,
    pub failed_jobs: usize,
    pub total_input_bytes: i64,
    pub total_output_bytes: i64,
    pub total_shuffle_bytes: i64,
    pub critical_count: usize,
    pub warning_count: usize,
    pub top_issues: Vec<String>,
}
}

Aggregates health metrics for the summary bar widget, computed by compute_health_summary in the poller.

`skew/` — Skew Detection

`detect_skew`

#![allow(unused)]
fn main() {
pub fn detect_skew(
    tasks: &[SparkTask],
    stage_id: i64,
    job_id: Option<i64>,
    stage_name: Option<&str>,
    sql_id: Option<i64>,
    sql_description: Option<&str>,
) -> Vec<Suspect>
}

Detects all forms of skew in a stage’s tasks. Returns a Vec<Suspect> covering duration skew, data-size skew, record-count skew, and executor hotspot detection. See Understanding Analysis for threshold details.

`suspects/` — Stage-Level Detection

Constants

Constant	Value
`ONE_MB`	1,048,576 bytes
`FIFTY_MB`	52,428,800 bytes
`ONE_HUNDRED_MB`	104,857,600 bytes
`FIVE_HUNDRED_MB`	524,288,000 bytes
`ONE_GB`	1,073,741,824 bytes

`SuspectContext`

Holds all lookup maps needed by suspect detectors, eliminating repetitive parameter passing.

#![allow(unused)]
fn main() {
pub struct SuspectContext<'a> {
    pub stage_to_job: &'a HashMap<i64, i64>,
    pub job_to_sql: &'a HashMap<i64, i64>,
    pub sql_descriptions: &'a HashMap<i64, String>,
    pub sql_plans: &'a HashMap<i64, String>,
}
}

Constructor:

#![allow(unused)]
fn main() {
pub fn new(
    stage_to_job: &'a HashMap<i64, i64>,
    job_to_sql: &'a HashMap<i64, i64>,
    sql_descriptions: &'a HashMap<i64, String>,
    sql_plans: &'a HashMap<i64, String>,
) -> Self
}

Methods:

Method	Signature	Description
`job_id`	`(&self, stage_id: i64) -> Option<i64>`	Look up the job_id for a stage
`resolve_sql`	`(&self, stage_id: i64) -> (Option<i64>, Option<String>)`	Resolve SQL id and description for a stage via its job (private)
`resolve_plan_hint_for`	`(&self, stage_id: i64) -> Option<String>`	Resolve top SQL plan operations for a stage (e.g., “HashAggregate → Exchange → Scan”)
`enrich`	`(&self, suspect: &mut Suspect, stage: &SparkStage)`	Enrich a suspect with stage_name, SQL linkage, I/O summary, and plan hint

Stage-Level Detectors

All 10 stage-level detectors share the same signature:

#![allow(unused)]
fn main() {
pub fn detect_*(stages: &[SparkStage], ctx: &SuspectContext) -> Vec<Suspect>
}

They are dispatched via a function pointer table in the poller:

#![allow(unused)]
fn main() {
type DetectorFn = fn(&[SparkStage], &SuspectContext) -> Vec<Suspect>;
let detectors: &[DetectorFn] = &[
    detect_slow_stages,
    detect_spill,
    detect_cpu_efficiency,
    detect_record_explosion,
    detect_task_failures,
    detect_memory_pressure,
    detect_partition_count,
    detect_broadcast_join,
    detect_python_udf,
    detect_cache_opportunity,
];
}

`detect_slow_stages`

Flags stages with executor_run_time exceeding mean + 2*stddev (warning) or mean + 4*stddev (critical). Sets estimated_savings_ms to time above mean.

`detect_spill`

Flags stages with disk_bytes_spilled > 0 (warning) or > 1 GB (critical). Estimates ~30% of runtime as spill overhead.

`detect_cpu_efficiency`

Detects CPU efficiency issues. Computes cpu_ratio = (executor_cpu_time / 1_000_000) / executor_run_time. Low ratio (< 0.3, runtime > 10s) → I/O bottleneck; high ratio (> 0.9, runtime > 30s) → CPU saturated. Estimates ~20% savings.

`detect_record_explosion`

Detects stages where output_records > 10x input_records (with input_records > 1000). Estimates ~50% savings.

`detect_task_failures`

Detects stages with task failures or killed tasks. Estimates savings proportional to failure rate.

`detect_memory_pressure`

Detects memory pressure: memory_bytes_spilled > 50 MB but disk_bytes_spilled == 0. Estimates ~10% savings from GC overhead.

`detect_partition_count`

Detects partition count issues. Two sub-categories:

TooManyPartitions: num_tasks > 10,000 and avg_bytes_per_task < 1 MB. Estimates ~40% savings from scheduling overhead.
TooFewPartitions: num_tasks ≤ 8 and avg_bytes_per_task > 1 GB. Estimates ~50% savings from straggler elimination.

Recommendations include a computed target partition count for ~128 MB/partition.

`detect_broadcast_join`

Detects shuffle joins where one side is small enough to broadcast. Triggers when shuffle_write_bytes < 100 MB, executor_run_time > 5s, and the SQL plan hint contains join indicators (SortMerge, ShuffledHash, Join). Estimates ~60% savings from shuffle elimination.

`detect_python_udf`

Detects Python UDFs in SQL plans by searching for markers: ArrowEvalPython, BatchEvalPython, PythonUDF, PythonRunner. Escalates to Critical severity if also CPU-bound (ratio > 0.9, runtime > 30s). Estimates ~50% savings.

`detect_cache_opportunity`

Detects repeated computations by grouping completed stages by cleaned name. Triggers when ≥ 2 stages share a name and total runtime > 30s. Savings estimated as total_runtime - min_single_runtime.

`classify_bottleneck`

#![allow(unused)]
fn main() {
pub fn classify_bottleneck(s: &SparkStage) -> Option<BottleneckPattern>
}

Classifies root cause based on I/O patterns:

Pattern	Condition
DataExplosion	`input > 100 MB` and `output > 5x input`
LargeScan	`input > 1 GB` and `input > 10x (output + shuffle_write)`
WideShuffle	`shuffle_write > 500 MB` or `shuffle_read > input`

`aggregate_suspects`

#![allow(unused)]
fn main() {
pub fn aggregate_suspects(mut suspects: Vec<Suspect>) -> Vec<Suspect>
}

Sorts suspects by severity (Critical first), then by estimated_savings_ms descending as a tiebreaker.

Helper Functions

Function	Description
`classify_bottleneck(s)`	Classifies root-cause bottleneck pattern for a stage
`bottleneck_recommendation(b)`	Returns a PySpark-specific recommendation string for a bottleneck pattern
`stage_io_summary(s)`	Formats I/O metrics for a stage (including in:out ratio)

Note: resolve_sql and resolve_plan_hint_for are now methods on SuspectContext (see above).

`sql_linker/` — Cross-Reference Maps

Function	Signature	Description
`build_job_to_sql_map`	`(sqls) -> HashMap<i64, i64>`	Maps job_id → sql_id from SQL execution job lists
`build_stage_to_job_map`	`(jobs) -> HashMap<i64, i64>`	Maps stage_id → job_id from job stage lists
`link_sql_to_jobs`	`(sqls) -> Vec<SqlJobLink>`	Groups SQL executions with their job IDs
`find_sql_for_job`	`(job_id, ...) -> (Option<i64>, Option<String>)`	Looks up SQL ID and description for a job
`stages_for_task_analysis`	`(stages) -> Vec<(i64, i64)>`	Selects up to ~15 stages for task-level analysis using multiple heuristics (top-by-runtime, top-by-shuffle, high-parallelism)

TUI Module

Path: src/tui/

Contains the terminal UI: app state machine, event loop, tab rendering, widgets, and theme.

Files

File	Purpose
`app/`	`App` struct, event loop, key handling, rendering dispatch (`state.rs`, `input.rs`, `render.rs`)
`theme.rs`	Color and style functions
`highlight.rs`	SQL/plan syntax highlighting
`tabs/jobs_list.rs`	Jobs table view
`tabs/job_detail.rs`	Stage breakdown for a selected job
`tabs/sql_detail.rs`	SQL execution plan view (scrollable)
`tabs/stage_detail.rs`	Detailed stage metrics view
`tabs/suspects.rs`	Suspects table view
`widgets/help.rs`	Help overlay (keybinding reference + SQL recommendations)
`widgets/status_line.rs`	Status bar with cluster info and last update time
`widgets/summary_bar.rs`	Health summary bar (top issues, job/IO counts)

`app/` — App State

`Tab`

#![allow(unused)]
fn main() {
pub enum Tab {
    Jobs,
    Suspects,
}
}

Methods: next(), prev(), index(), from_index(), titles().

`ViewMode`

#![allow(unused)]
fn main() {
pub enum ViewMode {
    List,         // Tab-level table view
    JobDetail,    // Stage breakdown for a selected job
    SqlDetail,    // SQL execution plan (scrollable)
    StageDetail,  // Detailed metrics for a selected stage
}
}

`App`

#![allow(unused)]
fn main() {
pub struct App {
    pub active_tab: Tab,
    pub view_mode: ViewMode,
    pub data: Option<Arc<DataPayload>>,
    pub error_msg: Option<String>,
    pub cluster_id: String,
    pub should_quit: bool,
    pub job_table_state: TableState,
    pub suspect_table_state: TableState,
    pub detail_table_state: TableState,
    pub sql_scroll_state: ScrollViewState,
    pub stage_detail_scroll_state: ScrollViewState,
    show_help: bool,
    return_tab: Option<Tab>,
    client: Arc<SparkHttpClient>,
    tx: mpsc::UnboundedSender<Action>,
    pending_task_fetches: HashSet<i64>,
}
}

Notable changes from v1:

data changed from Option<DataPayload> to Option<Arc<DataPayload>> for cheaper cloning
sql_scroll: u16 and stage_detail_scroll: u16 replaced with sql_scroll_state: ScrollViewState and stage_detail_scroll_state: ScrollViewState (from tui-scrollview)
show_help: bool — toggles the help overlay
return_tab: Option<Tab> — tracks which tab to return to on Esc from JobDetail (e.g., Suspects tab when entering via a suspect)

Key methods:

Method	Description
`new(cluster_id, client, tx)`	Creates a new App with initial state
`run(&mut self, terminal, rx)`	Main event loop — receives Actions from the channel, handles keys, re-renders
`handle_key(key_event)`	Processes keyboard input based on current ViewMode
`handle_action(action)`	Processes DataUpdate, FetchError, Key, Mouse, Resize actions
`handle_escape()`	Goes back one level; respects `return_tab`
`handle_enter()`	Drills into the selected item
`handle_navigation_down()`	Moves selection/scroll down based on ViewMode
`handle_navigation_up()`	Moves selection/scroll up based on ViewMode
`handle_navigation_home()`	Jumps to top
`handle_navigation_end()`	Jumps to bottom
`open_sql_detail()`	Opens SQL detail for the current job
`enter_suspect_job()`	Navigates from Suspects tab to a suspect’s job detail, setting `return_tab`
`enter_stage_detail()`	Navigates to stage detail from job detail
`trigger_task_fetch(stage_id)`	Triggers on-demand task fetch for a stage not already in `stage_tasks`
`render(frame)`	Renders the current state to the terminal
`render_tab_bar(frame, area)`	Renders the tab header
`render_content(frame, area)`	Renders the main content area for the current ViewMode
`render_status_bar(frame, area)`	Renders the bottom status bar with context-sensitive hint strings

`theme.rs` — Styles

Pure functions that return ratatui::style::Style:

Function	Usage
`critical()`	Red — critical severity
`warning()`	Yellow — warning severity
`healthy()`	Green — success/healthy
`running()`	Yellow — running status
`failed()`	Red — failed status
`muted()`	Gray — secondary text
`selected()`	Cyan — selected row
`tab_active()`	Active tab style
`tab_inactive()`	Inactive tab style
`status_bar()`	Status bar background
`severity_style(severity)`	Maps `Severity` to style
`job_status_style(status)`	Maps job status string to style
`metric_bytes_style(bytes)`	Color-codes byte counts by size
`shuffle_bytes_style(bytes)`	Color-codes shuffle bytes
`spill_bytes_style(bytes)`	Color-codes spill bytes
`cpu_utilization_style(ratio)`	Color-codes CPU utilization: ≥0.95 red (saturated), ≥0.5 green (healthy), ≥0.3 yellow (underutilized), <0.3 red (I/O bound)
`memory_utilization_style(peak, cluster_mem)`	Color-codes peak memory relative to cluster total; falls back to absolute thresholds when cluster data unavailable

Size thresholds for byte styling: MB = 1_048_576, GB = 1_073_741_824.

`tabs/jobs_list.rs` — Jobs List

Function	Description
`format_submission_time(time)`	Formats submission time for display
`render_jobs_tab(frame, area, app)`	Renders the jobs table with columns: ID, Status, Duration, Tasks, Failed, SQL, Submitted

`tabs/job_detail.rs` — Job Detail

Function	Description
`render_job_detail(frame, area, job, stages, sql_executions, stage_state, critical_stages)`	Renders stage breakdown table for a job. Critical path stages are annotated with “CP”

`tabs/sql_detail.rs` — SQL Detail

Function	Description
`render_sql_detail(frame, area, job, scroll_state, suspects)`	Renders scrollable SQL execution plan text with syntax highlighting. Uses `ScrollViewState` from `tui-scrollview`

`tabs/stage_detail.rs` — Stage Detail

Function	Description
`render_stage_header(frame, area, stage, total_cluster_memory, sql_hint)`	Renders stage header with I/O metrics, CPU %, and optional SQL plan hint
`render_duration_histogram(frame, area, tasks)`	Renders task duration histogram
`render_executor_breakdown(frame, area, tasks)`	Renders per-executor breakdown table
`render_peak_memory_section(frame, area, tasks, total_cluster_memory)`	Renders peak memory per task section
`render_skew_metrics(frame, area, tasks)`	Renders skew metrics (CV, max/median ratio)
`render_stage_detail(frame, area, stage, tasks, loading, scroll_state, total_cluster_memory, sql_hint)`	Renders full stage detail using `ScrollViewState`. Composes stage_header, I/O metrics, duration histogram, executor breakdown, peak memory, and skew metrics

`tabs/suspects.rs` — Suspects Tab

Function	Description
`render_suspects_tab(frame, area, suspects, state, critical_stages)`	Renders suspects table with columns: Severity, Category, Stage, Job, Title, Detail, Recommendation. Table title: `"Suspects (severity → savings)"`

`highlight.rs` — Syntax Highlighting

Function	Description
`highlight_sql(text)`	Applies syntax highlighting to SQL query text for display in the SQL detail view
`highlight_spark_plan(text)`	Applies syntax highlighting to Spark physical plan text

`widgets/help.rs`

Function	Description
`centered_rect(area, percent_x, percent_y)`	Computes a centered rectangle within an area (private helper)
`render_help_overlay(frame, area)`	Renders a general keybinding reference overlay
`render_sql_help_overlay(frame, area, job, suspects)`	Renders PySpark-specific recommendations for suspects related to the current SQL execution

`widgets/status_line.rs`

Function	Description
`render_status_line(frame, area, app)`	Renders the bottom status bar showing cluster ID, app ID, and last update time

`widgets/summary_bar.rs`

Function	Description
`render_summary_bar(frame, area, summary)`	Renders 2-line health summary with colored foreground text (red=critical, yellow=warning, green=healthy) showing job/IO counts and top issues

Utilities Module

Path: src/util/

Helper functions for formatting values and parsing timestamps.

Files

File	Purpose
`format.rs`	Human-readable formatting for durations, bytes, strings, and SQL plans
`time.rs`	Spark timestamp parsing and duration calculation

`format.rs` — Formatting

Function	Signature	Description
`format_duration_ms`	`(ms: i64) -> String`	Formats milliseconds as human-readable duration (e.g., `1h 23m 45s`, `500ms`, `2.3s`)
`format_bytes`	`(bytes: i64) -> String`	Formats byte counts with appropriate unit (e.g., `1.5 GB`, `256 MB`, `1.2 KB`)
`format_bytes_or_dash`	`(bytes: i64) -> String`	Like `format_bytes`, but returns `"-"` if bytes is zero
`format_records`	`(records: i64) -> String`	Formats record counts with K/M/B suffixes
`percentile`	`(sorted: &[f64], p: f64) -> f64`	Computes percentile from a sorted slice
`truncate`	`(s: &str, max: usize) -> String`	Truncates a string to `max` characters, appending `...` if truncated
`sanitize_for_span`	`(s: &str) -> String`	Replaces embedded newlines, carriage returns, and tabs with spaces. Ratatui’s `Line`/`Span` types expect no embedded newlines — they corrupt the differential renderer’s cursor position tracking
`clean_stage_name`	`(name: &str) -> String`	Removes Spark Connect prefixes and UUID suffixes from stage names for cleaner display
`parse_plan_top_operations`	`(plan: &str, limit: usize) -> Vec<String>`	Extracts the top N operations from a Spark SQL physical plan

Examples

#![allow(unused)]
fn main() {
format_duration_ms(3_723_000) // "1h 2m 3s"
format_duration_ms(500)        // "500ms"
format_duration_ms(2_300)      // "2.3s"

format_bytes(1_610_612_736)    // "1.5 GB"
format_bytes(1_048_576)        // "1.0 MB"

truncate("hello world", 5)     // "hello..."

clean_stage_name("spark-connect-UUID:stage_name") // "stage_name"
}

`time.rs` — Time Utilities

Function	Signature	Description
`parse_spark_timestamp`	`(s: &str) -> Option<DateTime<Utc>>`	Parses Spark timestamps in multiple formats: RFC3339, naive datetime, and GMT-suffix
`duration_between`	`(start: Option<&str>, end: Option<&str>) -> Option<i64>`	Computes duration in milliseconds between two optional timestamp strings

Supported Timestamp Formats

RFC3339: 2024-01-15T10:30:00.000Z
Naive: 2024-01-15T10:30:00.000
GMT suffix: 2024-01-15T10:30:00.000GMT

Examples

#![allow(unused)]
fn main() {
parse_spark_timestamp("2024-01-15T10:30:00.000GMT") // Some(DateTime<Utc>)

duration_between(
    Some("2024-01-15T10:30:00.000GMT"),
    Some("2024-01-15T10:31:00.000GMT"),
) // Some(60_000)

duration_between(Some("..."), None) // None
}

Troubleshooting

Common Errors

spark-tui maps HTTP status codes from the Spark REST API to user-friendly error messages:

Error	HTTP Status	Meaning	Solution
Unauthorized	401	Token expired or invalid	Regenerate your token at Databricks Settings > Developer > Access Tokens
Forbidden	403	Insufficient permissions	Check that your token has access to the specified cluster
Not Found	404	Spark UI not available	The Spark application may have ended. Start a new Spark session or check the application ID
Service Unavailable	503	Cluster not reachable	spark-tui will automatically check cluster state and attempt to load historical data if the cluster is terminated. If automatic fallback fails, verify the cluster is running or provide `--event-log-path` / `--sparkui-cookie`
No Applications	—	No Spark apps on cluster	Ensure a Spark session is active on the cluster (e.g., run a notebook or submit a job)

Configuration Errors

“Missing ‘host’ / ‘token’ / ‘cluster_id’”

spark-tui couldn’t find all three required fields. Check that you’ve provided them via CLI flags, environment variables, or ~/.databrickscfg. See Configuration for details.

“Profile ‘xyz’ not found in ~/.databrickscfg”

The --profile flag specifies a section name that doesn’t exist in your ~/.databrickscfg file. The error message lists available profiles.

Auto-detection fails

When no --profile is specified, spark-tui looks for the first profile in ~/.databrickscfg that has all three required fields (host, token, cluster_id). If no profile is complete, you’ll get a missing fields error.

Connection Issues

Timeout or no response

Verify the cluster is in a Running state in Databricks
Check that --host matches your workspace URL (e.g., adb-1234567890.azuredatabricks.net)
Ensure --cluster-id is correct (find it in the cluster’s URL or configuration page)

TLS errors

spark-tui uses rustls for TLS. If you’re behind a corporate proxy with custom CA certificates, you may need to set the SSL_CERT_FILE or SSL_CERT_DIR environment variables.

Log File

spark-tui writes logs to /tmp/spark-tui.log. To increase verbosity:

RUST_LOG=debug spark-tui --host ... --token ... --cluster-id ...

Available log levels: error, warn (default), info, debug, trace.

Check the log file for detailed error information:

tail -f /tmp/spark-tui.log

Terminal Issues

Display is corrupted after a crash

If spark-tui exits abnormally (e.g., killed by a signal), the terminal may remain in raw mode. Reset it with:

reset

spark-tui installs a panic hook that attempts to restore the terminal on panic, but external signals bypass this.

Colors look wrong

spark-tui uses 256-color mode via ratatui/crossterm. Ensure your terminal emulator supports 256 colors and that TERM is set correctly (e.g., xterm-256color).

SQL Rendering Artifacts

If SQL plan text appears corrupted or causes display glitches, this is likely caused by raw newlines embedded in SQL text. spark-tui sanitizes these via sanitize_for_span() in util/format.rs, which replaces embedded \n, \r, and \t characters with spaces before passing text to ratatui’s Line/Span types. Ratatui’s differential renderer tracks cursor positions per line, so embedded newlines corrupt its state.

If you encounter rendering artifacts, check whether the SQL text contains unusual control characters and file an issue.

Historical Mode

Spark UI shows “loading” but never becomes ready

The Historical Spark UI needs to download and parse event logs from DBFS, which can take a while for large applications. spark-tui retries with backoff for ~53 seconds. If it still doesn’t become ready:

Try opening the Spark UI in your browser first to trigger the warm-up
Check the log file (/tmp/spark-tui.log) for the exact URL being probed
The event log download may take longer than 53 seconds for very large applications — try again after waiting

Historical data loads but is incomplete

When using historical mode, some data may not be available:

Executor metrics are not available after termination (cluster resources will show as default/zero)
Real-time task data is replaced by complete post-mortem task data
SQL plan descriptions may be less detailed depending on the data source

If --sparkui-cookie doesn’t work:

Verify the cookie is from the correct domain (adb-dp-*, not adb-*)
Cookies expire — regenerate by visiting the Spark UI in your browser
Check /tmp/spark-tui.log for the HTTP status code returned by cookie probes
The cookie value should be the JWT-like string from DATAPLANE_DOMAIN_DBAUTH, not the entire cookie header

All historical strategies fail

If spark-tui reports “Could not load historical data”, check:

Cluster log delivery — is it configured? (Cluster settings > Logging)
DBFS permissions — does your token have access to read DBFS paths?
Event log path — try specifying it explicitly with --event-log-path
Spark UI cookie — try providing --sparkui-cookie (see Configuration)

Enable debug logging to see which strategies were attempted:

RUST_LOG=debug spark-tui --cluster-id ...

Deserialization Errors

If the log shows deserialization errors, the Spark API may have returned an unexpected response format. This can happen with:

Very old or very new Databricks Runtime versions
Custom Spark configurations that alter the REST API response

File an issue with the error message and your Databricks Runtime version.

Contributing

Development Setup

Prerequisites

Rust 1.85+ (edition 2024)
A Databricks workspace for testing (optional for code changes, required for integration testing)

Clone and build

git clone https://github.com/tadeasf/spark-tui.git
cd spark-tui
cargo build

Run locally

cargo run -- --host adb-123.azuredatabricks.net --token dapi... --cluster-id 0123-...

Project Structure

src/
├── main.rs          Entry point
├── config.rs        CLI args + config resolution
├── fetch/           HTTP client, API types, polling
├── analyze/         Suspect detection and classification
├── tui/             App state, rendering, widgets
└── util/            Formatting and time utilities

See Architecture for a detailed breakdown.

Testing

# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run a specific test module
cargo test config::tests
cargo test analyze::skew::tests
cargo test analyze::suspects::tests

The test suite covers:

Config parsing (~/.databrickscfg format, profile detection, URL normalization)
Skew detection (uniform tasks, warning-level skew, critical skew)
Suspect detection (slow stages, spill, bottleneck classification)
SQL linking (job-to-SQL mapping)
Formatting utilities (duration, bytes, truncation, stage name cleaning)
Time parsing (RFC3339, naive, GMT suffix formats)

Code Style

Edition 2024 — use current Rust idioms
Error handling — use thiserror for error types, Result for fallible operations
Formatting — run cargo fmt before committing
Linting — run cargo clippy and address warnings

cargo fmt --check
cargo clippy -- -D warnings

Dependencies

Crate	Purpose
`clap`	CLI argument parsing with env var fallback
`tokio`	Async runtime (`macros`, `rt-multi-thread`, `time`, `sync` features)
`reqwest`	HTTP client (with `rustls-tls`, no default features)
`serde` / `serde_json`	JSON deserialization
`thiserror`	Error type derivation
`ratatui`	Terminal UI framework (`unstable-rendered-line-info` feature)
`crossterm`	Terminal backend
`tracing` / `tracing-subscriber`	Structured logging (with `env-filter`)
`chrono`	Timestamp parsing
`syntect` / `syntect-tui`	SQL syntax highlighting
`tui-scrollview`	Smooth scrollable views for detail panels

CI/CD

The project uses four GitHub Actions workflows:

ci.yml — runs cargo fmt --check, cargo clippy, and cargo test on every push and PR
docs.yml — builds and deploys mdbook documentation to GitHub Pages
auto-tag.yml — creates a vX.Y.Z git tag when Cargo.toml version changes on master
release.yml — triggered by v* tags; builds cross-platform release binaries (Linux x86_64, macOS x86_64 + aarch64, Windows x86_64) and creates a GitHub Release with artifacts

Conventions

Keep analysis logic in analyze/, not in the TUI layer
Keep API types in fetch/types.rs, not scattered across modules
Format functions go in util/format.rs
Each suspect detector is a pure function: (&[SparkStage], &SuspectContext) -> Vec<Suspect>
The poller is the only place where API calls and analysis are composed together

Keyboard shortcuts

spark-tui Documentation