Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

spark-tui is a terminal-based performance analysis tool for Apache Spark applications running on Databricks. It connects to the Spark REST API through the Databricks driver proxy and presents live job metrics, stage breakdowns, and automated suspect detection in an interactive TUI.

Why spark-tui?

Debugging Spark performance problems typically involves clicking through the Spark UI in a browser, manually comparing stage durations, and guessing which stages have data skew or excessive spill. This process is slow and error-prone.

spark-tui automates this analysis:

  • Automatic suspect detection — identifies slow stages, data skew, and disk spill without manual inspection
  • Bottleneck classification — categorizes root causes as Large Scan, Wide Shuffle, or Data Explosion
  • Actionable recommendations — each finding includes a concrete tuning suggestion
  • SQL correlation — links stages back to the originating SQL query and shows plan hints
  • Live updates — polls the Spark API on a configurable interval and refreshes the display

What You See

The interface has two main tabs:

  1. Jobs — all Spark jobs ranked by duration (slowest first), with drill-down to stage details, duration bar charts, and SQL execution plans
  2. Suspects — automatically detected performance issues, sorted by severity (critical first), with category labels, I/O summaries, and recommendations

How It Works

spark-tui connects to the Spark History Server API exposed through Databricks’ driver proxy endpoint:

https://{host}/driver-proxy-api/o/0/{cluster_id}/40001/api/v1

A background poller fetches jobs, stages, SQL executions, and task lists at regular intervals. The analysis engine processes this data to detect anomalies, then the TUI renders the results in real time.

Next Steps

Quick Start

Prerequisites

  • Rust toolchain — version 1.85 or later (spark-tui uses edition 2024)
  • Databricks workspace — with a running cluster and an active Spark application
  • Personal access token — generated from Databricks Settings > Developer > Access Tokens

Installation

git clone https://github.com/tadeasf/spark-tui.git
cd spark-tui
cargo install --path .

Or run directly without installing:

cargo run -- --host adb-123.azuredatabricks.net --token dapi... --cluster-id 0123-...

Configuration

You need three pieces of information:

FieldExample
Workspace hostadb-1234567890.azuredatabricks.net
Personal access tokendapi0123456789abcdef...
Cluster ID0123-456789-abcdef

Provide them via any of these methods (highest priority first):

Option 1: CLI flags

spark-tui \
  --host adb-123.azuredatabricks.net \
  --token dapi0123456789abcdef \
  --cluster-id 0123-456789-abcdef

Option 2: Environment variables

export DATABRICKS_HOST=adb-123.azuredatabricks.net
export DATABRICKS_TOKEN=dapi0123456789abcdef
export DATABRICKS_CLUSTER_ID=0123-456789-abcdef
spark-tui

Option 3: ~/.databrickscfg file

Create or edit ~/.databrickscfg:

[my-workspace]
host = adb-123.azuredatabricks.net
token = dapi0123456789abcdef
cluster_id = 0123-456789-abcdef

Then run with a specific profile:

spark-tui --profile my-workspace

Or let spark-tui auto-detect the first complete profile:

spark-tui

First Run

When spark-tui starts, it will:

  1. Resolve configuration (CLI > env > databrickscfg)
  2. Connect to the Spark REST API via the driver proxy
  3. Discover the active Spark application
  4. Fetch jobs, stages, and SQL executions
  5. Display the Jobs tab with results ranked by duration

You should see a table of Spark jobs. Use j/k or arrow keys to navigate, Enter to drill into a job, and Tab to switch to the Suspects tab.

If something goes wrong, check the Troubleshooting guide.

Next Steps

Configuration

spark-tui requires three credentials to connect to a Databricks cluster: host, token, and cluster ID. These can be provided through CLI flags, environment variables, or a ~/.databrickscfg file.

Priority Resolution

Configuration is resolved in this order (highest priority first):

  1. CLI flags--host, --token, --cluster-id
  2. Environment variablesDATABRICKS_HOST, DATABRICKS_TOKEN, DATABRICKS_CLUSTER_ID
  3. ~/.databrickscfg — INI-format file with profile sections

CLI flags and environment variables are handled by clap with the env feature — each flag falls back to its corresponding env var automatically.

If all three required fields are not satisfied by CLI/env, spark-tui reads ~/.databrickscfg to fill the gaps. You can mix sources: for example, set host and token via env vars but cluster_id via the config file.

CLI Reference

FlagShortEnv VarDefaultDescription
--hostDATABRICKS_HOSTWorkspace hostname
--tokenDATABRICKS_TOKENPersonal access token
--cluster-idDATABRICKS_CLUSTER_IDCluster ID
--profile-pDATABRICKS_CONFIG_PROFILEauto-detectProfile name from ~/.databrickscfg
--poll-intervalSPARK_TUI_POLL_INTERVAL10Poll interval in seconds
--event-log-pathSPARK_TUI_EVENT_LOG_PATHDBFS path to a Spark event log file
--sparkui-cookieSPARK_TUI_SPARKUI_COOKIEDATAPLANE_DOMAIN_DBAUTH cookie value

~/.databrickscfg Format

The file uses INI format with named profile sections:

[DEFAULT]
host = adb-123.azuredatabricks.net
token = dapi0123456789abcdef

[production]
host = adb-999.azuredatabricks.net
token = dapi_prod_token
cluster_id = 0123-456789-prod

[development]
host = adb-123.azuredatabricks.net
token = dapi_dev_token
cluster_id = 0456-789012-dev

Profile selection

  • Explicit: spark-tui --profile production uses the [production] section
  • Auto-detect: without --profile, spark-tui scans all profiles and uses the first one that has all three required fields (host, token, cluster_id)

If the named profile doesn’t exist, spark-tui lists available profiles in the error message.

Base URL Construction

spark-tui constructs the Spark REST API base URL as:

https://{host}/driver-proxy-api/o/0/{cluster_id}/40001/api/v1

The host field is normalized: any https:// prefix and trailing slashes are stripped before URL construction.

Poll Interval

The --poll-interval flag controls how often spark-tui refreshes data from the Spark API (default: 10 seconds). Lower values give more responsive updates but increase API load.

# Refresh every 5 seconds
spark-tui --poll-interval 5

Historical Mode

When spark-tui detects a terminated cluster (HTTP 503 or INVALID_STATE response), it automatically attempts to load historical Spark data using a 4-strategy fallback chain:

PriorityStrategyDescription
0Spark UI REST APIProbes https://{host}/sparkui/{cluster}/{driver}/api/v1/. Requires spark_context_id from the cluster info. Also tries the dataplane domain variant (adb-dp- prefix).
1Spark History ServerProbes known Databricks history server proxy URLs (multiple path patterns).
2DBFS event logsReads event logs from the cluster’s cluster_log_conf delivery path, or from --event-log-path if specified.
3Default DBFS pathsScans well-known DBFS directories (dbfs:/cluster-logs/, dbfs:/databricks/spark/eventLogs/, etc.) for event log files.

The first strategy that succeeds provides the historical data. The status line shows a HISTORICAL badge.

Spark UI warm-up

The Historical Spark UI needs to download and parse event logs from DBFS before serving JSON data. During this warm-up phase, it returns an HTML loading page instead of JSON. spark-tui detects this and retries with backoff (3s, 5s, 10s, 15s, 20s — ~53s total), showing progress messages like “Spark UI loading… retrying (2/5)”.

On authenticated Databricks workspaces, the Spark UI endpoint requires a cookie instead of a Bearer token:

  1. Open the Databricks workspace in your browser
  2. Navigate to your cluster’s Spark UI tab (this warms up the endpoint)
  3. Open browser DevTools (F12) → ApplicationCookies
  4. Find the adb-dp-* domain (e.g., adb-dp-1234567890.azuredatabricks.net)
  5. Copy the value of the DATAPLANE_DOMAIN_DBAUTH cookie

Then pass it to spark-tui:

spark-tui --sparkui-cookie "eyJ0eXAiOiJKV1Q..." --cluster-id 0123-456789-abcdef
# or
export SPARK_TUI_SPARKUI_COOKIE="eyJ0eXAiOiJKV1Q..."
spark-tui --cluster-id 0123-456789-abcdef

Using --event-log-path

If you know the exact DBFS path to a Spark event log file:

spark-tui --event-log-path "dbfs:/cluster-logs/0123-456789-abcdef/eventlog/events.log.gz"

This is useful when the automatic DBFS scanning doesn’t find your logs (e.g., custom log delivery paths).

Logging

spark-tui writes logs to /tmp/spark-tui.log (logs cannot go to stderr as it would corrupt the TUI). Control the log level with the RUST_LOG environment variable:

RUST_LOG=info spark-tui    # Info and above
RUST_LOG=debug spark-tui   # Debug messages
RUST_LOG=trace spark-tui   # Everything

Default log level is warn.

Navigation & Keybindings

spark-tui uses vim-style keybindings for navigation. The interface has three view modes arranged in a drill-down hierarchy.

View Modes

List ──Enter──▶ JobDetail ──Enter──▶ StageDetail
                    │                     │
                    └──s──▶ SqlDetail     │
  ◀──Esc──        ◀──Esc──        ◀──Esc──
  1. List — the top-level view showing either the Jobs or Suspects tab
  2. JobDetail — stage breakdown and duration bar chart for a selected job
  3. StageDetail — detailed metrics for a selected stage (I/O, CPU %, peak RAM, task histograms, per-executor breakdown, skew metrics)
  4. SqlDetail — scrollable SQL execution plan for the selected job’s query

Keybindings

Global

KeyAction
qQuit the application
EscGo back one level (SqlDetail → JobDetail → List → Quit)
hToggle help overlay (keybinding reference in most views; PySpark recommendations in SqlDetail)

List Mode (Jobs / Suspects tabs)

KeyAction
TabSwitch to next tab
Shift+TabSwitch to previous tab
j / Move selection down
k / Move selection up
g / HomeJump to first row
G / EndJump to last row
EnterDrill into the selected job’s detail view

Status bar hint: q:quit Tab:switch j/k:nav Enter:detail h:help

JobDetail Mode

KeyAction
j / Move selection down in the stage list
k / Move selection up in the stage list
g / HomeJump to first stage
G / EndJump to last stage
EnterDrill into the selected stage’s detail view
sOpen SQL plan view (if the job has a linked SQL execution)
EscReturn to List mode

Status bar hint: Esc:back j/k:nav Enter:stage s:sql h:help

Note: When entering JobDetail from the Suspects tab, pressing Esc returns to the Suspects tab (not Jobs). This is tracked via the return_tab field.

StageDetail Mode

KeyAction
j / Scroll down
k / Scroll up
g / HomeScroll to top
G / EndScroll to bottom
EscReturn to JobDetail mode

Status bar hint: Esc:back j/k:scroll g/G:top/bot h:help

SqlDetail Mode

KeyAction
j / Scroll down
k / Scroll up
g / HomeScroll to top
G / EndScroll to bottom
hShow PySpark recommendations for suspects related to this SQL execution
EscReturn to JobDetail mode

Status bar hint: Esc:back j/k:scroll g/G:top/bot h:hints

Help Overlay

Pressing h toggles a help overlay:

  • In List, JobDetail, and StageDetail modes: Shows a general keybinding reference card listing all available shortcuts for the current view.
  • In SqlDetail mode: Shows PySpark-specific recommendations based on suspect findings related to the current SQL execution.

Press h again or Esc to dismiss the overlay.

Tabs

Jobs Tab

Displays all Spark jobs in a table, ranked by duration (slowest first). Running jobs (with no completion time) appear at the top. Columns include:

  • Job ID
  • Status (with color coding)
  • Duration
  • Task counts
  • SQL description (if linked)
  • Submission time

Suspects Tab

Displays automatically detected performance issues, sorted by severity (Critical first), then by estimated savings descending as a tiebreaker. Each row shows:

  • Severity indicator (color-coded)
  • Category (Slow Stage / Data Skew / Data Size Skew / Record Count Skew / Disk Spill / CPU Bottleneck / I/O Bottleneck / Record Explosion / Task Failures / Memory Pressure / Executor Hotspot / Too Many Partitions / Too Few Partitions / Broadcast Join Opportunity / Python UDF / Cache Opportunity)
  • Stage ID and job ID
  • Title with key metrics
  • Detail summary
  • Recommendation

Color Coding

ColorMeaning
RedCritical severity, failed status
YellowWarning severity, running status
GreenHealthy / succeeded status
GrayMuted / secondary information
CyanSelected row highlight
MagentaCP — Critical Path stage (longest-running stage per job)

CPU Utilization (Stage Detail)

ColorRangeMeaning
Red≥ 95% or < 30%CPU saturated or severe I/O bound
Green50%–94%Healthy utilization
Yellow30%–49%Underutilized

Peak Memory (Stage Detail)

Color-coded relative to total cluster memory (ratio-based), with absolute fallback when executor data is unavailable. See Understanding Analysis for threshold details.

Summary Bar

The health summary bar in List view uses colored foreground text (not colored background) to display job/IO counts and top issues: red for critical, yellow for warning, green for healthy.

Understanding Analysis

spark-tui automatically detects performance issues in your Spark application and presents them as suspects. This guide explains how each detector works, what the thresholds mean, and how to act on the findings.

Suspect Categories

Slow Stage

Detects stages whose executor_run_time is statistically anomalous compared to all completed stages.

How it works:

  1. Computes the mean and standard deviation of executor_run_time across all completed stages
  2. Flags stages that exceed the threshold
SeverityThreshold
Warningexecutor_run_time > mean + 2 * stddev
Criticalexecutor_run_time > mean + 4 * stddev

The suspect detail shows how many times slower the stage is compared to the average (e.g., “3.5x slower than average”).

Data Skew

Detects uneven task duration distribution within a stage, indicating skewed partitions.

How it works:

  1. Collects all task durations for the stage
  2. Computes the coefficient of variation (CV = stddev / mean) and the max/median ratio
  3. Flags if either metric exceeds threshold
SeverityThreshold
WarningCV > 1.0 or max > 3x median
CriticalCV > 2.0 or max > 10x median

The suspect detail identifies the slowest task, its duration vs. the median, and how much data it processed.

Note: Task-level analysis is performed for up to ~15 stages selected by multiple heuristics (top-by-runtime, top-by-shuffle, high-parallelism). On-demand task fetching is triggered when entering StageDetail for stages not already analyzed.

Data Size Skew

Detects uneven data size distribution across tasks within a stage.

How it works:

  1. Computes the total bytes processed per task (input_bytes + shuffle_read_bytes)
  2. Applies the same CV and max/median ratio thresholds as duration skew
SeverityThreshold
WarningCV > 1.0 or max > 3x median
CriticalCV > 2.0 or max > 10x median

The suspect detail identifies the task processing the most data, its byte count vs. the median.

Record Count Skew

Detects uneven record count distribution across tasks within a stage.

How it works:

  1. Computes the total records processed per task (input_records + shuffle_read_records)
  2. Applies CV and max/median ratio thresholds (only when max records > 1000)
SeverityThreshold
WarningCV > 1.0 or max > 3x median (and max > 1000)
CriticalCV > 2.0 or max > 10x median

Indicates hot keys in joins or group-bys.

Disk Spill

Detects stages where data was spilled from memory to disk, indicating insufficient executor memory.

How it works:

  1. Checks disk_bytes_spilled for each stage
  2. Any spill > 0 is flagged
SeverityThreshold
Warningdisk_bytes_spilled > 0
Criticaldisk_bytes_spilled > 1 GB

The suspect detail shows both memory spill and disk spill amounts.

CPU Bottleneck

Detects stages where the CPU is fully saturated for a sustained period.

How it works:

  1. Computes cpu_ratio = (executor_cpu_time / 1_000_000) / executor_run_time
  2. Flags stages with high CPU ratio and significant runtime
SeverityThreshold
Warningcpu_ratio > 0.9 and runtime > 30s

The suspect detail shows CPU time vs. runtime and utilization percentage.

I/O Bottleneck

Detects stages that are I/O or GC bound (low CPU utilization despite significant runtime).

How it works:

  1. Uses the same CPU ratio as CPU Bottleneck detection
  2. Flags stages with low CPU ratio
SeverityThreshold
Warningcpu_ratio < 0.3 and runtime > 10s

Consider increasing memory, improving data locality, using faster storage, or checking GC pauses.

Record Explosion

Detects stages where output records vastly exceed input records, indicating explode(), cross joins, or generate() operations.

How it works:

  1. Checks if output_records > 10x input_records (only when input_records > 1000)
SeverityThreshold
Warningoutput_records > 10x input_records
Criticaloutput_records > 100x input_records

Task Failures

Detects stages with failed or killed tasks.

How it works:

  1. Checks if num_failed_tasks > 0 or num_killed_tasks > 0
SeverityThreshold
WarningAny failed or killed tasks
CriticalFailure rate > 10% or total problematic > 10

Common causes include OOM, data corruption, and fetch failures.

Memory Pressure

Detects stages where memory spill is occurring but hasn’t yet reached disk — a proactive warning before disk spill happens.

How it works:

  1. Checks if memory_bytes_spilled > 50 MB and disk_bytes_spilled == 0
SeverityThreshold
Warningmemory_bytes_spilled > 50 MB with no disk spill

Recommendation: increase spark.executor.memory or spark.executor.memoryOverhead, reduce partition size.

Executor Hotspot

Detects stages where a single executor handles a disproportionate share of data.

How it works:

  1. Sums input_bytes + shuffle_read_bytes per executor
  2. Flags executors processing > 50% of total data
SeverityThreshold
WarningOne executor handles > 50% of data

Check data locality and partition assignment. This may indicate skewed partition-to-executor mapping.

Too Many Partitions

Detects stages with excessive small partitions, causing high scheduling overhead.

How it works:

  1. Computes avg_bytes_per_task = (input_bytes + shuffle_read_bytes) / num_tasks
  2. Flags stages with too many tiny partitions
SeverityThreshold
Warningnum_tasks > 10,000 and avg_bytes_per_task < 1 MB

The recommendation suggests a target partition count to achieve ~128 MB/partition: df.coalesce(N).

Too Few Partitions

Detects stages with too few large partitions, causing stragglers and underutilized executors.

How it works:

  1. Computes avg_bytes_per_task = (input_bytes + shuffle_read_bytes) / num_tasks
  2. Flags stages with too few large partitions
SeverityThreshold
Warningnum_tasks ≤ 8 and avg_bytes_per_task > 1 GB

The recommendation suggests a target partition count: df.repartition(N).

Broadcast Join Opportunity

Detects shuffle joins where one side is small enough to broadcast, eliminating the shuffle entirely.

How it works:

  1. Filters stages with shuffle_write_bytes < 100 MB and executor_run_time > 5s
  2. Checks if the SQL plan contains join indicators (SortMerge, ShuffledHash, Join)
SeverityThreshold
Warningshuffle_write < 100 MB and join detected in SQL plan

Recommendation: from pyspark.sql.functions import broadcast; df.join(broadcast(small_df), on='key').

Python UDF

Detects Python UDF usage in SQL plans, which causes row-by-row serialization overhead.

How it works:

  1. Searches the SQL plan hint for markers: ArrowEvalPython, BatchEvalPython, PythonUDF, PythonRunner
  2. If the stage is also CPU-bound (ratio > 0.9, runtime > 30s), severity escalates to Critical
SeverityThreshold
WarningPython UDF marker found in plan, runtime > 5s
CriticalPython UDF + CPU ratio > 0.9 + runtime > 30s

Recommendation: Replace @udf with @pandas_udf for vectorized execution, or use native F.when()/F.expr() functions.

Cache Opportunity

Detects repeated computations (stages with the same name) that could benefit from caching.

How it works:

  1. Groups completed stages by cleaned name
  2. Flags groups where ≥ 2 stages share a name and total runtime > 30s
SeverityThreshold
Warning≥ 2 stages with same name, total runtime > 30s

Recommendation: df.cache() or df.persist(StorageLevel.MEMORY_AND_DISK) before the first action. Call df.unpersist() when no longer needed.

Bottleneck Classification

When a slow stage or spill suspect is detected, spark-tui classifies the root cause based on I/O patterns:

PatternConditionMeaning
Data Explosioninput > 100 MB and output > 5x inputStage produces far more data than it reads (e.g., explode, cross join)
Large Scaninput > 1 GB and input > 10x (output + shuffle_write)Stage reads a lot but produces little (missing pushdown filters)
Wide Shuffleshuffle_write > 500 MB or shuffle_read > inputStage shuffles more data than it reads directly (broad join, groupBy on high-cardinality key)
Record Explosionoutput_records > 10x input_recordsAttached to record explosion suspects (see above)

If none of these patterns match, no bottleneck tag is shown.

Recommendations

All recommendations use PySpark-specific syntax for immediate applicability. Each suspect includes a recommendation based on its category and bottleneck pattern:

Category + BottleneckRecommendation
Data SkewRepartition or salt skewed keys
Data Size SkewRepartition by a more uniform key or use salting
Record Count SkewCheck for hot keys in joins or group-bys
Disk Spillspark.conf.set('spark.executor.memory', '8g') or df.repartition(200)
CPU BottleneckReplace @udf with @pandas_udf or native F.when()/F.expr(). Cache with df.cache() and increase parallelism
I/O Bottleneckspark.conf.set('spark.executor.memory', '8g'), cache hot DataFrames with df.cache(), or use df.repartition() for better locality
Record ExplosionFilter before explode: df.filter(...).select(explode('col')). Check for unintentional cross joins
Task FailuresCheck executor logs for OOM/fetch failures. spark.conf.set('spark.task.maxFailures', '4')
Memory Pressurespark.conf.set('spark.executor.memory', '8g') and spark.conf.set('spark.executor.memoryOverhead', '2g')
Executor HotspotCheck data locality and partition assignment
Too Many Partitionsdf.coalesce(N) to target ~128 MB/partition
Too Few Partitionsdf.repartition(N) to target ~128 MB/partition
Broadcast Join Opportunityfrom pyspark.sql.functions import broadcast; df.join(broadcast(small_df), on='key')
Python UDFReplace @udf with @pandas_udf for vectorized execution, or use native F.when()/F.expr()
Cache Opportunitydf.cache() or df.persist(StorageLevel.MEMORY_AND_DISK) before the first action
Slow Stage + Large Scandf.filter(F.col('date') >= '2024-01-01') and select only needed columns. Use partition pruning
Slow Stage + Wide Shufflefrom pyspark.sql.functions import broadcast; df.join(broadcast(small_df), ...). Pre-aggregate with groupBy before joins
Slow Stage + Data ExplosionFilter before explode: df.filter(...).withColumn('x', explode('arr'))
Slow Stage (no pattern)df.explain(True) to see the query plan. Large shuffle may indicate missing filters or broad joins

Estimated Savings

Each suspect includes an estimated_savings_ms field — a rough estimate of how much time could be saved by addressing the issue. This is used as a secondary sort key (after severity) so that higher-impact issues appear first within the same severity level.

How savings are computed per category

CategoryEstimation Method
Slow Stageexecutor_run_time - mean_runtime (time above average)
Disk Spill~30% of executor_run_time (spill overhead)
CPU Bottleneck~20% of executor_run_time
I/O Bottleneck~20% of executor_run_time
Record Explosion~50% of executor_run_time
Task Failuresexecutor_run_time × failure_rate (retry overhead)
Memory Pressure~10% of executor_run_time (GC pause overhead)
Too Many Partitions~40% of executor_run_time (scheduling overhead)
Too Few Partitions~50% of executor_run_time (straggler overhead)
Broadcast Join Opportunity~60% of executor_run_time (shuffle elimination)
Python UDF~50% of executor_run_time (serialization overhead)
Cache Opportunitytotal_runtime - min_single_runtime (repeated computation)

These are heuristic estimates intended for prioritization, not precise predictions.

SQL Correlation

Each suspect is linked to its originating SQL execution when possible. The suspect shows:

  • SQL ID — the Spark SQL execution identifier
  • SQL Description — the query text or description
  • SQL Plan Hint — the top operations from the physical plan (e.g., “HashAggregate -> Exchange -> Scan parquet”)

This helps trace the suspect back to the specific query that caused it.

I/O Summary

Slow stage and spill suspects include an I/O summary showing:

  • Input bytes / records
  • Output bytes / records
  • Shuffle read bytes / records
  • Shuffle write bytes / records
  • Memory and disk spill amounts

Use this to understand the data flow through the flagged stage.

Severity Sorting

Suspects are sorted by severity (Critical first, then Warning), with estimated_savings_ms descending as a tiebreaker within the same severity level. The Suspects tab title reflects this: "Suspects (severity → savings)".

Color-Coding in Stage Detail

The stage detail view uses color-coded metrics to help identify issues at a glance.

CPU Utilization

The CPU % value in the stage header is color-coded based on the CPU ratio (executor_cpu_time / executor_run_time):

ColorRangeMeaning
Red≥ 95%CPU saturated
Green50%–94%Healthy utilization
Yellow30%–49%Underutilized (possible I/O bound)
Red< 30%Severe I/O bound

Peak Memory

Peak execution memory is color-coded relative to total cluster memory when executor data is available:

ColorRatio to cluster memoryMeaning
Red≥ 80%Near memory limit
Yellow50%–79%Moderate usage
Green10%–49%Comfortable
Default< 10%Low usage

When executor data is unavailable, absolute thresholds are used as fallback:

ColorThresholdMeaning
Red≥ 10 GBHigh memory usage
Yellow≥ 1 GBModerate
Green≥ 100 MBNormal
Default< 100 MBLow

Architecture

spark-tui follows a modular architecture with clear separation between configuration, data fetching, analysis, and rendering.

Module Map

src/
├── main.rs
├── config/
│   └── mod.rs             CLI args, env vars, ~/.databrickscfg parsing
├── fetch/
│   ├── client.rs          SparkHttpClient + FetchError
│   ├── spark.rs           Endpoint methods (get_jobs, get_stages, etc.)
│   ├── types.rs           Spark API response types (serde)
│   ├── databricks.rs      DatabricksClient (cluster info, DBFS, sparkui, history server)
│   ├── orchestrator.rs    poll_once, assemble_data_payload, compute_health_summary
│   ├── poller.rs          run_poller + historical fallback chain
│   └── eventlog/          Event log parsing (DBFS download, gzip, SparkEvent serde)
├── analyze/
│   ├── types.rs           Suspect, Severity, SuspectCategory, BottleneckPattern
│   ├── skew/              Data skew detection (CV + max/median)
│   ├── suspects/          SuspectContext, 10 detectors, bottleneck classification
│   └── sql_linker/        Job ↔ SQL ↔ Stage mapping
├── tui/
│   ├── app/               App state, event loop, key handling, rendering dispatch
│   │   ├── state.rs
│   │   ├── input.rs
│   │   └── render.rs
│   ├── theme.rs           Color/style functions
│   ├── highlight.rs       SQL/plan syntax highlighting
│   ├── tabs/
│   │   ├── jobs_list.rs   Jobs table
│   │   ├── job_detail.rs  Stage breakdown for a job
│   │   ├── sql_detail.rs  SQL execution plan view
│   │   ├── stage_detail.rs Detailed stage metrics
│   │   └── suspects.rs    Suspects table view
│   └── widgets/
│       ├── help.rs        Help overlay
│       ├── status_line.rs Status bar
│       └── summary_bar.rs Health summary bar
└── util/
    ├── format/            format_duration_ms, format_bytes, truncate, clean_stage_name
    └── time/              Spark timestamp parsing, duration_between

Data Flow

┌──────────┐     ┌──────────────┐     ┌──────────────┐     ┌───────────┐
│  Config  │────▶│ SparkHttp    │────▶│   Poller     │────▶│  Analysis │
│ resolve  │     │  Client      │     │ (poll_once)  │     │  Engine   │
└──────────┘     └──────────────┘     └──────┬───────┘     └─────┬─────┘
                                             │                   │
                                    DataPayload              Suspects
                                    + stage_sql_hints        (via SuspectContext)
                                    + critical_stages
                                             │                   │
                                             ▼                   ▼
                                     ┌───────────────────────────────┐
                                     │          App (TUI)            │
                                     │  event loop ← mpsc channel   │
                                     └───────────────────────────────┘

Step by step:

  1. Config resolution (config/mod.rs) — parses CLI args, env vars, and ~/.databrickscfg to produce a Config struct with host, token, cluster_id, and poll_interval

  2. HTTP client (fetch/client.rs) — SparkHttpClient wraps reqwest::Client with the base URL and token. FetchError maps HTTP status codes to user-friendly messages

  3. Endpoint methods (fetch/spark.rs) — discover_app_id, get_jobs, get_stages, get_sql_executions, get_task_list, get_executors — each calls the Spark REST API and deserializes the response

  4. Background poller (fetch/poller.rs) — run_poller runs in a tokio task. When the cluster becomes unreachable (503 or terminated), the poller automatically falls back to historical data via a 4-strategy chain: Spark UI REST API (with warm-up retry), Spark History Server proxy, DBFS event logs, and default DBFS path scanning. poll_once lives in fetch/orchestrator.rs (separate from the poller loop):

    • Fetches jobs, stages, SQL executions, and executors concurrently via 4-way tokio::join!
    • Aggregates active executors into ClusterResources (total memory, cores, executor count)
    • Builds cross-reference maps (job↔SQL, stage↔job)
    • Creates a SuspectContext with cross-reference maps
    • Runs 10 stage-level detectors via function pointer table, plus skew detection on task data
    • Fetches task lists for up to ~15 stages (selected by multiple heuristics)
    • Computes stage_sql_hints (SQL plan hints per stage) and critical_stages (longest wall-clock stage per job)
    • Computes HealthSummary for the summary bar
    • Sends a DataPayload (including cluster_resources, stage_sql_hints, critical_stages) through an mpsc channel
  5. Analysis (analyze/) — 10 stage-level detectors are dispatched via a function pointer table (&[DetectorFn]): detect_slow_stages, detect_spill, detect_cpu_efficiency, detect_record_explosion, detect_task_failures, detect_memory_pressure, detect_partition_count, detect_broadcast_join, detect_python_udf, detect_cache_opportunity. Each takes (&[SparkStage], &SuspectContext) and returns Vec<Suspect>. detect_skew runs separately on task data. aggregate_suspects sorts by severity then estimated_savings_ms

  6. App event loop (tui/app/) — App::run receives Action variants from the mpsc channel:

    • Action::DataUpdate(payload) — stores the new data
    • Action::FetchError(err) — stores the error message
    • Action::Key(event) — processes keybindings
    • Action::Resize(w, h) — triggers re-render
  7. Rendering (tui/tabs/, tui/widgets/) — renders the current view mode (List, JobDetail, StageDetail, SqlDetail) using ratatui widgets. The summary bar widget displays health metrics in List view

Async Model

spark-tui uses the tokio runtime with three concurrent tasks:

TaskChannelDescription
Pollertx → rxFetches data and sends Action::DataUpdate / Action::FetchError
Event readertx → rxReads terminal events via crossterm::event::read (blocking, wrapped in spawn_blocking)
App looprxReceives all actions and processes them sequentially

All tasks communicate through a single mpsc::UnboundedSender<Action> channel. The app loop owns the receiver and processes actions one at a time, ensuring thread-safe state updates without locks.

Design Decisions

  • Bounded task fetching: Task lists (per-task metrics) are fetched for up to ~15 stages selected by multiple heuristics (top-by-runtime, top-by-shuffle, high-parallelism). On-demand task fetching is triggered when entering StageDetail for stages not already analyzed
  • Concurrent fetches: Jobs, stages, SQL executions, and executors are fetched in parallel with 4-way tokio::join! to minimize latency
  • Function pointer dispatch: Stage-level detectors are stored in a &[DetectorFn] array and dispatched via flat_map, making it easy to add new detectors
  • SuspectContext: Replaces ad-hoc parameter passing — all cross-reference maps are bundled in a single struct with helper methods (job_id, resolve_sql, resolve_plan_hint_for, enrich)
  • tui-scrollview: Used for smooth scrolling in StageDetail and SqlDetail views, replacing manual u16 scroll offsets with ScrollViewState
  • Log file: Logs go to /tmp/spark-tui.log instead of stderr to avoid corrupting the TUI
  • Panic hook: A custom panic hook restores the terminal before printing the panic message, preventing terminal corruption
  • Edition 2024: Uses the latest Rust edition for modern language features

Dependencies

CratePurpose
clapCLI argument parsing with env var fallback
tokioAsync runtime (macros, rt-multi-thread, time, sync features)
reqwestHTTP client (with rustls-tls)
serde / serde_jsonJSON deserialization
thiserrorError type derivation
ratatuiTerminal UI framework (with unstable-rendered-line-info feature)
crosstermTerminal backend
tracing / tracing-subscriberStructured logging
chronoTimestamp parsing
syntect / syntect-tuiSQL syntax highlighting
tui-scrollviewSmooth scrollable views for detail panels

CI/CD Workflows

WorkflowTriggerDescription
ci.ymlPush / PRRuns cargo fmt --check, cargo clippy, cargo test
docs.ymlPush / PRBuilds and deploys mdbook documentation to GitHub Pages
auto-tag.ymlPush to master (Cargo.toml changed)Creates a vX.Y.Z tag when the version in Cargo.toml changes
release.ymlTag v*Cross-platform release builds (Linux x86_64, macOS x86_64 + aarch64, Windows x86_64) with GitHub Release artifacts

Module Reference

This section provides a reference for each module in the spark-tui codebase.

Modules

ModulePathDescription
Configsrc/config/CLI argument parsing, config resolution, ~/.databrickscfg support
Fetchsrc/fetch/HTTP client, Spark API types, endpoint methods, background poller
Analyzesrc/analyze/Suspect detection (slow stages, skew, spill), bottleneck classification, SQL linking
TUIsrc/tui/App state machine, tab rendering, widgets, theme
Utilitiessrc/util/Formatting (duration, bytes) and time parsing helpers

Config Module

Path: src/config.rs

Handles CLI argument parsing, environment variable fallback, and ~/.databrickscfg file parsing to produce a resolved Config struct.

Structs

CliArgs

Clap-derived struct for command-line arguments. Each field uses #[arg(env = "...")] for automatic env var fallback.

#![allow(unused)]
fn main() {
pub struct CliArgs {
    pub host: Option<String>,          // --host / DATABRICKS_HOST
    pub token: Option<String>,         // --token / DATABRICKS_TOKEN
    pub cluster_id: Option<String>,    // --cluster-id / DATABRICKS_CLUSTER_ID
    pub profile: Option<String>,       // --profile, -p / DATABRICKS_CONFIG_PROFILE
    pub poll_interval: u64,            // --poll-interval / SPARK_TUI_POLL_INTERVAL (default: 10)
}
}

Config

Resolved configuration with all required fields guaranteed present.

#![allow(unused)]
fn main() {
pub struct Config {
    pub host: String,
    pub token: String,
    pub cluster_id: String,
    pub poll_interval: u64,
}
}

Methods:

MethodSignatureDescription
base_url&self -> StringConstructs the Spark REST API base URL. Strips https:// prefix and trailing slashes from host

Functions

FunctionSignatureDescription
resolve_config() -> Result<Config, String>Main entry point. Resolves config from CLI > env > databrickscfg
databrickscfg_path() -> Option<PathBuf>Returns ~/.databrickscfg path if the file exists
parse_databrickscfg(path) -> Result<HashMap<String, Profile>>Parses INI-format config file into profile sections
find_complete_profile(profiles) -> Option<&Profile>Finds the first profile with host, token, and cluster_id

Resolution Logic

  1. Parse CLI args (clap handles env var fallback)
  2. If all three fields are present → return Config
  3. Otherwise, read ~/.databrickscfg
  4. If --profile is set → use that section (error if not found)
  5. Otherwise → auto-detect the first complete profile
  6. Merge CLI/env values with profile values (CLI/env takes priority)

Tests

  • test_parse_databrickscfg — verifies INI parsing
  • test_find_complete_profile — verifies auto-detection of complete profiles
  • test_base_url_strips_scheme_and_trailing_slash — verifies URL normalization

Fetch Module

Path: src/fetch/

Handles HTTP communication with the Spark REST API via the Databricks driver proxy.

Files

FilePurpose
client.rsSparkHttpClient and FetchError
types.rsSpark API response types (serde)
spark.rsEndpoint methods on SparkHttpClient
databricks.rsDatabricksClient for Databricks REST API, DBFS, Spark UI, and History Server
orchestrator.rspoll_once, assemble_data_payload, compute_health_summary
poller.rsBackground polling loop and historical fallback chain
eventlog/Event log parsing: DBFS download, gzip decompression, SparkEvent serde

client.rs — SparkHttpClient

SparkHttpClient

#![allow(unused)]
fn main() {
pub struct SparkHttpClient {
    client: reqwest::Client,
    base_url: String,
    token: String,
}
}
MethodSignatureDescription
new(base_url, token) -> SelfCreates a new client
base_url&self -> &strReturns the base URL
get&self, path: &str -> Result<T, FetchError>Generic GET request with Bearer auth and JSON deserialization

FetchError

Error enum with user-friendly messages for common HTTP errors:

VariantStatusMessage
Unauthorized401Token expired or invalid
Forbidden403Insufficient permissions
NotFound404Spark UI not available / app may have ended
ServiceUnavailable503Cluster not reachable
HttpErrorotherGeneric HTTP error with status and body
DeserializeJSON deserialization failure
RequestNetwork-level request failure
NoApplicationsNo Spark applications found

types.rs — API Types

Application & Job Types

TypeKey Fields
SparkApplicationid, name
SparkJobjob_id, name, status, submission_time, completion_time, stage_ids, task counts
JobStatusSucceeded, Running, Failed, Unknown

Stage Types

TypeKey Fields
SparkStagestage_id, attempt_id, status, num_tasks, executor_run_time, I/O bytes, spill bytes
StageStatusActive, Complete, Pending, Failed, Skipped

SQL Types

TypeKey Fields
SparkSqlExecutionid, status, description, plan_description, duration, job ID lists

Task Types

TypeKey Fields
SparkTasktask_id, stage_id, executor_id, host, status, duration, I/O bytes, spill bytes, peak_execution_memory
RawSparkTaskRaw API format with nested task_metrics (flattened into SparkTask via custom deserializer)

Executor Types

TypeKey Fields
SparkExecutorid, total_cores, max_memory, is_active
ClusterResourcestotal_executor_memory, total_executor_cores, num_executors

spark.rs — Endpoint Methods

Methods on SparkHttpClient:

MethodPathReturns
discover_app_id/applicationsString (first application ID)
get_jobs/applications/{id}/jobsVec<SparkJob>
get_stages/applications/{id}/stagesVec<SparkStage>
get_sql_executions/applications/{id}/sqlVec<SparkSqlExecution>
get_task_list/applications/{id}/stages/{sid}/{attempt}/taskListVec<SparkTask>
get_executors/applications/{id}/executorsVec<SparkExecutor>

databricks.rs — DatabricksClient

Thin client for Databricks REST API /api/2.0/* endpoints and workspace-level requests.

DatabricksClient

#![allow(unused)]
fn main() {
pub struct DatabricksClient {
    client: reqwest::Client,
    base_url: String,        // e.g. https://{host}/api/2.0
    workspace_root: String,  // e.g. https://{host}
    token: String,
    sparkui_cookie: Option<String>,
}
}

SparkuiProbeResult

Tri-state result from probing a Spark UI endpoint:

VariantFieldsMeaning
Readybase_url, app_idSpark UI is serving JSON data
Loadingbase_urlSpark UI is authenticated but still downloading/parsing event logs
NotFoundNo accessible Spark UI endpoint found

Key Methods

MethodDescription
get_cluster_infoFetch cluster state, log config, and spark_context_id
try_sparkui_endpointProbe Historical Spark UI REST API (tries Bearer + cookie auth, workspace + dataplane domains)
probe_sparkui_urlRe-probe a single URL for retry after Loading state
fetch_sparkui_dataFetch all data (jobs, stages, SQL, tasks) from a ready Spark UI endpoint
discover_history_serverProbe known Spark History Server proxy URL patterns
fetch_history_dataFetch all data from History Server
dbfs_list / dbfs_read_fullDBFS file operations
find_default_event_logsScan well-known DBFS paths for event log files

Warm-up Detection

The is_loading_page() helper detects HTML loading pages returned during Spark UI warm-up by checking for patterns like <title>Loading</title>, loading spark ui, please wait, or generic HTML responses.

eventlog/ — Event Log Parsing

FilePurpose
events.rsSparkEvent serde types for Spark event log JSON lines
parser.rsEventLogParser — converts raw events into jobs, stages, SQL, tasks
loader.rsDBFS download + gzip decompression pipeline

The load_event_log() function orchestrates: discover log path → download from DBFS → decompress gzip → parse JSON lines → return structured data.

orchestrator.rs — Data Polling & Assembly

poll_once

Fetches all data and runs analysis in a single poll cycle:

  1. Fetch jobs, stages, SQL executions, and executors concurrently (4-way tokio::join!)
  2. Aggregate active executors into ClusterResources (total memory, cores, executor count)
  3. Build cross-reference maps (job↔SQL, stage↔job)
  4. Build ranked jobs (sorted by duration, running first)
  5. Create a SuspectContext from the cross-reference maps
  6. Run 10 stage-level detectors via function pointer table (detect_slow_stages, detect_spill, detect_cpu_efficiency, detect_record_explosion, detect_task_failures, detect_memory_pressure, detect_partition_count, detect_broadcast_join, detect_python_udf, detect_cache_opportunity)
  7. Fetch task lists for top ~15 stages (selected by multiple heuristics)
  8. Run skew detection on fetched tasks (duration, data-size, record-count, executor hotspot)
  9. Aggregate and sort suspects (severity first, then estimated_savings_ms descending)
  10. Build stage_sql_hints — maps stage_id to top SQL plan operations
  11. Compute critical_stages — the longest wall-clock stage per job (critical path)
  12. Compute HealthSummary (job/IO counts, critical/warning counts, top issues)
  13. Return DataPayload

compute_health_summary

#![allow(unused)]
fn main() {
fn compute_health_summary(
    jobs: &[RankedJob],
    stages: &[SparkStage],
    suspects: &[Suspect],
) -> HealthSummary
}

Aggregates job counts, total I/O bytes, and suspect severity counts into a HealthSummary for the summary bar widget.

poller.rs — Background Poller

run_poller

#![allow(unused)]
fn main() {
pub async fn run_poller(
    client: Arc<SparkHttpClient>,
    databricks: Arc<DatabricksClient>,
    config: Arc<Config>,
    tx: mpsc::UnboundedSender<Action>,
    poll_interval: Duration,
)
}
  1. Discovers the application ID (or detects cluster unreachable → historical fallback)
  2. Enters a live poll loop: poll_once (from orchestrator.rs) → send result via channel → sleep
  3. On connection loss during polling, falls back to historical data via try_load_historical

Historical Fallback Chain

When the cluster is unreachable, try_load_historical attempts these strategies in order:

StrategyFunctionCondition
0 — Spark UItry_sparkuiRequires spark_context_id; retries with backoff if UI is loading
1 — History Servertry_history_serverProbes known proxy URL patterns
2 — DBFS event logstry_dbfs_event_logsUses cluster_log_conf or --event-log-path
3 — Default pathsfind_default_event_logsScans well-known DBFS directories

The Spark UI strategy includes warm-up retry: if the endpoint returns an HTML loading page, it retries with backoff delays defined in SPARKUI_RETRY_DELAYS (3, 5, 10, 15, 20 seconds — ~53s total).

DataPayload

#![allow(unused)]
fn main() {
pub struct DataPayload {
    pub app_id: String,
    pub jobs: Vec<RankedJob>,
    pub stages: Vec<SparkStage>,
    pub sql_executions: Vec<SparkSqlExecution>,
    pub suspects: Vec<Suspect>,
    pub stage_tasks: Arc<HashMap<i64, Vec<SparkTask>>>,
    pub summary: HealthSummary,
    pub cluster_resources: ClusterResources,
    pub stage_sql_hints: Arc<HashMap<i64, String>>,
    pub critical_stages: Arc<HashSet<i64>>,
    pub last_updated: String,
    pub data_source: DataSourceMode,
    pub data_source_detail: Option<String>,
}
}

Note: DataPayload is defined in src/tui/mod.rs and contains all data needed to render the TUI.

Key fields:

  • stage_sql_hints — maps stage_id → String with top SQL plan operations (e.g., “HashAggregate → Exchange → Scan parquet”), pre-computed for display in stage detail headers
  • critical_stages — set of stage IDs that represent the critical path (longest wall-clock stage per job), used for “CP” annotations in the job detail view
  • data_sourceDataSourceMode::Live or DataSourceMode::Historical, displayed in the status line
  • data_source_detail — optional label like "Spark UI", "history server", or "event logs", shown alongside the HISTORICAL badge

Analyze Module

Path: src/analyze/

Contains all performance analysis logic: suspect detection, bottleneck classification, and SQL correlation.

Files

FilePurpose
types.rsCore types: Suspect, Severity, SuspectCategory, BottleneckPattern, RankedJob, SqlJobLink
skew/Data skew detection using task-level metrics
suspects/SuspectContext, 10 stage-level detectors, bottleneck classification, aggregation
sql_linker/Cross-reference maps between jobs, stages, and SQL executions

types.rs — Core Types

Severity

#![allow(unused)]
fn main() {
pub enum Severity {
    Warning,
    Critical,
}
}

Implements Ord for sorting (Critical > Warning).

SuspectCategory

#![allow(unused)]
fn main() {
pub enum SuspectCategory {
    SlowStage,
    DataSkew,
    DataSizeSkew,
    RecordCountSkew,
    DiskSpill,
    CpuBottleneck,
    IoBottleneck,
    RecordExplosion,
    TaskFailures,
    MemoryPressure,
    ExecutorHotspot,
    TooManyPartitions,
    TooFewPartitions,
    BroadcastJoinOpportunity,
    PythonUdf,
    CacheOpportunity,
}
}

BottleneckPattern

#![allow(unused)]
fn main() {
pub enum BottleneckPattern {
    LargeScan,
    WideShuffle,
    DataExplosion,
    RecordExplosion,
}
}

Suspect

#![allow(unused)]
fn main() {
pub struct Suspect {
    pub severity: Severity,
    pub category: SuspectCategory,
    pub stage_id: i64,
    pub job_id: Option<i64>,
    pub title: String,
    pub detail: String,
    pub stage_name: Option<String>,
    pub sql_id: Option<i64>,
    pub sql_description: Option<String>,
    pub io_summary: Option<String>,
    pub recommendation: Option<String>,
    pub bottleneck: Option<BottleneckPattern>,
    pub sql_plan_hint: Option<String>,
    pub estimated_savings_ms: i64,
}
}

The estimated_savings_ms field contains a heuristic estimate of time savings from fixing the issue. It is used as a secondary sort key (after severity) in aggregate_suspects.

RankedJob

Processed job data for display, sorted by duration (running first, then slowest first).

#![allow(unused)]
fn main() {
pub struct RankedJob {
    pub job_id: i64,
    pub name: String,
    pub status: String,
    pub duration_ms: Option<i64>,
    pub num_tasks: i32,
    pub num_failed_tasks: i32,
    pub sql_id: Option<i64>,
    pub sql_description: Option<String>,
    pub stage_ids: Vec<i64>,
    pub submission_time: Option<String>,
    pub sql_plan: Option<String>,
}
}

HealthSummary

#![allow(unused)]
fn main() {
pub struct HealthSummary {
    pub total_jobs: usize,
    pub running_jobs: usize,
    pub failed_jobs: usize,
    pub total_input_bytes: i64,
    pub total_output_bytes: i64,
    pub total_shuffle_bytes: i64,
    pub critical_count: usize,
    pub warning_count: usize,
    pub top_issues: Vec<String>,
}
}

Aggregates health metrics for the summary bar widget, computed by compute_health_summary in the poller.

skew/ — Skew Detection

detect_skew

#![allow(unused)]
fn main() {
pub fn detect_skew(
    tasks: &[SparkTask],
    stage_id: i64,
    job_id: Option<i64>,
    stage_name: Option<&str>,
    sql_id: Option<i64>,
    sql_description: Option<&str>,
) -> Vec<Suspect>
}

Detects all forms of skew in a stage’s tasks. Returns a Vec<Suspect> covering duration skew, data-size skew, record-count skew, and executor hotspot detection. See Understanding Analysis for threshold details.

suspects/ — Stage-Level Detection

Constants

ConstantValue
ONE_MB1,048,576 bytes
FIFTY_MB52,428,800 bytes
ONE_HUNDRED_MB104,857,600 bytes
FIVE_HUNDRED_MB524,288,000 bytes
ONE_GB1,073,741,824 bytes

SuspectContext

Holds all lookup maps needed by suspect detectors, eliminating repetitive parameter passing.

#![allow(unused)]
fn main() {
pub struct SuspectContext<'a> {
    pub stage_to_job: &'a HashMap<i64, i64>,
    pub job_to_sql: &'a HashMap<i64, i64>,
    pub sql_descriptions: &'a HashMap<i64, String>,
    pub sql_plans: &'a HashMap<i64, String>,
}
}

Constructor:

#![allow(unused)]
fn main() {
pub fn new(
    stage_to_job: &'a HashMap<i64, i64>,
    job_to_sql: &'a HashMap<i64, i64>,
    sql_descriptions: &'a HashMap<i64, String>,
    sql_plans: &'a HashMap<i64, String>,
) -> Self
}

Methods:

MethodSignatureDescription
job_id(&self, stage_id: i64) -> Option<i64>Look up the job_id for a stage
resolve_sql(&self, stage_id: i64) -> (Option<i64>, Option<String>)Resolve SQL id and description for a stage via its job (private)
resolve_plan_hint_for(&self, stage_id: i64) -> Option<String>Resolve top SQL plan operations for a stage (e.g., “HashAggregate → Exchange → Scan”)
enrich(&self, suspect: &mut Suspect, stage: &SparkStage)Enrich a suspect with stage_name, SQL linkage, I/O summary, and plan hint

Stage-Level Detectors

All 10 stage-level detectors share the same signature:

#![allow(unused)]
fn main() {
pub fn detect_*(stages: &[SparkStage], ctx: &SuspectContext) -> Vec<Suspect>
}

They are dispatched via a function pointer table in the poller:

#![allow(unused)]
fn main() {
type DetectorFn = fn(&[SparkStage], &SuspectContext) -> Vec<Suspect>;
let detectors: &[DetectorFn] = &[
    detect_slow_stages,
    detect_spill,
    detect_cpu_efficiency,
    detect_record_explosion,
    detect_task_failures,
    detect_memory_pressure,
    detect_partition_count,
    detect_broadcast_join,
    detect_python_udf,
    detect_cache_opportunity,
];
}

detect_slow_stages

Flags stages with executor_run_time exceeding mean + 2*stddev (warning) or mean + 4*stddev (critical). Sets estimated_savings_ms to time above mean.

detect_spill

Flags stages with disk_bytes_spilled > 0 (warning) or > 1 GB (critical). Estimates ~30% of runtime as spill overhead.

detect_cpu_efficiency

Detects CPU efficiency issues. Computes cpu_ratio = (executor_cpu_time / 1_000_000) / executor_run_time. Low ratio (< 0.3, runtime > 10s) → I/O bottleneck; high ratio (> 0.9, runtime > 30s) → CPU saturated. Estimates ~20% savings.

detect_record_explosion

Detects stages where output_records > 10x input_records (with input_records > 1000). Estimates ~50% savings.

detect_task_failures

Detects stages with task failures or killed tasks. Estimates savings proportional to failure rate.

detect_memory_pressure

Detects memory pressure: memory_bytes_spilled > 50 MB but disk_bytes_spilled == 0. Estimates ~10% savings from GC overhead.

detect_partition_count

Detects partition count issues. Two sub-categories:

  • TooManyPartitions: num_tasks > 10,000 and avg_bytes_per_task < 1 MB. Estimates ~40% savings from scheduling overhead.
  • TooFewPartitions: num_tasks ≤ 8 and avg_bytes_per_task > 1 GB. Estimates ~50% savings from straggler elimination.

Recommendations include a computed target partition count for ~128 MB/partition.

detect_broadcast_join

Detects shuffle joins where one side is small enough to broadcast. Triggers when shuffle_write_bytes < 100 MB, executor_run_time > 5s, and the SQL plan hint contains join indicators (SortMerge, ShuffledHash, Join). Estimates ~60% savings from shuffle elimination.

detect_python_udf

Detects Python UDFs in SQL plans by searching for markers: ArrowEvalPython, BatchEvalPython, PythonUDF, PythonRunner. Escalates to Critical severity if also CPU-bound (ratio > 0.9, runtime > 30s). Estimates ~50% savings.

detect_cache_opportunity

Detects repeated computations by grouping completed stages by cleaned name. Triggers when ≥ 2 stages share a name and total runtime > 30s. Savings estimated as total_runtime - min_single_runtime.

classify_bottleneck

#![allow(unused)]
fn main() {
pub fn classify_bottleneck(s: &SparkStage) -> Option<BottleneckPattern>
}

Classifies root cause based on I/O patterns:

PatternCondition
DataExplosioninput > 100 MB and output > 5x input
LargeScaninput > 1 GB and input > 10x (output + shuffle_write)
WideShuffleshuffle_write > 500 MB or shuffle_read > input

aggregate_suspects

#![allow(unused)]
fn main() {
pub fn aggregate_suspects(mut suspects: Vec<Suspect>) -> Vec<Suspect>
}

Sorts suspects by severity (Critical first), then by estimated_savings_ms descending as a tiebreaker.

Helper Functions

FunctionDescription
classify_bottleneck(s)Classifies root-cause bottleneck pattern for a stage
bottleneck_recommendation(b)Returns a PySpark-specific recommendation string for a bottleneck pattern
stage_io_summary(s)Formats I/O metrics for a stage (including in:out ratio)

Note: resolve_sql and resolve_plan_hint_for are now methods on SuspectContext (see above).

sql_linker/ — Cross-Reference Maps

FunctionSignatureDescription
build_job_to_sql_map(sqls) -> HashMap<i64, i64>Maps job_id → sql_id from SQL execution job lists
build_stage_to_job_map(jobs) -> HashMap<i64, i64>Maps stage_id → job_id from job stage lists
link_sql_to_jobs(sqls) -> Vec<SqlJobLink>Groups SQL executions with their job IDs
find_sql_for_job(job_id, ...) -> (Option<i64>, Option<String>)Looks up SQL ID and description for a job
stages_for_task_analysis(stages) -> Vec<(i64, i64)>Selects up to ~15 stages for task-level analysis using multiple heuristics (top-by-runtime, top-by-shuffle, high-parallelism)

TUI Module

Path: src/tui/

Contains the terminal UI: app state machine, event loop, tab rendering, widgets, and theme.

Files

FilePurpose
app/App struct, event loop, key handling, rendering dispatch (state.rs, input.rs, render.rs)
theme.rsColor and style functions
highlight.rsSQL/plan syntax highlighting
tabs/jobs_list.rsJobs table view
tabs/job_detail.rsStage breakdown for a selected job
tabs/sql_detail.rsSQL execution plan view (scrollable)
tabs/stage_detail.rsDetailed stage metrics view
tabs/suspects.rsSuspects table view
widgets/help.rsHelp overlay (keybinding reference + SQL recommendations)
widgets/status_line.rsStatus bar with cluster info and last update time
widgets/summary_bar.rsHealth summary bar (top issues, job/IO counts)

app/ — App State

Tab

#![allow(unused)]
fn main() {
pub enum Tab {
    Jobs,
    Suspects,
}
}

Methods: next(), prev(), index(), from_index(), titles().

ViewMode

#![allow(unused)]
fn main() {
pub enum ViewMode {
    List,         // Tab-level table view
    JobDetail,    // Stage breakdown for a selected job
    SqlDetail,    // SQL execution plan (scrollable)
    StageDetail,  // Detailed metrics for a selected stage
}
}

App

#![allow(unused)]
fn main() {
pub struct App {
    pub active_tab: Tab,
    pub view_mode: ViewMode,
    pub data: Option<Arc<DataPayload>>,
    pub error_msg: Option<String>,
    pub cluster_id: String,
    pub should_quit: bool,
    pub job_table_state: TableState,
    pub suspect_table_state: TableState,
    pub detail_table_state: TableState,
    pub sql_scroll_state: ScrollViewState,
    pub stage_detail_scroll_state: ScrollViewState,
    show_help: bool,
    return_tab: Option<Tab>,
    client: Arc<SparkHttpClient>,
    tx: mpsc::UnboundedSender<Action>,
    pending_task_fetches: HashSet<i64>,
}
}

Notable changes from v1:

  • data changed from Option<DataPayload> to Option<Arc<DataPayload>> for cheaper cloning
  • sql_scroll: u16 and stage_detail_scroll: u16 replaced with sql_scroll_state: ScrollViewState and stage_detail_scroll_state: ScrollViewState (from tui-scrollview)
  • show_help: bool — toggles the help overlay
  • return_tab: Option<Tab> — tracks which tab to return to on Esc from JobDetail (e.g., Suspects tab when entering via a suspect)

Key methods:

MethodDescription
new(cluster_id, client, tx)Creates a new App with initial state
run(&mut self, terminal, rx)Main event loop — receives Actions from the channel, handles keys, re-renders
handle_key(key_event)Processes keyboard input based on current ViewMode
handle_action(action)Processes DataUpdate, FetchError, Key, Mouse, Resize actions
handle_escape()Goes back one level; respects return_tab
handle_enter()Drills into the selected item
handle_navigation_down()Moves selection/scroll down based on ViewMode
handle_navigation_up()Moves selection/scroll up based on ViewMode
handle_navigation_home()Jumps to top
handle_navigation_end()Jumps to bottom
open_sql_detail()Opens SQL detail for the current job
enter_suspect_job()Navigates from Suspects tab to a suspect’s job detail, setting return_tab
enter_stage_detail()Navigates to stage detail from job detail
trigger_task_fetch(stage_id)Triggers on-demand task fetch for a stage not already in stage_tasks
render(frame)Renders the current state to the terminal
render_tab_bar(frame, area)Renders the tab header
render_content(frame, area)Renders the main content area for the current ViewMode
render_status_bar(frame, area)Renders the bottom status bar with context-sensitive hint strings

theme.rs — Styles

Pure functions that return ratatui::style::Style:

FunctionUsage
critical()Red — critical severity
warning()Yellow — warning severity
healthy()Green — success/healthy
running()Yellow — running status
failed()Red — failed status
muted()Gray — secondary text
selected()Cyan — selected row
tab_active()Active tab style
tab_inactive()Inactive tab style
status_bar()Status bar background
severity_style(severity)Maps Severity to style
job_status_style(status)Maps job status string to style
metric_bytes_style(bytes)Color-codes byte counts by size
shuffle_bytes_style(bytes)Color-codes shuffle bytes
spill_bytes_style(bytes)Color-codes spill bytes
cpu_utilization_style(ratio)Color-codes CPU utilization: ≥0.95 red (saturated), ≥0.5 green (healthy), ≥0.3 yellow (underutilized), <0.3 red (I/O bound)
memory_utilization_style(peak, cluster_mem)Color-codes peak memory relative to cluster total; falls back to absolute thresholds when cluster data unavailable

Size thresholds for byte styling: MB = 1_048_576, GB = 1_073_741_824.

tabs/jobs_list.rs — Jobs List

FunctionDescription
format_submission_time(time)Formats submission time for display
render_jobs_tab(frame, area, app)Renders the jobs table with columns: ID, Status, Duration, Tasks, Failed, SQL, Submitted

tabs/job_detail.rs — Job Detail

FunctionDescription
render_job_detail(frame, area, job, stages, sql_executions, stage_state, critical_stages)Renders stage breakdown table for a job. Critical path stages are annotated with “CP”

tabs/sql_detail.rs — SQL Detail

FunctionDescription
render_sql_detail(frame, area, job, scroll_state, suspects)Renders scrollable SQL execution plan text with syntax highlighting. Uses ScrollViewState from tui-scrollview

tabs/stage_detail.rs — Stage Detail

FunctionDescription
render_stage_header(frame, area, stage, total_cluster_memory, sql_hint)Renders stage header with I/O metrics, CPU %, and optional SQL plan hint
render_duration_histogram(frame, area, tasks)Renders task duration histogram
render_executor_breakdown(frame, area, tasks)Renders per-executor breakdown table
render_peak_memory_section(frame, area, tasks, total_cluster_memory)Renders peak memory per task section
render_skew_metrics(frame, area, tasks)Renders skew metrics (CV, max/median ratio)
render_stage_detail(frame, area, stage, tasks, loading, scroll_state, total_cluster_memory, sql_hint)Renders full stage detail using ScrollViewState. Composes stage_header, I/O metrics, duration histogram, executor breakdown, peak memory, and skew metrics

tabs/suspects.rs — Suspects Tab

FunctionDescription
render_suspects_tab(frame, area, suspects, state, critical_stages)Renders suspects table with columns: Severity, Category, Stage, Job, Title, Detail, Recommendation. Table title: "Suspects (severity → savings)"

highlight.rs — Syntax Highlighting

FunctionDescription
highlight_sql(text)Applies syntax highlighting to SQL query text for display in the SQL detail view
highlight_spark_plan(text)Applies syntax highlighting to Spark physical plan text

widgets/help.rs

FunctionDescription
centered_rect(area, percent_x, percent_y)Computes a centered rectangle within an area (private helper)
render_help_overlay(frame, area)Renders a general keybinding reference overlay
render_sql_help_overlay(frame, area, job, suspects)Renders PySpark-specific recommendations for suspects related to the current SQL execution

widgets/status_line.rs

FunctionDescription
render_status_line(frame, area, app)Renders the bottom status bar showing cluster ID, app ID, and last update time

widgets/summary_bar.rs

FunctionDescription
render_summary_bar(frame, area, summary)Renders 2-line health summary with colored foreground text (red=critical, yellow=warning, green=healthy) showing job/IO counts and top issues

Utilities Module

Path: src/util/

Helper functions for formatting values and parsing timestamps.

Files

FilePurpose
format.rsHuman-readable formatting for durations, bytes, strings, and SQL plans
time.rsSpark timestamp parsing and duration calculation

format.rs — Formatting

FunctionSignatureDescription
format_duration_ms(ms: i64) -> StringFormats milliseconds as human-readable duration (e.g., 1h 23m 45s, 500ms, 2.3s)
format_bytes(bytes: i64) -> StringFormats byte counts with appropriate unit (e.g., 1.5 GB, 256 MB, 1.2 KB)
format_bytes_or_dash(bytes: i64) -> StringLike format_bytes, but returns "-" if bytes is zero
format_records(records: i64) -> StringFormats record counts with K/M/B suffixes
percentile(sorted: &[f64], p: f64) -> f64Computes percentile from a sorted slice
truncate(s: &str, max: usize) -> StringTruncates a string to max characters, appending ... if truncated
sanitize_for_span(s: &str) -> StringReplaces embedded newlines, carriage returns, and tabs with spaces. Ratatui’s Line/Span types expect no embedded newlines — they corrupt the differential renderer’s cursor position tracking
clean_stage_name(name: &str) -> StringRemoves Spark Connect prefixes and UUID suffixes from stage names for cleaner display
parse_plan_top_operations(plan: &str, limit: usize) -> Vec<String>Extracts the top N operations from a Spark SQL physical plan

Examples

#![allow(unused)]
fn main() {
format_duration_ms(3_723_000) // "1h 2m 3s"
format_duration_ms(500)        // "500ms"
format_duration_ms(2_300)      // "2.3s"

format_bytes(1_610_612_736)    // "1.5 GB"
format_bytes(1_048_576)        // "1.0 MB"

truncate("hello world", 5)     // "hello..."

clean_stage_name("spark-connect-UUID:stage_name") // "stage_name"
}

time.rs — Time Utilities

FunctionSignatureDescription
parse_spark_timestamp(s: &str) -> Option<DateTime<Utc>>Parses Spark timestamps in multiple formats: RFC3339, naive datetime, and GMT-suffix
duration_between(start: Option<&str>, end: Option<&str>) -> Option<i64>Computes duration in milliseconds between two optional timestamp strings

Supported Timestamp Formats

  • RFC3339: 2024-01-15T10:30:00.000Z
  • Naive: 2024-01-15T10:30:00.000
  • GMT suffix: 2024-01-15T10:30:00.000GMT

Examples

#![allow(unused)]
fn main() {
parse_spark_timestamp("2024-01-15T10:30:00.000GMT") // Some(DateTime<Utc>)

duration_between(
    Some("2024-01-15T10:30:00.000GMT"),
    Some("2024-01-15T10:31:00.000GMT"),
) // Some(60_000)

duration_between(Some("..."), None) // None
}

Troubleshooting

Common Errors

spark-tui maps HTTP status codes from the Spark REST API to user-friendly error messages:

ErrorHTTP StatusMeaningSolution
Unauthorized401Token expired or invalidRegenerate your token at Databricks Settings > Developer > Access Tokens
Forbidden403Insufficient permissionsCheck that your token has access to the specified cluster
Not Found404Spark UI not availableThe Spark application may have ended. Start a new Spark session or check the application ID
Service Unavailable503Cluster not reachablespark-tui will automatically check cluster state and attempt to load historical data if the cluster is terminated. If automatic fallback fails, verify the cluster is running or provide --event-log-path / --sparkui-cookie
No ApplicationsNo Spark apps on clusterEnsure a Spark session is active on the cluster (e.g., run a notebook or submit a job)

Configuration Errors

“Missing ‘host’ / ‘token’ / ‘cluster_id’”

spark-tui couldn’t find all three required fields. Check that you’ve provided them via CLI flags, environment variables, or ~/.databrickscfg. See Configuration for details.

“Profile ‘xyz’ not found in ~/.databrickscfg”

The --profile flag specifies a section name that doesn’t exist in your ~/.databrickscfg file. The error message lists available profiles.

Auto-detection fails

When no --profile is specified, spark-tui looks for the first profile in ~/.databrickscfg that has all three required fields (host, token, cluster_id). If no profile is complete, you’ll get a missing fields error.

Connection Issues

Timeout or no response

  • Verify the cluster is in a Running state in Databricks
  • Check that --host matches your workspace URL (e.g., adb-1234567890.azuredatabricks.net)
  • Ensure --cluster-id is correct (find it in the cluster’s URL or configuration page)

TLS errors

spark-tui uses rustls for TLS. If you’re behind a corporate proxy with custom CA certificates, you may need to set the SSL_CERT_FILE or SSL_CERT_DIR environment variables.

Log File

spark-tui writes logs to /tmp/spark-tui.log. To increase verbosity:

RUST_LOG=debug spark-tui --host ... --token ... --cluster-id ...

Available log levels: error, warn (default), info, debug, trace.

Check the log file for detailed error information:

tail -f /tmp/spark-tui.log

Terminal Issues

Display is corrupted after a crash

If spark-tui exits abnormally (e.g., killed by a signal), the terminal may remain in raw mode. Reset it with:

reset

spark-tui installs a panic hook that attempts to restore the terminal on panic, but external signals bypass this.

Colors look wrong

spark-tui uses 256-color mode via ratatui/crossterm. Ensure your terminal emulator supports 256 colors and that TERM is set correctly (e.g., xterm-256color).

SQL Rendering Artifacts

If SQL plan text appears corrupted or causes display glitches, this is likely caused by raw newlines embedded in SQL text. spark-tui sanitizes these via sanitize_for_span() in util/format.rs, which replaces embedded \n, \r, and \t characters with spaces before passing text to ratatui’s Line/Span types. Ratatui’s differential renderer tracks cursor positions per line, so embedded newlines corrupt its state.

If you encounter rendering artifacts, check whether the SQL text contains unusual control characters and file an issue.

Historical Mode

Spark UI shows “loading” but never becomes ready

The Historical Spark UI needs to download and parse event logs from DBFS, which can take a while for large applications. spark-tui retries with backoff for ~53 seconds. If it still doesn’t become ready:

  • Try opening the Spark UI in your browser first to trigger the warm-up
  • Check the log file (/tmp/spark-tui.log) for the exact URL being probed
  • The event log download may take longer than 53 seconds for very large applications — try again after waiting

Historical data loads but is incomplete

When using historical mode, some data may not be available:

  • Executor metrics are not available after termination (cluster resources will show as default/zero)
  • Real-time task data is replaced by complete post-mortem task data
  • SQL plan descriptions may be less detailed depending on the data source

If --sparkui-cookie doesn’t work:

  1. Verify the cookie is from the correct domain (adb-dp-*, not adb-*)
  2. Cookies expire — regenerate by visiting the Spark UI in your browser
  3. Check /tmp/spark-tui.log for the HTTP status code returned by cookie probes
  4. The cookie value should be the JWT-like string from DATAPLANE_DOMAIN_DBAUTH, not the entire cookie header

All historical strategies fail

If spark-tui reports “Could not load historical data”, check:

  1. Cluster log delivery — is it configured? (Cluster settings > Logging)
  2. DBFS permissions — does your token have access to read DBFS paths?
  3. Event log path — try specifying it explicitly with --event-log-path
  4. Spark UI cookie — try providing --sparkui-cookie (see Configuration)

Enable debug logging to see which strategies were attempted:

RUST_LOG=debug spark-tui --cluster-id ...

Deserialization Errors

If the log shows deserialization errors, the Spark API may have returned an unexpected response format. This can happen with:

  • Very old or very new Databricks Runtime versions
  • Custom Spark configurations that alter the REST API response

File an issue with the error message and your Databricks Runtime version.

Contributing

Development Setup

Prerequisites

  • Rust 1.85+ (edition 2024)
  • A Databricks workspace for testing (optional for code changes, required for integration testing)

Clone and build

git clone https://github.com/tadeasf/spark-tui.git
cd spark-tui
cargo build

Run locally

cargo run -- --host adb-123.azuredatabricks.net --token dapi... --cluster-id 0123-...

Project Structure

src/
├── main.rs          Entry point
├── config.rs        CLI args + config resolution
├── fetch/           HTTP client, API types, polling
├── analyze/         Suspect detection and classification
├── tui/             App state, rendering, widgets
└── util/            Formatting and time utilities

See Architecture for a detailed breakdown.

Testing

# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run a specific test module
cargo test config::tests
cargo test analyze::skew::tests
cargo test analyze::suspects::tests

The test suite covers:

  • Config parsing (~/.databrickscfg format, profile detection, URL normalization)
  • Skew detection (uniform tasks, warning-level skew, critical skew)
  • Suspect detection (slow stages, spill, bottleneck classification)
  • SQL linking (job-to-SQL mapping)
  • Formatting utilities (duration, bytes, truncation, stage name cleaning)
  • Time parsing (RFC3339, naive, GMT suffix formats)

Code Style

  • Edition 2024 — use current Rust idioms
  • Error handling — use thiserror for error types, Result for fallible operations
  • Formatting — run cargo fmt before committing
  • Linting — run cargo clippy and address warnings
cargo fmt --check
cargo clippy -- -D warnings

Dependencies

CratePurpose
clapCLI argument parsing with env var fallback
tokioAsync runtime (macros, rt-multi-thread, time, sync features)
reqwestHTTP client (with rustls-tls, no default features)
serde / serde_jsonJSON deserialization
thiserrorError type derivation
ratatuiTerminal UI framework (unstable-rendered-line-info feature)
crosstermTerminal backend
tracing / tracing-subscriberStructured logging (with env-filter)
chronoTimestamp parsing
syntect / syntect-tuiSQL syntax highlighting
tui-scrollviewSmooth scrollable views for detail panels

CI/CD

The project uses four GitHub Actions workflows:

  • ci.yml — runs cargo fmt --check, cargo clippy, and cargo test on every push and PR
  • docs.yml — builds and deploys mdbook documentation to GitHub Pages
  • auto-tag.yml — creates a vX.Y.Z git tag when Cargo.toml version changes on master
  • release.yml — triggered by v* tags; builds cross-platform release binaries (Linux x86_64, macOS x86_64 + aarch64, Windows x86_64) and creates a GitHub Release with artifacts

Conventions

  • Keep analysis logic in analyze/, not in the TUI layer
  • Keep API types in fetch/types.rs, not scattered across modules
  • Format functions go in util/format.rs
  • Each suspect detector is a pure function: (&[SparkStage], &SuspectContext) -> Vec<Suspect>
  • The poller is the only place where API calls and analysis are composed together