Introduction
spark-tui is a terminal-based performance analysis tool for Apache Spark applications running on Databricks. It connects to the Spark REST API through the Databricks driver proxy and presents live job metrics, stage breakdowns, and automated suspect detection in an interactive TUI.
Why spark-tui?
Debugging Spark performance problems typically involves clicking through the Spark UI in a browser, manually comparing stage durations, and guessing which stages have data skew or excessive spill. This process is slow and error-prone.
spark-tui automates this analysis:
- Automatic suspect detection — identifies slow stages, data skew, and disk spill without manual inspection
- Bottleneck classification — categorizes root causes as Large Scan, Wide Shuffle, or Data Explosion
- Actionable recommendations — each finding includes a concrete tuning suggestion
- SQL correlation — links stages back to the originating SQL query and shows plan hints
- Live updates — polls the Spark API on a configurable interval and refreshes the display
What You See
The interface has two main tabs:
- Jobs — all Spark jobs ranked by duration (slowest first), with drill-down to stage details, duration bar charts, and SQL execution plans
- Suspects — automatically detected performance issues, sorted by severity (critical first), with category labels, I/O summaries, and recommendations
How It Works
spark-tui connects to the Spark History Server API exposed through Databricks’ driver proxy endpoint:
https://{host}/driver-proxy-api/o/0/{cluster_id}/40001/api/v1
A background poller fetches jobs, stages, SQL executions, and task lists at regular intervals. The analysis engine processes this data to detect anomalies, then the TUI renders the results in real time.
Next Steps
- Quick Start — install, configure, and run spark-tui
- Navigation — learn the keybindings
- Understanding Analysis — interpret the suspect findings
Quick Start
Prerequisites
- Rust toolchain — version 1.85 or later (spark-tui uses edition 2024)
- Databricks workspace — with a running cluster and an active Spark application
- Personal access token — generated from Databricks Settings > Developer > Access Tokens
Installation
git clone https://github.com/tadeasf/spark-tui.git
cd spark-tui
cargo install --path .
Or run directly without installing:
cargo run -- --host adb-123.azuredatabricks.net --token dapi... --cluster-id 0123-...
Configuration
You need three pieces of information:
| Field | Example |
|---|---|
| Workspace host | adb-1234567890.azuredatabricks.net |
| Personal access token | dapi0123456789abcdef... |
| Cluster ID | 0123-456789-abcdef |
Provide them via any of these methods (highest priority first):
Option 1: CLI flags
spark-tui \
--host adb-123.azuredatabricks.net \
--token dapi0123456789abcdef \
--cluster-id 0123-456789-abcdef
Option 2: Environment variables
export DATABRICKS_HOST=adb-123.azuredatabricks.net
export DATABRICKS_TOKEN=dapi0123456789abcdef
export DATABRICKS_CLUSTER_ID=0123-456789-abcdef
spark-tui
Option 3: ~/.databrickscfg file
Create or edit ~/.databrickscfg:
[my-workspace]
host = adb-123.azuredatabricks.net
token = dapi0123456789abcdef
cluster_id = 0123-456789-abcdef
Then run with a specific profile:
spark-tui --profile my-workspace
Or let spark-tui auto-detect the first complete profile:
spark-tui
First Run
When spark-tui starts, it will:
- Resolve configuration (CLI > env > databrickscfg)
- Connect to the Spark REST API via the driver proxy
- Discover the active Spark application
- Fetch jobs, stages, and SQL executions
- Display the Jobs tab with results ranked by duration
You should see a table of Spark jobs. Use j/k or arrow keys to navigate, Enter to drill into a job, and Tab to switch to the Suspects tab.
If something goes wrong, check the Troubleshooting guide.
Next Steps
- Configuration — full reference for all options
- Navigation — keybindings and view modes
- Understanding Analysis — interpreting suspects
Configuration
spark-tui requires three credentials to connect to a Databricks cluster: host, token, and cluster ID. These can be provided through CLI flags, environment variables, or a ~/.databrickscfg file.
Priority Resolution
Configuration is resolved in this order (highest priority first):
- CLI flags —
--host,--token,--cluster-id - Environment variables —
DATABRICKS_HOST,DATABRICKS_TOKEN,DATABRICKS_CLUSTER_ID ~/.databrickscfg— INI-format file with profile sections
CLI flags and environment variables are handled by clap with the env feature — each flag falls back to its corresponding env var automatically.
If all three required fields are not satisfied by CLI/env, spark-tui reads ~/.databrickscfg to fill the gaps. You can mix sources: for example, set host and token via env vars but cluster_id via the config file.
CLI Reference
| Flag | Short | Env Var | Default | Description |
|---|---|---|---|---|
--host | DATABRICKS_HOST | — | Workspace hostname | |
--token | DATABRICKS_TOKEN | — | Personal access token | |
--cluster-id | DATABRICKS_CLUSTER_ID | — | Cluster ID | |
--profile | -p | DATABRICKS_CONFIG_PROFILE | auto-detect | Profile name from ~/.databrickscfg |
--poll-interval | SPARK_TUI_POLL_INTERVAL | 10 | Poll interval in seconds | |
--event-log-path | SPARK_TUI_EVENT_LOG_PATH | — | DBFS path to a Spark event log file | |
--sparkui-cookie | SPARK_TUI_SPARKUI_COOKIE | — | DATAPLANE_DOMAIN_DBAUTH cookie value |
~/.databrickscfg Format
The file uses INI format with named profile sections:
[DEFAULT]
host = adb-123.azuredatabricks.net
token = dapi0123456789abcdef
[production]
host = adb-999.azuredatabricks.net
token = dapi_prod_token
cluster_id = 0123-456789-prod
[development]
host = adb-123.azuredatabricks.net
token = dapi_dev_token
cluster_id = 0456-789012-dev
Profile selection
- Explicit:
spark-tui --profile productionuses the[production]section - Auto-detect: without
--profile, spark-tui scans all profiles and uses the first one that has all three required fields (host,token,cluster_id)
If the named profile doesn’t exist, spark-tui lists available profiles in the error message.
Base URL Construction
spark-tui constructs the Spark REST API base URL as:
https://{host}/driver-proxy-api/o/0/{cluster_id}/40001/api/v1
The host field is normalized: any https:// prefix and trailing slashes are stripped before URL construction.
Poll Interval
The --poll-interval flag controls how often spark-tui refreshes data from the Spark API (default: 10 seconds). Lower values give more responsive updates but increase API load.
# Refresh every 5 seconds
spark-tui --poll-interval 5
Historical Mode
When spark-tui detects a terminated cluster (HTTP 503 or INVALID_STATE response), it automatically attempts to load historical Spark data using a 4-strategy fallback chain:
| Priority | Strategy | Description |
|---|---|---|
| 0 | Spark UI REST API | Probes https://{host}/sparkui/{cluster}/{driver}/api/v1/. Requires spark_context_id from the cluster info. Also tries the dataplane domain variant (adb-dp- prefix). |
| 1 | Spark History Server | Probes known Databricks history server proxy URLs (multiple path patterns). |
| 2 | DBFS event logs | Reads event logs from the cluster’s cluster_log_conf delivery path, or from --event-log-path if specified. |
| 3 | Default DBFS paths | Scans well-known DBFS directories (dbfs:/cluster-logs/, dbfs:/databricks/spark/eventLogs/, etc.) for event log files. |
The first strategy that succeeds provides the historical data. The status line shows a HISTORICAL badge.
Spark UI warm-up
The Historical Spark UI needs to download and parse event logs from DBFS before serving JSON data. During this warm-up phase, it returns an HTML loading page instead of JSON. spark-tui detects this and retries with backoff (3s, 5s, 10s, 15s, 20s — ~53s total), showing progress messages like “Spark UI loading… retrying (2/5)”.
Getting the --sparkui-cookie
On authenticated Databricks workspaces, the Spark UI endpoint requires a cookie instead of a Bearer token:
- Open the Databricks workspace in your browser
- Navigate to your cluster’s Spark UI tab (this warms up the endpoint)
- Open browser DevTools (F12) → Application → Cookies
- Find the
adb-dp-*domain (e.g.,adb-dp-1234567890.azuredatabricks.net) - Copy the value of the
DATAPLANE_DOMAIN_DBAUTHcookie
Then pass it to spark-tui:
spark-tui --sparkui-cookie "eyJ0eXAiOiJKV1Q..." --cluster-id 0123-456789-abcdef
# or
export SPARK_TUI_SPARKUI_COOKIE="eyJ0eXAiOiJKV1Q..."
spark-tui --cluster-id 0123-456789-abcdef
Using --event-log-path
If you know the exact DBFS path to a Spark event log file:
spark-tui --event-log-path "dbfs:/cluster-logs/0123-456789-abcdef/eventlog/events.log.gz"
This is useful when the automatic DBFS scanning doesn’t find your logs (e.g., custom log delivery paths).
Logging
spark-tui writes logs to /tmp/spark-tui.log (logs cannot go to stderr as it would corrupt the TUI). Control the log level with the RUST_LOG environment variable:
RUST_LOG=info spark-tui # Info and above
RUST_LOG=debug spark-tui # Debug messages
RUST_LOG=trace spark-tui # Everything
Default log level is warn.
Navigation & Keybindings
spark-tui uses vim-style keybindings for navigation. The interface has three view modes arranged in a drill-down hierarchy.
View Modes
List ──Enter──▶ JobDetail ──Enter──▶ StageDetail
│ │
└──s──▶ SqlDetail │
◀──Esc── ◀──Esc── ◀──Esc──
- List — the top-level view showing either the Jobs or Suspects tab
- JobDetail — stage breakdown and duration bar chart for a selected job
- StageDetail — detailed metrics for a selected stage (I/O, CPU %, peak RAM, task histograms, per-executor breakdown, skew metrics)
- SqlDetail — scrollable SQL execution plan for the selected job’s query
Keybindings
Global
| Key | Action |
|---|---|
q | Quit the application |
Esc | Go back one level (SqlDetail → JobDetail → List → Quit) |
h | Toggle help overlay (keybinding reference in most views; PySpark recommendations in SqlDetail) |
List Mode (Jobs / Suspects tabs)
| Key | Action |
|---|---|
Tab | Switch to next tab |
Shift+Tab | Switch to previous tab |
j / ↓ | Move selection down |
k / ↑ | Move selection up |
g / Home | Jump to first row |
G / End | Jump to last row |
Enter | Drill into the selected job’s detail view |
Status bar hint: q:quit Tab:switch j/k:nav Enter:detail h:help
JobDetail Mode
| Key | Action |
|---|---|
j / ↓ | Move selection down in the stage list |
k / ↑ | Move selection up in the stage list |
g / Home | Jump to first stage |
G / End | Jump to last stage |
Enter | Drill into the selected stage’s detail view |
s | Open SQL plan view (if the job has a linked SQL execution) |
Esc | Return to List mode |
Status bar hint: Esc:back j/k:nav Enter:stage s:sql h:help
Note: When entering JobDetail from the Suspects tab, pressing Esc returns to the Suspects tab (not Jobs). This is tracked via the return_tab field.
StageDetail Mode
| Key | Action |
|---|---|
j / ↓ | Scroll down |
k / ↑ | Scroll up |
g / Home | Scroll to top |
G / End | Scroll to bottom |
Esc | Return to JobDetail mode |
Status bar hint: Esc:back j/k:scroll g/G:top/bot h:help
SqlDetail Mode
| Key | Action |
|---|---|
j / ↓ | Scroll down |
k / ↑ | Scroll up |
g / Home | Scroll to top |
G / End | Scroll to bottom |
h | Show PySpark recommendations for suspects related to this SQL execution |
Esc | Return to JobDetail mode |
Status bar hint: Esc:back j/k:scroll g/G:top/bot h:hints
Help Overlay
Pressing h toggles a help overlay:
- In List, JobDetail, and StageDetail modes: Shows a general keybinding reference card listing all available shortcuts for the current view.
- In SqlDetail mode: Shows PySpark-specific recommendations based on suspect findings related to the current SQL execution.
Press h again or Esc to dismiss the overlay.
Tabs
Jobs Tab
Displays all Spark jobs in a table, ranked by duration (slowest first). Running jobs (with no completion time) appear at the top. Columns include:
- Job ID
- Status (with color coding)
- Duration
- Task counts
- SQL description (if linked)
- Submission time
Suspects Tab
Displays automatically detected performance issues, sorted by severity (Critical first), then by estimated savings descending as a tiebreaker. Each row shows:
- Severity indicator (color-coded)
- Category (Slow Stage / Data Skew / Data Size Skew / Record Count Skew / Disk Spill / CPU Bottleneck / I/O Bottleneck / Record Explosion / Task Failures / Memory Pressure / Executor Hotspot / Too Many Partitions / Too Few Partitions / Broadcast Join Opportunity / Python UDF / Cache Opportunity)
- Stage ID and job ID
- Title with key metrics
- Detail summary
- Recommendation
Color Coding
| Color | Meaning |
|---|---|
| Red | Critical severity, failed status |
| Yellow | Warning severity, running status |
| Green | Healthy / succeeded status |
| Gray | Muted / secondary information |
| Cyan | Selected row highlight |
| Magenta | CP — Critical Path stage (longest-running stage per job) |
CPU Utilization (Stage Detail)
| Color | Range | Meaning |
|---|---|---|
| Red | ≥ 95% or < 30% | CPU saturated or severe I/O bound |
| Green | 50%–94% | Healthy utilization |
| Yellow | 30%–49% | Underutilized |
Peak Memory (Stage Detail)
Color-coded relative to total cluster memory (ratio-based), with absolute fallback when executor data is unavailable. See Understanding Analysis for threshold details.
Summary Bar
The health summary bar in List view uses colored foreground text (not colored background) to display job/IO counts and top issues: red for critical, yellow for warning, green for healthy.
Understanding Analysis
spark-tui automatically detects performance issues in your Spark application and presents them as suspects. This guide explains how each detector works, what the thresholds mean, and how to act on the findings.
Suspect Categories
Slow Stage
Detects stages whose executor_run_time is statistically anomalous compared to all completed stages.
How it works:
- Computes the mean and standard deviation of
executor_run_timeacross all completed stages - Flags stages that exceed the threshold
| Severity | Threshold |
|---|---|
| Warning | executor_run_time > mean + 2 * stddev |
| Critical | executor_run_time > mean + 4 * stddev |
The suspect detail shows how many times slower the stage is compared to the average (e.g., “3.5x slower than average”).
Data Skew
Detects uneven task duration distribution within a stage, indicating skewed partitions.
How it works:
- Collects all task durations for the stage
- Computes the coefficient of variation (CV = stddev / mean) and the max/median ratio
- Flags if either metric exceeds threshold
| Severity | Threshold |
|---|---|
| Warning | CV > 1.0 or max > 3x median |
| Critical | CV > 2.0 or max > 10x median |
The suspect detail identifies the slowest task, its duration vs. the median, and how much data it processed.
Note: Task-level analysis is performed for up to ~15 stages selected by multiple heuristics (top-by-runtime, top-by-shuffle, high-parallelism). On-demand task fetching is triggered when entering StageDetail for stages not already analyzed.
Data Size Skew
Detects uneven data size distribution across tasks within a stage.
How it works:
- Computes the total bytes processed per task (
input_bytes + shuffle_read_bytes) - Applies the same CV and max/median ratio thresholds as duration skew
| Severity | Threshold |
|---|---|
| Warning | CV > 1.0 or max > 3x median |
| Critical | CV > 2.0 or max > 10x median |
The suspect detail identifies the task processing the most data, its byte count vs. the median.
Record Count Skew
Detects uneven record count distribution across tasks within a stage.
How it works:
- Computes the total records processed per task (
input_records + shuffle_read_records) - Applies CV and max/median ratio thresholds (only when max records > 1000)
| Severity | Threshold |
|---|---|
| Warning | CV > 1.0 or max > 3x median (and max > 1000) |
| Critical | CV > 2.0 or max > 10x median |
Indicates hot keys in joins or group-bys.
Disk Spill
Detects stages where data was spilled from memory to disk, indicating insufficient executor memory.
How it works:
- Checks
disk_bytes_spilledfor each stage - Any spill > 0 is flagged
| Severity | Threshold |
|---|---|
| Warning | disk_bytes_spilled > 0 |
| Critical | disk_bytes_spilled > 1 GB |
The suspect detail shows both memory spill and disk spill amounts.
CPU Bottleneck
Detects stages where the CPU is fully saturated for a sustained period.
How it works:
- Computes
cpu_ratio = (executor_cpu_time / 1_000_000) / executor_run_time - Flags stages with high CPU ratio and significant runtime
| Severity | Threshold |
|---|---|
| Warning | cpu_ratio > 0.9 and runtime > 30s |
The suspect detail shows CPU time vs. runtime and utilization percentage.
I/O Bottleneck
Detects stages that are I/O or GC bound (low CPU utilization despite significant runtime).
How it works:
- Uses the same CPU ratio as CPU Bottleneck detection
- Flags stages with low CPU ratio
| Severity | Threshold |
|---|---|
| Warning | cpu_ratio < 0.3 and runtime > 10s |
Consider increasing memory, improving data locality, using faster storage, or checking GC pauses.
Record Explosion
Detects stages where output records vastly exceed input records, indicating explode(), cross joins, or generate() operations.
How it works:
- Checks if
output_records > 10x input_records(only wheninput_records > 1000)
| Severity | Threshold |
|---|---|
| Warning | output_records > 10x input_records |
| Critical | output_records > 100x input_records |
Task Failures
Detects stages with failed or killed tasks.
How it works:
- Checks if
num_failed_tasks > 0ornum_killed_tasks > 0
| Severity | Threshold |
|---|---|
| Warning | Any failed or killed tasks |
| Critical | Failure rate > 10% or total problematic > 10 |
Common causes include OOM, data corruption, and fetch failures.
Memory Pressure
Detects stages where memory spill is occurring but hasn’t yet reached disk — a proactive warning before disk spill happens.
How it works:
- Checks if
memory_bytes_spilled > 50 MBanddisk_bytes_spilled == 0
| Severity | Threshold |
|---|---|
| Warning | memory_bytes_spilled > 50 MB with no disk spill |
Recommendation: increase spark.executor.memory or spark.executor.memoryOverhead, reduce partition size.
Executor Hotspot
Detects stages where a single executor handles a disproportionate share of data.
How it works:
- Sums
input_bytes + shuffle_read_bytesper executor - Flags executors processing > 50% of total data
| Severity | Threshold |
|---|---|
| Warning | One executor handles > 50% of data |
Check data locality and partition assignment. This may indicate skewed partition-to-executor mapping.
Too Many Partitions
Detects stages with excessive small partitions, causing high scheduling overhead.
How it works:
- Computes
avg_bytes_per_task = (input_bytes + shuffle_read_bytes) / num_tasks - Flags stages with too many tiny partitions
| Severity | Threshold |
|---|---|
| Warning | num_tasks > 10,000 and avg_bytes_per_task < 1 MB |
The recommendation suggests a target partition count to achieve ~128 MB/partition: df.coalesce(N).
Too Few Partitions
Detects stages with too few large partitions, causing stragglers and underutilized executors.
How it works:
- Computes
avg_bytes_per_task = (input_bytes + shuffle_read_bytes) / num_tasks - Flags stages with too few large partitions
| Severity | Threshold |
|---|---|
| Warning | num_tasks ≤ 8 and avg_bytes_per_task > 1 GB |
The recommendation suggests a target partition count: df.repartition(N).
Broadcast Join Opportunity
Detects shuffle joins where one side is small enough to broadcast, eliminating the shuffle entirely.
How it works:
- Filters stages with
shuffle_write_bytes < 100 MBandexecutor_run_time > 5s - Checks if the SQL plan contains join indicators (
SortMerge,ShuffledHash,Join)
| Severity | Threshold |
|---|---|
| Warning | shuffle_write < 100 MB and join detected in SQL plan |
Recommendation: from pyspark.sql.functions import broadcast; df.join(broadcast(small_df), on='key').
Python UDF
Detects Python UDF usage in SQL plans, which causes row-by-row serialization overhead.
How it works:
- Searches the SQL plan hint for markers:
ArrowEvalPython,BatchEvalPython,PythonUDF,PythonRunner - If the stage is also CPU-bound (ratio > 0.9, runtime > 30s), severity escalates to Critical
| Severity | Threshold |
|---|---|
| Warning | Python UDF marker found in plan, runtime > 5s |
| Critical | Python UDF + CPU ratio > 0.9 + runtime > 30s |
Recommendation: Replace @udf with @pandas_udf for vectorized execution, or use native F.when()/F.expr() functions.
Cache Opportunity
Detects repeated computations (stages with the same name) that could benefit from caching.
How it works:
- Groups completed stages by cleaned name
- Flags groups where ≥ 2 stages share a name and total runtime > 30s
| Severity | Threshold |
|---|---|
| Warning | ≥ 2 stages with same name, total runtime > 30s |
Recommendation: df.cache() or df.persist(StorageLevel.MEMORY_AND_DISK) before the first action. Call df.unpersist() when no longer needed.
Bottleneck Classification
When a slow stage or spill suspect is detected, spark-tui classifies the root cause based on I/O patterns:
| Pattern | Condition | Meaning |
|---|---|---|
| Data Explosion | input > 100 MB and output > 5x input | Stage produces far more data than it reads (e.g., explode, cross join) |
| Large Scan | input > 1 GB and input > 10x (output + shuffle_write) | Stage reads a lot but produces little (missing pushdown filters) |
| Wide Shuffle | shuffle_write > 500 MB or shuffle_read > input | Stage shuffles more data than it reads directly (broad join, groupBy on high-cardinality key) |
| Record Explosion | output_records > 10x input_records | Attached to record explosion suspects (see above) |
If none of these patterns match, no bottleneck tag is shown.
Recommendations
All recommendations use PySpark-specific syntax for immediate applicability. Each suspect includes a recommendation based on its category and bottleneck pattern:
| Category + Bottleneck | Recommendation |
|---|---|
| Data Skew | Repartition or salt skewed keys |
| Data Size Skew | Repartition by a more uniform key or use salting |
| Record Count Skew | Check for hot keys in joins or group-bys |
| Disk Spill | spark.conf.set('spark.executor.memory', '8g') or df.repartition(200) |
| CPU Bottleneck | Replace @udf with @pandas_udf or native F.when()/F.expr(). Cache with df.cache() and increase parallelism |
| I/O Bottleneck | spark.conf.set('spark.executor.memory', '8g'), cache hot DataFrames with df.cache(), or use df.repartition() for better locality |
| Record Explosion | Filter before explode: df.filter(...).select(explode('col')). Check for unintentional cross joins |
| Task Failures | Check executor logs for OOM/fetch failures. spark.conf.set('spark.task.maxFailures', '4') |
| Memory Pressure | spark.conf.set('spark.executor.memory', '8g') and spark.conf.set('spark.executor.memoryOverhead', '2g') |
| Executor Hotspot | Check data locality and partition assignment |
| Too Many Partitions | df.coalesce(N) to target ~128 MB/partition |
| Too Few Partitions | df.repartition(N) to target ~128 MB/partition |
| Broadcast Join Opportunity | from pyspark.sql.functions import broadcast; df.join(broadcast(small_df), on='key') |
| Python UDF | Replace @udf with @pandas_udf for vectorized execution, or use native F.when()/F.expr() |
| Cache Opportunity | df.cache() or df.persist(StorageLevel.MEMORY_AND_DISK) before the first action |
| Slow Stage + Large Scan | df.filter(F.col('date') >= '2024-01-01') and select only needed columns. Use partition pruning |
| Slow Stage + Wide Shuffle | from pyspark.sql.functions import broadcast; df.join(broadcast(small_df), ...). Pre-aggregate with groupBy before joins |
| Slow Stage + Data Explosion | Filter before explode: df.filter(...).withColumn('x', explode('arr')) |
| Slow Stage (no pattern) | df.explain(True) to see the query plan. Large shuffle may indicate missing filters or broad joins |
Estimated Savings
Each suspect includes an estimated_savings_ms field — a rough estimate of how much time could be saved by addressing the issue. This is used as a secondary sort key (after severity) so that higher-impact issues appear first within the same severity level.
How savings are computed per category
| Category | Estimation Method |
|---|---|
| Slow Stage | executor_run_time - mean_runtime (time above average) |
| Disk Spill | ~30% of executor_run_time (spill overhead) |
| CPU Bottleneck | ~20% of executor_run_time |
| I/O Bottleneck | ~20% of executor_run_time |
| Record Explosion | ~50% of executor_run_time |
| Task Failures | executor_run_time × failure_rate (retry overhead) |
| Memory Pressure | ~10% of executor_run_time (GC pause overhead) |
| Too Many Partitions | ~40% of executor_run_time (scheduling overhead) |
| Too Few Partitions | ~50% of executor_run_time (straggler overhead) |
| Broadcast Join Opportunity | ~60% of executor_run_time (shuffle elimination) |
| Python UDF | ~50% of executor_run_time (serialization overhead) |
| Cache Opportunity | total_runtime - min_single_runtime (repeated computation) |
These are heuristic estimates intended for prioritization, not precise predictions.
SQL Correlation
Each suspect is linked to its originating SQL execution when possible. The suspect shows:
- SQL ID — the Spark SQL execution identifier
- SQL Description — the query text or description
- SQL Plan Hint — the top operations from the physical plan (e.g., “HashAggregate -> Exchange -> Scan parquet”)
This helps trace the suspect back to the specific query that caused it.
I/O Summary
Slow stage and spill suspects include an I/O summary showing:
- Input bytes / records
- Output bytes / records
- Shuffle read bytes / records
- Shuffle write bytes / records
- Memory and disk spill amounts
Use this to understand the data flow through the flagged stage.
Severity Sorting
Suspects are sorted by severity (Critical first, then Warning), with estimated_savings_ms descending as a tiebreaker within the same severity level. The Suspects tab title reflects this: "Suspects (severity → savings)".
Color-Coding in Stage Detail
The stage detail view uses color-coded metrics to help identify issues at a glance.
CPU Utilization
The CPU % value in the stage header is color-coded based on the CPU ratio (executor_cpu_time / executor_run_time):
| Color | Range | Meaning |
|---|---|---|
| Red | ≥ 95% | CPU saturated |
| Green | 50%–94% | Healthy utilization |
| Yellow | 30%–49% | Underutilized (possible I/O bound) |
| Red | < 30% | Severe I/O bound |
Peak Memory
Peak execution memory is color-coded relative to total cluster memory when executor data is available:
| Color | Ratio to cluster memory | Meaning |
|---|---|---|
| Red | ≥ 80% | Near memory limit |
| Yellow | 50%–79% | Moderate usage |
| Green | 10%–49% | Comfortable |
| Default | < 10% | Low usage |
When executor data is unavailable, absolute thresholds are used as fallback:
| Color | Threshold | Meaning |
|---|---|---|
| Red | ≥ 10 GB | High memory usage |
| Yellow | ≥ 1 GB | Moderate |
| Green | ≥ 100 MB | Normal |
| Default | < 100 MB | Low |
Architecture
spark-tui follows a modular architecture with clear separation between configuration, data fetching, analysis, and rendering.
Module Map
src/
├── main.rs
├── config/
│ └── mod.rs CLI args, env vars, ~/.databrickscfg parsing
├── fetch/
│ ├── client.rs SparkHttpClient + FetchError
│ ├── spark.rs Endpoint methods (get_jobs, get_stages, etc.)
│ ├── types.rs Spark API response types (serde)
│ ├── databricks.rs DatabricksClient (cluster info, DBFS, sparkui, history server)
│ ├── orchestrator.rs poll_once, assemble_data_payload, compute_health_summary
│ ├── poller.rs run_poller + historical fallback chain
│ └── eventlog/ Event log parsing (DBFS download, gzip, SparkEvent serde)
├── analyze/
│ ├── types.rs Suspect, Severity, SuspectCategory, BottleneckPattern
│ ├── skew/ Data skew detection (CV + max/median)
│ ├── suspects/ SuspectContext, 10 detectors, bottleneck classification
│ └── sql_linker/ Job ↔ SQL ↔ Stage mapping
├── tui/
│ ├── app/ App state, event loop, key handling, rendering dispatch
│ │ ├── state.rs
│ │ ├── input.rs
│ │ └── render.rs
│ ├── theme.rs Color/style functions
│ ├── highlight.rs SQL/plan syntax highlighting
│ ├── tabs/
│ │ ├── jobs_list.rs Jobs table
│ │ ├── job_detail.rs Stage breakdown for a job
│ │ ├── sql_detail.rs SQL execution plan view
│ │ ├── stage_detail.rs Detailed stage metrics
│ │ └── suspects.rs Suspects table view
│ └── widgets/
│ ├── help.rs Help overlay
│ ├── status_line.rs Status bar
│ └── summary_bar.rs Health summary bar
└── util/
├── format/ format_duration_ms, format_bytes, truncate, clean_stage_name
└── time/ Spark timestamp parsing, duration_between
Data Flow
┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐
│ Config │────▶│ SparkHttp │────▶│ Poller │────▶│ Analysis │
│ resolve │ │ Client │ │ (poll_once) │ │ Engine │
└──────────┘ └──────────────┘ └──────┬───────┘ └─────┬─────┘
│ │
DataPayload Suspects
+ stage_sql_hints (via SuspectContext)
+ critical_stages
│ │
▼ ▼
┌───────────────────────────────┐
│ App (TUI) │
│ event loop ← mpsc channel │
└───────────────────────────────┘
Step by step:
-
Config resolution (
config/mod.rs) — parses CLI args, env vars, and~/.databrickscfgto produce aConfigstruct with host, token, cluster_id, and poll_interval -
HTTP client (
fetch/client.rs) —SparkHttpClientwrapsreqwest::Clientwith the base URL and token.FetchErrormaps HTTP status codes to user-friendly messages -
Endpoint methods (
fetch/spark.rs) —discover_app_id,get_jobs,get_stages,get_sql_executions,get_task_list,get_executors— each calls the Spark REST API and deserializes the response -
Background poller (
fetch/poller.rs) —run_pollerruns in a tokio task. When the cluster becomes unreachable (503 or terminated), the poller automatically falls back to historical data via a 4-strategy chain: Spark UI REST API (with warm-up retry), Spark History Server proxy, DBFS event logs, and default DBFS path scanning.poll_oncelives infetch/orchestrator.rs(separate from the poller loop):- Fetches jobs, stages, SQL executions, and executors concurrently via 4-way
tokio::join! - Aggregates active executors into
ClusterResources(total memory, cores, executor count) - Builds cross-reference maps (job↔SQL, stage↔job)
- Creates a
SuspectContextwith cross-reference maps - Runs 10 stage-level detectors via function pointer table, plus skew detection on task data
- Fetches task lists for up to ~15 stages (selected by multiple heuristics)
- Computes
stage_sql_hints(SQL plan hints per stage) andcritical_stages(longest wall-clock stage per job) - Computes
HealthSummaryfor the summary bar - Sends a
DataPayload(includingcluster_resources,stage_sql_hints,critical_stages) through an mpsc channel
- Fetches jobs, stages, SQL executions, and executors concurrently via 4-way
-
Analysis (
analyze/) — 10 stage-level detectors are dispatched via a function pointer table (&[DetectorFn]):detect_slow_stages,detect_spill,detect_cpu_efficiency,detect_record_explosion,detect_task_failures,detect_memory_pressure,detect_partition_count,detect_broadcast_join,detect_python_udf,detect_cache_opportunity. Each takes(&[SparkStage], &SuspectContext)and returnsVec<Suspect>.detect_skewruns separately on task data.aggregate_suspectssorts by severity thenestimated_savings_ms -
App event loop (
tui/app/) —App::runreceivesActionvariants from the mpsc channel:Action::DataUpdate(payload)— stores the new dataAction::FetchError(err)— stores the error messageAction::Key(event)— processes keybindingsAction::Resize(w, h)— triggers re-render
-
Rendering (
tui/tabs/,tui/widgets/) — renders the current view mode (List, JobDetail, StageDetail, SqlDetail) using ratatui widgets. The summary bar widget displays health metrics in List view
Async Model
spark-tui uses the tokio runtime with three concurrent tasks:
| Task | Channel | Description |
|---|---|---|
| Poller | tx → rx | Fetches data and sends Action::DataUpdate / Action::FetchError |
| Event reader | tx → rx | Reads terminal events via crossterm::event::read (blocking, wrapped in spawn_blocking) |
| App loop | rx | Receives all actions and processes them sequentially |
All tasks communicate through a single mpsc::UnboundedSender<Action> channel. The app loop owns the receiver and processes actions one at a time, ensuring thread-safe state updates without locks.
Design Decisions
- Bounded task fetching: Task lists (per-task metrics) are fetched for up to ~15 stages selected by multiple heuristics (top-by-runtime, top-by-shuffle, high-parallelism). On-demand task fetching is triggered when entering StageDetail for stages not already analyzed
- Concurrent fetches: Jobs, stages, SQL executions, and executors are fetched in parallel with 4-way
tokio::join!to minimize latency - Function pointer dispatch: Stage-level detectors are stored in a
&[DetectorFn]array and dispatched viaflat_map, making it easy to add new detectors - SuspectContext: Replaces ad-hoc parameter passing — all cross-reference maps are bundled in a single struct with helper methods (
job_id,resolve_sql,resolve_plan_hint_for,enrich) - tui-scrollview: Used for smooth scrolling in StageDetail and SqlDetail views, replacing manual
u16scroll offsets withScrollViewState - Log file: Logs go to
/tmp/spark-tui.loginstead of stderr to avoid corrupting the TUI - Panic hook: A custom panic hook restores the terminal before printing the panic message, preventing terminal corruption
- Edition 2024: Uses the latest Rust edition for modern language features
Dependencies
| Crate | Purpose |
|---|---|
clap | CLI argument parsing with env var fallback |
tokio | Async runtime (macros, rt-multi-thread, time, sync features) |
reqwest | HTTP client (with rustls-tls) |
serde / serde_json | JSON deserialization |
thiserror | Error type derivation |
ratatui | Terminal UI framework (with unstable-rendered-line-info feature) |
crossterm | Terminal backend |
tracing / tracing-subscriber | Structured logging |
chrono | Timestamp parsing |
syntect / syntect-tui | SQL syntax highlighting |
tui-scrollview | Smooth scrollable views for detail panels |
CI/CD Workflows
| Workflow | Trigger | Description |
|---|---|---|
ci.yml | Push / PR | Runs cargo fmt --check, cargo clippy, cargo test |
docs.yml | Push / PR | Builds and deploys mdbook documentation to GitHub Pages |
auto-tag.yml | Push to master (Cargo.toml changed) | Creates a vX.Y.Z tag when the version in Cargo.toml changes |
release.yml | Tag v* | Cross-platform release builds (Linux x86_64, macOS x86_64 + aarch64, Windows x86_64) with GitHub Release artifacts |
Module Reference
This section provides a reference for each module in the spark-tui codebase.
Modules
| Module | Path | Description |
|---|---|---|
| Config | src/config/ | CLI argument parsing, config resolution, ~/.databrickscfg support |
| Fetch | src/fetch/ | HTTP client, Spark API types, endpoint methods, background poller |
| Analyze | src/analyze/ | Suspect detection (slow stages, skew, spill), bottleneck classification, SQL linking |
| TUI | src/tui/ | App state machine, tab rendering, widgets, theme |
| Utilities | src/util/ | Formatting (duration, bytes) and time parsing helpers |
Config Module
Path: src/config.rs
Handles CLI argument parsing, environment variable fallback, and ~/.databrickscfg file parsing to produce a resolved Config struct.
Structs
CliArgs
Clap-derived struct for command-line arguments. Each field uses #[arg(env = "...")] for automatic env var fallback.
#![allow(unused)]
fn main() {
pub struct CliArgs {
pub host: Option<String>, // --host / DATABRICKS_HOST
pub token: Option<String>, // --token / DATABRICKS_TOKEN
pub cluster_id: Option<String>, // --cluster-id / DATABRICKS_CLUSTER_ID
pub profile: Option<String>, // --profile, -p / DATABRICKS_CONFIG_PROFILE
pub poll_interval: u64, // --poll-interval / SPARK_TUI_POLL_INTERVAL (default: 10)
}
}
Config
Resolved configuration with all required fields guaranteed present.
#![allow(unused)]
fn main() {
pub struct Config {
pub host: String,
pub token: String,
pub cluster_id: String,
pub poll_interval: u64,
}
}
Methods:
| Method | Signature | Description |
|---|---|---|
base_url | &self -> String | Constructs the Spark REST API base URL. Strips https:// prefix and trailing slashes from host |
Functions
| Function | Signature | Description |
|---|---|---|
resolve_config | () -> Result<Config, String> | Main entry point. Resolves config from CLI > env > databrickscfg |
databrickscfg_path | () -> Option<PathBuf> | Returns ~/.databrickscfg path if the file exists |
parse_databrickscfg | (path) -> Result<HashMap<String, Profile>> | Parses INI-format config file into profile sections |
find_complete_profile | (profiles) -> Option<&Profile> | Finds the first profile with host, token, and cluster_id |
Resolution Logic
- Parse CLI args (clap handles env var fallback)
- If all three fields are present → return
Config - Otherwise, read
~/.databrickscfg - If
--profileis set → use that section (error if not found) - Otherwise → auto-detect the first complete profile
- Merge CLI/env values with profile values (CLI/env takes priority)
Tests
test_parse_databrickscfg— verifies INI parsingtest_find_complete_profile— verifies auto-detection of complete profilestest_base_url_strips_scheme_and_trailing_slash— verifies URL normalization
Fetch Module
Path: src/fetch/
Handles HTTP communication with the Spark REST API via the Databricks driver proxy.
Files
| File | Purpose |
|---|---|
client.rs | SparkHttpClient and FetchError |
types.rs | Spark API response types (serde) |
spark.rs | Endpoint methods on SparkHttpClient |
databricks.rs | DatabricksClient for Databricks REST API, DBFS, Spark UI, and History Server |
orchestrator.rs | poll_once, assemble_data_payload, compute_health_summary |
poller.rs | Background polling loop and historical fallback chain |
eventlog/ | Event log parsing: DBFS download, gzip decompression, SparkEvent serde |
client.rs — SparkHttpClient
SparkHttpClient
#![allow(unused)]
fn main() {
pub struct SparkHttpClient {
client: reqwest::Client,
base_url: String,
token: String,
}
}
| Method | Signature | Description |
|---|---|---|
new | (base_url, token) -> Self | Creates a new client |
base_url | &self -> &str | Returns the base URL |
get | &self, path: &str -> Result<T, FetchError> | Generic GET request with Bearer auth and JSON deserialization |
FetchError
Error enum with user-friendly messages for common HTTP errors:
| Variant | Status | Message |
|---|---|---|
Unauthorized | 401 | Token expired or invalid |
Forbidden | 403 | Insufficient permissions |
NotFound | 404 | Spark UI not available / app may have ended |
ServiceUnavailable | 503 | Cluster not reachable |
HttpError | other | Generic HTTP error with status and body |
Deserialize | — | JSON deserialization failure |
Request | — | Network-level request failure |
NoApplications | — | No Spark applications found |
types.rs — API Types
Application & Job Types
| Type | Key Fields |
|---|---|
SparkApplication | id, name |
SparkJob | job_id, name, status, submission_time, completion_time, stage_ids, task counts |
JobStatus | Succeeded, Running, Failed, Unknown |
Stage Types
| Type | Key Fields |
|---|---|
SparkStage | stage_id, attempt_id, status, num_tasks, executor_run_time, I/O bytes, spill bytes |
StageStatus | Active, Complete, Pending, Failed, Skipped |
SQL Types
| Type | Key Fields |
|---|---|
SparkSqlExecution | id, status, description, plan_description, duration, job ID lists |
Task Types
| Type | Key Fields |
|---|---|
SparkTask | task_id, stage_id, executor_id, host, status, duration, I/O bytes, spill bytes, peak_execution_memory |
RawSparkTask | Raw API format with nested task_metrics (flattened into SparkTask via custom deserializer) |
Executor Types
| Type | Key Fields |
|---|---|
SparkExecutor | id, total_cores, max_memory, is_active |
ClusterResources | total_executor_memory, total_executor_cores, num_executors |
spark.rs — Endpoint Methods
Methods on SparkHttpClient:
| Method | Path | Returns |
|---|---|---|
discover_app_id | /applications | String (first application ID) |
get_jobs | /applications/{id}/jobs | Vec<SparkJob> |
get_stages | /applications/{id}/stages | Vec<SparkStage> |
get_sql_executions | /applications/{id}/sql | Vec<SparkSqlExecution> |
get_task_list | /applications/{id}/stages/{sid}/{attempt}/taskList | Vec<SparkTask> |
get_executors | /applications/{id}/executors | Vec<SparkExecutor> |
databricks.rs — DatabricksClient
Thin client for Databricks REST API /api/2.0/* endpoints and workspace-level requests.
DatabricksClient
#![allow(unused)]
fn main() {
pub struct DatabricksClient {
client: reqwest::Client,
base_url: String, // e.g. https://{host}/api/2.0
workspace_root: String, // e.g. https://{host}
token: String,
sparkui_cookie: Option<String>,
}
}
SparkuiProbeResult
Tri-state result from probing a Spark UI endpoint:
| Variant | Fields | Meaning |
|---|---|---|
Ready | base_url, app_id | Spark UI is serving JSON data |
Loading | base_url | Spark UI is authenticated but still downloading/parsing event logs |
NotFound | — | No accessible Spark UI endpoint found |
Key Methods
| Method | Description |
|---|---|
get_cluster_info | Fetch cluster state, log config, and spark_context_id |
try_sparkui_endpoint | Probe Historical Spark UI REST API (tries Bearer + cookie auth, workspace + dataplane domains) |
probe_sparkui_url | Re-probe a single URL for retry after Loading state |
fetch_sparkui_data | Fetch all data (jobs, stages, SQL, tasks) from a ready Spark UI endpoint |
discover_history_server | Probe known Spark History Server proxy URL patterns |
fetch_history_data | Fetch all data from History Server |
dbfs_list / dbfs_read_full | DBFS file operations |
find_default_event_logs | Scan well-known DBFS paths for event log files |
Warm-up Detection
The is_loading_page() helper detects HTML loading pages returned during Spark UI warm-up by checking for patterns like <title>Loading</title>, loading spark ui, please wait, or generic HTML responses.
eventlog/ — Event Log Parsing
| File | Purpose |
|---|---|
events.rs | SparkEvent serde types for Spark event log JSON lines |
parser.rs | EventLogParser — converts raw events into jobs, stages, SQL, tasks |
loader.rs | DBFS download + gzip decompression pipeline |
The load_event_log() function orchestrates: discover log path → download from DBFS → decompress gzip → parse JSON lines → return structured data.
orchestrator.rs — Data Polling & Assembly
poll_once
Fetches all data and runs analysis in a single poll cycle:
- Fetch jobs, stages, SQL executions, and executors concurrently (4-way
tokio::join!) - Aggregate active executors into
ClusterResources(total memory, cores, executor count) - Build cross-reference maps (job↔SQL, stage↔job)
- Build ranked jobs (sorted by duration, running first)
- Create a
SuspectContextfrom the cross-reference maps - Run 10 stage-level detectors via function pointer table (
detect_slow_stages,detect_spill,detect_cpu_efficiency,detect_record_explosion,detect_task_failures,detect_memory_pressure,detect_partition_count,detect_broadcast_join,detect_python_udf,detect_cache_opportunity) - Fetch task lists for top ~15 stages (selected by multiple heuristics)
- Run skew detection on fetched tasks (duration, data-size, record-count, executor hotspot)
- Aggregate and sort suspects (severity first, then
estimated_savings_msdescending) - Build
stage_sql_hints— maps stage_id to top SQL plan operations - Compute
critical_stages— the longest wall-clock stage per job (critical path) - Compute
HealthSummary(job/IO counts, critical/warning counts, top issues) - Return
DataPayload
compute_health_summary
#![allow(unused)]
fn main() {
fn compute_health_summary(
jobs: &[RankedJob],
stages: &[SparkStage],
suspects: &[Suspect],
) -> HealthSummary
}
Aggregates job counts, total I/O bytes, and suspect severity counts into a HealthSummary for the summary bar widget.
poller.rs — Background Poller
run_poller
#![allow(unused)]
fn main() {
pub async fn run_poller(
client: Arc<SparkHttpClient>,
databricks: Arc<DatabricksClient>,
config: Arc<Config>,
tx: mpsc::UnboundedSender<Action>,
poll_interval: Duration,
)
}
- Discovers the application ID (or detects cluster unreachable → historical fallback)
- Enters a live poll loop:
poll_once(fromorchestrator.rs) → send result via channel → sleep - On connection loss during polling, falls back to historical data via
try_load_historical
Historical Fallback Chain
When the cluster is unreachable, try_load_historical attempts these strategies in order:
| Strategy | Function | Condition |
|---|---|---|
| 0 — Spark UI | try_sparkui | Requires spark_context_id; retries with backoff if UI is loading |
| 1 — History Server | try_history_server | Probes known proxy URL patterns |
| 2 — DBFS event logs | try_dbfs_event_logs | Uses cluster_log_conf or --event-log-path |
| 3 — Default paths | find_default_event_logs | Scans well-known DBFS directories |
The Spark UI strategy includes warm-up retry: if the endpoint returns an HTML loading page, it retries with backoff delays defined in SPARKUI_RETRY_DELAYS (3, 5, 10, 15, 20 seconds — ~53s total).
DataPayload
#![allow(unused)]
fn main() {
pub struct DataPayload {
pub app_id: String,
pub jobs: Vec<RankedJob>,
pub stages: Vec<SparkStage>,
pub sql_executions: Vec<SparkSqlExecution>,
pub suspects: Vec<Suspect>,
pub stage_tasks: Arc<HashMap<i64, Vec<SparkTask>>>,
pub summary: HealthSummary,
pub cluster_resources: ClusterResources,
pub stage_sql_hints: Arc<HashMap<i64, String>>,
pub critical_stages: Arc<HashSet<i64>>,
pub last_updated: String,
pub data_source: DataSourceMode,
pub data_source_detail: Option<String>,
}
}
Note: DataPayload is defined in src/tui/mod.rs and contains all data needed to render the TUI.
Key fields:
stage_sql_hints— mapsstage_id → Stringwith top SQL plan operations (e.g., “HashAggregate → Exchange → Scan parquet”), pre-computed for display in stage detail headerscritical_stages— set of stage IDs that represent the critical path (longest wall-clock stage per job), used for “CP” annotations in the job detail viewdata_source—DataSourceMode::LiveorDataSourceMode::Historical, displayed in the status linedata_source_detail— optional label like"Spark UI","history server", or"event logs", shown alongside the HISTORICAL badge
Analyze Module
Path: src/analyze/
Contains all performance analysis logic: suspect detection, bottleneck classification, and SQL correlation.
Files
| File | Purpose |
|---|---|
types.rs | Core types: Suspect, Severity, SuspectCategory, BottleneckPattern, RankedJob, SqlJobLink |
skew/ | Data skew detection using task-level metrics |
suspects/ | SuspectContext, 10 stage-level detectors, bottleneck classification, aggregation |
sql_linker/ | Cross-reference maps between jobs, stages, and SQL executions |
types.rs — Core Types
Severity
#![allow(unused)]
fn main() {
pub enum Severity {
Warning,
Critical,
}
}
Implements Ord for sorting (Critical > Warning).
SuspectCategory
#![allow(unused)]
fn main() {
pub enum SuspectCategory {
SlowStage,
DataSkew,
DataSizeSkew,
RecordCountSkew,
DiskSpill,
CpuBottleneck,
IoBottleneck,
RecordExplosion,
TaskFailures,
MemoryPressure,
ExecutorHotspot,
TooManyPartitions,
TooFewPartitions,
BroadcastJoinOpportunity,
PythonUdf,
CacheOpportunity,
}
}
BottleneckPattern
#![allow(unused)]
fn main() {
pub enum BottleneckPattern {
LargeScan,
WideShuffle,
DataExplosion,
RecordExplosion,
}
}
Suspect
#![allow(unused)]
fn main() {
pub struct Suspect {
pub severity: Severity,
pub category: SuspectCategory,
pub stage_id: i64,
pub job_id: Option<i64>,
pub title: String,
pub detail: String,
pub stage_name: Option<String>,
pub sql_id: Option<i64>,
pub sql_description: Option<String>,
pub io_summary: Option<String>,
pub recommendation: Option<String>,
pub bottleneck: Option<BottleneckPattern>,
pub sql_plan_hint: Option<String>,
pub estimated_savings_ms: i64,
}
}
The estimated_savings_ms field contains a heuristic estimate of time savings from fixing the issue. It is used as a secondary sort key (after severity) in aggregate_suspects.
RankedJob
Processed job data for display, sorted by duration (running first, then slowest first).
#![allow(unused)]
fn main() {
pub struct RankedJob {
pub job_id: i64,
pub name: String,
pub status: String,
pub duration_ms: Option<i64>,
pub num_tasks: i32,
pub num_failed_tasks: i32,
pub sql_id: Option<i64>,
pub sql_description: Option<String>,
pub stage_ids: Vec<i64>,
pub submission_time: Option<String>,
pub sql_plan: Option<String>,
}
}
HealthSummary
#![allow(unused)]
fn main() {
pub struct HealthSummary {
pub total_jobs: usize,
pub running_jobs: usize,
pub failed_jobs: usize,
pub total_input_bytes: i64,
pub total_output_bytes: i64,
pub total_shuffle_bytes: i64,
pub critical_count: usize,
pub warning_count: usize,
pub top_issues: Vec<String>,
}
}
Aggregates health metrics for the summary bar widget, computed by compute_health_summary in the poller.
skew/ — Skew Detection
detect_skew
#![allow(unused)]
fn main() {
pub fn detect_skew(
tasks: &[SparkTask],
stage_id: i64,
job_id: Option<i64>,
stage_name: Option<&str>,
sql_id: Option<i64>,
sql_description: Option<&str>,
) -> Vec<Suspect>
}
Detects all forms of skew in a stage’s tasks. Returns a Vec<Suspect> covering duration skew, data-size skew, record-count skew, and executor hotspot detection. See Understanding Analysis for threshold details.
suspects/ — Stage-Level Detection
Constants
| Constant | Value |
|---|---|
ONE_MB | 1,048,576 bytes |
FIFTY_MB | 52,428,800 bytes |
ONE_HUNDRED_MB | 104,857,600 bytes |
FIVE_HUNDRED_MB | 524,288,000 bytes |
ONE_GB | 1,073,741,824 bytes |
SuspectContext
Holds all lookup maps needed by suspect detectors, eliminating repetitive parameter passing.
#![allow(unused)]
fn main() {
pub struct SuspectContext<'a> {
pub stage_to_job: &'a HashMap<i64, i64>,
pub job_to_sql: &'a HashMap<i64, i64>,
pub sql_descriptions: &'a HashMap<i64, String>,
pub sql_plans: &'a HashMap<i64, String>,
}
}
Constructor:
#![allow(unused)]
fn main() {
pub fn new(
stage_to_job: &'a HashMap<i64, i64>,
job_to_sql: &'a HashMap<i64, i64>,
sql_descriptions: &'a HashMap<i64, String>,
sql_plans: &'a HashMap<i64, String>,
) -> Self
}
Methods:
| Method | Signature | Description |
|---|---|---|
job_id | (&self, stage_id: i64) -> Option<i64> | Look up the job_id for a stage |
resolve_sql | (&self, stage_id: i64) -> (Option<i64>, Option<String>) | Resolve SQL id and description for a stage via its job (private) |
resolve_plan_hint_for | (&self, stage_id: i64) -> Option<String> | Resolve top SQL plan operations for a stage (e.g., “HashAggregate → Exchange → Scan”) |
enrich | (&self, suspect: &mut Suspect, stage: &SparkStage) | Enrich a suspect with stage_name, SQL linkage, I/O summary, and plan hint |
Stage-Level Detectors
All 10 stage-level detectors share the same signature:
#![allow(unused)]
fn main() {
pub fn detect_*(stages: &[SparkStage], ctx: &SuspectContext) -> Vec<Suspect>
}
They are dispatched via a function pointer table in the poller:
#![allow(unused)]
fn main() {
type DetectorFn = fn(&[SparkStage], &SuspectContext) -> Vec<Suspect>;
let detectors: &[DetectorFn] = &[
detect_slow_stages,
detect_spill,
detect_cpu_efficiency,
detect_record_explosion,
detect_task_failures,
detect_memory_pressure,
detect_partition_count,
detect_broadcast_join,
detect_python_udf,
detect_cache_opportunity,
];
}
detect_slow_stages
Flags stages with executor_run_time exceeding mean + 2*stddev (warning) or mean + 4*stddev (critical). Sets estimated_savings_ms to time above mean.
detect_spill
Flags stages with disk_bytes_spilled > 0 (warning) or > 1 GB (critical). Estimates ~30% of runtime as spill overhead.
detect_cpu_efficiency
Detects CPU efficiency issues. Computes cpu_ratio = (executor_cpu_time / 1_000_000) / executor_run_time. Low ratio (< 0.3, runtime > 10s) → I/O bottleneck; high ratio (> 0.9, runtime > 30s) → CPU saturated. Estimates ~20% savings.
detect_record_explosion
Detects stages where output_records > 10x input_records (with input_records > 1000). Estimates ~50% savings.
detect_task_failures
Detects stages with task failures or killed tasks. Estimates savings proportional to failure rate.
detect_memory_pressure
Detects memory pressure: memory_bytes_spilled > 50 MB but disk_bytes_spilled == 0. Estimates ~10% savings from GC overhead.
detect_partition_count
Detects partition count issues. Two sub-categories:
- TooManyPartitions:
num_tasks > 10,000andavg_bytes_per_task < 1 MB. Estimates ~40% savings from scheduling overhead. - TooFewPartitions:
num_tasks ≤ 8andavg_bytes_per_task > 1 GB. Estimates ~50% savings from straggler elimination.
Recommendations include a computed target partition count for ~128 MB/partition.
detect_broadcast_join
Detects shuffle joins where one side is small enough to broadcast. Triggers when shuffle_write_bytes < 100 MB, executor_run_time > 5s, and the SQL plan hint contains join indicators (SortMerge, ShuffledHash, Join). Estimates ~60% savings from shuffle elimination.
detect_python_udf
Detects Python UDFs in SQL plans by searching for markers: ArrowEvalPython, BatchEvalPython, PythonUDF, PythonRunner. Escalates to Critical severity if also CPU-bound (ratio > 0.9, runtime > 30s). Estimates ~50% savings.
detect_cache_opportunity
Detects repeated computations by grouping completed stages by cleaned name. Triggers when ≥ 2 stages share a name and total runtime > 30s. Savings estimated as total_runtime - min_single_runtime.
classify_bottleneck
#![allow(unused)]
fn main() {
pub fn classify_bottleneck(s: &SparkStage) -> Option<BottleneckPattern>
}
Classifies root cause based on I/O patterns:
| Pattern | Condition |
|---|---|
| DataExplosion | input > 100 MB and output > 5x input |
| LargeScan | input > 1 GB and input > 10x (output + shuffle_write) |
| WideShuffle | shuffle_write > 500 MB or shuffle_read > input |
aggregate_suspects
#![allow(unused)]
fn main() {
pub fn aggregate_suspects(mut suspects: Vec<Suspect>) -> Vec<Suspect>
}
Sorts suspects by severity (Critical first), then by estimated_savings_ms descending as a tiebreaker.
Helper Functions
| Function | Description |
|---|---|
classify_bottleneck(s) | Classifies root-cause bottleneck pattern for a stage |
bottleneck_recommendation(b) | Returns a PySpark-specific recommendation string for a bottleneck pattern |
stage_io_summary(s) | Formats I/O metrics for a stage (including in:out ratio) |
Note: resolve_sql and resolve_plan_hint_for are now methods on SuspectContext (see above).
sql_linker/ — Cross-Reference Maps
| Function | Signature | Description |
|---|---|---|
build_job_to_sql_map | (sqls) -> HashMap<i64, i64> | Maps job_id → sql_id from SQL execution job lists |
build_stage_to_job_map | (jobs) -> HashMap<i64, i64> | Maps stage_id → job_id from job stage lists |
link_sql_to_jobs | (sqls) -> Vec<SqlJobLink> | Groups SQL executions with their job IDs |
find_sql_for_job | (job_id, ...) -> (Option<i64>, Option<String>) | Looks up SQL ID and description for a job |
stages_for_task_analysis | (stages) -> Vec<(i64, i64)> | Selects up to ~15 stages for task-level analysis using multiple heuristics (top-by-runtime, top-by-shuffle, high-parallelism) |
TUI Module
Path: src/tui/
Contains the terminal UI: app state machine, event loop, tab rendering, widgets, and theme.
Files
| File | Purpose |
|---|---|
app/ | App struct, event loop, key handling, rendering dispatch (state.rs, input.rs, render.rs) |
theme.rs | Color and style functions |
highlight.rs | SQL/plan syntax highlighting |
tabs/jobs_list.rs | Jobs table view |
tabs/job_detail.rs | Stage breakdown for a selected job |
tabs/sql_detail.rs | SQL execution plan view (scrollable) |
tabs/stage_detail.rs | Detailed stage metrics view |
tabs/suspects.rs | Suspects table view |
widgets/help.rs | Help overlay (keybinding reference + SQL recommendations) |
widgets/status_line.rs | Status bar with cluster info and last update time |
widgets/summary_bar.rs | Health summary bar (top issues, job/IO counts) |
app/ — App State
Tab
#![allow(unused)]
fn main() {
pub enum Tab {
Jobs,
Suspects,
}
}
Methods: next(), prev(), index(), from_index(), titles().
ViewMode
#![allow(unused)]
fn main() {
pub enum ViewMode {
List, // Tab-level table view
JobDetail, // Stage breakdown for a selected job
SqlDetail, // SQL execution plan (scrollable)
StageDetail, // Detailed metrics for a selected stage
}
}
App
#![allow(unused)]
fn main() {
pub struct App {
pub active_tab: Tab,
pub view_mode: ViewMode,
pub data: Option<Arc<DataPayload>>,
pub error_msg: Option<String>,
pub cluster_id: String,
pub should_quit: bool,
pub job_table_state: TableState,
pub suspect_table_state: TableState,
pub detail_table_state: TableState,
pub sql_scroll_state: ScrollViewState,
pub stage_detail_scroll_state: ScrollViewState,
show_help: bool,
return_tab: Option<Tab>,
client: Arc<SparkHttpClient>,
tx: mpsc::UnboundedSender<Action>,
pending_task_fetches: HashSet<i64>,
}
}
Notable changes from v1:
datachanged fromOption<DataPayload>toOption<Arc<DataPayload>>for cheaper cloningsql_scroll: u16andstage_detail_scroll: u16replaced withsql_scroll_state: ScrollViewStateandstage_detail_scroll_state: ScrollViewState(fromtui-scrollview)show_help: bool— toggles the help overlayreturn_tab: Option<Tab>— tracks which tab to return to on Esc from JobDetail (e.g., Suspects tab when entering via a suspect)
Key methods:
| Method | Description |
|---|---|
new(cluster_id, client, tx) | Creates a new App with initial state |
run(&mut self, terminal, rx) | Main event loop — receives Actions from the channel, handles keys, re-renders |
handle_key(key_event) | Processes keyboard input based on current ViewMode |
handle_action(action) | Processes DataUpdate, FetchError, Key, Mouse, Resize actions |
handle_escape() | Goes back one level; respects return_tab |
handle_enter() | Drills into the selected item |
handle_navigation_down() | Moves selection/scroll down based on ViewMode |
handle_navigation_up() | Moves selection/scroll up based on ViewMode |
handle_navigation_home() | Jumps to top |
handle_navigation_end() | Jumps to bottom |
open_sql_detail() | Opens SQL detail for the current job |
enter_suspect_job() | Navigates from Suspects tab to a suspect’s job detail, setting return_tab |
enter_stage_detail() | Navigates to stage detail from job detail |
trigger_task_fetch(stage_id) | Triggers on-demand task fetch for a stage not already in stage_tasks |
render(frame) | Renders the current state to the terminal |
render_tab_bar(frame, area) | Renders the tab header |
render_content(frame, area) | Renders the main content area for the current ViewMode |
render_status_bar(frame, area) | Renders the bottom status bar with context-sensitive hint strings |
theme.rs — Styles
Pure functions that return ratatui::style::Style:
| Function | Usage |
|---|---|
critical() | Red — critical severity |
warning() | Yellow — warning severity |
healthy() | Green — success/healthy |
running() | Yellow — running status |
failed() | Red — failed status |
muted() | Gray — secondary text |
selected() | Cyan — selected row |
tab_active() | Active tab style |
tab_inactive() | Inactive tab style |
status_bar() | Status bar background |
severity_style(severity) | Maps Severity to style |
job_status_style(status) | Maps job status string to style |
metric_bytes_style(bytes) | Color-codes byte counts by size |
shuffle_bytes_style(bytes) | Color-codes shuffle bytes |
spill_bytes_style(bytes) | Color-codes spill bytes |
cpu_utilization_style(ratio) | Color-codes CPU utilization: ≥0.95 red (saturated), ≥0.5 green (healthy), ≥0.3 yellow (underutilized), <0.3 red (I/O bound) |
memory_utilization_style(peak, cluster_mem) | Color-codes peak memory relative to cluster total; falls back to absolute thresholds when cluster data unavailable |
Size thresholds for byte styling: MB = 1_048_576, GB = 1_073_741_824.
tabs/jobs_list.rs — Jobs List
| Function | Description |
|---|---|
format_submission_time(time) | Formats submission time for display |
render_jobs_tab(frame, area, app) | Renders the jobs table with columns: ID, Status, Duration, Tasks, Failed, SQL, Submitted |
tabs/job_detail.rs — Job Detail
| Function | Description |
|---|---|
render_job_detail(frame, area, job, stages, sql_executions, stage_state, critical_stages) | Renders stage breakdown table for a job. Critical path stages are annotated with “CP” |
tabs/sql_detail.rs — SQL Detail
| Function | Description |
|---|---|
render_sql_detail(frame, area, job, scroll_state, suspects) | Renders scrollable SQL execution plan text with syntax highlighting. Uses ScrollViewState from tui-scrollview |
tabs/stage_detail.rs — Stage Detail
| Function | Description |
|---|---|
render_stage_header(frame, area, stage, total_cluster_memory, sql_hint) | Renders stage header with I/O metrics, CPU %, and optional SQL plan hint |
render_duration_histogram(frame, area, tasks) | Renders task duration histogram |
render_executor_breakdown(frame, area, tasks) | Renders per-executor breakdown table |
render_peak_memory_section(frame, area, tasks, total_cluster_memory) | Renders peak memory per task section |
render_skew_metrics(frame, area, tasks) | Renders skew metrics (CV, max/median ratio) |
render_stage_detail(frame, area, stage, tasks, loading, scroll_state, total_cluster_memory, sql_hint) | Renders full stage detail using ScrollViewState. Composes stage_header, I/O metrics, duration histogram, executor breakdown, peak memory, and skew metrics |
tabs/suspects.rs — Suspects Tab
| Function | Description |
|---|---|
render_suspects_tab(frame, area, suspects, state, critical_stages) | Renders suspects table with columns: Severity, Category, Stage, Job, Title, Detail, Recommendation. Table title: "Suspects (severity → savings)" |
highlight.rs — Syntax Highlighting
| Function | Description |
|---|---|
highlight_sql(text) | Applies syntax highlighting to SQL query text for display in the SQL detail view |
highlight_spark_plan(text) | Applies syntax highlighting to Spark physical plan text |
widgets/help.rs
| Function | Description |
|---|---|
centered_rect(area, percent_x, percent_y) | Computes a centered rectangle within an area (private helper) |
render_help_overlay(frame, area) | Renders a general keybinding reference overlay |
render_sql_help_overlay(frame, area, job, suspects) | Renders PySpark-specific recommendations for suspects related to the current SQL execution |
widgets/status_line.rs
| Function | Description |
|---|---|
render_status_line(frame, area, app) | Renders the bottom status bar showing cluster ID, app ID, and last update time |
widgets/summary_bar.rs
| Function | Description |
|---|---|
render_summary_bar(frame, area, summary) | Renders 2-line health summary with colored foreground text (red=critical, yellow=warning, green=healthy) showing job/IO counts and top issues |
Utilities Module
Path: src/util/
Helper functions for formatting values and parsing timestamps.
Files
| File | Purpose |
|---|---|
format.rs | Human-readable formatting for durations, bytes, strings, and SQL plans |
time.rs | Spark timestamp parsing and duration calculation |
format.rs — Formatting
| Function | Signature | Description |
|---|---|---|
format_duration_ms | (ms: i64) -> String | Formats milliseconds as human-readable duration (e.g., 1h 23m 45s, 500ms, 2.3s) |
format_bytes | (bytes: i64) -> String | Formats byte counts with appropriate unit (e.g., 1.5 GB, 256 MB, 1.2 KB) |
format_bytes_or_dash | (bytes: i64) -> String | Like format_bytes, but returns "-" if bytes is zero |
format_records | (records: i64) -> String | Formats record counts with K/M/B suffixes |
percentile | (sorted: &[f64], p: f64) -> f64 | Computes percentile from a sorted slice |
truncate | (s: &str, max: usize) -> String | Truncates a string to max characters, appending ... if truncated |
sanitize_for_span | (s: &str) -> String | Replaces embedded newlines, carriage returns, and tabs with spaces. Ratatui’s Line/Span types expect no embedded newlines — they corrupt the differential renderer’s cursor position tracking |
clean_stage_name | (name: &str) -> String | Removes Spark Connect prefixes and UUID suffixes from stage names for cleaner display |
parse_plan_top_operations | (plan: &str, limit: usize) -> Vec<String> | Extracts the top N operations from a Spark SQL physical plan |
Examples
#![allow(unused)]
fn main() {
format_duration_ms(3_723_000) // "1h 2m 3s"
format_duration_ms(500) // "500ms"
format_duration_ms(2_300) // "2.3s"
format_bytes(1_610_612_736) // "1.5 GB"
format_bytes(1_048_576) // "1.0 MB"
truncate("hello world", 5) // "hello..."
clean_stage_name("spark-connect-UUID:stage_name") // "stage_name"
}
time.rs — Time Utilities
| Function | Signature | Description |
|---|---|---|
parse_spark_timestamp | (s: &str) -> Option<DateTime<Utc>> | Parses Spark timestamps in multiple formats: RFC3339, naive datetime, and GMT-suffix |
duration_between | (start: Option<&str>, end: Option<&str>) -> Option<i64> | Computes duration in milliseconds between two optional timestamp strings |
Supported Timestamp Formats
- RFC3339:
2024-01-15T10:30:00.000Z - Naive:
2024-01-15T10:30:00.000 - GMT suffix:
2024-01-15T10:30:00.000GMT
Examples
#![allow(unused)]
fn main() {
parse_spark_timestamp("2024-01-15T10:30:00.000GMT") // Some(DateTime<Utc>)
duration_between(
Some("2024-01-15T10:30:00.000GMT"),
Some("2024-01-15T10:31:00.000GMT"),
) // Some(60_000)
duration_between(Some("..."), None) // None
}
Troubleshooting
Common Errors
spark-tui maps HTTP status codes from the Spark REST API to user-friendly error messages:
| Error | HTTP Status | Meaning | Solution |
|---|---|---|---|
| Unauthorized | 401 | Token expired or invalid | Regenerate your token at Databricks Settings > Developer > Access Tokens |
| Forbidden | 403 | Insufficient permissions | Check that your token has access to the specified cluster |
| Not Found | 404 | Spark UI not available | The Spark application may have ended. Start a new Spark session or check the application ID |
| Service Unavailable | 503 | Cluster not reachable | spark-tui will automatically check cluster state and attempt to load historical data if the cluster is terminated. If automatic fallback fails, verify the cluster is running or provide --event-log-path / --sparkui-cookie |
| No Applications | — | No Spark apps on cluster | Ensure a Spark session is active on the cluster (e.g., run a notebook or submit a job) |
Configuration Errors
“Missing ‘host’ / ‘token’ / ‘cluster_id’”
spark-tui couldn’t find all three required fields. Check that you’ve provided them via CLI flags, environment variables, or ~/.databrickscfg. See Configuration for details.
“Profile ‘xyz’ not found in ~/.databrickscfg”
The --profile flag specifies a section name that doesn’t exist in your ~/.databrickscfg file. The error message lists available profiles.
Auto-detection fails
When no --profile is specified, spark-tui looks for the first profile in ~/.databrickscfg that has all three required fields (host, token, cluster_id). If no profile is complete, you’ll get a missing fields error.
Connection Issues
Timeout or no response
- Verify the cluster is in a Running state in Databricks
- Check that
--hostmatches your workspace URL (e.g.,adb-1234567890.azuredatabricks.net) - Ensure
--cluster-idis correct (find it in the cluster’s URL or configuration page)
TLS errors
spark-tui uses rustls for TLS. If you’re behind a corporate proxy with custom CA certificates, you may need to set the SSL_CERT_FILE or SSL_CERT_DIR environment variables.
Log File
spark-tui writes logs to /tmp/spark-tui.log. To increase verbosity:
RUST_LOG=debug spark-tui --host ... --token ... --cluster-id ...
Available log levels: error, warn (default), info, debug, trace.
Check the log file for detailed error information:
tail -f /tmp/spark-tui.log
Terminal Issues
Display is corrupted after a crash
If spark-tui exits abnormally (e.g., killed by a signal), the terminal may remain in raw mode. Reset it with:
reset
spark-tui installs a panic hook that attempts to restore the terminal on panic, but external signals bypass this.
Colors look wrong
spark-tui uses 256-color mode via ratatui/crossterm. Ensure your terminal emulator supports 256 colors and that TERM is set correctly (e.g., xterm-256color).
SQL Rendering Artifacts
If SQL plan text appears corrupted or causes display glitches, this is likely caused by raw newlines embedded in SQL text. spark-tui sanitizes these via sanitize_for_span() in util/format.rs, which replaces embedded \n, \r, and \t characters with spaces before passing text to ratatui’s Line/Span types. Ratatui’s differential renderer tracks cursor positions per line, so embedded newlines corrupt its state.
If you encounter rendering artifacts, check whether the SQL text contains unusual control characters and file an issue.
Historical Mode
Spark UI shows “loading” but never becomes ready
The Historical Spark UI needs to download and parse event logs from DBFS, which can take a while for large applications. spark-tui retries with backoff for ~53 seconds. If it still doesn’t become ready:
- Try opening the Spark UI in your browser first to trigger the warm-up
- Check the log file (
/tmp/spark-tui.log) for the exact URL being probed - The event log download may take longer than 53 seconds for very large applications — try again after waiting
Historical data loads but is incomplete
When using historical mode, some data may not be available:
- Executor metrics are not available after termination (cluster resources will show as default/zero)
- Real-time task data is replaced by complete post-mortem task data
- SQL plan descriptions may be less detailed depending on the data source
Cookie authentication fails
If --sparkui-cookie doesn’t work:
- Verify the cookie is from the correct domain (
adb-dp-*, notadb-*) - Cookies expire — regenerate by visiting the Spark UI in your browser
- Check
/tmp/spark-tui.logfor the HTTP status code returned by cookie probes - The cookie value should be the JWT-like string from
DATAPLANE_DOMAIN_DBAUTH, not the entire cookie header
All historical strategies fail
If spark-tui reports “Could not load historical data”, check:
- Cluster log delivery — is it configured? (Cluster settings > Logging)
- DBFS permissions — does your token have access to read DBFS paths?
- Event log path — try specifying it explicitly with
--event-log-path - Spark UI cookie — try providing
--sparkui-cookie(see Configuration)
Enable debug logging to see which strategies were attempted:
RUST_LOG=debug spark-tui --cluster-id ...
Deserialization Errors
If the log shows deserialization errors, the Spark API may have returned an unexpected response format. This can happen with:
- Very old or very new Databricks Runtime versions
- Custom Spark configurations that alter the REST API response
File an issue with the error message and your Databricks Runtime version.
Contributing
Development Setup
Prerequisites
- Rust 1.85+ (edition 2024)
- A Databricks workspace for testing (optional for code changes, required for integration testing)
Clone and build
git clone https://github.com/tadeasf/spark-tui.git
cd spark-tui
cargo build
Run locally
cargo run -- --host adb-123.azuredatabricks.net --token dapi... --cluster-id 0123-...
Project Structure
src/
├── main.rs Entry point
├── config.rs CLI args + config resolution
├── fetch/ HTTP client, API types, polling
├── analyze/ Suspect detection and classification
├── tui/ App state, rendering, widgets
└── util/ Formatting and time utilities
See Architecture for a detailed breakdown.
Testing
# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run a specific test module
cargo test config::tests
cargo test analyze::skew::tests
cargo test analyze::suspects::tests
The test suite covers:
- Config parsing (
~/.databrickscfgformat, profile detection, URL normalization) - Skew detection (uniform tasks, warning-level skew, critical skew)
- Suspect detection (slow stages, spill, bottleneck classification)
- SQL linking (job-to-SQL mapping)
- Formatting utilities (duration, bytes, truncation, stage name cleaning)
- Time parsing (RFC3339, naive, GMT suffix formats)
Code Style
- Edition 2024 — use current Rust idioms
- Error handling — use
thiserrorfor error types,Resultfor fallible operations - Formatting — run
cargo fmtbefore committing - Linting — run
cargo clippyand address warnings
cargo fmt --check
cargo clippy -- -D warnings
Dependencies
| Crate | Purpose |
|---|---|
clap | CLI argument parsing with env var fallback |
tokio | Async runtime (macros, rt-multi-thread, time, sync features) |
reqwest | HTTP client (with rustls-tls, no default features) |
serde / serde_json | JSON deserialization |
thiserror | Error type derivation |
ratatui | Terminal UI framework (unstable-rendered-line-info feature) |
crossterm | Terminal backend |
tracing / tracing-subscriber | Structured logging (with env-filter) |
chrono | Timestamp parsing |
syntect / syntect-tui | SQL syntax highlighting |
tui-scrollview | Smooth scrollable views for detail panels |
CI/CD
The project uses four GitHub Actions workflows:
- ci.yml — runs
cargo fmt --check,cargo clippy, andcargo teston every push and PR - docs.yml — builds and deploys mdbook documentation to GitHub Pages
- auto-tag.yml — creates a
vX.Y.Zgit tag whenCargo.tomlversion changes onmaster - release.yml — triggered by
v*tags; builds cross-platform release binaries (Linux x86_64, macOS x86_64 + aarch64, Windows x86_64) and creates a GitHub Release with artifacts
Conventions
- Keep analysis logic in
analyze/, not in the TUI layer - Keep API types in
fetch/types.rs, not scattered across modules - Format functions go in
util/format.rs - Each suspect detector is a pure function:
(&[SparkStage], &SuspectContext) -> Vec<Suspect> - The poller is the only place where API calls and analysis are composed together