Skip to content

Data Model

Electoral Periods

The Czech Chamber of Deputies operates in electoral periods. Each period has a number, a year identifier (used in psp.cz ZIP filenames), and an organ ID (used in the database).

Period Years Label ZIP Year Organ ID
10 2025–present Current 2025 174
9 2021–2025 2021 173
8 2017–2021 2017 172
7 2013–2017 2013 171
6 2010–2013 2010 170
5 2006–2010 2006 169
4 2002–2006 2002 168
3 1998–2002 1998 167
2 1996–1998 1996 166
1 1993–1996 1993 165

The organ ID mapping is critical — id_obdobi in the poslanec table uses organ IDs (165–174), not period numbers (1–10).

UNL File Format

psp.cz distributes data as UNL files inside ZIP archives:

  • Encoding: Windows-1250 (Czech)
  • Delimiter: pipe |
  • Headers: none — column order defined in models/schemas.py
  • Trailing pipe: every line ends with |, producing an extra empty column (dropped during parsing)
  • Quoting: some files contain unescaped double-quotes — parsed with quote_char=None

Data Sources (ZIP Archives)

Archive URL Pattern Contents
hl-{year}ps.zip /opendata/hl-{year}ps.zip Voting data for one period
poslanci.zip /opendata/poslanci.zip MPs, persons, organs, memberships
schuze.zip /opendata/schuze.zip Sessions and agenda items
tisky.zip /opendata/tisky.zip Parliamentary prints (bills)

Base URL: https://www.psp.cz/eknih/cdrom/opendata

Key Tables

Voting Data (per-period)

hl_hlasovani — vote summaries (hl{year}s.unl):

Column Type Description
id_hlasovani Int64 Unique vote ID
id_organ Int32 Organ (chamber) ID
schuze Int32 Session number
cislo Int32 Vote number within session
bod Int32 Agenda item number
datum Utf8 Date (string)
cas Utf8 Time (string)
pro / proti / zdrzel / nehlasoval Int32 Vote counts
prihlaseno Int32 MPs registered
kvorum Int32 Quorum required
vysledek Utf8 Outcome code (see below)
nazev_dlouhy / nazev_kratky Utf8 Vote description (long/short)

hl_poslanec — individual MP votes (hl{year}hN.unl, multiple files per period):

Column Type Description
id_poslanec Int64 MP identifier
id_hlasovani Int64 Vote ID (FK to hl_hlasovani)
vysledek Utf8 Vote result code (see below)

zmatecne — void vote IDs (hl{year}z.unl):

Column Type Description
id_hlasovani Int64 ID of a void vote

Shared Tables

osoby — persons: id_osoba, pred (title before), prijmeni (surname), jmeno (first name), za (title after), narozeni, pohlavi, zmena, umrti

poslanec — MP records: id_poslanec, id_osoba (FK), id_kraj, id_kandidatka, id_obdobi (organ ID, not period number), web, contact fields, foto, facebook

organy — organs/organizations: id_organ, organ_id_organ (parent), id_typ_organu (1 = parliamentary club), zkratka (abbreviation), nazev_organu_cz/en, date range, priorita

zarazeni — memberships: id_osoba (FK), id_of (organ FK), cl_funkce, od_o/do_o (membership dates), od_f/do_f (function dates)

schuze — sessions: id_schuze, id_org, schuze (session number), od_schuze/do_schuze, aktualizace

bod_schuze — agenda items: id_bod, id_schuze (FK), id_tisk (FK to tisky), bod (item number), uplny_naz (full name)

tisky — parliamentary prints: id_tisk, ct (print number), nazev_tisku, datum_doruceni, id_obdobi, and more

Tisk Enrichment Data

Data produced by the background tisk pipeline, stored in the cache directory.

TiskInfo Model

The TiskInfo dataclass (models/tisk_models.py) holds enriched data for each parliamentary print:

Field Type Description
ct int Print number
nazev str Print name
url str psp.cz URL
topics list[str] Topic labels (from LLM or keyword classification)
topics_en list[str] English topic labels (from LLM or keyword classification)
summary str Czech AI summary
summary_en str English AI summary
has_text bool Whether extracted PDF text exists
sub_versions list[dict] Sub-tisk versions with diff summaries
law_changes list[str] Laws changed by this print
history TiskHistory Legislative process timeline

PDF Text Cache

Extracted plain text from parliamentary print PDFs:

~/.cache/pspcz-analyzer/psp/tisky_text/{period}/{ct}.txt

Topic Classifications

Per-period topic classification stored as Parquet files:

~/.cache/pspcz-analyzer/psp/tisky_meta/{period}/topic_classifications.parquet

Columns: ct (print number), topic (serialized Czech topic labels), topic_en (serialized English topic labels), summary (Czech), summary_en (English), source (classification method).

Topics are assigned either by keyword matching (topic_service.py) or by LLM classification (ollama_service.py). The LLM results take priority when available.

AI Summaries

Per-tisk summaries generated by Ollama in both Czech and English, stored in the topic classifications Parquet cache alongside topic data.

Version Diff Summaries

LLM-generated comparison summaries between sub-versions of a parliamentary print:

~/.cache/pspcz-analyzer/psp/tisky_version_diffs/{period}/{ct}_{sub_ct}.txt      # Czech
~/.cache/pspcz-analyzer/psp/tisky_version_diffs/{period}/{ct}_{sub_ct}_en.txt   # English

Legislative Histories

Scraped from psp.cz HTML, stored as JSON:

~/.cache/pspcz-analyzer/psp/tisky_historie/{period}/{ct}.json

Contains the full legislative process timeline (readings, committee reports, Senate, President).

Configuration

All configuration is via environment variables, loaded from .env by python-dotenv. Constants are defined in config.py.

Environment Variables

Variable Default Description
PSPCZ_CACHE_DIR ~/.cache/pspcz-analyzer/psp Root cache directory for all data
PSPCZ_DEV 1 1 for hot reload (dev), 0 for production
OLLAMA_BASE_URL http://localhost:11434 Ollama API endpoint
OLLAMA_API_KEY (empty) Bearer token for remote HTTPS Ollama
OLLAMA_MODEL qwen3:8b Model for inference
DAILY_REFRESH_ENABLED 1 1 to enable daily data refresh, 0 to disable
DAILY_REFRESH_HOUR 3 Hour (CET, 0-23) at which the daily refresh runs
GITHUB_FEEDBACK_ENABLED 0 Enable user feedback via GitHub Issues
GITHUB_FEEDBACK_TOKEN (empty) GitHub PAT with public_repo scope
GITHUB_FEEDBACK_REPO tadeasf/pspcz_analyzer Repository for feedback issues
GITHUB_FEEDBACK_LABELS user-feedback Labels applied to feedback issues
TISK_SHORTENER 0 Truncate tisk text for LLM (0 = full, 1 = truncate)

Ollama Configuration

Additional constants in config.py (not overridable via env var):

Constant Default Description
OLLAMA_TIMEOUT 300.0 Per-request timeout in seconds
OLLAMA_HEALTH_TIMEOUT 5.0 Health check timeout
OLLAMA_MAX_TEXT_CHARS 50000 Max text length sent to LLM
OLLAMA_VERBATIM_CHARS 40000 Chars included verbatim (rest truncated)

If Ollama is not running or unreachable, the system silently falls back to keyword-based classification.

Vote Result Codes

Individual MP Votes (hl_poslanec.vysledek)

Code Enum Meaning
A YES Voted yes
B NO Voted no
C ABSTAINED Abstained
F DID_NOT_VOTE Registered but didn't press button
@ ABSENT Not registered in the chamber
M EXCUSED Formally excused
W BEFORE_OATH Before taking oath
K ABSTAIN_ALT Alternative abstain code

Vote Outcomes (hl_hlasovani.vysledek)

Code Meaning
A Passed
R Rejected
X Invalid
Q Invalid (variant)
K Invalid (variant)

Votes in the zmatecne table are void and are always filtered out before any analysis.

Caching Strategy

~/.cache/pspcz-analyzer/psp/          (or $PSPCZ_CACHE_DIR)
    raw/              # Downloaded ZIP files
    extracted/        # Extracted UNL files
    parquet/          # Parsed DataFrames cached as Parquet
    tisky_pdf/        # Downloaded parliamentary print PDFs
    tisky_text/       # Extracted plain text from PDFs
    tisky_meta/       # Topic classification + summary Parquet caches
    tisky_historie/   # Legislative history JSON files
    tisky_version_diffs/  # LLM diff summaries (Czech + English)
    tisky_related_bills/  # Related bills JSON files (from zakon.cz)

The Parquet cache uses file modification times: if the Parquet file is newer than the source UNL directory, it's loaded directly. Otherwise the UNL files are re-parsed and the Parquet is regenerated.

Column schemas are defined in pspcz_analyzer/models/schemas.py — each table has a *_COLUMNS list (column order) and a *_DTYPES dict (type casts).