Skip to content

Data Model

Electoral Periods

The Czech Chamber of Deputies operates in electoral periods. Each period has a number, a year identifier (used in psp.cz ZIP filenames), and an organ ID (used in the database).

Period Years Label ZIP Year Organ ID
10 2025–present Current 2025 174
9 2021–2025 2021 173
8 2017–2021 2017 172
7 2013–2017 2013 171
6 2010–2013 2010 170
5 2006–2010 2006 169
4 2002–2006 2002 168
3 1998–2002 1998 167
2 1996–1998 1996 166
1 1993–1996 1993 165

The organ ID mapping is critical — id_obdobi in the poslanec table uses organ IDs (165–174), not period numbers (1–10).

UNL File Format

psp.cz distributes data as UNL files inside ZIP archives:

  • Encoding: Windows-1250 (Czech)
  • Delimiter: pipe |
  • Headers: none — column order defined in models/schemas.py
  • Trailing pipe: every line ends with |, producing an extra empty column (dropped during parsing)
  • Quoting: some files contain unescaped double-quotes — parsed with quote_char=None

Data Sources (ZIP Archives)

Archive URL Pattern Contents
hl-{year}ps.zip /opendata/hl-{year}ps.zip Voting data for one period
poslanci.zip /opendata/poslanci.zip MPs, persons, organs, memberships
schuze.zip /opendata/schuze.zip Sessions and agenda items
tisky.zip /opendata/tisky.zip Parliamentary prints (bills)

Base URL: https://www.psp.cz/eknih/cdrom/opendata

Key Tables

Voting Data (per-period)

hl_hlasovani — vote summaries (hl{year}s.unl):

Column Type Description
id_hlasovani Int64 Unique vote ID
id_organ Int32 Organ (chamber) ID
schuze Int32 Session number
cislo Int32 Vote number within session
bod Int32 Agenda item number
datum Utf8 Date (string)
cas Utf8 Time (string)
pro / proti / zdrzel / nehlasoval Int32 Vote counts
prihlaseno Int32 MPs registered
kvorum Int32 Quorum required
vysledek Utf8 Outcome code (see below)
nazev_dlouhy / nazev_kratky Utf8 Vote description (long/short)

hl_poslanec — individual MP votes (hl{year}hN.unl, multiple files per period):

Column Type Description
id_poslanec Int64 MP identifier
id_hlasovani Int64 Vote ID (FK to hl_hlasovani)
vysledek Utf8 Vote result code (see below)

zmatecne — void vote IDs (hl{year}z.unl):

Column Type Description
id_hlasovani Int64 ID of a void vote

Shared Tables

osoby — persons: id_osoba, pred (title before), prijmeni (surname), jmeno (first name), za (title after), narozeni, pohlavi, zmena, umrti

poslanec — MP records: id_poslanec, id_osoba (FK), id_kraj, id_kandidatka, id_obdobi (organ ID, not period number), web, contact fields, foto, facebook

organy — organs/organizations: id_organ, organ_id_organ (parent), id_typ_organu (1 = parliamentary club), zkratka (abbreviation), nazev_organu_cz/en, date range, priorita

zarazeni — memberships: id_osoba (FK), id_of (organ FK), cl_funkce, od_o/do_o (membership dates), od_f/do_f (function dates)

schuze — sessions: id_schuze, id_org, schuze (session number), od_schuze/do_schuze, aktualizace

bod_schuze — agenda items: id_bod, id_schuze (FK), id_tisk (FK to tisky), bod (item number), uplny_naz (full name)

tisky — parliamentary prints: id_tisk, ct (print number), nazev_tisku, datum_doruceni, id_obdobi, and more

Tisk Enrichment Data

Data produced by the background tisk pipeline, stored in the cache directory.

TiskInfo Model

The TiskInfo dataclass (models/tisk_models.py) holds enriched data for each parliamentary print:

Field Type Description
ct int Print number
nazev str Print name
url str psp.cz URL
topics list[str] Topic labels (from LLM or keyword classification)
topics_en list[str] English topic labels (from LLM or keyword classification)
summary str Czech AI summary
summary_en str English AI summary
has_text bool Whether extracted PDF text exists
sub_versions list[dict] Sub-tisk versions with diff summaries
law_changes list[str] Laws changed by this print
history TiskHistory Legislative process timeline

PDF Text Cache

Extracted plain text from parliamentary print PDFs:

~/.cache/pspcz-analyzer/psp/tisky_text/{period}/{ct}.txt

Topic Classifications

Per-period topic classification stored as Parquet files:

~/.cache/pspcz-analyzer/psp/tisky_meta/{period}/topic_classifications.parquet

Columns: ct (print number), topic (serialized Czech topic labels), topic_en (serialized English topic labels), summary (Czech), summary_en (English), source (classification method).

Topics are assigned by LLM classification via the services/llm/ package. When the LLM is unavailable, tisks remain unclassified until the pipeline is re-run with an available LLM.

AI Summaries

Per-tisk summaries generated by Ollama in both Czech and English, stored in the topic classifications Parquet cache alongside topic data.

Version Diff Summaries

LLM-generated comparison summaries between sub-versions of a parliamentary print:

~/.cache/pspcz-analyzer/psp/tisky_version_diffs/{period}/{ct}_{sub_ct}.txt      # Czech
~/.cache/pspcz-analyzer/psp/tisky_version_diffs/{period}/{ct}_{sub_ct}_en.txt   # English

Legislative Histories

Scraped from psp.cz HTML, stored as JSON:

~/.cache/pspcz-analyzer/psp/tisky_historie/{period}/{ct}.json

Contains the full legislative process timeline (readings, committee reports, Senate, President).

Amendment Data

Amendment voting data is computed by the amendment pipeline and cached as Parquet files.

AmendmentVote

Represents a single amendment's voting record within a third-reading bill.

Field Type Description
amendment_number int Amendment number within the bill
submitter str Amendment author name(s)
submitter_party str | None Party affiliation of submitter (if resolved)
description str Amendment text / proposed change
result str Vote outcome: "A" (accepted) or "N" (rejected)
vote_id int | None Matched id_hlasovani from hl_hlasovani
yes_count int Total YES votes
no_count int Total NO votes
abstained_count int Total abstained
absent_count int Total absent/excused
summary_cs str | None Czech AI summary
summary_en str | None English AI summary

BillAmendmentData

Container for all amendments of a single bill (one session + agenda point).

Field Type Description
period int Electoral period number
schuze int Session number
bod int Agenda point number
tisk_id int | None Associated parliamentary print (tisk) number
bill_title str Bill title from legislative history
amendments list[AmendmentVote] All amendments for this bill
coalitions dict | None Coalition analysis results

Amendment Cache

Amendment data is cached under {PSPCZ_CACHE_DIR}/amendments/:

amendments/
├── {period}/
│   ├── amendments_{period}.parquet    # Parsed amendment records
│   ├── vote_mappings_{period}.json    # Amendment → vote ID mappings
│   └── coalitions_{period}.json       # Coalition analysis results

Configuration

All configuration is via environment variables, loaded from .env by python-dotenv. Constants are defined in config.py.

Environment Variables

Variable Default Description
PSPCZ_CACHE_DIR ~/.cache/pspcz-analyzer/psp Root cache directory for all data
PSPCZ_DEV 1 1 for hot reload (dev), 0 for production
LLM_PROVIDER ollama LLM backend: ollama or openai
OLLAMA_BASE_URL http://localhost:11434 Ollama API endpoint
OLLAMA_API_KEY (empty) Bearer token for remote HTTPS Ollama
OLLAMA_MODEL qwen3:8b Model for Ollama inference
OPENAI_BASE_URL https://api.openai.com/v1 OpenAI-compatible API endpoint
OPENAI_API_KEY (empty) API key for OpenAI-compatible backend
OPENAI_MODEL gpt-4o-mini Model for OpenAI-compatible inference
AI_PERIODS_LIMIT 3 Number of newest periods to process with AI (0 = all)
DAILY_REFRESH_ENABLED 1 1 to enable daily data refresh, 0 to disable
DAILY_REFRESH_HOUR 3 Hour (CET, 0-23) at which the daily refresh runs
GITHUB_FEEDBACK_ENABLED 0 Enable user feedback via GitHub Issues
GITHUB_FEEDBACK_TOKEN (empty) GitHub PAT with public_repo scope
GITHUB_FEEDBACK_REPO tadeasf/pspcz_analyzer Repository for feedback issues
GITHUB_FEEDBACK_LABELS user-feedback Labels applied to feedback issues
TISK_SHORTENER 0 Truncate tisk text for LLM (0 = full, 1 = truncate)
LLM_STRUCTURED_OUTPUT 1 JSON schema structured output (0 = regex fallback)
LLM_EMPTY_RETRIES 2 Extra LLM attempts on empty free-text results
ADMIN_PORT 8001 Admin backend server port
ADMIN_USERNAME admin Admin dashboard login username
ADMIN_PASSWORD_HASH (empty) bcrypt hash of admin password
ADMIN_SESSION_SECRET (auto-generated) HMAC secret for admin session cookies
ADMIN_ALLOWED_IPS 127.0.0.1,::1,172.16.0.0/12 IP/CIDR whitelist for admin access

LLM Configuration

Additional constants in config.py (not overridable via env var):

Constant Default Description
LLM_TIMEOUT 300.0 Per-request timeout in seconds
LLM_HEALTH_TIMEOUT 5.0 Health check timeout
LLM_MAX_TEXT_CHARS 50000 Max text length sent to LLM
LLM_VERBATIM_CHARS 40000 Chars included verbatim (rest truncated)

If the configured LLM is not running or unreachable, the system silently falls back to keyword-based classification.

Vote Result Codes

Individual MP Votes (hl_poslanec.vysledek)

Code Enum Meaning
A YES Voted yes
B NO Voted no
C ABSTAINED Abstained
F DID_NOT_VOTE Registered but didn't press button
@ ABSENT Not registered in the chamber
M EXCUSED Formally excused
W BEFORE_OATH Before taking oath
K ABSTAIN_ALT Alternative abstain code

Vote Outcomes (hl_hlasovani.vysledek)

Code Meaning
A Passed
R Rejected
X Invalid
Q Invalid (variant)
K Invalid (variant)

Votes in the zmatecne table are void and are always filtered out before any analysis.

Caching Strategy

~/.cache/pspcz-analyzer/psp/          (or $PSPCZ_CACHE_DIR)
    raw/              # Downloaded ZIP files
    extracted/        # Extracted UNL files
    parquet/          # Parsed DataFrames cached as Parquet
    tisky_pdf/        # Downloaded parliamentary print PDFs
    tisky_text/       # Extracted plain text from PDFs
    tisky_meta/       # Topic classification + summary Parquet caches
    tisky_historie/   # Legislative history JSON files
    tisky_version_diffs/  # LLM diff summaries (Czech + English)
    tisky_related_bills/  # Related bills JSON files (from zakon.cz)
    amendments/       # Amendment data (parsed amendments, vote mappings, coalitions) — Parquet + JSON under amendments/{period}/

The Parquet cache uses file modification times: if the Parquet file is newer than the source UNL directory, it's loaded directly. Otherwise the UNL files are re-parsed and the Parquet is regenerated.

Column schemas are defined in pspcz_analyzer/models/schemas.py — each table has a *_COLUMNS list (column order) and a *_DTYPES dict (type casts).