Data Model¶

Electoral Periods¶

The Czech Chamber of Deputies operates in electoral periods. Each period has a number, a year identifier (used in psp.cz ZIP filenames), and an organ ID (used in the database).

Period	Years	Label	ZIP Year	Organ ID
10	2025–present	Current	2025	174
9	2021–2025		2021	173
8	2017–2021		2017	172
7	2013–2017		2013	171
6	2010–2013		2010	170
5	2006–2010		2006	169
4	2002–2006		2002	168
3	1998–2002		1998	167
2	1996–1998		1996	166
1	1993–1996		1993	165

The organ ID mapping is critical — id_obdobi in the poslanec table uses organ IDs (165–174), not period numbers (1–10).

UNL File Format¶

psp.cz distributes data as UNL files inside ZIP archives:

Encoding: Windows-1250 (Czech)
Delimiter: pipe |
Headers: none — column order defined in models/schemas.py
Trailing pipe: every line ends with |, producing an extra empty column (dropped during parsing)
Quoting: some files contain unescaped double-quotes — parsed with quote_char=None

Data Sources (ZIP Archives)¶

Archive	URL Pattern	Contents
`hl-{year}ps.zip`	`/opendata/hl-{year}ps.zip`	Voting data for one period
`poslanci.zip`	`/opendata/poslanci.zip`	MPs, persons, organs, memberships
`schuze.zip`	`/opendata/schuze.zip`	Sessions and agenda items
`tisky.zip`	`/opendata/tisky.zip`	Parliamentary prints (bills)

Base URL: https://www.psp.cz/eknih/cdrom/opendata

Key Tables¶

Voting Data (per-period)¶

hl_hlasovani — vote summaries (hl{year}s.unl):

Column	Type	Description
`id_hlasovani`	Int64	Unique vote ID
`id_organ`	Int32	Organ (chamber) ID
`schuze`	Int32	Session number
`cislo`	Int32	Vote number within session
`bod`	Int32	Agenda item number
`datum`	Utf8	Date (string)
`cas`	Utf8	Time (string)
`pro` / `proti` / `zdrzel` / `nehlasoval`	Int32	Vote counts
`prihlaseno`	Int32	MPs registered
`kvorum`	Int32	Quorum required
`vysledek`	Utf8	Outcome code (see below)
`nazev_dlouhy` / `nazev_kratky`	Utf8	Vote description (long/short)

hl_poslanec — individual MP votes (hl{year}hN.unl, multiple files per period):

Column	Type	Description
`id_poslanec`	Int64	MP identifier
`id_hlasovani`	Int64	Vote ID (FK to hl_hlasovani)
`vysledek`	Utf8	Vote result code (see below)

zmatecne — void vote IDs (hl{year}z.unl):

Column	Type	Description
`id_hlasovani`	Int64	ID of a void vote

Shared Tables¶

osoby — persons: id_osoba, pred (title before), prijmeni (surname), jmeno (first name), za (title after), narozeni, pohlavi, zmena, umrti

poslanec — MP records: id_poslanec, id_osoba (FK), id_kraj, id_kandidatka, id_obdobi (organ ID, not period number), web, contact fields, foto, facebook

organy — organs/organizations: id_organ, organ_id_organ (parent), id_typ_organu (1 = parliamentary club), zkratka (abbreviation), nazev_organu_cz/en, date range, priorita

zarazeni — memberships: id_osoba (FK), id_of (organ FK), cl_funkce, od_o/do_o (membership dates), od_f/do_f (function dates)

schuze — sessions: id_schuze, id_org, schuze (session number), od_schuze/do_schuze, aktualizace

bod_schuze — agenda items: id_bod, id_schuze (FK), id_tisk (FK to tisky), bod (item number), uplny_naz (full name)

tisky — parliamentary prints: id_tisk, ct (print number), nazev_tisku, datum_doruceni, id_obdobi, and more

Tisk Enrichment Data¶

Data produced by the background tisk pipeline, stored in the cache directory.

TiskInfo Model¶

The TiskInfo dataclass (models/tisk_models.py) holds enriched data for each parliamentary print:

Field	Type	Description
`ct`	int	Print number
`nazev`	str	Print name
`url`	str	psp.cz URL
`topics`	list[str]	Topic labels (from LLM or keyword classification)
`topics_en`	list[str]	English topic labels (from LLM or keyword classification)
`summary`	str	Czech AI summary
`summary_en`	str	English AI summary
`has_text`	bool	Whether extracted PDF text exists
`sub_versions`	list[dict]	Sub-tisk versions with diff summaries
`law_changes`	list[str]	Laws changed by this print
`history`	TiskHistory	Legislative process timeline

PDF Text Cache¶

Extracted plain text from parliamentary print PDFs:

~/.cache/pspcz-analyzer/psp/tisky_text/{period}/{ct}.txt

Topic Classifications¶

Per-period topic classification stored as Parquet files:

~/.cache/pspcz-analyzer/psp/tisky_meta/{period}/topic_classifications.parquet

Columns: ct (print number), topic (serialized Czech topic labels), topic_en (serialized English topic labels), summary (Czech), summary_en (English), source (classification method).

Topics are assigned by LLM classification via the services/llm/ package. When the LLM is unavailable, tisks remain unclassified until the pipeline is re-run with an available LLM.

AI Summaries¶

Per-tisk summaries generated by Ollama in both Czech and English, stored in the topic classifications Parquet cache alongside topic data.

Version Diff Summaries¶

LLM-generated comparison summaries between sub-versions of a parliamentary print:

~/.cache/pspcz-analyzer/psp/tisky_version_diffs/{period}/{ct}_{sub_ct}.txt      # Czech
~/.cache/pspcz-analyzer/psp/tisky_version_diffs/{period}/{ct}_{sub_ct}_en.txt   # English

Legislative Histories¶

Scraped from psp.cz HTML, stored as JSON:

~/.cache/pspcz-analyzer/psp/tisky_historie/{period}/{ct}.json

Contains the full legislative process timeline (readings, committee reports, Senate, President).

Amendment Data¶

Amendment voting data is computed by the amendment pipeline and cached as Parquet files.

AmendmentVote¶

Represents a single amendment's voting record within a third-reading bill.

Field	Type	Description
`amendment_number`	int	Amendment number within the bill
`submitter`	str	Amendment author name(s)
`submitter_party`	str \| None	Party affiliation of submitter (if resolved)
`description`	str	Amendment text / proposed change
`result`	str	Vote outcome: "A" (accepted) or "N" (rejected)
`vote_id`	int \| None	Matched `id_hlasovani` from `hl_hlasovani`
`yes_count`	int	Total YES votes
`no_count`	int	Total NO votes
`abstained_count`	int	Total abstained
`absent_count`	int	Total absent/excused
`summary_cs`	str \| None	Czech AI summary
`summary_en`	str \| None	English AI summary

BillAmendmentData¶

Container for all amendments of a single bill (one session + agenda point).

Field	Type	Description
`period`	int	Electoral period number
`schuze`	int	Session number
`bod`	int	Agenda point number
`tisk_id`	int \| None	Associated parliamentary print (tisk) number
`bill_title`	str	Bill title from legislative history
`amendments`	list[AmendmentVote]	All amendments for this bill
`coalitions`	dict \| None	Coalition analysis results

Amendment Cache¶

Amendment data is cached under {PSPCZ_CACHE_DIR}/amendments/:

amendments/
├── {period}/
│   ├── amendments_{period}.parquet    # Parsed amendment records
│   ├── vote_mappings_{period}.json    # Amendment → vote ID mappings
│   └── coalitions_{period}.json       # Coalition analysis results

Configuration¶

All configuration is via environment variables, loaded from .env by python-dotenv. Constants are defined in config.py.

Environment Variables¶

Variable	Default	Description
`PSPCZ_CACHE_DIR`	`~/.cache/pspcz-analyzer/psp`	Root cache directory for all data
`PSPCZ_DEV`	`1`	`1` for hot reload (dev), `0` for production
`LLM_PROVIDER`	`ollama`	LLM backend: `ollama` or `openai`
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama API endpoint
`OLLAMA_API_KEY`	(empty)	Bearer token for remote HTTPS Ollama
`OLLAMA_MODEL`	`qwen3:8b`	Model for Ollama inference
`OPENAI_BASE_URL`	`https://api.openai.com/v1`	OpenAI-compatible API endpoint
`OPENAI_API_KEY`	(empty)	API key for OpenAI-compatible backend
`OPENAI_MODEL`	`gpt-4o-mini`	Model for OpenAI-compatible inference
`AI_PERIODS_LIMIT`	`3`	Number of newest periods to process with AI (0 = all)
`DAILY_REFRESH_ENABLED`	`1`	`1` to enable daily data refresh, `0` to disable
`DAILY_REFRESH_HOUR`	`3`	Hour (CET, 0-23) at which the daily refresh runs
`GITHUB_FEEDBACK_ENABLED`	`0`	Enable user feedback via GitHub Issues
`GITHUB_FEEDBACK_TOKEN`	(empty)	GitHub PAT with `public_repo` scope
`GITHUB_FEEDBACK_REPO`	`tadeasf/pspcz_analyzer`	Repository for feedback issues
`GITHUB_FEEDBACK_LABELS`	`user-feedback`	Labels applied to feedback issues
`TISK_SHORTENER`	`0`	Truncate tisk text for LLM (`0` = full, `1` = truncate)
`LLM_STRUCTURED_OUTPUT`	`1`	JSON schema structured output (`0` = regex fallback)
`LLM_EMPTY_RETRIES`	`2`	Extra LLM attempts on empty free-text results
`ADMIN_PORT`	`8001`	Admin backend server port
`ADMIN_USERNAME`	`admin`	Admin dashboard login username
`ADMIN_PASSWORD_HASH`	(empty)	bcrypt hash of admin password
`ADMIN_SESSION_SECRET`	(auto-generated)	HMAC secret for admin session cookies
`ADMIN_ALLOWED_IPS`	`127.0.0.1,::1,172.16.0.0/12`	IP/CIDR whitelist for admin access

LLM Configuration¶

Additional constants in config.py (not overridable via env var):

Constant	Default	Description
`LLM_TIMEOUT`	`300.0`	Per-request timeout in seconds
`LLM_HEALTH_TIMEOUT`	`5.0`	Health check timeout
`LLM_MAX_TEXT_CHARS`	`50000`	Max text length sent to LLM
`LLM_VERBATIM_CHARS`	`40000`	Chars included verbatim (rest truncated)

If the configured LLM is not running or unreachable, the system silently falls back to keyword-based classification.

Vote Result Codes¶

Individual MP Votes (`hl_poslanec.vysledek`)¶

Code	Enum	Meaning
`A`	YES	Voted yes
`B`	NO	Voted no
`C`	ABSTAINED	Abstained
`F`	DID_NOT_VOTE	Registered but didn't press button
`@`	ABSENT	Not registered in the chamber
`M`	EXCUSED	Formally excused
`W`	BEFORE_OATH	Before taking oath
`K`	ABSTAIN_ALT	Alternative abstain code

Vote Outcomes (`hl_hlasovani.vysledek`)¶

Code	Meaning
`A`	Passed
`R`	Rejected
`X`	Invalid
`Q`	Invalid (variant)
`K`	Invalid (variant)

Votes in the zmatecne table are void and are always filtered out before any analysis.

Caching Strategy¶

~/.cache/pspcz-analyzer/psp/          (or $PSPCZ_CACHE_DIR)
    raw/              # Downloaded ZIP files
    extracted/        # Extracted UNL files
    parquet/          # Parsed DataFrames cached as Parquet
    tisky_pdf/        # Downloaded parliamentary print PDFs
    tisky_text/       # Extracted plain text from PDFs
    tisky_meta/       # Topic classification + summary Parquet caches
    tisky_historie/   # Legislative history JSON files
    tisky_version_diffs/  # LLM diff summaries (Czech + English)
    tisky_related_bills/  # Related bills JSON files (from zakon.cz)
    amendments/       # Amendment data (parsed amendments, vote mappings, coalitions) — Parquet + JSON under amendments/{period}/

The Parquet cache uses file modification times: if the Parquet file is newer than the source UNL directory, it's loaded directly. Otherwise the UNL files are re-parsed and the Parquet is regenerated.

Column schemas are defined in pspcz_analyzer/models/schemas.py — each table has a *_COLUMNS list (column order) and a *_DTYPES dict (type casts).