API Reference¶
Auto-generated documentation from source code docstrings.
Scraper¶
Core¶
The core scraper handles fetching Reddit JSON data via a stealth browser.
python_reddit_scraper.scraper.core
¶
Camoufox-based Reddit JSON scraper with pagination.
Navigates old.reddit.com JSON API using a stealth Firefox browser, follows pagination tokens, and returns raw post data.
scrape_subreddit(browser, subreddit, max_pages=50, delay=1.5, quiet=False)
¶
Scrape a subreddit's posts via Reddit's old JSON API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
browser
|
Browser
|
A camoufox Browser instance (from sync context manager). |
required |
subreddit
|
str
|
Subreddit name (without r/ prefix). |
required |
max_pages
|
int
|
Maximum number of pages to fetch (100 posts per page). |
50
|
delay
|
float
|
Seconds to wait between page requests. |
1.5
|
quiet
|
bool
|
If True, suppress all progress output. |
False
|
Returns:
| Type | Description |
|---|---|
list[dict]
|
List of post data dicts (the 'data' field of each child). |
Source code in src/python_reddit_scraper/scraper/core.py
Parallel¶
Multi-process parallel scraping of multiple subreddits.
python_reddit_scraper.scraper.parallel
¶
Parallel multi-process scraping of multiple subreddits.
scrape_worker(subreddit, max_pages=50, delay=1.5, quiet=False, proxy=None)
¶
Standalone scrape function for use with ProcessPoolExecutor.
Each call creates its own Camoufox browser instance (Playwright sync API is not thread-safe, so each process must have its own browser).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subreddit
|
str
|
Subreddit name (without r/ prefix). |
required |
max_pages
|
int
|
Maximum pages to fetch. |
50
|
delay
|
float
|
Seconds between page requests. |
1.5
|
quiet
|
bool
|
Suppress per-subreddit progress output. |
False
|
proxy
|
dict | None
|
Optional proxy dict with keys |
None
|
Returns:
| Type | Description |
|---|---|
tuple[str, list[dict]]
|
Tuple of (subreddit_name, list_of_post_dicts). |
Source code in src/python_reddit_scraper/scraper/parallel.py
scrape_parallel(subreddits, max_pages=50, delay=1.5, max_workers=4, on_complete=None, progress=None, proxies=None)
¶
Scrape multiple subreddits in parallel using separate processes.
Each process gets its own Camoufox browser instance. Results are returned as a dict keyed by subreddit name. An optional callback is invoked as each subreddit finishes (useful for queueing downloads).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subreddits
|
list[str]
|
List of subreddit names. |
required |
max_pages
|
int
|
Max pages per subreddit. |
50
|
delay
|
float
|
Seconds between page requests per scraper. |
1.5
|
max_workers
|
int
|
Maximum concurrent scraper processes. |
4
|
on_complete
|
OnCompleteCallback | None
|
Optional callback |
None
|
progress
|
ProgressDisplay | None
|
Optional shared :class: |
None
|
proxies
|
list[dict] | None
|
Optional list of proxy dicts from :func: |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, list[dict]]
|
Dict mapping subreddit name to its list of post dicts. |
Source code in src/python_reddit_scraper/scraper/parallel.py
JSON I/O¶
JSON file reading and writing for scraped data.
python_reddit_scraper.scraper.json_io
¶
JSON I/O for scraped Reddit data.
save_scraped_json(posts, subreddit, output_dir='./input')
¶
Save scraped posts to a JSON file compatible with the existing parser.
Wraps posts in Reddit's listing format so parse_json_files() can read them. Returns the path to the saved file.
Source code in src/python_reddit_scraper/scraper/json_io.py
parse_json_files(input_dir)
¶
Parse all JSON files in input directory and extract posts.
Source code in src/python_reddit_scraper/scraper/json_io.py
Downloader¶
Media¶
Media URL extraction, type detection, and filtering.
python_reddit_scraper.downloader.media
¶
Media URL extraction, type detection, and filtering.
sanitize_filename(text, max_length=100)
¶
Convert text to a safe filename.
Source code in src/python_reddit_scraper/downloader/media.py
get_file_extension(url)
¶
Extract file extension from URL.
Source code in src/python_reddit_scraper/downloader/media.py
get_media_type(filename)
¶
Determine media type from filename for directory sorting.
Source code in src/python_reddit_scraper/downloader/media.py
is_media_url(url)
¶
Check if URL points to a media file.
extract_media_urls(post_data)
¶
Extract all media URLs from a Reddit post at highest resolution.
Source code in src/python_reddit_scraper/downloader/media.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 | |
extract_all_media(posts)
¶
Extract all media URLs from a list of posts, deduplicating by URL.
Returns list of dicts with 'url', 'filename', and 'subreddit' keys.
Source code in src/python_reddit_scraper/downloader/media.py
filter_by_media_type(downloads, media_types=None)
¶
Filter media list by a set of allowed types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
downloads
|
list[dict[str, str]]
|
List of dicts with 'url' and 'filename' keys. |
required |
media_types
|
set[str] | frozenset[str] | None
|
Allowed values from |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, str]]
|
Filtered list. |
Source code in src/python_reddit_scraper/downloader/media.py
Engine¶
Concurrent file downloading with progress tracking.
python_reddit_scraper.downloader.engine
¶
Download engine: concurrent file downloading with progress tracking.
download_file(url, filepath, *, fallback_urls=None)
¶
Download a file from URL to filepath with retries.
Returns:
| Type | Description |
|---|---|
bool
|
|
str
|
reason is a short label like |
Source code in src/python_reddit_scraper/downloader/engine.py
download_all(downloads, output_dir, workers=16, on_file_done=None, on_file_failed=None, progress=None)
¶
Download all media files concurrently.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
downloads
|
list[dict[str, str]]
|
List of dicts with 'url', 'filename', optionally 'subreddit', 'optional', and 'audio_fallbacks' keys. |
required |
output_dir
|
str
|
Base output directory (files sorted into subdirectories). |
required |
workers
|
int
|
Number of parallel download threads. |
16
|
on_file_done
|
OnFileDoneCallback | None
|
Optional callback |
None
|
on_file_failed
|
OnFileFailedCallback | None
|
Optional callback |
None
|
progress
|
ProgressDisplay | None
|
Optional shared :class: |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Tuple of |
int
|
is a :class: |
Source code in src/python_reddit_scraper/downloader/engine.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 | |
run_download_queue(download_q, output_dir, workers, media_types, state=None, progress=None)
¶
Consumer thread: pulls (subreddit, posts) from queue, downloads one sub at a time.
Returns cumulative (successful, failed) counts.
Source code in src/python_reddit_scraper/downloader/engine.py
Session State¶
The state module manages resume/session persistence.
python_reddit_scraper.downloader.state
¶
Session state management for resume support.
Persists scraping progress and download manifests to .scraper-state/
so interrupted runs can be resumed with --resume.
SessionState
¶
Manages persistent state for a single scrape+download session.
State is saved to a JSON file in .scraper-state/{timestamp}.json.
The file tracks which subreddits have been scraped, the full media
manifest, and which files have been successfully downloaded.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str
|
The download output directory for this session. |
required |
media_types
|
set[str] | frozenset[str] | None
|
Allowed media types. |
None
|
state_path
|
str | None
|
Explicit path to a state file (used when resuming). |
None
|
Source code in src/python_reddit_scraper/downloader/state.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | |
video_only
property
¶
Legacy boolean view — true when filter is videos + gifs only.
image_only
property
¶
Legacy boolean view — true when filter is images only.
save()
¶
Write current state to disk atomically.
Source code in src/python_reddit_scraper/downloader/state.py
load(path)
classmethod
¶
Load a session state from a JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the state JSON file. |
required |
Returns:
| Type | Description |
|---|---|
SessionState
|
A populated SessionState instance. |
Source code in src/python_reddit_scraper/downloader/state.py
find_latest()
classmethod
¶
Find the most recent state file in the state directory.
Returns:
| Type | Description |
|---|---|
str | None
|
Path to the newest state file, or None if none exist. |
Source code in src/python_reddit_scraper/downloader/state.py
list_all()
classmethod
¶
Return all state-file paths in .scraper-state/, newest first.
Source code in src/python_reddit_scraper/downloader/state.py
mark_subreddit_scraped(sub)
¶
set_media_manifest(media_list)
¶
Set the full media manifest (list of files to download).
Each item should have url, filename, subreddit keys.
A downloaded field is added and defaults to False.
Source code in src/python_reddit_scraper/downloader/state.py
mark_downloaded(url, batch_size=50)
¶
Mark a media URL as successfully downloaded.
State is flushed to disk every batch_size completions for performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The URL that was downloaded. |
required |
batch_size
|
int
|
How often to flush state to disk. |
50
|
Source code in src/python_reddit_scraper/downloader/state.py
mark_permanently_failed(url, reason, permanent)
¶
Mark a media URL as permanently failed (e.g. HTTP 403/404).
These items will be skipped on future resume attempts. Only stores permanent failures; transient ones can be retried.
Source code in src/python_reddit_scraper/downloader/state.py
get_pending_media()
¶
Get media items that have not yet been downloaded.
Also checks whether the file already exists on disk (handles the case where the file was downloaded but state wasn't saved). Permanently failed items (HTTP 403/404) are skipped.
Returns:
| Type | Description |
|---|---|
list[dict]
|
List of media dicts that still need downloading. |
Source code in src/python_reddit_scraper/downloader/state.py
flush_and_cleanup()
¶
Save final state and remove the state file on completion.
Source code in src/python_reddit_scraper/downloader/state.py
CLI¶
Commands¶
The CLI commands module provides the main download command.
python_reddit_scraper.cli.commands
¶
Typer-decorated entry points for the Reddit media downloader.
This module is intentionally thin: each command parses its arguments and
delegates straight to a flow in python_reddit_scraper.cli.flows.
download(subreddits=None, output_dir=None, save_json=False, max_pages=None, workers=None, scrape_workers=None, resume=False, dry_run=False, version=False)
¶
Download media from Reddit subreddits.
Source code in src/python_reddit_scraper/cli/commands.py
Runtime¶
Shared runtime helpers used across CLI flows (env checks, config resolution).
python_reddit_scraper.cli.runtime
¶
Runtime environment setup: camoufox check, proxy loading, option resolution.
This module is the glue between user-provided CLI arguments / YAML defaults
and the concrete values the scraper and downloader need. It contains no UI
styling — presentation is the responsibility of python_reddit_scraper.ui.
resolve_options(output_dir, max_pages, workers, scrape_workers)
¶
Apply the CLI → config → prompt-or-default ladder for user-tunable options.
Source code in src/python_reddit_scraper/cli/runtime.py
check_camoufox_binary()
¶
Verify the stealth Firefox binary required by camoufox is installed.
Source code in src/python_reddit_scraper/cli/runtime.py
load_proxies()
¶
Load proxies for the chosen provider, with per-account fallback.
Returns the working proxy pool or None when no providers are configured.
Prompts for a provider when multiple are present in the YAML config.
Exits with a clear error when every account for the picked provider has
hit its bandwidth limit.
Source code in src/python_reddit_scraper/cli/runtime.py
Configure¶
Interactive subcommand that writes user defaults to config.yaml.
python_reddit_scraper.cli.configure
¶
Interactive configure subcommand — writes defaults to config.yaml.
configure()
¶
Interactively write download defaults to ~/.config/python_reddit_scraper/config.yaml.
Source code in src/python_reddit_scraper/cli/configure.py
History¶
history subcommand — lists past runs from the run log.
python_reddit_scraper.cli.history_cmd
¶
download-reddit-media history subcommand.
history(limit=20, show_all=False, since=None)
¶
Render past runs as a rich table.
Source code in src/python_reddit_scraper/cli/history_cmd.py
Flows — Live¶
End-to-end scrape + download flow with a live dashboard.
python_reddit_scraper.cli.flows.live
¶
Live-scrape flow: scrape subreddits from Reddit, then download media.
run_live(subreddits, output_dir, save_json, max_pages, workers, scrape_workers, dry_run=False)
¶
Scrape one or more subreddits and download all matching media.
With dry_run=True the media URLs are extracted and counted but nothing
is written to disk — useful for validating a large run before committing.
Source code in src/python_reddit_scraper/cli/flows/live.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
Flows — Resume¶
Resume flow for continuing a previously interrupted download session.
python_reddit_scraper.cli.flows.resume
¶
Resume flow: pick up a previously interrupted download session.
run_resume(workers)
¶
Resume a previously interrupted download session.
When a single interrupted session exists it is picked automatically; otherwise the user picks via a radiolist dialog (Enter takes the newest).
Source code in src/python_reddit_scraper/cli/flows/resume.py
UI¶
Prompts¶
Interactive prompts (fuzzy-completed, styled) used by the CLI flows.
python_reddit_scraper.ui.prompts
¶
Interactive prompts built on prompt_toolkit — styled, fuzzy-completed, live-validated.
Every single-line prompt renders through :func:styled_prompt so the visual
style ([LABEL:] in bold cyan + dim default hint) stays consistent. Dialogs
use the palette from :mod:python_reddit_scraper.ui.theme.
styled_prompt(label, *, default=None, validator=None, completer=None)
¶
Render a prompt as [LABEL:] (default) and return the user's answer.
An empty answer returns default when provided, otherwise returns "".
Source code in src/python_reddit_scraper/ui/prompts.py
choose_provider(providers)
¶
Return the single provider, or let the user pick via a radiolist dialog.
Source code in src/python_reddit_scraper/ui/prompts.py
prompt_subreddits()
¶
Prompt for comma-separated subreddit names with fuzzy completion.
Source code in src/python_reddit_scraper/ui/prompts.py
pick_resume_session(sessions)
¶
Prompt the user to pick one of sessions to resume.
sessions is a list of (path, label) pairs in newest-first order.
Returns the chosen path, or None if the dialog was cancelled.
Auto-returns the only option when len(sessions) == 1.
Source code in src/python_reddit_scraper/ui/prompts.py
prompt_media_types(default=None)
¶
Prompt for media types via a checkbox dialog (space toggles, enter confirms).
Source code in src/python_reddit_scraper/ui/prompts.py
prompt_output_dir(default)
¶
Prompt for an output directory with tab-completion; empty input returns default.
Source code in src/python_reddit_scraper/ui/prompts.py
Theme¶
Shared color palette and prompt_toolkit styles.
python_reddit_scraper.ui.theme
¶
Central colour palette and formatting helpers.
A single coherent look across the CLI: rich markup for output, prompt_toolkit styles for input. The two styling systems are kept in sync here so a change to a colour in one place flows everywhere.
key_value(label, value)
¶
Banner, Spinner, Summary, Tables¶
Rich-based UI helpers — startup banner, run summary, preflight and history tables.
python_reddit_scraper.ui.banner
¶
Startup banner: a single rich Rule with app name, version, mode, and target count.
print_banner(mode, subreddit_count=None)
¶
Print a one-line ━━ app v1.1.1 • mode=live • 3 subs ━━ rule.
Source code in src/python_reddit_scraper/ui/banner.py
python_reddit_scraper.ui.spinner
¶
Uniform spinner context manager backed by rich.status.Status.
Usage::
with spinner("Probing proxy accounts…"):
result = slow_work()
spinner(message, *, spinner_style='dots')
¶
Show a spinner with message while the wrapped block runs.
The spinner is automatically hidden when the block exits (success or
exception), and on non-interactive stdout it degrades to a single
message… line so piped invocations still surface progress.
Source code in src/python_reddit_scraper/ui/spinner.py
python_reddit_scraper.ui.summary
¶
End-of-run summary rendering.
Two callers:
print_summary— after a live scrape or resume run; per-subreddit counts.print_defaults_panel— afterconfiguresaves; echoes the persisted values.
print_summary(output_dir, successful, failed, subreddits, *, started_at=None, title='Download summary')
¶
Render a rich Panel+Table summary of what was downloaded.
Source code in src/python_reddit_scraper/ui/summary.py
print_dry_run(counts, *, started_at=None)
¶
Render a DRY RUN panel of per-subreddit media counts; no disk I/O.
Source code in src/python_reddit_scraper/ui/summary.py
print_defaults_panel(path, defaults)
¶
Echo the saved configure defaults as a KEY/value table inside a panel.
Source code in src/python_reddit_scraper/ui/summary.py
python_reddit_scraper.ui.preflight_table
¶
Render the subreddit preflight results as a rich Table.
print_preflight(results)
¶
Render results as a bordered table inside a cyan panel.
Source code in src/python_reddit_scraper/ui/preflight_table.py
python_reddit_scraper.ui.history_table
¶
Render download-reddit-media history output as a rich Panel+Table.
print_history(runs, *, path)
¶
Render runs (newest first) with a subtitle pointing to the source file.
Source code in src/python_reddit_scraper/ui/history_table.py
Config¶
User configuration loader — defaults, proxy providers, and YAML read/write.
python_reddit_scraper.config
¶
Load user configuration from ~/.config/python_reddit_scraper/config.yaml.
Provider
dataclass
¶
One proxy provider block from the config file.
accounts holds the provider-specific raw dicts (shape differs per
provider — see scraper.proxy_handler for how each is parsed).
Source code in src/python_reddit_scraper/config.py
Defaults
dataclass
¶
User-configured defaults for the download command.
Each field is None when not configured, meaning the CLI should fall
back to prompting the user (or to its built-in default).
Source code in src/python_reddit_scraper/config.py
get_providers()
¶
Return all configured proxy providers in the order they appear in YAML.
Source code in src/python_reddit_scraper/config.py
get_defaults()
¶
Return user-configured defaults from the defaults: block, if any.
Source code in src/python_reddit_scraper/config.py
save_defaults(defaults)
¶
Merge defaults into the config file, preserving other top-level keys.
Only fields that are not None are written, so partial configs stay
partial. Returns the path that was written.