API Reference¶
Auto-generated documentation from source code docstrings.
Scraper¶
Core¶
The core scraper handles fetching Reddit JSON data via a stealth browser.
python_reddit_scraper.scraper.core
¶
Camoufox-based Reddit JSON scraper with pagination.
Navigates old.reddit.com JSON API using a stealth Firefox browser, follows pagination tokens, and returns raw post data.
scrape_subreddit(browser, subreddit, max_pages=50, delay=1.5, quiet=False)
¶
Scrape a subreddit's posts via Reddit's old JSON API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
browser
|
A camoufox Browser instance (from sync context manager). |
required | |
subreddit
|
str
|
Subreddit name (without r/ prefix). |
required |
max_pages
|
int
|
Maximum number of pages to fetch (100 posts per page). |
50
|
delay
|
float
|
Seconds to wait between page requests. |
1.5
|
quiet
|
bool
|
If True, suppress all progress output. |
False
|
Returns:
| Type | Description |
|---|---|
list[dict]
|
List of post data dicts (the 'data' field of each child). |
Source code in src/python_reddit_scraper/scraper/core.py
Parallel¶
Multi-process parallel scraping of multiple subreddits.
python_reddit_scraper.scraper.parallel
¶
Parallel multi-process scraping of multiple subreddits.
scrape_worker(subreddit, max_pages=50, delay=1.5, quiet=False)
¶
Standalone scrape function for use with ProcessPoolExecutor.
Each call creates its own Camoufox browser instance (Playwright sync API is not thread-safe, so each process must have its own browser).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subreddit
|
str
|
Subreddit name (without r/ prefix). |
required |
max_pages
|
int
|
Maximum pages to fetch. |
50
|
delay
|
float
|
Seconds between page requests. |
1.5
|
quiet
|
bool
|
Suppress per-subreddit progress output. |
False
|
Returns:
| Type | Description |
|---|---|
tuple[str, list[dict]]
|
Tuple of (subreddit_name, list_of_post_dicts). |
Source code in src/python_reddit_scraper/scraper/parallel.py
scrape_parallel(subreddits, max_pages=50, delay=1.5, max_workers=4, on_complete=None, progress=None)
¶
Scrape multiple subreddits in parallel using separate processes.
Each process gets its own Camoufox browser instance. Results are returned as a dict keyed by subreddit name. An optional callback is invoked as each subreddit finishes (useful for queueing downloads).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subreddits
|
list[str]
|
List of subreddit names. |
required |
max_pages
|
int
|
Max pages per subreddit. |
50
|
delay
|
float
|
Seconds between page requests per scraper. |
1.5
|
max_workers
|
int
|
Maximum concurrent scraper processes. |
4
|
on_complete
|
Optional callback |
None
|
|
progress
|
ProgressDisplay | None
|
Optional shared :class: |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, list[dict]]
|
Dict mapping subreddit name to its list of post dicts. |
Source code in src/python_reddit_scraper/scraper/parallel.py
JSON I/O¶
JSON file reading and writing for scraped data.
python_reddit_scraper.scraper.json_io
¶
JSON I/O for scraped Reddit data.
save_scraped_json(posts, subreddit, output_dir='./input')
¶
Save scraped posts to a JSON file compatible with the existing parser.
Wraps posts in Reddit's listing format so parse_json_files() can read them. Returns the path to the saved file.
Source code in src/python_reddit_scraper/scraper/json_io.py
parse_json_files(input_dir)
¶
Parse all JSON files in input directory and extract posts.
Source code in src/python_reddit_scraper/scraper/json_io.py
Downloader¶
Media¶
Media URL extraction, type detection, and filtering.
python_reddit_scraper.downloader.media
¶
Media URL extraction, type detection, and filtering.
sanitize_filename(text, max_length=100)
¶
Convert text to a safe filename.
Source code in src/python_reddit_scraper/downloader/media.py
get_file_extension(url)
¶
Extract file extension from URL.
Source code in src/python_reddit_scraper/downloader/media.py
get_media_type(filename)
¶
Determine media type from filename for directory sorting.
Source code in src/python_reddit_scraper/downloader/media.py
is_media_url(url)
¶
Check if URL points to a media file.
extract_media_urls(post_data)
¶
Extract all media URLs from a Reddit post at highest resolution.
Source code in src/python_reddit_scraper/downloader/media.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 | |
extract_all_media(posts)
¶
Extract all media URLs from a list of posts, deduplicating by URL.
Returns list of dicts with 'url', 'filename', and 'subreddit' keys.
Source code in src/python_reddit_scraper/downloader/media.py
filter_by_media_type(downloads, video_only=False, image_only=False)
¶
Filter media list by type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
downloads
|
list[dict[str, str]]
|
List of dicts with 'url' and 'filename' keys. |
required |
video_only
|
bool
|
Keep only videos + gifs (animations). |
False
|
image_only
|
bool
|
Keep only images. |
False
|
Returns:
| Type | Description |
|---|---|
list[dict[str, str]]
|
Filtered list. |
Source code in src/python_reddit_scraper/downloader/media.py
Engine¶
Concurrent file downloading with progress tracking.
python_reddit_scraper.downloader.engine
¶
Download engine: concurrent file downloading with progress tracking.
download_file(url, filepath, *, fallback_urls=None)
¶
Download a file from URL to filepath with retries.
Returns:
| Type | Description |
|---|---|
bool
|
|
str
|
reason is a short label like |
Source code in src/python_reddit_scraper/downloader/engine.py
download_all(downloads, output_dir, workers=16, on_file_done=None, on_file_failed=None, progress=None)
¶
Download all media files concurrently.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
downloads
|
list[dict[str, str]]
|
List of dicts with 'url', 'filename', optionally 'subreddit', 'optional', and 'audio_fallbacks' keys. |
required |
output_dir
|
str
|
Base output directory (files sorted into subdirectories). |
required |
workers
|
int
|
Number of parallel download threads. |
16
|
on_file_done
|
Optional callback |
None
|
|
on_file_failed
|
Optional callback |
None
|
|
progress
|
ProgressDisplay | None
|
Optional shared :class: |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Tuple of |
int
|
is a :class: |
Source code in src/python_reddit_scraper/downloader/engine.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 | |
run_download_queue(download_q, output_dir, workers, video_only, image_only, state=None, progress=None)
¶
Consumer thread: pulls (subreddit, posts) from queue, downloads one sub at a time.
Returns cumulative (successful, failed) counts.
Source code in src/python_reddit_scraper/downloader/engine.py
Session State¶
The state module manages resume/session persistence.
python_reddit_scraper.downloader.state
¶
Session state management for resume support.
Persists scraping progress and download manifests to .scraper-state/
so interrupted runs can be resumed with --resume.
SessionState
¶
Manages persistent state for a single scrape+download session.
State is saved to a JSON file in .scraper-state/{timestamp}.json.
The file tracks which subreddits have been scraped, the full media
manifest, and which files have been successfully downloaded.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str
|
The download output directory for this session. |
required |
video_only
|
bool
|
Whether |
False
|
image_only
|
bool
|
Whether |
False
|
state_path
|
str | None
|
Explicit path to a state file (used when resuming). |
None
|
Source code in src/python_reddit_scraper/downloader/state.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | |
save()
¶
Write current state to disk atomically.
Source code in src/python_reddit_scraper/downloader/state.py
load(path)
classmethod
¶
Load a session state from a JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the state JSON file. |
required |
Returns:
| Type | Description |
|---|---|
SessionState
|
A populated SessionState instance. |
Source code in src/python_reddit_scraper/downloader/state.py
find_latest()
classmethod
¶
Find the most recent state file in the state directory.
Returns:
| Type | Description |
|---|---|
str | None
|
Path to the newest state file, or None if none exist. |
Source code in src/python_reddit_scraper/downloader/state.py
mark_subreddit_scraped(sub)
¶
set_media_manifest(media_list)
¶
Set the full media manifest (list of files to download).
Each item should have url, filename, subreddit keys.
A downloaded field is added and defaults to False.
Source code in src/python_reddit_scraper/downloader/state.py
mark_downloaded(url, batch_size=50)
¶
Mark a media URL as successfully downloaded.
State is flushed to disk every batch_size completions for performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The URL that was downloaded. |
required |
batch_size
|
int
|
How often to flush state to disk. |
50
|
Source code in src/python_reddit_scraper/downloader/state.py
mark_permanently_failed(url, reason, permanent)
¶
Mark a media URL as permanently failed (e.g. HTTP 403/404).
These items will be skipped on future resume attempts. Only stores permanent failures; transient ones can be retried.
Source code in src/python_reddit_scraper/downloader/state.py
get_pending_media()
¶
Get media items that have not yet been downloaded.
Also checks whether the file already exists on disk (handles the case where the file was downloaded but state wasn't saved). Permanently failed items (HTTP 403/404) are skipped.
Returns:
| Type | Description |
|---|---|
list[dict]
|
List of media dicts that still need downloading. |
Source code in src/python_reddit_scraper/downloader/state.py
flush_and_cleanup()
¶
Save final state and remove the state file on completion.
Source code in src/python_reddit_scraper/downloader/state.py
CLI¶
Commands¶
The CLI commands module provides the main download command.
python_reddit_scraper.cli.commands
¶
CLI commands for the Reddit media downloader.
Handles the main download command and its sub-modes (live scrape, resume, from-json).
download(subreddits=None, output_dir='./redditdownloads', video_only=False, image_only=False, from_json=False, save_json=False, max_pages=50, workers=16, scrape_workers=max(1, (os.cpu_count() or 2) // 2), resume=False, version=False)
¶
Download media from Reddit subreddits.
Source code in src/python_reddit_scraper/cli/commands.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 | |
Prompt¶
Interactive prompts and environment checks.
python_reddit_scraper.cli.prompt
¶
Interactive prompts and environment checks for the CLI.
prompt_subreddits()
¶
Interactively prompt for subreddit names using prompt-toolkit.
Source code in src/python_reddit_scraper/cli/prompt.py
check_camoufox_binary()
¶
Check if the camoufox Firefox binary is installed.