Applying data engineering for image editing
Building an Automated Image Processing Pipeline for My Photography Portfolio
When I set out to build my photography portfolio, I quickly realized that manually resizing, converting, and cataloging photos was unsustainable. Every RAW file from my Ricoh GR III needed to become multiple AVIF variants, thumbnails for gallery grids, mid-size previews for collection pages, and full-resolution images for the lightbox, each with carefully tuned compression. On top of that, I needed structured EXIF metadata and YAML collection files for the Astro frontend to consume. So I built a CLI-driven image processing pipeline in Python to automate the entire workflow.
The Problem
A single RAW photograph needs to go through several transformations before it can appear on the web:
- EXIF metadata extraction: Camera make/model, ISO, aperture, shutter speed, focal length, and date are all buried in the binary EXIF headers.
- Responsive image generation: The web demands multiple sizes: a 350px thumbnail for the gallery grid, a 700px version for collection previews, and a 1400px display image for the lightbox.
- Modern format encoding: AVIF delivers dramatically better compression than JPEG at equivalent perceptual quality, but encoding it well requires careful tuning of quality, effort, chroma subsampling, and bit depth.
- Catalog management: The Astro frontend reads YAML collection files that reference every photo’s variants, metadata, and CDN paths. Maintaining these by hand is error-prone and tedious.
Architecture Overview
The pipeline is a Python CLI application built with Click, pyvips for high-performance image processing, and exifread for metadata extraction. It’s structured as four distinct modules with clear responsibilities:
Module Breakdown
1. Metadata Extraction (image_metadata.py)
The PhotoMetadata dataclass is the central data model. It captures everything the portfolio cares about:
@dataclassclass PhotoMetadata: camera_make: str | None = None camera_model: str | None = None lens: str | None = None date_taken: str | None = None focal_length_35mm: str | None = None aperture: str | None = None shutter_speed: str | None = None iso: str | None = None exposure_mode: str | None = None metering_mode: str | None = NoneThe from_exif_tags() classmethod takes the raw dictionary returned by exifread.process_file() and maps EXIF tag names (like EXIF FNumber, EXIF ISOSpeedRatings) to clean, typed fields. This decouples the rest of the pipeline from EXIF internals.
Full list of extracted EXIF fields
| Field | EXIF Tag | Example |
|---|---|---|
camera_make | Image Make | RICOH IMAGING COMPANY, LTD. |
camera_model | Image Model | RICOH GR III |
lens | EXIF LensModel | RICOH GR LENS 28mm F2.8 |
aperture | EXIF FNumber | f/4.0 |
shutter_speed | EXIF ExposureTime | 1/60 |
iso | EXIF ISOSpeedRatings | 200 |
focal_length_35mm | EXIF FocalLengthIn35mmFilm | 28mm |
date_taken | EXIF DateTimeOriginal | 2021:09:19 |
2. Image Conversion (image_converter.py)
This is where the heavy lifting happens. The ImageConverter class wraps libvips (via pyvips) to handle image resizing and AVIF encoding:
Key design decisions:
thumbnail()over manual resize: libvips’thumbnail()uses shrink-on-load, decoding the image at a reduced resolution and then applying a high-quality Lanczos3 resize in a single pass. This is both faster and sharper than loading at full resolution and then resizing.- Post-resize sharpening: A mild unsharp mask (
sigma=1.0) recovers perceived detail lost during downsampling, withm1=0to avoid sharpening flat areas like skies. - 12-bit AVIF at effort 7: Higher bit depth reduces banding in gradients; effort 7 gives excellent compression at an acceptable encoding speed for offline processing.
- ICC profile preservation: All other metadata is stripped to reduce file size, but the ICC profile is kept for accurate color reproduction.
The core method generates all three responsive sizes in a single call:
sizes = { "thumbnail": 350, # Gallery grid "collection": 700, # Collection preview "display": 1400, # Lightbox full view}
output_files = converter.generate_responsive_sizes( "R0012110.DNG", output_dir="pipeline_artifacts/converted/", output_format="avif", sizes=sizes,)3. YAML Generation (yaml_generator.py)
The YAMLGenerator bridges the gap between processed images and the Astro frontend. It takes metadata and image paths and produces structured YAML collection files like:
collection: Tokyo Streetsdescription: Exploring the vibrant streets of Tokyophotos: - id: jinbocho-passing-reader title: Passing reader in 神保町 (Jinbōchō) image: https://cdn.avm.photography/collections/tokyo/R0012164-display.avif collection: https://cdn.avm.photography/collections/tokyo/R0012164-collection.avif thumbnail: https://cdn.avm.photography/collections/tokyo/R0012164-thumbnail.avif metadata: camera: RICOH IMAGING COMPANY, LTD. RICOH GR III lens: RICOH GR LENS 28mm F2.8 settings: iso: [200] aperture: f/4.0 shutter: 1/60 focalLength: 28mm location: Jinbōchō, Tokyo, Japan dateTaken: 2021:09:19The generator handles formatting raw EXIF values into human-readable strings, converting fractional aperture ratios like 14/5 into f/4.0, exposure times into 1/60 notation, and focal lengths into 28mm format.
4. CLI Orchestrator (cli.py)
The Click-based CLI ties everything together with four commands:
| Command | Purpose |
|---|---|
process | Process a single RAW file → extract metadata + generate all AVIF sizes |
generate-yaml | Scan a directory of processed images → produce a collection YAML |
quick-add | One-shot: process a RAW file and create its YAML entry |
add-to-collection | Detect new RAW files, process them, and append to an existing collection |
Design Decisions Worth Noting
AVIF over WebP
AVIF consistently delivers 20-30% smaller files than WebP at the same perceptual quality, especially for photographic content. The encoding is slower (hence effort 7 rather than 9), but this is an offline pipeline so encoding speed barely matters.
12-bit depth
Most images are 8-bit, but encoding at 12-bit avoids introducing banding artifacts in subtle gradients (think: dusk skies, smooth bokeh). The file size increase is negligible.
Chroma subsampling set to auto
Lets libvips decide based on quality. At Q≥85, it typically preserves full chroma, which matters for color accuracy in a photography portfolio.
Metadata stripped, ICC kept
EXIF data is extracted separately into JSON, so there’s no reason to keep it in the image file. But the ICC color profile must stay for the browser to render colors correctly.
Click for CLI
Type-safe argument parsing, auto-generated help text, and composable command groups made it the obvious choice over argparse.
What’s Next
This pipeline handles the core workflow well, but there’s room to grow:
- planned Parallel processing: Currently sequential; libvips is fast enough that Python’s GIL isn’t the bottleneck, but processing multiple files concurrently would still speed up large batches.
- planned CDN upload integration: Automatically push converted images to Cloudflare R2 after processing.
- planned Watermarking: The Astro frontend already has a
Watermark.astrocomponent; baking watermarks into the images themselves would add another layer of protection.
Comments