Applying data engineering for image editing
· updated · 6 min read · 1,514 words

Applying data engineering for image editing

Building an Automated Image Processing Pipeline for My Photography Portfolio

When I set out to build my photography portfolio, I quickly realized that manually resizing, converting, and cataloging photos was unsustainable. Every RAW file from my Ricoh GR III needed to become multiple AVIF variants, thumbnails for gallery grids, mid-size previews for collection pages, and full-resolution images for the lightbox, each with carefully tuned compression. On top of that, I needed structured EXIF metadata and YAML collection files for the Astro frontend to consume. So I built a CLI-driven image processing pipeline in Python to automate the entire workflow.

The Problem

A single RAW photograph needs to go through several transformations before it can appear on the web:

  1. EXIF metadata extraction: Camera make/model, ISO, aperture, shutter speed, focal length, and date are all buried in the binary EXIF headers.
  2. Responsive image generation: The web demands multiple sizes: a 350px thumbnail for the gallery grid, a 700px version for collection previews, and a 1400px display image for the lightbox.
  3. Modern format encoding: AVIF delivers dramatically better compression than JPEG at equivalent perceptual quality, but encoding it well requires careful tuning of quality, effort, chroma subsampling, and bit depth.
  4. Catalog management: The Astro frontend reads YAML collection files that reference every photo’s variants, metadata, and CDN paths. Maintaining these by hand is error-prone and tedious.

Architecture Overview

The pipeline is a Python CLI application built with Click, pyvips for high-performance image processing, and exifread for metadata extraction. It’s structured as four distinct modules with clear responsibilities:

Diagram

Module Breakdown

1. Metadata Extraction (image_metadata.py)

The PhotoMetadata dataclass is the central data model. It captures everything the portfolio cares about:

@dataclass
class PhotoMetadata:
camera_make: str | None = None
camera_model: str | None = None
lens: str | None = None
date_taken: str | None = None
focal_length_35mm: str | None = None
aperture: str | None = None
shutter_speed: str | None = None
iso: str | None = None
exposure_mode: str | None = None
metering_mode: str | None = None

The from_exif_tags() classmethod takes the raw dictionary returned by exifread.process_file() and maps EXIF tag names (like EXIF FNumber, EXIF ISOSpeedRatings) to clean, typed fields. This decouples the rest of the pipeline from EXIF internals.

Full list of extracted EXIF fields
FieldEXIF TagExample
camera_makeImage MakeRICOH IMAGING COMPANY, LTD.
camera_modelImage ModelRICOH GR III
lensEXIF LensModelRICOH GR LENS 28mm F2.8
apertureEXIF FNumberf/4.0
shutter_speedEXIF ExposureTime1/60
isoEXIF ISOSpeedRatings200
focal_length_35mmEXIF FocalLengthIn35mmFilm28mm
date_takenEXIF DateTimeOriginal2021:09:19

2. Image Conversion (image_converter.py)

This is where the heavy lifting happens. The ImageConverter class wraps libvips (via pyvips) to handle image resizing and AVIF encoding:

Key design decisions:

  • thumbnail() over manual resize: libvips’ thumbnail() uses shrink-on-load, decoding the image at a reduced resolution and then applying a high-quality Lanczos3 resize in a single pass. This is both faster and sharper than loading at full resolution and then resizing.
  • Post-resize sharpening: A mild unsharp mask (sigma=1.0) recovers perceived detail lost during downsampling, with m1=0 to avoid sharpening flat areas like skies.
  • 12-bit AVIF at effort 7: Higher bit depth reduces banding in gradients; effort 7 gives excellent compression at an acceptable encoding speed for offline processing.
  • ICC profile preservation: All other metadata is stripped to reduce file size, but the ICC profile is kept for accurate color reproduction.

The core method generates all three responsive sizes in a single call:

sizes = {
"thumbnail": 350, # Gallery grid
"collection": 700, # Collection preview
"display": 1400, # Lightbox full view
}
output_files = converter.generate_responsive_sizes(
"R0012110.DNG",
output_dir="pipeline_artifacts/converted/",
output_format="avif",
sizes=sizes,
)

3. YAML Generation (yaml_generator.py)

The YAMLGenerator bridges the gap between processed images and the Astro frontend. It takes metadata and image paths and produces structured YAML collection files like:

collection: Tokyo Streets
description: Exploring the vibrant streets of Tokyo
photos:
- id: jinbocho-passing-reader
title: Passing reader in 神保町 (Jinbōchō)
image: https://cdn.avm.photography/collections/tokyo/R0012164-display.avif
collection: https://cdn.avm.photography/collections/tokyo/R0012164-collection.avif
thumbnail: https://cdn.avm.photography/collections/tokyo/R0012164-thumbnail.avif
metadata:
camera: RICOH IMAGING COMPANY, LTD. RICOH GR III
lens: RICOH GR LENS 28mm F2.8
settings:
iso: [200]
aperture: f/4.0
shutter: 1/60
focalLength: 28mm
location: Jinbōchō, Tokyo, Japan
dateTaken: 2021:09:19

The generator handles formatting raw EXIF values into human-readable strings, converting fractional aperture ratios like 14/5 into f/4.0, exposure times into 1/60 notation, and focal lengths into 28mm format.

4. CLI Orchestrator (cli.py)

The Click-based CLI ties everything together with four commands:

CommandPurpose
processProcess a single RAW file → extract metadata + generate all AVIF sizes
generate-yamlScan a directory of processed images → produce a collection YAML
quick-addOne-shot: process a RAW file and create its YAML entry
add-to-collectionDetect new RAW files, process them, and append to an existing collection

Design Decisions Worth Noting

AVIF over WebP

AVIF consistently delivers 20-30% smaller files than WebP at the same perceptual quality, especially for photographic content. The encoding is slower (hence effort 7 rather than 9), but this is an offline pipeline so encoding speed barely matters.

12-bit depth

Most images are 8-bit, but encoding at 12-bit avoids introducing banding artifacts in subtle gradients (think: dusk skies, smooth bokeh). The file size increase is negligible.

Chroma subsampling set to auto

Lets libvips decide based on quality. At Q≥85, it typically preserves full chroma, which matters for color accuracy in a photography portfolio.

Metadata stripped, ICC kept

EXIF data is extracted separately into JSON, so there’s no reason to keep it in the image file. But the ICC color profile must stay for the browser to render colors correctly.

Click for CLI

Type-safe argument parsing, auto-generated help text, and composable command groups made it the obvious choice over argparse.

What’s Next

This pipeline handles the core workflow well, but there’s room to grow:

  • planned Parallel processing: Currently sequential; libvips is fast enough that Python’s GIL isn’t the bottleneck, but processing multiple files concurrently would still speed up large batches.
  • planned CDN upload integration: Automatically push converted images to Cloudflare R2 after processing.
  • planned Watermarking: The Astro frontend already has a Watermark.astro component; baking watermarks into the images themselves would add another layer of protection.

Comments