Edward Kahler — Playbook

Playbook

All Data PlatformAI / AgentsSystems / GraphicsFull-Stack WebFrontendSimulation / GraphicsConnected SystemsComputer Vision

ETL Pipeline Architecture Data Platform

A layered Bronze/Silver/Serving approach to building ETL pipelines that support both operational reporting and future AI/ML workloads without re-architecture.

Problem

Most ETL pipelines are built to answer today's question. The schema is designed around the current report, the transformation logic is tightly coupled to the destination, and within 18 months the pipeline is a brittle mess of one-off queries that no one wants to touch. When the business asks for a new cut of the data — or when an ML team wants to train on historical records — you're rewriting from scratch.

The deeper problem: raw data gets transformed in place and the original is gone. If a transformation has a bug, you've corrupted your source of truth. If requirements change, there's no way to replay history through the new logic.

Approach

I use a three-layer architecture — Bronze, Silver, and Serving — where each layer has a single, well-defined responsibility and the data at each stage is preserved.

Bronze is the raw landing zone. Nothing is transformed; everything is written as-received. S3 with Parquet and date-partitioned prefixes. The only acceptable operation here is schema normalization (flattening nested JSON, coercing types to match the declared schema). No business logic. The Bronze layer is the source of truth for replay.

Silver is where business logic lives. PySpark jobs running on AWS Glue read from Bronze, apply transformations (deduplication, derived fields, join enrichment), and write clean, typed records back to S3. Glue jobs are parameterized by execution date, which means any date range can be replayed independently. Silver is the stable contract that downstream consumers depend on.

Serving is purpose-built for consumers. Redshift Serverless materialized views, Glue Data Catalog tables, or direct S3 partitions depending on the consumer. A Tableau workbook gets a different serving schema than a Python ML training pipeline, but both read from the same Silver layer. Serving schemas can be rebuilt from Silver at any time.

AWS Step Functions orchestrates the pipeline. Each stage is an explicit state — Bronze ingest, Silver transform, Serving refresh — with retry logic, failure notifications, and execution history. The graph makes dependencies visible and failures diagnosable.

S3 Raw (Bronze)
    │
    ▼
AWS Glue PySpark (Silver transform)
    │
    ▼
S3 Clean (Silver)
    │
    ├──▶ Redshift Serverless (Serving — BI / Tableau)
    └──▶ S3 Partitioned (Serving — ML training)

Orchestration: AWS Step Functions

Key Decisions

Bronze is immutable: Never transform in-place. If a downstream job has a bug, you replay from Bronze. This is not optional — it's the only way to maintain correctness over time.
Glue over raw Lambda for transforms: PySpark scales horizontally and handles schema evolution gracefully. Lambda has a 15-minute limit and memory constraints that make it the wrong tool for large-volume batch transforms.
Step Functions over MWAA (Airflow): For a single-team pipeline without cross-system DAG dependencies, Managed Airflow adds operational complexity (clusters, plugins, DAG authoring overhead) with no meaningful benefit. Step Functions is serverless, natively integrated with Glue and SNS, and the visual graph doubles as documentation.
Redshift Serverless for Serving: Zero cluster management. Scales to zero between loads. For BI workloads with predictable query patterns this is the right default; dedicated clusters are a premature optimization.
Date-partitioned replay by design: Every Glue job takes an execution date parameter. Backfills are a loop, not a special case.

Where I've Applied This

Built this architecture end-to-end at HCE Group to replace a single-stage ETL that was running hand-written SQL directly against the source database — transforming and discarding raw data in one step. The rebuild added Bronze immutability, Silver transformation jobs in PySpark on Glue, and Step Functions orchestration. The pipeline now handles 40+ source systems and the Silver layer feeds both Tableau dashboards and a planned ML workload.

How to Run This

Inputs you need before starting:

Source system connection details (database, API, or flat file)
Target data consumers and their query patterns (BI tool, ML pipeline, ad hoc SQL)
Expected data volume and refresh frequency

Scaffold prompt (for an AI agent or a developer):

Create an AWS Glue PySpark job that reads from an S3 Bronze prefix (s3://[bucket]/bronze/[source]/[date]/), applies the following transformations: [list transformations], and writes Parquet output to s3://[bucket]/silver/[source]/[date]/. The job must accept --execution-date as a Glue job parameter. Include error handling that fails the job explicitly if the Bronze prefix is empty. Write a corresponding AWS Step Functions state machine definition (JSON) that: (1) triggers the Glue job, (2) waits for completion, (3) on success runs a Redshift COPY command to refresh the Serving layer, (4) on failure sends an SNS notification.

Decision points requiring human review:

Schema evolution strategy — what happens when the source adds a column? (Recommended: additive only; fail on removals.)
Deduplication key — what uniquely identifies a record in Silver?
Serving layer SLA — how stale is acceptable for each consumer?
Partition strategy — date-only vs. date + source shard for very high volume sources.

Python / PySparkAWS GlueAWS Step FunctionsRedshift ServerlessS3

LLM-Assisted Technical Modernization AI / Agents

A structured approach to applying LLMs to legacy migration and modernization work, with explicit evaluation gates, human review checkpoints, and patterns for scaling output without scaling headcount.

Problem

Legacy technical debt accumulates in two forms: systems that need to be migrated (SPSS scripts, VBA macros, SQL procedures written ten years ago for a different platform) and workflows that need more throughput than the current team can deliver. Traditional approaches to both problems are slow and expensive because the constraint is the same: engineers have to read, understand, and rewrite large volumes of existing logic by hand.

LLMs change the economics of that specific bottleneck. They don't replace architectural judgment, but they can dramatically accelerate the translation and generation work that consumes most of the calendar time in a migration project.

The risk is treating this as a binary choice. Fully automated output without expert review introduces subtle bugs that are expensive to find later. Ignoring LLM tooling entirely means competing on headcount against teams that aren't. The pattern that works is structured augmentation: AI handles volume, humans own quality gates.

Approach

Four distinct patterns, applied based on the task type:

1. LLM-assisted syntax migration

When the legacy artifact has a clear target (SPSS .sps scripts translating to PySpark Glue jobs, VBA workbooks lifting to web-portable frameworks), LLMs handle the syntax translation well. The workflow: sample representative scripts across complexity tiers, evaluate multiple models on accuracy and consistency, select the best performer per tier, then run production migration with expert review at defined checkpoints.

The evaluation step is not optional. Performance varies significantly across models and task types. A model that handles simple aggregations accurately may struggle with SPSS LOOP constructs or complex DEFINE macro usage. You find this out in evaluation, not in production.

2. Test case generation at scale

When the goal is coverage breadth across a known input space, LLMs can generate thousands of test cases faster than any manual process. The key discipline is defining the input space explicitly before generating. A well-specified prompt describing the domain, field constraints, and known edge cases produces usable output. A vague prompt produces output that looks complete but has systematic gaps.

Applied to NPC identity path validation: generated every student input combination across the estimation path, then used sampling to validate output quality before integrating into the CI pipeline.

3. AI-assisted code review for high-stakes deliverables

Used AI agents as a first-pass review layer on complex solutions from offshore development partners before senior engineering review. This catches structural issues earlier in the cycle and focuses senior review time on the problems that actually require architectural judgment.

This is a triage step, not a quality gate replacement. The output is a prioritized list of things worth a closer look, not a correctness certificate.

4. Agentic implementation alongside senior architects

The most impactful pattern: architect-level decisions (design, interaction patterns, resource management, security model, lifecycle) owned by senior staff; implementation of stubbed functions and boilerplate delegated to AI assistants. Applied during the NPC microservices build and infrastructure migration work.

At the architect level, the throughput gain on implementation tasks is approximately 100x. The design work takes the same time. What changes is how fast the design becomes running code.

Key Decisions

Evaluate before you commit. Multiple models were tested for the SPSS-to-PySpark task before production use. Model selection is not obvious and varies significantly by task type. Do not assume the model you use for general work is the right one for migration work.
Human moderation scales with stakes, not volume. Low-complexity output (test cases, boilerplate) can use sampling review. Production pipeline code and anything touching security requires full expert review before integration, regardless of volume.
The bottleneck shifts from generation to review. Once AI handles volume, the constraint becomes review throughput. If your review process cannot keep up with generation output, you are building a queue, not a pipeline. Plan for this before you start.
AI review is a triage tool. It surfaces things worth examining. It does not replace senior engineering judgment on the items it flags.
Intent travels well; obscure syntax does not. The most reliable use case is when you can articulate intent clearly and the target syntax is well-represented in training data. When the target is proprietary or unusual, expect more revision.

Where I've Applied This

HCE ETL Platform: SPSS-to-PySpark translation across 40+ source system scripts during the AWS Glue migration
HCE Net Price Calculator: identity test case generation across all student estimation path input combinations
HCE offshore delivery: AI-assisted first-pass review on complex vendor-submitted solutions before senior engineering review
HCE infrastructure operations: AI-assisted CLI scripting for an Azure VM migration, producing a repeatable and auditable migration sequence without an implementation partner
HCE NPC microservices: agentic implementation against architect-defined stubs across the .NET 8 service catalog

How to Run This

Inputs you need before starting:

Inventory of legacy artifacts (file count, line count, dependency graph depth)
Target platform and language
Available senior reviewers and their weekly review capacity

For migration work (syntax translation):

Classify your legacy artifact set into three tiers: direct translation (syntax equivalent exists in target), requires refactor (logic is sound but idioms don't map), requires redesign (the target platform needs a different approach). Run 3-5 representative samples from each tier through candidate models. For each sample, evaluate: does the output compile? Is it semantically equivalent? Does it handle the edge cases in the original? Score consistency across runs. Select a model per tier based on this data. Only then run production migration.

For test case generation:

Define the full input space before generating: all field combinations, known boundary conditions, documented edge cases. Provide this as explicit constraints in the prompt. Generate at volume. Sample-review a representative set (aim for 10-15% of output for medium-stakes coverage, 100% for anything security-adjacent). Integrate passing output into the CI pipeline and track defect escape rate over time.

Decision points requiring human review:

Tier classification for each artifact type — what looks like direct translation often isn't
Sampling strategy — which outputs get full review vs. spot-check, and at what threshold
Edge case handling in translated output — the most common source of production bugs post-migration
Any security-adjacent code (authentication, encryption, PII handling) — always full review regardless of complexity tier

Claude / GPT-4 class LLMsPython / PySparkGitHub Copilot / CursorAWS

Triple Buffer Architecture Systems / Graphics

A lock-free triple-buffer pattern for smooth real-time rendering pipelines where a producer and consumer run at different rates on separate threads.

Problem

Double buffering solves screen tearing but introduces a harder problem in real-time systems: if the producer (camera capture, sensor input, simulation) and the consumer (renderer) run at different rates, one of them blocks waiting for the other. A camera running at 60fps and a renderer running at 120fps either waste frames or introduce latency spikes as the renderer waits for new data.

In an interactive installation — where a physical visitor is standing in front of a screen expecting mirror-like responsiveness — latency spikes are immediately perceptible. The renderer cannot block. The camera cannot drop frames because the renderer isn't ready. These constraints are incompatible with a double-buffer design.

Approach

The triple buffer pattern uses three buffers and an atomic pointer to eliminate blocking entirely. The producer always writes to a private "back" buffer. When a frame is ready, it atomically swaps its back buffer with a shared "ready" buffer. The consumer reads from its own private "front" buffer and atomically swaps with the ready buffer whenever it wants the freshest available frame. Neither thread ever waits for the other.

Producer thread         Shared state          Consumer thread
─────────────────────   ─────────────────────  ─────────────────────
[back buffer]           [ready buffer] ◄──────► [front buffer]
     │                        ▲                       │
     │  write frame           │ atomic swap           │  render frame
     └────────────────────────┘                       │
                              └───────────────────────┘
                                atomic swap (on demand)

The key insight is that the ready buffer can be overwritten. If the producer generates two frames before the consumer reads, the first is simply replaced. In a real-time display context this is correct behavior — the consumer wants the freshest frame, not a queued one. This is fundamentally different from a FIFO queue and the distinction matters: use a queue if every frame must be processed; use a triple buffer if you want the latest state.

The atomic swap is the entire synchronization mechanism. No mutex, no condition variable, no blocking. The producer and consumer run at their natural rates and the ready buffer absorbs the rate mismatch.

Key Decisions

Atomic pointer, not mutex: A mutex would reintroduce blocking. The whole point is that neither thread waits. On modern hardware, an atomic compare-and-swap on a pointer is a single instruction.
Overwrite semantics, not queue semantics: For display pipelines, frame queuing causes latency to accumulate under load. If the renderer is slower than the camera for three frames, a queue delivers frames that are 3x stale. The triple buffer delivers the most recent frame available at the moment of consumption.
Three buffers, not two: Two buffers would require the producer to wait for the consumer to release the shared buffer before writing the next frame. Three buffers give each thread its own private buffer at all times with one buffer in the shared ready slot.
No frame skip compensation: In this use case (interactive installation), skipping a frame is correct. The alternative — slowing the renderer to match the camera — introduces latency, which is the worse failure mode here.

Where I've Applied This

Implemented this as the core of the camera server for the GrandVision Magic Mirror installation — a large-format interactive display where a camera captured visitors and the renderer applied real-time effects. The camera ran at 30fps over USB; the renderer ran at 60fps on an OpenGL pipeline. The triple buffer eliminated the blocking that was causing visible stutter in the initial double-buffer implementation. The camera server ran continuously for months in a retail environment without a threading incident.

How to Run This

Inputs you need before starting:

Producer thread rate and consumer thread rate (they can differ)
Buffer data type (raw frame bytes, struct, shared_ptr to heap allocation)
Target language (C++ shown here; adaptable to Rust, Go)

Implementation prompt (for an AI agent or a developer):

Implement a lock-free triple buffer in C++ with the following interface:

TripleBuffer<T> templated on the buffer type

producer_write(T& frame) — copies frame into the back buffer, then atomically swaps back with ready

consumer_read(T& out) — atomically swaps front with ready if a new frame is available, copies front into out; returns false if no new frame since last read

Use std::atomic<int> with a bitmask to track which slot is "ready" and whether it has been updated since the last consumer read (the "dirty" flag)

All operations must be wait-free (no spin loops, no blocking) Include a unit test that runs a producer at 30fps and a consumer at 60fps for 5 seconds on separate threads and asserts: (1) no race conditions (run under ThreadSanitizer), (2) consumer never blocks, (3) maximum observed latency is under 35ms.

Decision points requiring human review:

Buffer ownership — stack allocation (copy semantics) vs. heap with shared_ptr (move semantics). For large frames (raw camera data), heap allocation avoids copying.
Dirty flag necessity — if the consumer can tolerate re-rendering the same frame, skip the dirty flag. If rendering is expensive, the dirty flag avoids redundant work.
Memory ordering — std::memory_order_acq_rel on swaps is safe and correct. std::memory_order_relaxed is faster but requires careful analysis of happens-before relationships.

C++OpenGLPOSIX Threads

React + Netlify + Supabase + SSO Full-Stack Web

A JAMstack pattern for public/private portfolio or SaaS apps — statically hosted, auth-gated private sections, zero backend ops, SSO via Google.

Problem

Most full-stack web apps are over-engineered for their actual requirements. A portfolio site with a private job-tracker section, or a SaaS MVP with a public marketing page and authenticated dashboard, does not need a dedicated backend server, a container orchestration platform, or a managed Kubernetes cluster. But it does need: static hosting with global CDN, row-level security on the database, auth that a real user will actually complete (not a custom login form), and a local development environment that matches production closely enough to be trustworthy.

The trap is reaching for a framework that solves all of this and ends up owning your architecture. Next.js server components, Remix loaders, full-stack meta-frameworks — these are valid choices for the right problem, but they introduce complexity (server rendering, edge functions, cold starts) that a JAMstack app doesn't need.

Approach

The pattern is a static React SPA hosted on Netlify with Supabase handling auth and data. No server. No container. No backend deployments. The entire operational surface is: push to master, Netlify builds and deploys in 90 seconds, done.

Auth: Supabase Auth with Google OAuth. The user clicks "Sign in with Google," Supabase handles the OAuth flow and issues a JWT, the React app reads the session from supabase.auth.getSession() on mount. No passwords, no password reset flows, no user table management. SSO means real users actually complete sign-in — custom login forms have abandonment rates that OAuth flows don't.

Routing: React Router with a protected route wrapper. Public routes render without checking auth. Protected routes check the Supabase session; if null, redirect to a login screen. The session check is synchronous after mount — no flash of private content, no server-side redirect.

Data access: Supabase Row Level Security (RLS) enforces that users can only read their own rows, regardless of what the client sends. The frontend uses the Supabase JS client directly — no API layer, no Express server, no GraphQL gateway. RLS is the security boundary. The JWT from Auth is automatically attached to every Supabase query.

Build pipeline: Vite for local dev and production build. vite build produces a static dist/ directory. Netlify's build command runs vite build; the publish directory is dist. Environment variables (Supabase URL, anon key) are set in the Netlify dashboard and injected at build time via import.meta.env.

Browser
  │
  ├── Public routes ──► Netlify CDN (static HTML/JS/CSS)
  │
  └── Protected routes ──► Supabase Auth (JWT)
                               │
                               └── Supabase DB (RLS-enforced queries)

Pre-rendering for SEO: Public pages that need to be crawled add a prerender.js script that runs after vite build, reads the data, and generates static HTML files. This is not SSR — it's a one-time build-time HTML generation step. Crawlers and AI scrapers get real content; the browser gets the React app.

Key Decisions

Supabase over Firebase: Postgres + row-level security is more expressive than Firestore security rules, and SQL is a skill that transfers. Firebase locks you into a NoSQL document model and a proprietary query language.
Netlify over Vercel: No strong preference for a static SPA, but Netlify's branch deploy previews and form handling (relevant for feature-contact.md) are worth noting. Either works.
Anon key in client bundle is fine: Supabase's anon key is not a secret — it's a publishable key. RLS is the security layer, not key secrecy. The service role key never touches the client.
No API layer: Adding an Express server or serverless functions between the client and Supabase is complexity without benefit when RLS already enforces access control. The API layer adds latency, a deployment surface, and code to maintain for no security or capability gain.
Vite over CRA: Create React App is effectively deprecated. Vite is faster, has a better plugin model, and the vite.config.js virtual module pattern enables build-time data loading without a backend.

Where I've Applied This

This site. The portfolio and job tracker are built on this exact stack — Vite + React, Netlify hosting, Supabase auth with Google SSO, RLS-enforced private routes for the tracker. The pattern has been stable across multiple feature additions with no backend incidents.

Also the template for two other side projects where the public/private split and zero-ops requirement matched the same constraints.

How to Run This

The bootstrap template for this pattern is published at github.com/ekahle2/react-netlify-supabase-sso. Clone or use the GitHub template button to start. The README is a full operator's guide — step-by-step setup for Supabase, Google OAuth, Netlify, and the email allowlist. The entry below describes the pattern; the repo is the concrete artifact.

Inputs you need before starting:

Google Cloud project with OAuth 2.0 credentials (client ID + secret)
Supabase project (free tier is sufficient)
Netlify account

Scaffold prompt (for an AI agent or a developer):

Clone https://github.com/ekahle2/react-netlify-supabase-sso and work through the README setup steps in order:

Create a new Supabase project; copy VITE_SUPABASE_URL and VITE_SUPABASE_ANON_KEY into .env.local

Create a Google OAuth 2.0 client; set redirect URI to https://<supabase-ref>.supabase.co/auth/v1/callback; enter credentials in Supabase → Authentication → Providers → Google

Update ALLOWED_EMAILS in src/context/AuthContext.jsx — first config change; anyone not listed is signed out immediately

Replace src/hooks/useData.js with your schema's CRUD hook; keep the hook pattern (useCallback loadData, useEffect trigger, CRUD functions)

Connect to Netlify; set VITE_SUPABASE_URL and VITE_SUPABASE_ANON_KEY as environment variables; add the Netlify URL to Supabase Auth redirect URLs

What's generic and reusable as-is: AuthContext.jsx, useAuth.js, supabase.js, ErrorBoundary.jsx, LoginScreen.jsx, TabNav.jsx, netlify.toml (update CSP connect-src only)

What's domain-specific (replace): useData.js, src/pages/AppShell.jsx, any feature components

Decision points requiring human review:

Session persistence — use supabase.auth.onAuthStateChange listener, not getSession() polling. The listener handles token refresh automatically.
RLS policy — start with auth.uid() = user_id on all tables. Add permissive policies only with a specific, justified reason.
Pre-rendering scope — run prerender.js only for public routes. Never attempt to pre-render authenticated routes.
Environment variables — never commit .env.local. Netlify injects VITE_* vars at build time.

React 18ViteNetlifySupabase AuthPostgreSQL (Supabase)

Canvas-Based Charting Frontend

When SVG charting libraries hit their performance ceiling, canvas immediate-mode rendering unlocks smooth animation and real-time updates at scale — without a heavy library dependency.

Problem

SVG-based charting libraries (Chart.js with SVG renderer, D3 SVG, Recharts) work well at moderate data volumes. Each data point is a DOM element — and the browser's layout engine has to track, style, and paint every one of them. Below a few hundred points this cost is invisible. At 5,000+ data points, with animation or real-time updates, the DOM overhead becomes perceptible: frame rate drops, interactions feel sluggish, and memory climbs.

The typical response is to reach for a "high-performance" charting library. Most of them still use SVG. The actual fix is canvas.

Approach

Canvas is an immediate-mode rendering surface. You paint pixels; the browser tracks nothing. There are no DOM nodes, no layout calculations, no retained scene graph. At 50,000 data points, a canvas chart repaints in the same time as one with 500 — the constraint is your rendering logic, not the browser's scene graph.

The pattern I use is a React component that owns the canvas element, the animation loop, and the data state, with chart rendering delegated to pure functions:

function LineChart({ data, width, height }) {
  const canvasRef = useRef(null)
  const rafRef    = useRef(null)

  useEffect(() => {
    const ctx = canvasRef.current.getContext('2d')

    function render() {
      ctx.clearRect(0, 0, width, height)
      drawAxes(ctx, data, width, height)
      drawLine(ctx, data, width, height)
      // schedule next frame only if animating
    }

    render()
    return () => cancelAnimationFrame(rafRef.current)
  }, [data, width, height])

  return <canvas ref={canvasRef} width={width} height={height} />
}

drawAxes and drawLine are pure functions — canvas context + data in, pixels out. No state, no side effects, easy to test, easy to swap. The component is the lifecycle manager; the draw functions are the chart.

For real-time updates (streaming data, live dashboards), the loop runs on requestAnimationFrame. Data flows in via a ref — not state — so updates don't trigger React re-renders. The canvas updates at 60fps without React's reconciliation overhead.

When to use canvas over SVG:

More than ~1,000 data points with animation or real-time updates
Custom rendering that doesn't map to standard chart types
Pixel-level control (density maps, heatmaps, custom sparklines)

When SVG is still right:

Standard charts at moderate scale where accessibility matters (SVG elements can carry ARIA)
Charts that need CSS hover states, tooltips on individual elements, or clip-path effects
Output that needs to be exported as scalable vector (print, PDF)

Key Decisions

React ref for the canvas, not state: The canvas element is mutable and imperative by nature. Putting it in state would cause unnecessary re-renders. A ref gives direct DOM access without React's rendering cycle.
Pure draw functions, not class methods: Keeping rendering logic as pure functions makes it testable (pass a mock canvas context) and composable (combine chart types by calling multiple draw functions in sequence).
Data via ref for real-time paths: If data updates faster than React's render cycle, use a ref to hold the latest data and read it in the animation loop. This avoids stale closures and unnecessary re-renders.
DPR scaling for retina displays: A canvas sized to CSS pixels will look blurry on high-DPI screens. Always scale by window.devicePixelRatio and set CSS width/height separately from canvas width/height.

Where I've Applied This

Built canvas-based real-time visualization for interactive installation work at Second Story — display pipelines where SVG's DOM overhead was incompatible with the frame rate requirements of the experience. Also applied this pattern for the TacTile multi-touch platform, where the rendering layer needed to be decoupled from both the touch event system and the content data model.

How to Run This

Inputs you need before starting:

Data shape (array of {x, y} points, time series, categorical, etc.)
Update frequency (static, user-triggered, or real-time streaming)
Target container size and DPR handling requirement

Scaffold prompt (for an AI agent or a developer):

Build a React LineChart component using the HTML Canvas API with the following requirements:

Props: data: Array<{x: number, y: number}>, width: number, height: number, color?: string

Scale x and y to fit the data range within the canvas, with 10% padding on each axis

Draw x and y axes with 5 evenly-spaced tick marks and labels

Draw the line connecting all data points using lineTo

Scale the canvas by window.devicePixelRatio for retina displays (set canvas width/height to cssWidth * dpr, then ctx.scale(dpr, dpr), then set CSS width/height to cssWidth/cssHeight)

Re-render on data change using a useEffect — do not use requestAnimationFrame unless animated prop is true

Keep all drawing logic in pure functions outside the component: drawAxes(ctx, xScale, yScale) and drawLine(ctx, data, xScale, yScale)

Decision points requiring human review:

Animation loop vs. effect re-render — if data updates more than ~10x per second, move to requestAnimationFrame with a data ref. For slower updates, a useEffect redraw is simpler and correct.
Tooltip/hover interaction — canvas has no built-in hit testing. Implement with mousemove on the canvas element, converting mouse coordinates to data space and finding the nearest point.
Accessibility — canvas is opaque to screen readers. If accessibility is a requirement, add a visually-hidden <table> with the same data, or use SVG for this use case.

ReactHTML Canvas APIrequestAnimationFrame

Boid Simulation Simulation / Graphics

Emergent flocking behavior from three simple rules — and why the naive O(N²) implementation fails at scale and how spatial partitioning and GPU compute fix it.

Problem

Boids (Bird-oid objects) simulate emergent flocking behavior from three local rules applied to each agent: separation (avoid crowding neighbors), alignment (steer toward average heading of neighbors), cohesion (steer toward average position of neighbors). The rules are simple. The scale problem is not.

A naive implementation checks every agent against every other agent each frame — O(N²) neighbor lookups. At 500 agents that's 250,000 comparisons per frame. At 5,000 agents it's 25 million. At 60fps, a simulation with 10,000 agents and a naive neighbor search is doing 6 billion comparisons per second. This is where the demo breaks.

The interesting engineering problem is getting from O(N²) to something that scales, and then understanding when the CPU stops being the right tool entirely.

Approach

Stage 1: Spatial partitioning (CPU, O(N))

Divide the world into a uniform grid. Cell size = max influence radius (the distance at which an agent can affect another). Each frame, place every agent in its cell. For neighbor lookups, only check agents in the 8 surrounding cells (or 26 in 3D). Average agents per cell is O(1) if cell size is tuned correctly — so neighbor lookup goes from O(N) to O(k) where k is a small constant.

World divided into cells of size = influence_radius:

 ┌───┬───┬───┬───┐
 │   │   │   │   │
 ├───┼───┼───┼───┤   Agent at (★) only checks
 │   │ ★ │ · │   │   cells marked (·) — 8 neighbors
 ├───┼───┼───┼───┤   in 2D, 26 in 3D
 │   │ · │   │   │
 └───┴───┴───┴───┘

This brings a 10,000-agent simulation from 100M comparisons/frame to roughly 80 comparisons/agent (8 cells × ~10 agents/cell) = 800K/frame. A 125x reduction.

Stage 2: GPU compute (massive parallelism)

The spatial grid approach saturates a single CPU core around 50,000 agents. The next ceiling is the CPU itself — each agent update is independent, which is exactly the parallelism profile that GPUs are designed for.

Using GLSL compute shaders (or WebGL transform feedback for browser targets), agent positions and velocities are stored in GPU textures or SSBOs (Shader Storage Buffer Objects). The simulation step runs as a parallel kernel — one shader invocation per agent, all running simultaneously on GPU cores.

The GPU boid loop:

Read: Each invocation reads its agent's position/velocity from the input buffer
Neighbor scan: Reads from the spatial grid (also in GPU memory) to find nearby agents
Apply rules: Compute separation, alignment, cohesion forces
Write: Writes updated position/velocity to the output buffer
Render: Output buffer is used directly as vertex data for the draw call — no CPU readback

No CPU readback is the key. The simulation stays entirely on the GPU until the draw call. At 1M agents, the GPU handles the compute in a few milliseconds per frame while the CPU is free for physics, audio, or UI.

Key Decisions

Uniform grid over k-d tree: k-d trees have better worst-case behavior but are expensive to rebuild every frame (agents move). For a uniform-speed simulation where agents are roughly evenly distributed, a uniform grid rebuilt each frame is faster in practice and trivially GPU-friendly.
Transform feedback over compute shaders for WebGL: WebGL 2 lacks compute shader support; transform feedback achieves the same ping-pong buffer pattern (read from A, write to B, swap) using vertex shaders. If targeting modern desktop (Vulkan/Metal/DX12), prefer compute shaders — cleaner API.
Fixed-size agent pool, not dynamic allocation: Allocating/freeing GPU memory mid-simulation causes hitches. Allocate the maximum agent count at startup; use a count uniform to control how many are active. Dead agents contribute zero force.
Wrap-around world over bounded: Boundary conditions (reflecting edges, attractors) add if-branches in the shader. A wrap-around torus world eliminates boundary logic entirely — agents that exit one edge enter the other. Start here; add boundaries only if the visual requires it.

Where I've Applied This

Implemented CPU boids (with spatial grid) and GPU boids (WebGL transform feedback) for real-time generative installations. The GPU path enabled 100,000+ simultaneous agents at 60fps in a browser context, which was the requirement for a large-format public display.

Also explored boid variants at Argonne National Laboratory in the context of agent-based crowd simulation — HPC environments have a different constraint profile (memory bandwidth, MPI communication between nodes) but the spatial partitioning fundamentals are the same.

How to Run This

Inputs you need before starting:

Target agent count (determines whether CPU spatial grid or GPU compute is needed)
Target platform (browser → WebGL transform feedback; native → compute shaders; massive scale → Vulkan/CUDA)
Boundary behavior: wrap, reflect, or attractor

Scaffold prompt — CPU spatial grid (for an AI agent or a developer):

Implement a 2D boid simulation in C++ with the following architecture:

Agent struct: vec2 position, vec2 velocity, int cell_id

SpatialGrid class: world divided into cells of size influence_radius; methods: clear(), insert(agent_index, position), neighbors(position) → vector<int> (returns indices of agents in the 3x3 cell neighborhood)

Simulation::step(): (1) clear and rebuild the grid, (2) for each agent, query neighbors, compute separation/alignment/cohesion forces weighted by w_sep, w_ali, w_coh, (3) clamp speed to max_speed, (4) update position with wrap-around (fmod)

Render with OpenGL: agent positions uploaded as a VBO each frame; render as GL_POINTS with a point sprite shader Target: 10,000 agents at 60fps on a single CPU core.

Decision points requiring human review:

Influence radius tuning — too large and agents all see each other (degrades to O(N²)); too small and the flock fragments. Start with influence_radius = world_size / sqrt(agent_count).
Weight tuning — separation weight dominates at close range (prevents clumping); cohesion and alignment provide the flock structure. A reasonable start: w_sep = 1.5, w_ali = 1.0, w_coh = 1.0.
GPU readback — if you need agent positions on the CPU (for collision, physics), GPU readback is expensive. Design the pipeline to keep positions GPU-side and only surface aggregate data (counts, bounding box) to CPU.

C++ / OpenGLGLSL (compute shaders)WebGL (browser port)

Multi-Tier Connected Display Architecture Connected Systems

Synchronizing physically separate display tiers — large-format projection, interactive kiosks, and mobile — into a single coherent experience with isolated failure boundaries.

Problem

Large-scale public installations often involve multiple independent display systems that need to behave as a single coherent experience. A shared content state — what's currently featured on the main projection — should be reflected on nearby kiosks. A visitor who interacts with a kiosk should be able to continue that interaction on their mobile device. These are three physically separate systems, built on different platforms, with different update rates and failure modes.

The naive approach is tight coupling: the projection controller tells the kiosks what to show, and the kiosks tell mobile what to do. This works until one tier fails — and in a public venue, failure is visible. A projection hiccup taking the kiosk offline is an unacceptable blast radius.

Approach

The pattern is a central synchronization layer that owns shared state and publishes events to subscribed tiers. Each tier subscribes to the state it cares about; no tier communicates directly with another. The synchronization layer is the single source of truth.

                    ┌─────────────────────────┐
                    │  Synchronization Layer   │
                    │  (Node.js / WebSockets)  │
                    │  owns: content state     │
                    │  owns: interaction log   │
                    └─────────┬───────────────┘
                              │ publish events
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
     [Beacon Projection]  [Kiosk Tier]   [Mobile Tier]
     reads: active content reads: active  reads: session
     write: none          content         continuity token
                          write: user
                          interaction

Each tier has a defined read/write contract:

Projection: reads active content from the sync layer; never writes. It's a display, not an input device.
Kiosks: read active content; write user interactions (touch events, selections) back to the sync layer.
Mobile: reads a session continuity token set when a visitor scans a QR code at a kiosk. The mobile experience picks up from where the kiosk left off.

Failure isolation comes from this structure: the projection can go dark without affecting kiosk interaction. A kiosk reboot doesn't interrupt the projection or invalidate mobile sessions. Each tier reconnects and reads the current state from the sync layer on startup — no state lives in the tier itself.

State machine per content cycle: The sync layer runs a content state machine — idle, featured-content, user-session, transition. State transitions are triggered by the kiosk tier (user starts an interaction) or by a timeout (no interaction for N seconds → return to idle, resume ambient projection cycle). The state machine's current state is what gets published to all tiers.

Key Decisions

Central state owner, not peer-to-peer messaging: Peer-to-peer event buses (tiers publish directly to each other) create N×N coupling. A single state owner creates N connections to one hub. Adding a new display tier means subscribing to the sync layer, not wiring into every existing tier.
WebSockets over polling: Public venue displays need sub-second synchronization. HTTP polling at 1s intervals creates visible lag between, say, a kiosk selection and the projection response. WebSocket push eliminates this.
Tiers reconnect on startup, don't cache state: Each tier reads current state from the sync layer on connect. No tier persists state locally. This means any tier can be rebooted without data loss or state drift.
Timeout-driven transitions, not event-driven only: User interactions drive state forward; timeouts drive state back to idle. Without timeout recovery, a failed mid-session interaction could leave all tiers in a "session in progress" state indefinitely.

Where I've Applied This

Designed and built this architecture for the SFO Terminal 3 Flight Deck at San Francisco International Airport (Second Story / Publicis Groupe, 2014). The installation connected a large-scale beacon projection, six multi-touch interactive kiosks, and a mobile takeaway experience for 35M+ annual passengers. The project won 1st Place Creative Innovations Award at the 2014 Peggy G. Hereford Award ceremony.

How to Run This

Inputs you need before starting:

Number of display tiers and their platform constraints (web, native, embedded)
Shared state schema: what data must all tiers agree on? (current content ID, active session token, etc.)
Failure behavior requirements: what should each tier do if it loses connection to the sync layer?

Scaffold prompt (for an AI agent or a developer):

Build a Node.js synchronization server using the ws WebSocket library with the following architecture:

Maintains a sharedState object: { activeContentId, sessionToken, mode: 'idle' | 'featured' | 'session' }

On client connect: immediately sends the current sharedState as a SYNC message

Accepts ACTION messages from kiosk clients: { type: 'SELECT_CONTENT', contentId } and { type: 'START_SESSION', visitorId }

On SELECT_CONTENT: updates sharedState.activeContentId, broadcasts updated state to all connected clients

On START_SESSION: sets mode to 'session', generates a session token, broadcasts; sets a 60-second inactivity timeout that returns mode to 'idle' and broadcasts

On TIMEOUT: transitions mode back to 'idle', clears session token, broadcasts Client-side: on connect, apply the received SYNC message as initial state; on disconnect, show a "reconnecting" state and retry with exponential backoff; reapply SYNC on reconnect.

Decision points requiring human review:

Session persistence on sync layer restart — if the sync server reboots, all tiers receive a fresh state (idle). Decide if mid-session state needs to survive server restarts (requires Redis or similar).
Conflict resolution — what if two kiosks simultaneously submit SELECT_CONTENT? Last-write-wins is simplest; first-write-wins requires a lock. For low-contention installations, last-write-wins is correct.
Mobile session hand-off — the QR code encodes the session token. Mobile clients connect to the sync layer and subscribe to that session token. The sync layer routes kiosk events to the subscribed mobile client.

Node.js (synchronization layer)WebSocketsiOS (mobile tier)Multi-touch kiosk systemsBeacon / projection control

Computer Vision Pipeline Computer Vision

How to structure a real-time OpenCV pipeline with multiple competing timing constraints — capture rate, processing latency, transmission, and display — without blocking stages on each other.

Problem

A real-time computer vision pipeline has multiple stages: capture frames from the camera, process them (detection, recognition, calibration), encode and transmit results, and display or act on the output. Each stage has its own timing constraint:

The camera runs at a fixed hardware framerate (30fps, 60fps). Falling behind means dropped frames — scan quality degrades.
Processing (face detection, 3D reconstruction) is computationally expensive. A single frame might take 80ms.
Transmission has network latency + encoding overhead. JPEG encoding a 4K frame takes 40ms without hardware acceleration.
Display or downstream consumers want the freshest available data, not a queue of stale frames.

The naive single-threaded pipeline blocks every stage on the slowest one. An 80ms processing step caps the whole pipeline at ~12fps, even if the camera is capable of 60fps.

Approach

Stage isolation with non-blocking handoff between stages. Each stage runs on its own thread; stages communicate through buffers with defined ownership semantics.

Camera Thread          Process Thread         Transmit Thread
─────────────────      ─────────────────      ─────────────────
capture frame          read from capture      encode JPEG
  │                    buffer                 (libjpegturbo)
  │ triple buffer       │                       │
  └──────────────────► detection/recon         UDP send
                        │                      │
                        └──────────────────────┘
                         result buffer

Capture thread: Reads frames from the camera at hardware rate. Writes to a triple buffer (see Triple Buffer Architecture entry). Never blocks. If the process thread is slow, the capture thread overwrites the ready buffer — the process thread always gets the most recent frame, not a queued one.

Process thread: Reads the latest frame from the triple buffer. Runs detection or reconstruction. This is the slow stage — it may take multiple frame intervals. That's fine: it processes as fast as it can, always starting from the freshest available frame.

Transmit thread: Independent of both capture and process. Takes completed results from a result queue, encodes JPEG with libjpegturbo (hardware-accelerated path), and sends over UDP. Uses a bounded queue — if the network is slow, old results are dropped in favor of newer ones. The consumer (display, downstream service) wants the latest result, not a backlog.

Calibration as a mode, not a step: Camera calibration (lens correction, white balance, exposure) is typically done once at setup. I implement it as an explicit operational mode — the pipeline switches to calibration mode, captures a calibration set, computes correction matrices, then returns to live mode. The correction matrices are applied in the capture thread as a fast transform, not in the process thread.

Frame metadata alongside pixels: Each frame carries a timestamp and sequence number. Processing results carry the sequence number of the source frame. This lets the transmit thread detect if a result is stale (sequence number is more than N frames behind the current capture sequence) and drop it.

Key Decisions

Triple buffer for capture-to-process handoff: A queue would cause processing latency to accumulate under load. The triple buffer ensures the process thread always starts from the most recent frame regardless of processing time. See the Triple Buffer Architecture entry for the implementation.
libjpegturbo over OpenCV imencode: OpenCV's imencode uses libjpeg directly without SIMD acceleration. libjpegturbo uses SSE/AVX paths and is 2–6x faster for the same quality level. For a pipeline where JPEG encoding is on the critical path, this matters.
UDP over TCP for transmission: TCP's retransmission and flow control introduce unpredictable latency spikes. For real-time video where a dropped frame is better than a delayed one, UDP with application-level sequence numbers is the right transport. The receiver discards out-of-order and stale frames; it does not request retransmission.
Bounded result queue: An unbounded queue means a slow consumer causes memory growth and latency accumulation. A bounded queue (capacity 2–3) drops old results when full. This keeps end-to-end latency bounded regardless of consumer speed.

Where I've Applied This

Built this pipeline architecture for the GrandVision Magic Mirror at Second Story / Publicis Groupe. The requirement was a full head scan completing within 15 seconds and results transmitted to a tablet within 5 seconds. The C++ camera server implemented capture, processing (3D face reconstruction), JPEG encoding with libjpegturbo, and UDP transmission to a WPF tablet application. The triple buffer between capture and processing, and the bounded UDP transmit queue, were the architectural choices that made the timing requirements achievable. Commissioned at both Shanghai flagship store openings.

How to Run This

Inputs you need before starting:

Camera specs: resolution, framerate, interface (USB, GigE, CSI)
Processing SLA: how long can a single frame take to process?
Consumer: what receives processed results and at what rate?
Network path: local (shared memory), LAN (UDP), or WAN (QUIC/WebRTC)?

Scaffold prompt (for an AI agent or a developer):

Build a three-stage C++ camera pipeline with the following structure:

CaptureThread: opens OpenCV VideoCapture, captures frames at device rate, writes each frame to a TripleBuffer<cv::Mat>. Logs frame sequence number and timestamp on each capture.

ProcessThread: reads the latest frame from TripleBuffer, runs cv::CascadeClassifier::detectMultiScale for face detection, packages results as {sequence_id, timestamp, vector<cv::Rect> detections}, pushes to a BoundedQueue<Result> with capacity 3 (drop oldest on overflow).

TransmitThread: reads from BoundedQueue<Result>, serializes to JSON, sends over UDP socket to 127.0.0.1:9000. Each thread runs in a std::thread; shutdown is coordinated by a shared std::atomic<bool> running. All three threads start in main() and join on Ctrl+C.

Decision points requiring human review:

Camera interface — USB cameras use OpenCV's VideoCapture; GigE cameras require the vendor SDK (Basler Pylon, Allied Vision Vimba). Plan for the vendor SDK path if camera quality is a requirement.
Processing thread count — if the process stage is embarrassingly parallel (detecting in multiple ROIs), use a thread pool rather than a single process thread. OpenCV's cv::parallel_for_ handles this if OpenCV is built with TBB.
JPEG encoding placement — encoding in the process thread blocks result delivery; encoding in the transmit thread adds latency. For high-resolution frames, consider a dedicated encode thread between process and transmit.

C++ (camera server)OpenCVlibjpegturboPOSIX ThreadsCustom UDP transport

Computer Use Agent AI / Agents

Building an LLM agent that operates a computer through screen observation and input events — tool-use architecture, grounding strategies, and the verification loop that keeps it reliable.

Problem

Most LLM agents interact with software through structured APIs: function calls, REST endpoints, database queries. This works when the software exposes an API. Most enterprise software does not — it was built for human operators, not programmatic access. ERP systems, legacy web apps, insurance portals, government databases, internal tools with no API documentation — these systems' only interface is the UI.

A computer use agent treats the UI as the API. It sees what a human would see (a screenshot) and acts as a human would act (mouse clicks, keyboard input). This sounds straightforward. The hard parts are: knowing where to click, knowing when an action succeeded, and recovering when it didn't.

Approach

The agent runs a perception-reasoning-action-verification loop:

┌─────────────────────────────────────────────────────┐
│  1. PERCEIVE — take screenshot, encode as base64    │
│  2. REASON  — send to LLM with task + history       │
│               LLM outputs: action to take           │
│  3. ACT     — execute action (click, type, scroll)  │
│  4. VERIFY  — take new screenshot, compare to       │
│               expected post-action state             │
│  5. LOOP    — if goal not reached, go to 1          │
└─────────────────────────────────────────────────────┘

Grounding — the core technical challenge: The LLM reasons about what to do in natural language ("click the Save button"). The action executor needs pixel coordinates. Grounding is the translation between them.

Three approaches, in order of complexity:

Coordinate output: Ask the LLM to output {action: "click", x: 452, y: 318}. Works with Claude's computer use API, which returns structured tool calls with coordinates identified from the screenshot.
Element description + locator: LLM outputs {action: "click", selector: "button[type=submit]"}. The executor finds the element using a locator (Playwright, Selenium). More robust than pixel coordinates — survives layout changes.
Set-of-marks: Overlay numbered labels on all interactive elements before sending the screenshot to the LLM. Ask the LLM to identify the label number, not coordinates. The executor maps label → element. Reduces hallucination risk significantly.

Tool use schema: The agent exposes a small, explicit tool set:

tools = [
    {"name": "screenshot", "description": "Take a screenshot of the current screen"},
    {"name": "click",      "description": "Click at coordinates", "input_schema": {"x": int, "y": int}},
    {"name": "type",       "description": "Type text at cursor",  "input_schema": {"text": str}},
    {"name": "scroll",     "description": "Scroll the page",      "input_schema": {"direction": str, "amount": int}},
    {"name": "done",       "description": "Signal task completion", "input_schema": {"result": str}},
]

The LLM calls these tools in sequence. The agent executes each, captures the result (a new screenshot), and continues until done is called or the max steps limit is reached.

Verification: After each action, the agent compares the new screenshot to the expected state. For deterministic actions (clicking a button that opens a modal), the LLM can describe what it expects to see. For non-deterministic states, verify by looking for specific UI markers: "the form submission succeeded if a success banner is visible."

Failure recovery: Computer use agents fail in predictable ways: unexpected dialogs, loading states that take longer than expected, UI elements that moved. Build explicit recovery steps: if the expected post-action state doesn't appear within N seconds, take a fresh screenshot and re-reason from the current state. Don't retry the last action blindly — the LLM may need to take a different path.

Key Decisions

Claude's built-in computer use over custom grounding: Anthropic's computer use API handles coordinate extraction from screenshots natively. Building custom grounding (OCR, element detection, set-of-marks overlay) is significant engineering effort. Start with the built-in; add custom grounding only if the task domain requires it (low-res displays, custom rendering engines).
Small tool set, explicit schemas: A large tool set increases the LLM's decision surface and hallucination rate. Keep it minimal: screenshot, click, type, scroll, done. Add tools only when a task genuinely requires them (file upload, keyboard shortcuts, drag-and-drop).
Max steps limit, not just timeout: A timeout-only limit can be hit during a legitimately long-running action (file upload, page load). A max steps limit catches infinite loops regardless of wall time. Use both.
Human checkpoint for high-stakes actions: Form submissions, data deletion, financial transactions. Before executing, pause and surface the planned action for human approval. The agent continues after approval. This is not optional for production deployments against systems with side effects.

Where I've Applied This

Built a computer use agent using Claude's computer use API for automating data entry and extraction workflows against internal web tools with no API access. The agent handles multi-step form flows, table navigation, and conditional branching based on screen state. The set-of-marks grounding approach reduced coordinate hallucination to near zero compared to raw coordinate output.

How to Run This

Inputs you need before starting:

Target application (web, desktop, or both)
Task definition: what does success look like? (specific text on screen, file downloaded, form submitted)
Risk level: are there irreversible actions? (deletions, submissions, payments)

Scaffold prompt (for an AI agent or a developer):

Build a computer use agent loop in Python using the Anthropic SDK with the following structure:

take_screenshot() → str — captures the screen using pyautogui.screenshot(), encodes as base64 PNG

execute_action(tool_name, tool_input) — dispatches to pyautogui.click(x, y), pyautogui.typewrite(text), or pyautogui.scroll(clicks) based on tool_name

run_agent(task: str, max_steps: int = 20):

Take screenshot, send to claude-opus-4-6 with task description and the tool list: screenshot, click, type, scroll, done

Execute each tool call in the LLM response

After each action, take a new screenshot and add it to the message history

Stop when done is called or max_steps is reached

Return the result from the done call, or raise an error on max_steps exceeded Use claude-opus-4-6 for complex reasoning tasks; claude-sonnet-4-6 for repetitive, well-defined tasks where cost matters.

Decision points requiring human review:

Input control library — pyautogui for desktop; playwright for browser-only (more reliable element targeting, built-in waits). Prefer Playwright if the target is a web app.
Screenshot frequency — screenshot after every action is the safest approach; it's also the most token-expensive. For known deterministic sequences (type text, press Tab, type text), batch actions before taking the next screenshot.
Credential handling — if the agent needs to log in, credentials must be available at runtime. Use environment variables or a secrets manager. Never embed credentials in the prompt or tool schema.

PythonAnthropic Claude API (claude-opus-4-6 / claude-sonnet-4-6)Playwright or pyautogui (input control)Pillow (screenshot processing)

Data Middleware — BI to Warehouse Data Platform

Designing the schema and serving layer that sits between a data warehouse and BI tools like Tableau — star schema, materialized views, distribution keys, and when to use extracts.

Problem

Connecting Tableau (or any BI tool) to a data warehouse is not just a connection string. Tableau's query patterns are fundamentally different from OLTP or even standard analytical queries: it generates multi-join, multi-GROUP BY SQL with dynamic date ranges based on what a dashboard user is filtering. On a poorly designed schema, an interactive dashboard generates a query that takes 30 seconds. The user clicks something else. Three more 30-second queries launch. The warehouse is grinding; the user has given up.

The other failure mode is the "everything in one table" schema — a wide, denormalized fact table with 200+ columns. It's fast to query for simple aggregations but breaks down when Tableau needs to cross-reference dimensions that aren't in the same row, and it's expensive to maintain when source schemas evolve.

Approach

The serving layer is purpose-built for the consumer. For Tableau, that means a star schema: one or a few fact tables containing measurements and foreign keys, surrounded by dimension tables containing the descriptive attributes Tableau uses for filtering and grouping.

                    ┌──────────────────┐
                    │  dim_client      │
                    │  client_id (PK)  │
                    │  client_name     │
                    │  region          │
                    └────────┬─────────┘
                             │
┌──────────────────┐  FK     ▼       FK  ┌──────────────────┐
│  dim_date        ├──────► fact_survey ◄──┤  dim_context    │
│  date_id (PK)    │         (measures)   │  context_id (PK) │
│  year, quarter   │                      │  survey_type     │
│  month, week     │                      │  wave_name       │
└──────────────────┘                      └──────────────────┘

Redshift-specific serving design:

Distribution key: Set DISTKEY on the join column used most frequently in Tableau queries. For a fact-dimension join, this is the dimension foreign key. Rows with the same DISTKEY land on the same compute node — joins become node-local, eliminating redistribution.
Sort key: Set SORTKEY on the date column. Tableau's date range filters allow Redshift to skip entire blocks of storage (zone map pruning). Without a date sort key, every date filter is a full scan.
Materialized views for heavy aggregations: Tableau's "Show Me" charts and LOD calculations generate complex sub-queries. Pre-compute the expensive ones as Redshift materialized views. The view refresh is part of the ETL pipeline (post-Silver step in the Bronze/Silver/Serving architecture); Tableau queries the view, not the raw fact table.
Extracts for static dashboards: Tableau extracts (.hyper files) are a local in-memory columnar store. For dashboards that don't need real-time data and are queried by many users, an extract eliminates warehouse load entirely. Schedule refreshes overnight or hourly. Live connections are appropriate only when real-time data is genuinely required.

Schema design principles for Tableau:

Dimension tables are wide, fact tables are narrow: Tableau builds its "Data pane" from the schema. Dimensions should carry all the human-readable attributes (names, categories, hierarchies). Facts should carry only measures (counts, amounts) and foreign keys. A 200-column fact table is a design smell.
Consistent date granularity: Tableau's date drill-down (year → quarter → month → week → day) requires a date dimension with pre-computed hierarchy columns. Don't rely on Tableau to derive these — generate them at build time in the ETL.
Avoid Tableau calculated fields on large tables: Tableau row-level calculated fields run in the DB as SQL expressions on every row. A CASE WHEN that recodes 50 categories executes 50 comparisons per row, per query. Pre-compute it in the serving layer as a column.

Key Decisions

Star schema over wide flat table: Star schema adds a join, but the join is cheap (DISTKEY-aligned). The wide table looks simpler but produces unmaintainable ETL as source schemas evolve, and Tableau can't efficiently cross-reference data across categorical dimensions that aren't in the same row.
Serving layer separate from Silver: Tableau's schema requirements (star schema, pre-aggregated views) are not the same as the analytical schema needed by Python/ML workloads. Keep them separate. The Silver layer is the stable contract; serving schemas are consumer-specific projections on top of Silver.
Redshift Serverless over provisioned cluster for BI: Tableau dashboards are bursty — heavy during business hours, idle overnight. Serverless scales to zero when idle. A provisioned cluster runs 24/7 at full cost. For a BI-only workload, Serverless is the right default.
Limit live connections: Every Tableau user opening a dashboard that uses a live connection fires a warehouse query. For 50 users on the same dashboard, that's 50 simultaneous queries per interaction. Extracts eliminate this. Reserve live connections for dashboards where data freshness is the explicit requirement.

Where I've Applied This

Designed and built the serving layer for Tableau at HCE Group as part of the cloud data platform transformation. The Silver-to-Serving ETL produces star-schema tables in Redshift Serverless with DISTKEY/SORTKEY aligned to Tableau's query patterns, and materialized views for the most expensive aggregation paths. Dashboard load times dropped from 30–60 seconds on the legacy SQL Server schema to 2–5 seconds on the tuned Redshift serving layer. Extracts are used for all dashboards with >10 concurrent users.

How to Run This

Inputs you need before starting:

Tableau workbook(s) to optimize, or target dashboard design
Silver layer schema (what data is available, at what grain)
Concurrency expectations: how many users, how often, what times of day

Scaffold prompt (for an AI agent or a developer):

Given the following Silver layer tables: [list Silver tables with column names and data types], design a star schema serving layer for Tableau with the following requirements:

Identify the primary fact table and its grain (one row per what?)

Identify dimension tables for [client, date, context/survey type, etc.]

Write the Redshift CREATE TABLE statements for each, including: DISTKEY on the fact-to-dimension join column, SORTKEY on the date column of the fact table, ENCODE compression for each column based on cardinality (ZSTD for high-cardinality strings, BYTEDICT for low-cardinality)

Write the Glue/SQL ETL that populates fact and dimension tables from Silver, including: deduplication on natural key, dim_date population from a date range (generate all dates for the past 5 years + next 1 year), upsert logic using Redshift MERGE

Write the Redshift MATERIALIZED VIEW for the most expensive aggregation: total responses by client, context, and date_month, refreshed daily

Decision points requiring human review:

Fact table grain — what is one row? Getting this wrong cascades into incorrect aggregations in Tableau. The grain should match the most detailed question the dashboards need to answer.
Slowly changing dimensions — if client names or context labels change over time, do historical records need to reflect the old name (SCD Type 2) or the current name (SCD Type 1)? Tableau date range filters behave differently for each.
Extract vs. live — identify which dashboards genuinely require real-time data. Most don't. Forcing a live connection on a static weekly report adds warehouse cost with no user benefit.

Redshift ServerlessSQLTableauAWS Glue (ETL to serving layer)

Let's Connect

These are the patterns I lead with. If your team is tackling one of these challenges, I'm always happy to connect and share insights or point you in the right direction.