Skip to content

The Orch8 Playbook

22 patterns ready to copy

Each pattern starts with the problem, explains how Orch8 solves it, walks through what happens step by step, and includes a ready-to-use sequence definition. No prior orchestration knowledge required.

10 everyday patterns12 advanced capabilitiescopy-paste JSON

Everyday Problems

1

Never Lose a Payment to a Provider Outage

try_catchretrytimeout

The Problem

Stripe goes down. Transactions fail and you lose revenue. You want Braintree as a backup — but you can't charge both providers at once, or the customer gets double-charged. You need sequential failover: try Stripe, and only if it definitively fails, try the next provider.

How Orch8 Solves It

Nested try_catch blocks attempt providers in order. Stripe runs first with retries. If all Stripe attempts fail, the catch block tries Braintree. If Braintree also fails, Adyen runs. An idempotency key on the order ID ensures the same charge is never submitted twice to the same provider — even after a crash mid-retry.

What happens, step by step

  1. 1Prepare — normalize the charge and attach an idempotency key derived from the order ID
  2. 2Try Stripe — up to 2 retries on transient errors (network timeout, 5xx); hard-stop on card declines
  3. 3Stripe failed? Catch and try Braintree with its own retry config
  4. 4Braintree failed? Catch and try Adyen as last resort
  5. 5Whichever provider succeeds records the transaction; if all three fail the instance goes to DLQ

Key takeaways

  • Never race payment providers — if two succeed before the race cancels them, the customer is double-charged.
  • try_catch gives you sequential failover: Stripe → Braintree → Adyen. Each provider is only attempted if the previous one definitively failed.
  • The idempotency_key on the order ID means retrying Stripe three times still counts as one charge attempt — the provider deduplicates for you.
  • retryable_errors limits retries to transient failures. Card declines are permanent — don't retry them, move straight to the next provider.
  • Each catch block logs the failure before trying the next provider, so you have a full trail of which providers were attempted.
2

Reach Users on Every Channel Without Blocking

paralleltry_catch

The Problem

You need to notify a user via email, SMS, and push notification. If the email provider is slow, it shouldn't delay the SMS. If push fails, it shouldn't kill the whole notification. And each channel should have its own backup provider.

How Orch8 Solves It

A parallel block sends all three channels simultaneously. Each channel is wrapped in try_catch so failures are isolated — if push notifications fail, email and SMS still deliver.

What happens, step by step

  1. 1Email — try Amazon SES, fall back to SendGrid
  2. 2SMS — try Twilio, fall back to Vonage
  3. 3Push — try Firebase, log a warning on failure
  4. 4Record — save delivery results for all channels
  5. 5All three channels run at the same time. A failure in one never affects the others.

Key takeaways

  • parallel runs branches simultaneously. try_catch isolates failures per channel.
  • Each channel has its own primary + fallback provider with independent retry configuration.
  • Adjust timeouts per channel based on provider speed expectations.
  • The record step uses |null defaults — it always runs, even if a channel had no output.
3

Run Email Campaigns That Don't Get You Blacklisted

ab_splitrate_limitsend_windowdelayrouterretry

The Problem

You're running cold outreach at scale. You need to send emails only during business hours, respect per-mailbox daily limits, A/B test subject lines, wait between follow-ups, and stop the sequence if someone replies. One wrong move and your domain gets blacklisted.

How Orch8 Solves It

Send windows ensure delivery during business hours only. Rate limiting caps volume per mailbox. A/B split tests subject lines deterministically. Business-day delays with jitter make timing look natural. Routers check for replies between steps.

What happens, step by step

  1. 1Enrich — look up company info for personalization
  2. 2A/B test — half get subject A, half get subject B (deterministic hash, survives restarts)
  3. 3Send first email — only during weekdays 8am–5pm, respecting mailbox daily limits
  4. 4Wait 3 business days with ±2h jitter for natural timing
  5. 5Check for reply — if replied, update CRM and stop; if not, send follow-up
  6. 6Wait 5 more business days, check again, mark as no-reply if silence

Key takeaways

  • send_window prevents emails at odd hours. rate_limit_key caps daily volume per mailbox.
  • ab_split is deterministic — the same lead always gets the same variant, even after a restart.
  • business_days_only skips weekends. jitter makes timing look human.
  • Configure per-mailbox rate limits (e.g. 40/day) via the rate limits API, not in the sequence definition.
4

Let an AI Agent Think, Act, and Loop on Its Own

looprouterretrytemplates

The Problem

You want an AI agent that can reason about a task, pick the right tools (search, database, calculator), execute them, observe results, and keep looping until it has an answer. You also need a safety limit on iterations and retries if the LLM API flakes.

How Orch8 Solves It

A loop block repeats until the agent marks itself done or hits a max iteration limit. Inside the loop, the LLM picks a tool, a router dispatches to the right handler, and the result is fed back for the next iteration. If the engine crashes, completed iterations aren't re-executed.

What happens, step by step

  1. 1Initialize — set up empty conversation history and iteration counter
  2. 2Loop (up to 10 times): call the LLM with current history and tool list
  3. 3Router dispatches to the right tool (search, calculator, database) based on LLM output
  4. 4Tool result is appended to conversation history for the next iteration
  5. 5When the LLM returns a final answer (no tool call), mark done and exit the loop
  6. 6Save — persist the final answer to the session store

Key takeaways

  • loop repeats until done or max_iterations is hit. No runaway agents.
  • router dispatches to the right tool based on LLM output. The default branch handles final answers.
  • API keys live in context.config, never in the sequence definition — they're never logged.
  • Completed iterations are saved. A crash mid-loop doesn't re-execute finished iterations.
5

Recover Failed Payments Without Losing Customers

delayrouterhuman_inputcancellation_scopesla_deadlineretry

The Problem

A subscription payment fails. You need to retry the charge on a schedule, send increasingly urgent emails, get human approval before suspending high-value accounts, and make sure the suspension step can't be accidentally cancelled mid-execution.

How Orch8 Solves It

Delays schedule retries. Routers branch on success/failure. Human input pauses the workflow for manual review. Cancellation scopes protect critical billing operations from interruption. SLA deadlines alert your team if resolution takes too long.

What happens, step by step

  1. 1Notify — send 'payment failed' email immediately
  2. 2Wait 3 days, retry the charge
  3. 3If paid — send confirmation, done
  4. 4If still failing — send urgent email, wait 5 more days, retry again (with a 7-day SLA deadline)
  5. 5If paid on second try — send confirmation, done
  6. 6If high-value account — pause for human review (48h timeout, then escalate to manager)
  7. 7If low-value or rejected — suspend account inside a cancellation scope so it can't be interrupted

Key takeaways

  • delay schedules retry attempts days apart. router branches on payment success/failure at each attempt.
  • wait_for_input pauses the workflow for review, with automatic escalation after 48 hours if nobody responds.
  • cancellation_scope protects critical steps (account suspension) from being interrupted by a signal.
  • deadline fires a Slack notification if resolution drags past 7 days — your team always knows.
6

Process Thousands of Items Without Writing Queue Code

for_eachsub_sequenceparalleltry_catch

The Problem

You have a batch of invoices (or images, records, orders) to process. Each one needs multiple steps — validate, enrich, transform, store. If one fails, the rest should keep going. You'd normally need a message queue, consumer workers, error handling per item, and progress tracking.

How Orch8 Solves It

for_each iterates over the collection. Each item spawns a sub_sequence — a complete child workflow with its own lifecycle. Failures are isolated per item via try_catch. The parent workflow waits for all children and generates a summary.

What happens, step by step

  1. 1Fetch — query all pending invoices in the batch
  2. 2Fan out — for each invoice, spawn a child workflow
  3. 3Child workflow: validate the invoice and look up vendor info in parallel
  4. 4Child workflow: apply accounting rules and post to the ledger
  5. 5Child failure: quarantine the invoice rather than crashing
  6. 6Summarize — generate a batch report after all items are processed

Key takeaways

  • for_each + sub_sequence = batch processing without building queue infrastructure.
  • Each item is a separate child workflow. One failure doesn't affect the others.
  • Sub-sequences can be independently versioned, tested, and reused by other workflows.
  • max_iterations: 500 caps batch size. For larger batches, paginate and run multiple instances.
7

Onboard Users Differently Based on Their Plan

routerparalleltemplatesdelay

The Problem

Free users just need an account and a welcome email. Pro users need feature flags enabled. Enterprise users need dedicated infrastructure, a customer success manager, SSO setup, and a kickoff meeting — all happening in parallel to save time. You don't want to maintain separate workflows for each plan.

How Orch8 Solves It

A single sequence handles all plans. A router branches on the plan type. Enterprise setup runs steps in parallel (infra + CSM + SSO at the same time). Template interpolation with defaults handles missing fields gracefully.

What happens, step by step

  1. 1Create account — all plans
  2. 2Send welcome email — template name varies by plan using {{context.data.plan}}
  3. 3Route by plan: Enterprise → provisioned dedicated infra + CSM + SSO in parallel, then kickoff
  4. 4Pro → provision shared-priority infra + enable advanced features
  5. 5Free → provision shared infra
  6. 6Send getting-started email 1 hour later — all plans

Key takeaways

  • One workflow handles all plan types. router branches on any condition.
  • parallel runs enterprise setup steps simultaneously — no waiting for each to finish in sequence.
  • Template defaults ({{context.data.preferred_region|us-east-1}}) handle missing fields without crashing.
  • Adding a new plan = adding one route block. The default catches anything unrecognized.
8

Stop Hammering a Broken API

circuit_breakertry_catchrouterretry

The Problem

You're calling a partner API that goes down occasionally. Without protection, your system keeps sending requests that timeout, wasting resources and making things worse. You need to fail fast when the API is down, serve cached data when possible, and recover automatically when it comes back.

How Orch8 Solves It

A circuit breaker tracks failures at the infrastructure level. After a threshold (e.g., 5 failures), it opens and short-circuits all requests — instant failure, no network call. After a cooldown, it lets one request through to test recovery. try_catch catches the failure and a router decides whether to use cached data or retry later.

What happens, step by step

  1. 1Try calling the partner API (with retries for transient errors)
  2. 2If successful — transform the response and continue
  3. 3If all retries fail or circuit breaker is open — catch the error
  4. 4If cached data is available, use it; otherwise schedule the request for retry later
  5. 5Finally — always record success/failure metrics, regardless of outcome

Key takeaways

  • Configure the breaker via API — not in the sequence: POST /circuit-breakers { handler: 'partner_api_request', failure_threshold: 5, cooldown_secs: 60 }
  • When the breaker opens, the step fails instantly — no network call, no timeout wait.
  • try_catch handles the failure. router picks the best fallback (cache or reschedule).
  • finally always runs — metrics are recorded regardless of success or failure.
9

Turn Incoming Webhooks Into Multi-Step Pipelines

parallelroutertemplatesretry

The Problem

When a new lead fills out your website form, you need to validate the data, verify their email, enrich with company info, score the lead, and route them to the right sales rep or nurture sequence. Today this is a tangle of Lambda functions, SQS queues, and glue code.

How Orch8 Solves It

One sequence definition replaces the entire pipeline. Enrichment steps run in parallel (email verification, company lookup, and intent detection at the same time). A router scores and routes leads: hot leads go straight to a sales rep with a Slack notification, warm leads enter a drip campaign, cold leads are stored for later.

What happens, step by step

  1. 1Validate — check the raw form data against a schema
  2. 2Normalize — clean up the email address
  3. 3Enrich (in parallel) — company data from Clearbit, email verification, intent detection — all at the same time
  4. 4Score — combine all signals into a lead score and tier
  5. 5Route — hot → AE assignment + Slack alert; warm → nurture drip; cold → stored in CRM

Key takeaways

  • Parallel enrichment saves time — three API calls run simultaneously instead of sequentially.
  • router replaces complex if/else chains. Conditions read like natural language.
  • Template defaults ({{...utm_source|direct}}) handle missing fields without crashing.
  • Point any webhook source at Orch8's trigger endpoint and this pipeline runs automatically.
10

Require Human Approval Before Critical Actions

human_inputcancellation_scoperouterretry

The Problem

Your deployment pipeline runs tests, generates a diff report, and should only deploy if a human approves. If nobody responds within 4 hours, it should escalate. Once approved, the deploy must not be cancellable mid-execution.

How Orch8 Solves It

wait_for_input pauses the workflow and waits for a human signal. The workflow stays paused (using zero resources) until someone responds via API. If nobody responds within the timeout, an escalation handler fires automatically. The deploy step is marked non-cancellable.

What happens, step by step

  1. 1Run CI — execute the test pipeline (up to 10 min, with retries for flaky infra)
  2. 2Generate diff report — summarize what changed and test results
  3. 3Request approval — send to reviewer, workflow pauses (costs nothing while waiting)
  4. 4Wait up to 4 hours — if no response, escalate to a backup reviewer
  5. 5If approved — deploy (non-cancellable), notify the team
  6. 6If rejected — notify the team with the rejection reason

Key takeaways

  • wait_for_input pauses the workflow at zero cost until a signal arrives. No polling, no heartbeats.
  • The reviewer responds via API: POST /instances/{id}/signal { signal_type: 'custom', payload: { decision: 'approved' } }
  • Escalation fires automatically if nobody responds within the timeout — no cron job needed.
  • cancellable: false prevents accidental interruption during the deploy step itself.

Advanced Capabilities

These patterns showcase capabilities unique to Orch8 — features that eliminate entire categories of custom infrastructure code.

11

Workflows That Rewrite Themselves at Runtime

dynamic_injectionloop

The Problem

An AI agent needs to decide its own next steps based on what it learns. You can't pre-define every possible path because the agent discovers what tools to call at runtime. Traditional workflow engines require you to know the entire graph upfront.

How Orch8 Solves It

The self_modify built-in handler lets a running workflow inject new steps into itself. An LLM analyzes the task, returns a list of tool calls, and the workflow inserts those as actual execution steps. No external worker needed.

What happens, step by step

  1. 1LLM analyzes the task and returns a list of tool calls as JSON
  2. 2self_modify injects those tool calls as real workflow steps
  3. 3The engine picks them up on the next tick and executes them
  4. 4A final step synthesizes results after all injected steps complete
  5. 5If the engine crashes mid-execution, completed steps aren't re-run

Key takeaways

  • self_modify is a built-in handler — no external worker required.
  • position: 0 inserts before the next unexecuted block. Omit position to append at the end.
  • You can also inject blocks externally via POST /instances/{id}/inject-blocks — useful from AI orchestration layers.
12

Call Any gRPC Service as a Workflow Step

grpcretrycircuit_breaker

The Problem

You have existing microservices with gRPC interfaces. You want to use them in workflows without writing adapter code or polling workers.

How Orch8 Solves It

Prefix any handler with grpc:// and the engine calls your service directly. Standard retry, timeout, and circuit breaker all apply. No worker polling needed.

What happens, step by step

  1. 1Set handler to grpc://<host>:<port>/<Service>.<Method>
  2. 2The engine sends step params as JSON, receives JSON response
  3. 3Mix gRPC steps with regular workers and built-in handlers in the same workflow
  4. 4Configure retry, timeout, and circuit breaker identically to any other step

Key takeaways

  • No adapter code needed. The grpc:// prefix is all that changes.
  • Different steps can target different gRPC services in the same workflow.
  • Circuit breaker works identically — configure via POST /circuit-breakers with the grpc:// handler name.
13

Start Workflows From External Events

triggers

The Problem

Stripe sends a 'payment failed' webhook. GitHub sends a 'pull request opened' event. You want these to automatically start workflows without writing glue code.

How Orch8 Solves It

Register a trigger, point the external webhook at it, and Orch8 creates workflow instances automatically. The webhook payload becomes context.data.

What happens, step by step

  1. 1Create a trigger with a slug, sequence name, and version
  2. 2Point the external service's webhook URL at POST /triggers/{slug}/fire
  3. 3When the event fires, Orch8 creates an instance with the payload as context.data
  4. 4Enable or disable triggers without deleting them
  5. 5Use idempotency_key in the trigger config to prevent duplicates from webhook retries

Key takeaways

  • The secret field enables HMAC webhook signature validation — Orch8 rejects unsigned events.
  • Combine with idempotency_key to prevent duplicate instances from webhook retries.
  • Any HTTP client can fire a trigger — Stripe, GitHub, Zapier, your own backend, a curl command.
14

A/B Test Inside Your Workflows

ab_split

The Problem

You want to test two email subject lines, two onboarding flows, or two provider strategies. The split must be deterministic (same user always gets the same variant) and survive crashes.

How Orch8 Solves It

The ab_split block uses a hash of the instance ID to deterministically route to a variant. Weights control distribution. The same instance always takes the same path, even after a crash or restart.

What happens, step by step

  1. 1Define ab_split with variants and weights
  2. 2Each variant has a name and a blocks array
  3. 3The engine hashes the instance ID to pick a variant deterministically
  4. 4The selected variant name is available in outputs for analytics: steps.split_id.output.selected_variant
  5. 5Weights are relative: 50/50 = equal split, 2/1 = 67%/33%

Key takeaways

  • Deterministic hash — the same user always gets the same variant, even after a restart or migration.
  • The selected variant name is available in step output for downstream analytics and attribution.
  • Use inside parallel or sequential blocks. Variants can be arbitrarily complex sub-workflows.
  • Weights are relative integers — 1/1 is 50/50, 2/1 is 67%/33%, 3/1/1 splits three ways.
15

Send Messages Only During Business Hours

send_windowdelay

The Problem

You don't want sales emails landing at 3am. SMS notifications should arrive during working hours. Follow-up delays should skip weekends and holidays.

How Orch8 Solves It

send_window restricts when a step can execute. business_days_only delays skip weekends. Holiday lists skip specific dates. Jitter adds randomness so emails don't all fire at exactly 9:00am.

What happens, step by step

  1. 1Set send_window with start_hour, end_hour, and days array (0=Mon, 4=Fri)
  2. 2Add business_days_only: true to any delay to skip weekends
  3. 3Add a holidays array to skip specific dates
  4. 4Add jitter (milliseconds) to spread sends within a window naturally
  5. 5Steps ready outside the window are deferred until it opens — never dropped

Key takeaways

  • send_window and delay.business_days_only work independently — combine them for full control.
  • Per-instance timezone applies to both the window and delay calculations automatically.
  • Jitter of ±2h (7200000ms) means sends are spread naturally over a 4-hour window.
  • Steps never fire outside the window — they're deferred, not dropped or rescheduled manually.
16

Warm Up New Email Accounts Automatically

warmup_ramprate_limit

The Problem

New email accounts need to start with low volume and gradually ramp up, or email providers will flag them as spam. Managing this manually is error-prone and takes time every week.

How Orch8 Solves It

Resource pools with warmup ramps. You set a starting cap, a target cap, and a ramp duration. The engine calculates the daily limit automatically. When the cap is hit, steps are deferred to the next day — not failed.

What happens, step by step

  1. 1Create a resource pool for your sender domain or IP
  2. 2Add mailboxes with warmup configuration (start cap, target cap, duration)
  3. 3Engine calculates the current daily limit based on elapsed warmup days
  4. 4When the cap is hit, steps are deferred to the next available slot
  5. 5After warmup_days, the resource operates at full daily_cap

Key takeaways

  • Linear ramp is calculated automatically — Day 7 of a 14-day ramp is exactly 50% of target capacity.
  • Combine with send_window so the warmup also respects business hours.
  • Steps deferred by warmup limits are rescheduled to the next available slot — never failed.
  • Add or remove resources from the pool without touching the sequence definition.
17

Add Logging and Monitoring Without Touching Each Step

interceptors

The Problem

You want to add audit logging, PII redaction, or failure alerts to every step in a workflow. Adding it to each step individually is tedious and easy to forget.

How Orch8 Solves It

Interceptors are lifecycle hooks defined once at the sequence level. They run before/after every step, on completion, on failure, or on signal receipt. A failing interceptor never fails the step itself.

What happens, step by step

  1. 1Define interceptors in the sequence definition (not in individual steps)
  2. 2before_step fires before every step — use for audit logging, auth checks
  3. 3after_step fires after every step — use for PII redaction from outputs
  4. 4on_complete fires once when the instance finishes successfully
  5. 5on_failure fires once when the instance fails after retries
  6. 6Interceptors don't block step execution — a failing interceptor is logged, not propagated

Key takeaways

  • before_step and after_step fire for every step automatically — once defined, nothing to forget.
  • on_complete and on_failure fire once at the end. Use for DWH sync, alert routing, cleanup.
  • Interceptors receive the same step context as regular handlers — full access to params and outputs.
  • A failing interceptor is logged but doesn't affect the step result or instance state.
18

Hide Secrets From Third-Party Plugins

context_access

The Problem

A workflow step calls a third-party service that shouldn't see your API keys, tenant configuration, or internal state.

How Orch8 Solves It

context_access restricts which sections of the execution context a handler can see. The engine strips restricted sections before sending data to the handler — they're not just hidden, they're absent.

What happens, step by step

  1. 1Set context_access on any step to control visibility per section
  2. 2data: true — allow the handler to see workflow data
  3. 3config: false — hide API keys and tenant config
  4. 4audit: false — hide the audit trail
  5. 5runtime: false — hide engine metadata (step counts, timestamps)
  6. 6The stripped sections are never serialized to the worker request

Key takeaways

  • context_access is set per step — different handlers can have different visibility.
  • The restricted sections are absent from the serialized request, not just redacted in logs.
  • Combine with encryption at rest (ORCH8_ENCRYPTION_KEY) for defense in depth.
  • Works identically for gRPC handlers, external REST workers, and built-in handlers.
19

Share State Across Multiple Workflows

sessions

The Problem

An onboarding flow has multiple workflows (email verification, profile setup, first action) that need to share state. You don't want each workflow to re-fetch everything from scratch.

How Orch8 Solves It

Sessions are named, tenant-scoped containers with shared JSON data. Multiple workflow instances can reference the same session. Sessions have optional TTL and lifecycle states.

What happens, step by step

  1. 1Create a session with a unique key and optional shared data
  2. 2Create instances bound to the session via session_id
  3. 3Instance A writes to context.data — readable by instance B in the same session
  4. 4Cancel, query, or close all instances in a session with one API call
  5. 5Session expires automatically at expires_at, blocking new instances

Key takeaways

  • Sessions are lightweight — they're just a named container. No overhead beyond the lookup.
  • Instance A writes context.data; instance B reads it via the session — no extra API calls.
  • Use expires_at to auto-expire onboarding sessions after a window (e.g. 30 days).
  • Cancel all in-progress flows for a user with one POST /sessions/{id}/cancel call.
20

Route Work to Specialized Worker Pools

queue_routing

The Problem

ML inference needs GPU workers. Data processing needs high-memory workers. EU compliance requires EU-region workers. You can't have all workers compete for all tasks.

How Orch8 Solves It

Add queue_name to any step. Workers poll specific queues. Different queues can run on different hardware, regions, or scaling policies.

What happens, step by step

  1. 1Add queue_name to any step that needs a specialized worker
  2. 2Start workers that poll their specific queue
  3. 3Steps without queue_name go to the default queue
  4. 4Queues are implicit — they exist when first referenced, no setup needed
  5. 5Scale each queue's worker pool independently based on load

Key takeaways

  • Queues are implicit — no setup required. Add queue_name to a step and the queue exists.
  • Workers that poll without a queue name claim from the default queue only.
  • Run GPU workers on GPU machines, EU workers in EU regions — scale each pool independently.
  • Tasks claimed on the wrong queue never get double-dispatched — they wait for the right worker type.
21

Update Live Workflows Without Restarting Them

hot_migration

The Problem

You deployed a new version of a workflow, but thousands of instances are mid-execution on v1. You want to move them to v2 without restarting from scratch.

How Orch8 Solves It

Hot migration rebinds a running instance to a new sequence version. Completed steps keep their saved outputs. The instance continues from where it left off using the new definition.

What happens, step by step

  1. 1Deploy the new sequence version via POST /sequences
  2. 2Deprecate the old version — prevents new instances, doesn't affect running ones
  3. 3Test migration on one instance: POST /sequences/migrate-instance
  4. 4Verify the instance proceeds correctly with the new definition
  5. 5Bulk migrate remaining instances

Key takeaways

  • Only non-terminal instances can be migrated — completed and failed ones are left alone.
  • Completed steps keep their memoized outputs. No re-execution of finished work.
  • Test on one instance before bulk migration. Verify outputs look correct first.
  • Deprecating v1 only blocks new instances — it never touches running ones.
22

Checkpoint Long-Running Workflows for Fast Recovery

checkpoints

The Problem

A workflow runs for days or weeks (a month-long campaign, a multi-day data migration). If something goes wrong, you don't want to replay from the beginning.

How Orch8 Solves It

Checkpoints are application-level snapshots you save at any point. They capture whatever custom state you need — pagination cursors, accumulated results, external system state. On recovery, restore from the latest checkpoint instead of replaying everything.

What happens, step by step

  1. 1Save a checkpoint after each significant batch or phase
  2. 2Store application-specific state: cursors, counts, external IDs
  3. 3On recovery, load the latest checkpoint and continue from there
  4. 4Prune old checkpoints to keep storage lean
  5. 5The engine's built-in crash recovery handles step-level state — checkpoints are for your application state

Key takeaways

  • Checkpoints complement the engine's built-in crash recovery — which handles step-level state automatically.
  • Use checkpoints for application state the engine doesn't track: pagination cursors, running totals, external IDs.
  • Save after each meaningful milestone (page of data, batch of items) — not every step.
  • Prune regularly. For a 14-day campaign, keeping 5 checkpoints is usually enough.