137Foundry Web & App Design @137-foundry - Tumblr Blog

Free and Open Source Tools for Building a Rate Limiter

We get asked fairly often what we actually reach for when building rate limiting into a client's API, versus what we'd recommend a small team stand up on their own without hiring anyone or bringing in outside help. Here's the honest list, all free or open source, roughly in the order we'd typically bring them into a new project.

1. Redis

Redis is the backbone of most distributed rate limiters we've built, and for good reason. Its atomic increment operations make it straightforward to safely track per-client request counts across multiple application servers without race conditions producing an incorrect count, and it's fast enough that the added latency from checking it on every request is rarely noticeable. If you're only going to add one piece of infrastructure specifically for rate limiting, this is usually it.

2. Nginx

Nginx has rate limiting built into its core module set, letting you enforce a coarse limit at the gateway layer before requests ever reach your application code. This matters more than it might sound like, since a flood of rejected requests handled at the gateway never competes with legitimate traffic for your application server's database connections or worker threads. We often layer this under a finer-grained, business-aware limit implemented in the application itself.

3. Envoy

For teams running a service mesh or wanting more sophisticated traffic management than Nginx's built-in module offers, Envoy includes a dedicated rate limiting filter designed to work with an external rate limit service, which decouples the limiting logic from the proxy itself and makes it easier to share consistent limits across multiple services rather than configuring each proxy independently.

4. Kong

Kong is an API gateway built specifically around plugin-based request handling, and its rate limiting plugin is one of the more commonly deployed pieces of its open source offering. If you're already running an API gateway for other reasons, authentication, request transformation, adding rate limiting through the same layer avoids standing up yet another separate piece of infrastructure just for this one concern.

5. k6

Building a rate limiter is only half the job; testing it properly under realistic bursty traffic is the other half, and it's the half teams skip most often. k6 supports scripting arbitrary request timing patterns rather than just constant-rate load, which makes it meaningfully better suited to testing rate limiter burst behavior than tools built primarily around steady-state load generation.

6. The IETF's RateLimit Header Fields Draft

Not a tool exactly, but worth knowing about: there's an ongoing IETF effort to standardize the response headers APIs use to communicate rate limit status, remaining capacity, reset time, and so on. Following a proposed standard rather than inventing your own header naming convention makes your API easier for client libraries and tooling built around the emerging convention to work with automatically.

7. Traefik

Traefik is another open source reverse proxy and load balancer, popular in container and Kubernetes-heavy environments, with rate limiting available as one of its built-in middleware options. If your infrastructure is already containerized and you're picking a proxy layer from scratch, Traefik's rate limiting configuration integrates naturally with the same dynamic service discovery it uses for routing, which can be simpler to manage than a static configuration file in environments where services are constantly scaling up and down.

8. HAProxy

HAProxy has been a standard load balancer for a long time, and its stick-table feature can be used to implement rate limiting by tracking request counts per client IP or other identifiers directly at the load balancer layer. It's a heavier lift to configure correctly than some of the more purpose-built API gateway options on this list, but it's battle-tested at serious scale and worth considering if your infrastructure already relies on it for load balancing and you'd rather not introduce a separate tool just for rate limiting.

How These Tools Actually Fit Together in a Real Stack

In practice, most of the rate limiting setups we build combine two or three of these rather than relying on just one. A typical pattern: Nginx or Traefik at the edge enforcing a coarse limit to absorb obvious floods before they reach application servers, Redis as the shared counter backing both the edge enforcement and any finer-grained application-level limits, and k6 in the CI pipeline running load tests against realistic bursty traffic patterns before any rate limiting change ships to production. Envoy, Kong, and HAProxy tend to enter the picture once an organization already has one of them in place for other reasons, load balancing, service mesh routing, and it makes more sense to add rate limiting to existing infrastructure than to introduce a new tool solely for this purpose.

What We'd Actually Recommend for a Small Team

If you're a small team standing this up for the first time without dedicated infrastructure headcount, start with Redis for the shared counter and either application-level logic or Nginx's built-in module for enforcement, depending on whether you need business-aware limits (different limits per subscription tier, for instance) or just a flat technical ceiling. Envoy and Kong are worth reaching for once you're running enough distinct services that centralizing rate limiting logic outside individual applications starts paying for itself, which for most teams is later than they initially expect.

None of these tools solve the actual design decisions on their own, choosing token bucket versus sliding window, sizing limits so legitimate bursts don't get needlessly rejected, deciding whether to enforce per-client, per-endpoint, or both, and deciding how failures in the shared store itself should be handled. We wrote up our full thinking on those design questions in How to Design a Rate Limiter That Doesn't Punish Legitimate Bursts, which pairs naturally with whichever tools from this list you end up standing up.

Where Managed Services Fit In

Everything above is self-hosted, which is deliberate since this list is specifically about free and open source options a team can run themselves. It's worth acknowledging that managed API gateway and rate limiting services exist too, and for teams without the operational appetite to run and monitor Redis or a proxy layer themselves, a managed option can trade cost for reduced operational burden. We tend to recommend starting with the open source stack described here for most clients, since the actual operational overhead of running Redis and a proxy is smaller than it's often assumed to be, but it's a legitimate tradeoff worth evaluating against your team's actual infrastructure capacity rather than assuming self-hosted is automatically the right call for every situation.

The Honest Caveat

Every tool on this list is free to use and genuinely solid, but "free" doesn't mean "zero effort to run correctly." Redis needs monitoring and a redundancy plan of its own, Envoy and Kong both have real learning curves before a team is confident configuring them correctly, and even Nginx's rate limiting module has enough configuration nuance (burst parameters, delay behavior) that a naive first attempt often behaves differently than intended. 137Foundry helps teams get past that first-attempt gap when the tooling is right but the configuration around it needs an experienced set of eyes on it.

#webdev #opensource #api design #backend #tools

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

How We Decide Between a Settings Page and a First-Run Wizard

A question that comes up on nearly every product project at 137Foundry: when a new feature needs a handful of user choices before it can work well, should those choices live in a first-run wizard, a settings page, or both? Getting this wrong in either direction has a real cost, a wizard that's too long loses users before they ever reach the product, and a setting buried in a settings page nobody visits on their own means most users never configure it at all, even when they'd genuinely benefit from doing so.

The test we actually use

The question we ask first isn't "how important is this choice," it's "does the product work meaningfully worse if the user never makes this choice at all." If the answer is no, a sensible default plus an optional settings-page entry is almost always the right call. If the answer is yes, meaning the feature is confusing or broken without the choice being made, it belongs in onboarding, not buried in settings where a meaningful share of users will simply never find it.

Why we default toward settings, not wizards

Our starting bias leans toward keeping choices out of onboarding rather than into it, and that's deliberate. Every additional step in a first-run wizard is a point where a user can abandon the product entirely before they've experienced any value from it. A setting that lives in the settings page, reachable whenever a user actually wants to change it, costs nothing in onboarding friction and loses nothing in capability. The bar for pulling a choice out of settings and into a wizard has to be genuinely high to justify that tradeoff.

Where this gets genuinely hard

The harder calls are features where the default behavior is fine for most users but actively wrong for a meaningful minority, and there's no way to know in advance which group a given user falls into. Notification frequency defaults are a common example: a sensible default works for most people, but a subset finds it either overwhelming or too sparse from day one, and by the time they'd think to go looking in settings, they may have already formed a negative impression of the product.

For cases like this, we've had good results with a middle option: a single, lightweight prompt shown once, early, but separate from a full wizard flow, "want fewer or more notifications than the default? change it now or anytime in settings," with a clear skip option that respects the user's time if they don't care.

A pattern we actively avoid

The pattern we push back on hardest with clients is a long first-run wizard that front-loads every configurable choice in the product before the user has seen any actual value. This shows up more than it should because it feels thorough from a product-planning perspective, cover every decision up front, but it consistently performs worse in practice than getting a user to a working default state fast and letting them refine from there. A wizard is borrowing trust the user hasn't extended yet, and the debt comes due in the form of abandoned signups.

A case where we got the call wrong the first time

On one project, we initially put a workspace-naming and team-invite step into the first-run wizard for a collaboration tool, reasoning that a workspace without a name or invited teammates felt incomplete. Completion data told a different story: a meaningful share of users dropped off at exactly that step, not because naming a workspace is hard, but because deciding on team structure before you've even seen the product work is a bigger commitment than most new users are ready to make on a first visit.

We moved both steps out of the wizard, gave every new workspace an editable auto-generated name, and moved team invites into a settings-adjacent prompt that surfaces naturally the first time a user tries to do something that benefits from collaborators. Wizard completion improved measurably, and team invites, counterintuitively, went up rather than down, because users were inviting teammates once they'd already found value in the product rather than being asked to commit to it sight unseen.

Why defaults do more work than most teams give them credit for

A recurring theme across projects like this: the quality of your default matters more than whether a choice lives in onboarding or settings. A genuinely well-chosen default, one based on actual usage data rather than an engineer's best guess, makes the entire onboarding-versus-settings question lower stakes, because most users never need to touch the choice at all if the default already fits them. Conversely, a mediocre default raises the stakes on the placement decision considerably, since more users will need to find and change it to have a good experience, which puts pressure on discoverability regardless of where the setting lives.

This is part of why we push clients to invest in getting defaults right before spending much energy debating wizard placement. A great default with the setting buried in an advanced settings section often outperforms a mediocre default surfaced prominently in onboarding, simply because fewer users need to intervene at all.

What we look at after shipping either choice

For anything we put in a wizard, we track completion rate for each individual step, not just overall wizard completion, because a single confusing step can be dragging down the whole flow while looking fine in aggregate numbers. For anything left in settings with a default, we track whether users who'd clearly benefit from a non-default value ever actually find and change it, which tells us whether the setting needs a nudge, a smarter default, or genuinely belongs in onboarding after all.

The mistake that's easy to make in either direction

Putting too much in a wizard costs you signups. Putting too little in a wizard and leaving it all to settings costs you engaged users who never realize a better configuration exists for them, quietly getting a worse experience than they would have with thirty extra seconds of upfront choice. Neither mistake announces itself loudly. Both show up slowly, in metrics that are easy to attribute to something else if you're not specifically watching for this failure mode, which is exactly why we treat this as a decision worth revisiting periodically with real usage data rather than one made once at launch and never reconsidered.

We wrote more about how we approach settings pages once a product has enough options to need real structure, grouping, progressive disclosure, and search, in our full guide to designing settings screens that scale. The onboarding-versus-settings tradeoff described here is one piece of the broader interface work we do at 137Foundry for products going through this kind of growth.

For teams thinking through their own onboarding flow, the Nielsen Norman Group has solid research on onboarding friction and abandonment specifically, and Material Design documents progressive disclosure patterns that apply directly to deciding what belongs in a first-run flow versus a settings page reachable later. Both are worth reading before your team debates this tradeoff from intuition alone, since the research on this question is more settled than most onboarding discussions tend to assume going in.

#ux #product design #onboarding #webdev #ui design

Why We Redesigned a Client's Settings Page From 40 Toggles to 12

A client came to us at 137Foundry with a specific complaint: their settings page had become the most-abandoned screen in their product, based on their own analytics. Users opened it, spent a few seconds scrolling, and left without changing anything, even when support logs showed they clearly wanted to change something specific. The page had forty visible toggles on first load, no search, and category names that made sense only to the engineers who'd built each feature.

What forty toggles actually looked like

Every one of those forty settings was defensible in isolation. Each had shipped alongside a real feature, solved a real user need at the time, and had a reasonable engineering-facing label. The problem wasn't any individual setting. It was that showing all forty to every user, regardless of whether they'd ever touch thirty-five of them, turned a page that should take fifteen seconds into one that took minutes of scrolling and confused scanning before most users gave up.

The first thing we did: measure, don't guess

Before redesigning anything, we pulled interaction data on every one of the forty settings. Nine of them had essentially zero interaction across the entire user base over the previous six months. Another dozen were changed almost exclusively during initial account setup and never touched again afterward. That data reshaped the whole plan, because it told us which dozen or so settings actually needed to be visible by default and which could move behind progressive disclosure without meaningfully inconveniencing anyone.

Grouping by what users were trying to do

We threw out the client's existing category structure entirely, since it mapped to which internal team had built each feature rather than to anything a user would recognize. Regrouping around user intent, notifications, privacy, appearance, account, collapsed what had felt like eleven scattered sections into five that actually mapped to how people think about the choices they're making.

Cutting the default view down to twelve

Of the forty total settings, twelve made the cut for the always-visible default view, chosen using the interaction data rather than gut feeling about which ones "felt important." The remaining twenty-eight didn't disappear. They moved behind a clearly labeled "advanced settings" section per category, still fully accessible, just not competing for attention with the handful of settings most users actually needed on a given visit.

The one thing we almost got wrong

Early in the project we nearly cut a setting that had near-zero interaction data, on the assumption that low usage meant low importance. It turned out to control a data-export option used by a small but vocal subset of the client's power users, people who exported their data monthly as part of their own workflow outside the product. The interaction number was low because so few users needed it, not because it wasn't important to the ones who did.

That near-miss changed how we read the rest of the interaction data for the project. Low usage on its own wasn't enough justification to hide or remove a setting. We started cross-referencing low-usage settings against support tickets and direct client feedback before making a final call, and found two more settings in the same category: rarely touched, but load-bearing for a specific group of users who'd be genuinely upset to lose easy access. Both stayed visible, just moved into a more clearly labeled section rather than getting buried behind advanced settings alongside truly low-stakes options.

How we handled the settings that were genuine duplicates

Part of the original forty turned out to be functional duplicates, two separate notification toggles that had been added by different feature teams roughly a year apart, each controlling essentially the same underlying behavior with a slightly different label. Merging these took more care than it might sound like, since we had to make sure a user's existing preference on either of the old settings carried over correctly to the merged one rather than silently reverting to a default. We wrote an explicit migration for this rather than trusting a generic merge, checking both old values and preferring whichever one indicated the more restrictive preference, since defaulting toward more privacy or fewer notifications is the safer assumption when the two saved values disagree.

What changed after it shipped

Within the first month, the settings-related support ticket volume dropped by a third, almost entirely from users who could now find and change a setting themselves instead of asking support where it lived. Time spent on the page before completing a change also dropped sharply, which the client's team read, correctly we think, as a sign the page had stopped feeling like a chore.

The part that surprised the client most

What genuinely surprised the client's product team wasn't that fewer visible settings helped, they'd expected that going in. It was how much of the improvement came from renaming things, not hiding them. Several of the twelve settings that stayed on the default view kept their exact same function but got new labels written in plain language instead of the internal terms engineering had originally picked. That alone, before any restructuring, accounted for a noticeable share of the drop in "where is the setting for X" tickets.

Rewriting labels turned out to be its own project

We mentioned this briefly above, but it's worth dwelling on because it took more of the project's total time than we expected going in. Roughly a third of the forty original labels were written in language that made sense to the engineer who built the feature and nobody else, things like "eager sync" or "extended retention mode" with no explanation of what a user should expect if they flipped the switch. Rewriting every one of those into plain, specific language, what actually changes, described the way you'd explain it out loud to someone who emailed support, took longer than the restructuring work itself, but the client's team told us afterward it was the change users commented on most in follow-up interviews.

What we'd tell any team looking at a similar page

Don't start with a visual redesign. Start with interaction data on what's actually being used, because it will very likely surprise you about which settings genuinely matter to your users versus which ones just accumulated because they were easy to add at the time. The Nielsen Norman Group has research on exactly this kind of progressive disclosure decision, and our own experience on this project matched what they've found: hiding the long tail behind a clear, discoverable control reduces cognitive load without actually taking capability away from anyone who wants it.

We wrote up the broader set of patterns from this and similar projects, grouping, progressive disclosure, search, and preference migrations, in our full guide to designing settings screens that scale. If your product has a settings page that's grown past the point where the current structure still makes sense, this is exactly the kind of work we do at 137Foundry.

For teams doing this kind of audit themselves, Material Design has a useful reference section on progressive disclosure patterns worth reading before you start moving settings around.

#ux #ui design #product design #webdev #settings

A CSV Export That Almost Cost a Client Real Money, and What We Changed

A few months back, we at 137Foundry got a support message from a client asking why their weekly revenue report looked about eight percent low. Nobody had touched the reporting logic in weeks. The numbers were just quietly wrong, and the reason turned out to be a CSV import we'd built for them, not a bug in the code exactly, but a decision our parser was making silently that nobody had told it to make.

What was actually happening

The client's payment processor exported daily transaction data as CSV. Our import script read the amount column, stripped the currency symbol, and parsed it as a float. It had worked fine in testing, and it had worked fine in production for months. What changed was that the processor started including a thousands separator in larger transaction amounts, "1,204.50" instead of "1204.50", and our parser's float conversion choked silently on the comma, in a way that didn't raise an error, it just truncated the string at the first invalid character and parsed "1" instead of "1204.50".

Nobody saw an exception. The import completed successfully every single day. The number that showed up in the report was just wrong, quietly, for every transaction over a thousand dollars, for about three weeks before the discrepancy got big enough for someone on the client's side to notice something felt off.

Why this is the failure mode that actually scares us

A crash is annoying but honest. It tells you immediately that something needs attention. A parser that succeeds and produces a plausible-looking but wrong number is far more dangerous, because there's no signal telling anyone to go check. We'd tested the happy path thoroughly. We hadn't tested what happened when the source system's export format changed slightly, which turned out to be the actual failure mode that mattered.

What we changed after that

The fix itself was small, an explicit strip of thousands separators before the float conversion, rather than trusting Python's float() to handle a comma-formatted number correctly. But the bigger change was to how we build these imports generally, not just for this one client.

We now write an explicit coercion function for every numeric or date column in any import that touches money, rather than trusting a generic parse-and-hope approach, and we log a warning any time a coercion produces a value meaningfully different from what a naive parse would have produced, so a format change like this one shows up as a log entry the same day it happens instead of a client noticing a discrepancy three weeks later.

The habit we added that catches this earlier now

We also added a lightweight sanity check that runs after every scheduled import: compare the day's total against a rolling seven-day average, and flag anything more than a set percentage off for a human to glance at before the numbers feed into any downstream report. It's a blunt instrument, it wouldn't catch every possible bug, but it's caught two more format changes since we added it, both times within a day instead of weeks.

The uncomfortable part of this story

The uncomfortable truth is that the original code wasn't badly written by the standards we'd have judged it against at the time. It handled the format we tested against correctly. The gap was that we hadn't built in a way to notice when the source system's assumptions changed underneath us, and a silent partial parse is exactly the kind of bug that traditional testing doesn't catch, because the test fixtures reflect the format you already know about, not the format that shows up six months later.

How we explained it to the client

Part of the conversation that stuck with us was how we walked the client through what happened. We could have framed it as a minor rounding issue and moved on, but that undersells what actually occurred: for three weeks, a number they were making decisions based on was wrong by a meaningful margin, and neither we nor they had any way to know it until the discrepancy grew large enough to notice by eye. We showed them the exact line of code, the exact input that triggered it, and the exact date range affected, then walked through the specific changes we were making so the same category of bug couldn't recur silently. Being precise about what actually happened, instead of vague reassurance, is what turned an uncomfortable incident into a conversation that actually rebuilt trust.

Why we didn't just patch the one bug and move on

The narrow fix, stripping thousands separators before the float conversion, would have solved this specific instance in about ten minutes. We spent considerably longer than that because a narrow fix only addresses the symptom. The actual problem was structural: nothing in our import pipeline was watching for the case where a source system's format assumptions changed underneath an import that had been running correctly for months. Fixing the comma bug without fixing that structural gap would have left us exposed to the next format change, whatever shape it happened to take, with no earlier warning than we got this time.

What surprised us most in the aftermath

The part that actually surprised us wasn't the bug itself, formats do drift over time, and we'd have said that was obvious if you'd asked us before this happened. What surprised us was how long a genuinely wrong number can sit in a live report without anyone catching it, simply because a number that's off by eight percent still looks like a plausible number. A number that's off by eight hundred percent gets caught immediately because it's obviously broken. A number that's subtly wrong is the dangerous kind precisely because it doesn't look wrong to anyone glancing at a dashboard.

Where this leaves us now

Every new import we build for a client now starts from the assumption that the source format will eventually change in some way we didn't anticipate, rather than treating the sample files we tested against as a permanent contract. That shift in default assumption, building for drift instead of building for the file in front of us today, has done more for the reliability of our data integration work than any single parsing library or validation rule we've added since. It's a small mindset change with an outsized payoff, and it's one we'd rather have learned from a caught bug and an honest conversation with a client than from a much worse outcome further down the line.

We wrote up the full set of patterns that came out of this and similar imports, explicit type coercion, malformed row handling, encoding detection, in our guide to CSV parsing code snippets. If your team is dealing with recurring file-based imports and wants a second set of eyes on where the silent failure modes might be hiding, that's exactly the kind of review 137Foundry does for clients regularly.

For anyone building similar coercion logic, pandas' documentation covers the dtype and converters options that would have caught this specific bug if we'd used them from the start, and Python's official docs has the full standard library reference for the string and numeric parsing functions involved.

#webdev #data #csv #engineering #backend

Our Checklist for Vetting a New CSV Export Before It Touches Production

Every time we at 137Foundry start building an import against a new client system's CSV export, we run the same short checklist before writing a single line of parsing code. It's saved us from more than one bug that would otherwise have shown up three weeks into production instead of during the first hour of building the thing.

Step 1: get three real sample files, not one

A single sample file tells you what a clean export looks like. It doesn't tell you what the export looks like on a day with unusual data, a day with zero transactions, or a day where someone manually edited the source spreadsheet before exporting. We ask for at least three files spanning different dates whenever possible, specifically because format inconsistencies between exports are one of the most common sources of import bugs we see, and a single sample file will never surface that.

Step 2: check the encoding and look for a BOM

We run a quick byte-level check on the raw file before opening it in anything. A byte-order mark at the start signals UTF-8-with-BOM, which needs handling or your first column header ends up with an invisible character glued to it. Non-ASCII bytes with no BOM usually mean Windows-1252 or a similar legacy encoding, common from older Windows-based export tools. We've been surprised more than once by an export tool that changed its default encoding between one client delivery and the next with zero notice.

Step 3: count columns across every row, not just the header

We run a quick script that counts fields per row across the entire file and flags any row where the count doesn't match the header. Ragged rows are common enough in real exports that finding out about them before writing the parser, rather than discovering them as a runtime exception, saves real debugging time. It also tells us upfront whether malformed rows are rare enough to skip-and-log, or common enough that something's structurally wrong with the source export that's worth raising with the client directly.

Step 4: inspect every column's actual value distribution

For each column, we check the distinct values or a sample of the range, not just the declared meaning of the column. A column named status that's supposed to have three known values sometimes has a fourth, undocumented one that only shows up in edge cases the source system's own team forgot about. Catching this during inspection is far cheaper than catching it as an unhandled case in production three months later.

Step 5: identify which columns are money, dates, or IDs, and treat them differently from the rest

Any column representing money, a date, or an identifier gets flagged for explicit, hand-written coercion logic rather than generic type inference. This is the step we skip least often, because it's the step that's caused us actual production incidents in the past when we didn't apply it carefully enough. A generic parser's best guess about a currency or date format is a reasonable starting point for exploratory analysis. It's not something we're willing to trust for a client's financial reporting without an explicit check.

Step 6: decide, out loud, what happens to a bad row

Before writing the import, we write down, in a comment or a short doc, what should happen when a row is malformed: skip and log, fail the whole import, or attempt a partial recovery. Making this decision explicit before writing the code means the eventual behavior is a deliberate choice, not whatever the parsing library happened to default to.

Step 7: run the file through the actual parser you plan to ship, not just an inspection tool

Inspection tools tell you a lot, but the final check we do is running the real production parser against the sample files, end to end, before it ever touches a live database. This catches a category of issue the earlier steps sometimes miss: a library-specific quirk, a configuration flag we forgot to set, or an interaction between two edge cases (a malformed row that also happens to contain a currency symbol, say) that only shows up when the whole pipeline runs together rather than when each concern is checked in isolation.

What we do differently for a recurring export versus a one-time import

A one-time import gets the full checklist and then we move on. A recurring export, a daily or weekly file from the same source system, gets something extra: we save the first several successful runs' worth of structural metadata, column names, inferred types, row count ranges, and compare each new file against that baseline automatically. A format change that would otherwise slip through unnoticed shows up as a diff against the baseline the same day it happens, instead of surfacing as a mystery weeks later when a downstream number looks slightly off.

A checklist item we added after getting burned once

We didn't always check for duplicate rows explicitly, and we added it after a client's export tool once retried a failed batch and appended the retried rows to the same file instead of overwriting it, doubling roughly a third of that day's transactions. Now, checking for exact duplicate rows, and near-duplicates that differ only in a timestamp field, is part of the standard routine, specifically because that failure mode cost us real debugging time the one time we didn't catch it up front.

How long this checklist actually takes in practice

People sometimes assume a six or seven step checklist means a significant time investment before any real work starts. In practice, for a typical export, the whole routine takes somewhere between fifteen and thirty minutes, most of which is running a handful of command-line tools and glancing at their output rather than writing any custom code. That's a small cost against the alternative, discovering a structural problem three weeks into production instead of before the first line of parsing code was written. We've tried skipping steps under time pressure a handful of times over the years, and every single time it's cost us more debugging hours later than the checklist would have taken up front.

Why we bother with all these steps for what looks like a simple task

None of this is complicated engineering. It's closer to a pre-flight checklist than a technical challenge, and that's exactly the point. The bugs that CSV imports produce are rarely hard to fix once you've found them. They're expensive because they hide, succeeding quietly with wrong data instead of failing loudly, and a short inspection routine up front catches most of what would otherwise surface as a confusing support ticket weeks later.

We wrote up the parsing patterns themselves, handling malformed rows, encoding detection, type coercion, and streaming large files, in our full guide to CSV parsing code snippets. If your team is about to build an import against an unfamiliar export and wants a second set of eyes on the risky parts before they become production bugs, that's the kind of review work 137Foundry does regularly.

For structural validation specifically, csvlint.io automates a chunk of step 3 for us, and Python's official documentation has the full reference for the encoding and parsing functions this checklist assumes you'll eventually reach for.

#webdev #data #engineering #backend #checklist

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

5 real-time UX patterns we reach for when a WebSocket drops

Solid reconnection logic on the backend doesn't matter much if the interface doesn't tell users what's happening while it runs. These are five UX patterns we reach for at 137Foundry whenever we're building a live dashboard or collaborative feature, patterns that keep users oriented during the seconds or minutes a connection is quietly recovering in the background.

1. A visible connection-state indicator

Not a toast that flashes once and disappears, a persistent, small indicator (a dot, a label, a subtle color change) somewhere consistent in the interface that reflects one of three states: connected, reconnecting, disconnected. Users glance at it the same way they glance at a wifi icon on their phone, quickly and without thinking, and it answers the question "can I trust what I'm looking at right now" before they even have to ask it out loud.

The Nielsen Norman Group's writing on system status visibility covers the broader usability principle this pattern is built on, and it applies just as directly to real-time connection state as it does to any other kind of background system status.

2. Timestamping stale data explicitly

If a dashboard stops updating, showing "last updated 2 minutes ago" next to the data is far more honest than showing the data with no context at all. Users can make their own judgment about whether two-minute-old data is fine for their purposes or not, but only if the interface actually tells them the data is old instead of presenting it as fresh.

3. Disabling actions that assume a live connection

If a feature depends on real-time acknowledgment, sending a chat message, submitting a collaborative edit, confirming a trade, disable the affected controls or clearly mark them as queued while disconnected, rather than letting users act into a void and wonder later whether anything actually happened on the other end.

4. A manual retry option after enough automatic attempts fail

Automatic reconnection with backoff is the right default, but after several failed attempts, our threshold is usually five or six, give the user a manual "try again" button instead of leaving them staring at an indefinite spinner with no way to intervene. Sometimes the underlying issue needs a page reload or a network switch on the user's end, and a manual control communicates that possibility instead of implying the app will eventually sort itself out on its own.

5. Optimistic local state with a rollback path

For actions taken while reconnecting, queued messages, in-progress edits, show them in the UI immediately as pending, then reconcile once the connection returns and the server confirms or rejects them. Silently discarding a user's in-progress action because the network hiccuped is one of the fastest ways to lose someone's trust in a product, especially if it happens more than once.

A pattern we almost shipped and pulled back

Early on, we tried a version of the connection-state indicator that auto-hid itself after a few seconds once the connection recovered, on the theory that users didn't need to keep seeing a "reconnected" confirmation once things were back to normal. In practice, users who'd noticed the earlier "reconnecting" state and then glanced away came back to a UI that looked exactly like it had before anything happened, with no way to confirm whether the gap had actually been resolved or whether they were still looking at stale information. We ended up keeping a brief, deliberate "reconnected, you're up to date" confirmation state for a couple of seconds rather than snapping straight back to the normal connected indicator, specifically to close that loop for anyone who'd noticed the disruption in the first place.

Photo by Walls.io on Pexels

The lesson generalized past this one feature: any state your UI shows during a failure needs a matching, equally visible state confirming resolution. An interface that announces trouble loudly and then recovers silently trains users to distrust the "everything's fine" state just as much as the "something's wrong" one, since they have no way to tell the difference between actual recovery and the indicator simply timing out. It's a small detail, but it's the kind of detail that separates a product that feels trustworthy under real-world network conditions from one that only feels trustworthy in a demo on office wifi.

How we decide which patterns a given feature actually needs

Not every real-time feature needs all five patterns at once. A internal analytics dashboard viewed by a handful of employees can probably get away with just the connection-state indicator and stale-data timestamping, since the cost of someone glancing at slightly outdated numbers for a minute is low. A collaborative editing feature or anything involving money or irreversible actions needs the full set, including optimistic state with a real rollback path, because the cost of a silently lost or duplicated action is much higher. We size the UX investment to the actual cost of getting it wrong for that specific feature, rather than applying a single fixed checklist to every real-time surface in a product regardless of stakes.

Where these patterns live in our own process

We keep this list as a literal design review checklist for any new real-time feature, reviewed before the first line of frontend code gets written, not retrofitted after launch once someone notices the UX feels confusing during an outage. Retrofitting is possible but always costs more, both in engineering time and in however much user trust got spent during the period the feature shipped without it.

What we measure to know if the patterns are actually working

We track a simple support-ticket tag for anything mentioning stale data, a frozen dashboard, or a lost action, and watch the rate of that tag over time as we roll these patterns out to a given feature. A dropping rate after we add the connection-state indicator or the optimistic-with-rollback pattern to a specific surface is a much better signal than any amount of internal confidence that the UX is now "handled." User-reported confusion is the ground truth here, not whether our own team finds the reconnecting state adequately communicated.

Making these patterns actually reliable

None of these patterns require the backend to be perfect. They require the frontend to be honest about what the backend currently knows, which is a genuinely different design goal than making the reconnection process invisible. Trying to hide reconnection entirely from the user is usually the wrong instinct: a brief, clearly communicated "reconnecting" state builds more trust than a UI that pretends nothing happened, especially once a user notices the data was stale for longer than the interface admitted.

Material Design's guidance on communicating offline states is worth a look if you want a broader design-system perspective on these same ideas applied outside the specific case of WebSockets, and it lines up closely with the patterns above even though it wasn't written with real-time dashboards specifically in mind.

We put these UX patterns together with the actual backoff, heartbeat, and message-replay mechanics that make the "reconnecting" state trustworthy rather than purely cosmetic, in our guide to WebSocket reconnection strategy, if you want the full technical pattern behind the UI states described here. More of our writing on real-time products and web engineering lives at 137foundry.com.

#webdev #ux #product design #websockets #frontend

Behind the scenes: how we handle WebSocket reconnection storms during our own deploys

Every time we at 137Foundry roll out a new backend version for a client's real-time dashboard, we cause the exact failure our own reconnection code is designed to survive. The deploy cycles connections. Every client connected to the instance being replaced gets disconnected within the same few seconds. If our backoff and jitter logic weren't doing their job properly, we'd be reproducing a self-inflicted outage every single release, which is not a great look when the whole point of the feature is reliability.

What actually happens during a rolling deploy

A rolling deploy takes backend instances out of rotation one at a time, drains their connections, and replaces them with new code. From the client's perspective, this looks identical to any other unexpected disconnect: the socket closes, no warning, no context about why. Every client connected to the instance being replaced gets dropped at roughly the same moment, which is a meaningfully different scenario than a single user's connection dropping in isolation.

If every one of those clients used a fixed retry delay, they'd all reconnect in the same instant, hit the next instance in rotation simultaneously, and potentially trigger the exact same cycle again if that instance is also mid-drain from the same deploy. We've seen this happen on other systems before we tightened our own jitter implementation, a self-inflicted retry storm caused entirely by deploy tooling instead of any actual outage in the underlying infrastructure.

The fix we actually ship

Our reconnect logic uses exponential backoff with roughly 25 percent jitter on every attempt, capped around 20 seconds. During a normal deploy, this spreads client reconnection attempts across a multi-second window instead of a single spike, which is the difference between a deploy nobody notices and a deploy that looks like a production incident to whoever's watching the dashboards at the time.

We also stagger our own rolling deploys deliberately, cycling instances with a delay between each one specifically so client reconnection windows don't overlap across multiple instances going down close together. It's a small operational habit that costs us a few extra minutes per deploy and saves us from ever having a routine deploy masquerade as a customer-facing incident.

The part that actually took the longest

The part that took longest to get right wasn't the backoff curve itself, that's a well-understood pattern with plenty of reference implementations. It was making sure our sequence-numbered replay buffer correctly backfilled whatever a client missed during the drain window, so a dashboard reconnecting after our own deploy shows the exact same data it would have shown if the deploy had never happened at all. Getting the replay window wrong in either direction, too short and you drop legitimate updates, too long and you risk replaying stale data as if it were current, took several iterations to tune correctly against our own production traffic patterns.

We also learned the hard way that health checks matter here too. An instance that's draining connections but still reports healthy to the load balancer for a few extra seconds will keep receiving new connection attempts it's about to reject, which just adds more reconnection churn to an already busy moment. Coordinating the drain signal with the load balancer's health check removed a meaningful chunk of unnecessary reconnect attempts during every deploy.

The monitoring change that finally gave us confidence

For a while, we only found out our reconnection handling was working because nobody complained, which isn't a great signal on its own since it's just as consistent with nobody looking. What actually changed things was adding a specific metric: the count of active WebSocket connections per instance, graphed alongside our deploy markers. A healthy deploy shows a smooth, staggered drop and recovery in that graph as instances cycle one at a time. A deploy with a reconnection storm problem shows a sharp cliff followed by a slow, uneven recovery, and you can see it on the graph before a single customer notices anything on their end.

Photo by AMORIE SAM on Pexels

We also added an alert on the specific rate of new connection attempts per second, since a retry storm shows up there first, well before it shows up as a customer complaint. Catching it in that metric during a deploy window, rather than in a support ticket an hour later, is the difference between a two-minute non-event and an actual incident review.

Why this matters beyond our own infrastructure

If you're shipping real-time features and your own deploys are the thing triggering your reconnection storms, you're not doing anything wrong, it's an expected side effect of rolling deploys on any stateful, long-lived connection like a WebSocket. The fix isn't avoiding deploys or freezing your release cadence. It's making sure the reconnection path you built for network failures also holds up cleanly for the failures you cause on purpose as part of normal operations.

The general reference on rolling deployment strategies from the Kubernetes documentation covers the orchestration side of graceful connection draining if you're running on that kind of infrastructure, and it's worth reading even if you're not using Kubernetes directly, since most managed platforms implement some variant of the same drain-then-replace pattern.

What we'd tell a team just starting out

If you're setting this up for the first time, don't try to build the deploy dashboard and the staggered rollout and the jitter tuning all in the same sprint. Ship jitter first, it's the highest leverage, lowest effort change. Add the connection-count metric next, since it gives you visibility into whether the jitter is actually working before you invest further. Staggered rollout timing and a dedicated deploy dashboard are refinements worth adding once you've confirmed the basics are solid, not prerequisites to shipping reconnection logic that works.

A tool we added specifically for this

We eventually wired a small Grafana dashboard specifically for deploy windows, showing active connection count, new-connection rate, and resync-required rate side by side with our deploy markers on the same timeline. Having all three on one screen during a release turned "does this deploy look normal" from a gut-feel judgment into something anyone on the team could check in ten seconds, including someone who wasn't the original author of the reconnection code.

One more habit: deploying during genuinely quiet windows

None of the engineering above removes the value of a boring operational habit: we still schedule our own riskier real-time deploys during lower-traffic windows when we have the choice. Fewer concurrently connected clients means a smaller reconnection wave even in the worst case, and it gives us a bigger margin for error while we're still validating a new release candidate against production traffic. The jitter and staggered rollout handle the technical side of the problem; picking a quieter deploy window is the low-effort, high-value habit that reduces blast radius on top of it.

We wrote up the full pattern, heartbeat detection, jittered backoff, and the replay mechanism together, in our guide to building a WebSocket reconnection strategy that doesn't lose messages. If you want to see more of how 137Foundry approaches this kind of production reliability work, our homepage has the rest of our writing on real-time systems and web application engineering.

#webdev #programming #software engineering #websockets #engineering

What We Learned Auditing Loading States Across a Dozen Client Dashboards

Over the last year we've done loading state audits as part of maybe a dozen different engagements, usually tacked onto a broader front-end review rather than requested on their own. Nobody asks for a "loading state audit" specifically. But the pattern of what we find is consistent enough now that it's worth writing down.

How These Audits Usually Start

A client brings us in for something else entirely, a performance review, an accessibility pass, a general front-end health check, and loading states come up as a side finding almost every time, not because we go looking for them specifically but because they're one of the first things a systematic review surfaces. Once you're clicking through every major flow in an application looking for issues, loading states are impossible to miss, because you hit one on nearly every screen.

We started keeping informal notes after the third or fourth engagement where the same three issues showed up independently, on completely unrelated codebases, built by different teams, in different frameworks. That kind of consistency across unrelated projects usually means the root cause isn't a specific team's mistake. It's a gap in how loading states get taught and reviewed industry-wide.

The Same Three Problems, Almost Every Time

The first is a progress bar that isn't measuring anything real. Somewhere in the codebase, a developer decided a spinner "felt incomplete" and added a bar that fills based on elapsed time rather than actual completion. It looks more informative than a spinner and is actually less honest, because it's implying a measurement that doesn't exist. Users notice this faster than teams expect, usually within the first few uses, and it quietly erodes trust in every other progress indicator in the product afterward.

The second is a loading state with no timeout. If a request stalls, silently, without ever resolving or erroring, the spinner just keeps spinning. We've found this in production more times than we'd like to admit. The fix is small (a timeout that swaps to a retry state after 8 to 15 seconds) but it's almost never there by default.

The third, and the one that surprises people most, is loading states that are completely invisible to screen reader users. A spinner is a purely visual signal. Without an aria-live region announcing when the load completes, a screen reader user gets total silence during the wait and no explicit confirmation that content actually appeared. MDN's documentation on ARIA live regions is the reference we point every team to for this, and WebAIM has good practical examples of applying it to real interfaces, not just spec text.

Why This Keeps Happening

Loading states get built under time pressure, usually as one of the last things wired up before a feature ships, and they get built once and never revisited unless something breaks visibly. Nobody files a bug report that says "your loading state doesn't announce to screen readers," because the people affected by that gap usually can't tell whether it's a bug or expected behavior. It just quietly fails for a subset of users who don't have a clear channel to report it.

The fix isn't more design time. It's a short checklist applied consistently: does the timing match what the system actually knows, is there a timeout, is the completion announced, does the skeleton shape match the real content closely enough to avoid a layout jump. Running through those four questions on any loading state we touch catches the majority of what we find in audits, and we've started handing this same short checklist to client teams directly so they can run it themselves between our engagements rather than waiting for the next audit to surface the same recurring issues again.

What Surprised Us Most

Going in, we expected the accessibility gap to be the rarest finding, something only teams with an existing accessibility practice would have thought about at all. It turned out to be almost universal in the other direction: across the dozen engagements, only one had any aria-live handling around loading state transitions, and that one had it because a specific accessibility audit had flagged it months earlier as a standalone issue.

That tells us the gap isn't really about team skill or care. It's that loading state accessibility sits at the intersection of two things nobody explicitly owns: it's not quite a design responsibility (designers think about the visual shimmer, not the screen reader announcement) and it's not quite treated as a core engineering responsibility either (engineers wire up the data fetching and consider the visual state "done" once it renders correctly). It falls into the gap between those two ownership boundaries more often than almost any other accessibility issue we find.

The fix we recommend is organizational as much as technical: bake the aria-live pattern into whatever shared component or hook handles loading states in the codebase, so it's automatic for every feature built on top of it rather than something each team has to remember to add individually.

The One That Was Hardest to Explain to Stakeholders

The fake progress bar finding is consistently the hardest one to get buy-in on fixing, because it usually tests well in isolation. Show a stakeholder a mockup with a smooth, steadily filling progress bar next to a plain spinner, and the progress bar reads as more polished, more informative, more "finished" as a piece of design. The problem only shows up over repeated real use, once a user has seen the bar jump or stall inconsistently enough times to stop trusting it, and that kind of trust erosion doesn't show up in a single mockup review.

The argument that eventually works is usually a concrete example from the client's own product: pulling up a session recording where a real user watched a fake progress bar behave inconsistently, then comparing it to how a plain, honest spinner with a status label would have handled the same wait. Seeing the actual user reaction, hesitation, repeated clicking, abandoning the flow, tends to land better than an abstract argument about honesty in interface design. Once a team has watched that footage once, the fake progress bar tends to get fixed without much further debate, and it usually becomes the example that gets pulled up the next time a similar shortcut is proposed elsewhere in the product. We've started keeping a short highlight reel of these moments across engagements, with permission, specifically because they do more to change a team's default habits than another slide of abstract UX guidance ever does.

"None of the loading state problems we find are hard to fix individually. What's hard is that they're spread across dozens of components, built by different people, at different times, with no shared checklist. That's a process gap more than a skill gap." - Dennis Traina, founder of 137Foundry

We put the full breakdown of spinners versus skeleton screens versus real progress bars, and when each one is actually the right call, into a longer guide here: designing loading states that don't feel like lying to users. If you want an outside pair of eyes on your own product's loading patterns, that's a normal part of the front-end work we take on at https://137foundry.com.

#ux #webdev #product design #accessibility #137foundry

Why We Replaced Spinners With Skeleton Screens on a Recent Client Build

A few months ago we shipped a dashboard rebuild for a client whose users were filing tickets that all said some version of the same thing: "the app feels slower than it used to." The odd part was that our own performance instrumentation said the opposite. Median load time had actually dropped by about 15% compared to the previous version.

We spent a week convinced the metrics were lying before we figured out the metrics were fine. The interface was lying.

Where We Started Looking

The first instinct on a project like this is to distrust the monitoring, so we spent the first couple of days re-verifying our own numbers before we trusted them. Real user monitoring, synthetic tests from three different regions, server-side timing logs, all of it agreed: things had gotten faster, not slower, on every dimension we were tracking.

That's a genuinely uncomfortable place to be as an engineering team, because "the users are wrong" is rarely the right conclusion, and "our metrics are lying" was demonstrably false once we'd checked it three different ways. The only remaining explanation was that we were measuring the wrong thing entirely, that the users' experience of speed and our instrumentation of speed had quietly diverged.

What Was Actually Happening

The old dashboard had a full-page spinner that appeared the instant you navigated to it and stayed until every widget on the page had finished loading, sometimes six or seven separate data fetches. The new version was objectively faster in aggregate, but it rendered its first content later relative to when the spinner appeared, because we'd restructured the data fetching to be more efficient in total but less front-loaded.

Users don't experience total time. They experience the interval between "I did something" and "I saw something happen." We'd made the first number better and the second number worse, and the second number was the one people actually felt.

The Fix Wasn't a Backend Change

We replaced the full-page spinner with a skeleton layout matching the dashboard's actual grid: placeholder cards in the exact positions the real widgets would occupy, each one resolving to real content independently as its data arrived, instead of waiting for all seven fetches to finish together.

This meant the page started looking "alive" within a few hundred milliseconds, even though the slowest widget on the page still took the same amount of time it always had. Nothing about the backend changed. What changed was that the interface stopped hiding its own progress from the user.

Nielsen Norman Group's research on response time thresholds was a useful gut check while we made this call. The window between one and ten seconds, where most of these individual widget loads fell, is exactly the range where interface design determines whether a wait feels managed or ignored.

What We'd Do Differently Next Time

The one thing we underestimated was how closely the skeleton shapes needed to match the real content. Our first pass used generic rectangular placeholders that didn't account for the fact that one widget rendered a chart and another rendered a short text summary. When the real content swapped in, the layout shifted more than we expected, which reintroduced a smaller version of the same "something feels off" complaint we were trying to fix.

The second pass matched the skeleton shapes to the actual rendered dimensions of each widget type, checked against web.dev's Cumulative Layout Shift guidance, and that fixed it. It's a detail that's easy to skip under deadline pressure and expensive to skip in practice.

How We Verified It Actually Worked

We didn't just ship the skeleton version and assume the problem was solved. We watched a handful of session recordings from before and after the change, specifically looking at what users did in the first two seconds after navigating to the dashboard. Before the change, a meaningful number of sessions showed a click on the browser refresh button within the first second, a strong signal that the blank spinner state read as "nothing is happening" rather than "loading."

After the change, that refresh-click pattern dropped close to zero. Nobody was refreshing a page that was visibly already rendering content, even partial, placeholder content. That single behavioral signal told us more about whether the fix worked than any aggregate timing metric did, because it captured exactly the thing we were trying to fix: the moment of uncertainty where a user decides whether to trust that something is happening.

Ticket volume mentioning "slow" or "frozen" dropped by roughly half over the following month, without a single additional backend change. The lesson generalized past this one project: for any product where users describe something as feeling slow despite metrics saying otherwise, the fix is very often in how the wait is communicated, not in the wait itself.

The Part That Took Longer Than Expected

Building the skeleton layout itself took about two days. Getting the skeleton shapes to actually match the real widget dimensions took closer to a week, because it meant going back through every widget type in the dashboard, measuring its real rendered dimensions across a range of realistic data (a chart widget with three data series looks different from one with twelve), and building a matching placeholder for each variant rather than one generic block reused everywhere.

That ratio surprised us. We'd budgeted the project assuming the skeleton implementation itself was the hard part. It turned out the shimmer animation and placeholder markup were trivial. The actual work was in the audit: figuring out what the real content shapes looked like across enough realistic scenarios that the skeleton didn't introduce its own layout shift once real data arrived. If you're estimating a similar project, budget the audit time separately from the implementation time. They're genuinely different kinds of work, and treating them as one estimate is how this kind of project ends up running over its original timeline without anyone being able to point to a specific reason why.

"The lesson that keeps repeating on projects like this is that perceived speed and actual speed are two different engineering problems, and most teams only budget time for one of them." - Dennis Traina, founder of 137Foundry

A few weeks after this project wrapped, a second client asked us to look at a strikingly similar problem on a completely different codebase, a mobile-first booking flow where the reported complaint was "the app hangs when you tap search." Same root cause, a blank screen during a genuinely brief wait, same fix, a skeleton preview of the results list shape. It's become one of the more reliable diagnostic shortcuts we reach for now: when the metrics and the complaints disagree, check what the user actually sees during the wait before assuming either one is wrong.

We wrote up the fuller framework, when to use a skeleton versus a spinner versus a real progress bar, in a longer guide if you want the details beyond this one project: designing loading states that don't feel like lying to users. And if your product has this exact "the metrics say fast but users say slow" gap, it's usually a loading state problem before it's anything else, which is the kind of front-end work we do a lot of at 137Foundry.

#ux #ui design #product design #webdev #137foundry

Why We Added Circuit Breakers to Every External API Call in Our Data Jobs

We hit the same wall on three different client projects in the same month: a scheduled data sync job that normally finished in under fifteen minutes suddenly took hours, and every time, the root cause traced back to the same thing. A third-party API the job depended on had a bad morning, and our retry logic, doing exactly what we told it to do, kept hammering it anyway.

The Pattern We Kept Seeing

Retries are great for the failure they're designed for: a single dropped connection, a load balancer blip, something that resolves itself in seconds. They're a bad fit for the failure we kept running into, which was sustained unavailability lasting anywhere from twenty minutes to a couple of hours. During that window, every retry attempt was wasted work, and across a few thousand records in a batch, the wasted time added up fast.

Worse, none of these outages showed up as an alert. The job hadn't technically failed. It was just very, very slow, quietly grinding through timeouts while nobody on the team knew anything was wrong until someone downstream asked why the numbers looked stale.

What We Changed

We now wrap every external API call in a circuit breaker, scoped to that one specific endpoint, not the whole job and not the whole API provider if it exposes multiple endpoints with different reliability profiles. The breaker tracks failures over a rolling window rather than counting strict consecutive failures, since a single dropped connection in an otherwise healthy run shouldn't be enough to trip it.

Once the failure rate crosses our threshold, the breaker opens. New calls fail immediately, no network request made, and whatever record was being processed gets written to a pending table instead of being retried into the void. A separate scheduled sweep picks that table back up once the breaker's half-open probe confirms the dependency has recovered.

The nightly sync job that used to take four hours during an outage now finishes in about two minutes, with the deferred records sitting in a queue waiting for the next successful sweep. An alert fires the moment the breaker opens, which means someone on our team knows about the outage in real time instead of finding out from a client the next morning.

The Part We Almost Skipped

Early on we built the breaker without the pending queue, purely to stop the wasted retries. That's better than nothing, but it just trades one problem for another: the job finishes fast, but the records that would have synced during the outage are gone unless you catch them somewhere. Pairing the breaker with a durable queue is what actually closes the loop, and it's the piece we'd tell anyone building this pattern not to skip.

Where We Landed on Thresholds

We don't use one universal threshold across every integration. A payment-adjacent API gets a tighter threshold, we'd rather stop early than keep hitting something tied to money movement. A less critical data feed gets more room to fail before the breaker trips. The circuit breaker design pattern on Wikipedia is a good reference if your team hasn't implemented one before, it's a genuinely old idea borrowed from electrical engineering and it holds up well in software.

We also log every state transition into our tracing setup. OpenTelemetry made this straightforward to wire in alongside metrics we already collect, and having a clean record of exactly when a dependency went down and came back has already been useful in a couple of vendor conversations where "it feels like they're down a lot" needed actual numbers behind it.

How This Changed Our Incident Reviews

Before the breaker, our postmortems for these outages were frustratingly vague. We'd know roughly when a job started running slow and roughly when it finished, but the actual outage window on the dependency's side was reconstructed from scattered timeout errors that didn't line up cleanly. Now the breaker's open and half-open transitions give us a precise timeline: exactly when we detected the dependency was unhealthy, exactly how long it stayed unavailable, and exactly when it confirmed recovery. That precision has made our retros with clients shorter and more useful, because we're not spending the first twenty minutes just agreeing on what happened.

It's also changed how we talk to the vendors on the other end of these APIs. A vague "your service felt down for a while last week" doesn't get much traction. "Your API returned failures for 43 minutes on Tuesday, here's the exact window, and this is the third time this quarter" is a very different conversation, and one that's actually moved the needle on getting a partner to take a reliability issue seriously.

What We'd Tell a Team Building This For the First Time

Start with the one dependency that's actually caused you pain, not every external call your jobs make. We made the mistake early on of trying to wrap everything at once, and ended up with a pile of breakers tuned on guesswork rather than real traffic patterns. Pick the integration that's generated the most support tickets or the most confused Slack threads about why a job ran long, wrap that one first, watch it for a couple of weeks, and let the thresholds come from what you actually observe rather than a default copied from a tutorial.

The other thing we'd say: build the pending queue at the same time as the breaker, not after. It's tempting to ship the breaker alone because it's the more interesting engineering problem, but a breaker without a queue just changes how you lose data, from "job takes forever" to "job finishes fast and quietly drops work." Neither is acceptable once you know the difference, and building both together from day one is barely more work than building the breaker alone.

Why We're Writing This Up

This isn't a novel pattern, it's decades old and well documented. What surprised us was how many of our own automation jobs didn't have it, relying entirely on retries and hoping outages would stay short. If you're running scheduled jobs against third-party APIs and you've ever had a run finish hours late with no alert firing, there's a decent chance you're in the same spot we were.

We wrote up the fuller version of how we approach this, thresholds, the pending queue design, and a walked-through example, over on 137foundry.com if you want the longer read. And if reliability work like this is something your team is trying to get ahead of, 137Foundry is the place we do this kind of work day to day.

None of this required exotic tooling. A small in-house wrapper class, a database table for the pending queue, and a scheduled sweep job were enough to get the whole pattern working across every client project where we've since rolled it out. The hard part was never the code, it was recognizing that the outages we kept treating as one-off bad luck were actually the same predictable failure mode showing up again and again, and finally building the thing that catches it instead of writing another postmortem about it.

#automation #data #integration #python #engineering

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

How We Tune Circuit Breaker Thresholds for Different Third-Party APIs

Once we started wrapping external API calls in circuit breakers across client projects, the question that came up constantly was: what numbers do we actually use. There's no single right answer, and copying a tutorial's default threshold is how you end up with a breaker that either never trips or trips constantly on noise. Here's the process we actually walk through.

Step 1: Look at How the API Fails, Not Just That It Fails

Before picking a threshold, we pull the last quarter of error logs for that specific integration and look at the failure pattern. Does it fail in short, isolated blips, a handful of scattered timeouts across weeks? Or does it fail in clusters, dozens of consecutive failures concentrated in short windows? These two patterns need very different thresholds. Isolated blips call for a loose threshold so the breaker doesn't trip on noise. Clustered failures call for a tighter one, since a cluster starting is a strong early signal that a real outage is underway.

Step 2: Weight the Threshold by What the Call Actually Does

We treat calls differently based on what happens if they fail silently for too long. An integration tied to payment status gets a tight threshold, two or three failures and the breaker opens, because we'd rather stop early and investigate than keep hitting something tied to money movement while it's unhealthy. A less critical data enrichment call, something that improves a report but doesn't block anything time-sensitive, gets more room, maybe eight or ten failures across a rolling window before we open the breaker.

Step 3: Use a Rolling Window, Not Strict Consecutive Failures

For anything that runs at meaningful volume, counting failures over a rolling window of the last twenty or so calls works better than requiring strict consecutive failures. A single dropped connection sandwiched between successful calls shouldn't be enough to trip a breaker meant to catch sustained outages. This one change alone cut our false-positive breaker trips significantly once we rolled it out across a few client integrations that had been using strict consecutive counting.

Step 4: Set the Cooldown Based on How the Dependency Actually Recovers

Cooldown length is the second knob, and it matters as much as the failure threshold. Too short, and the breaker flaps open and closed every few seconds under sustained load, generating alert noise without actually protecting anything. Too long, and you're sitting idle for minutes after an outage that already cleared. We start most integrations around thirty to sixty seconds and adjust from there based on how quickly that specific dependency has historically come back after past incidents.

Step 5: Let the Half-Open Probe Be Conservative

When the cooldown expires, we don't flood the dependency with a full batch of test calls to check recovery. A single successful test call, or a small handful, is enough to move the breaker back to closed. Sending too many probe calls at once against a dependency that's only partially recovered can trigger the same cascading problem the breaker was built to prevent in the first place.

Step 6: Revisit Thresholds After Every Incident

Any time a breaker's behavior surprises us, opened too early, took too long to open, flapped when it shouldn't have, we treat that as a signal to revisit the threshold, not just move on once the incident's resolved. Over time this turns threshold tuning into a living process tied to actual production behavior rather than a one-time setup task.

Where the Reference Material Helps

We don't pretend we invented any of this. The circuit breaker design pattern has been documented for a long time, and the AWS Builders' Library has several write-ups from teams that went through this exact tuning exercise at much larger scale than most of our client projects operate at. Reading how teams handling far more traffic approached the same rolling-window and cooldown tradeoffs saved us from re-learning some of those lessons the hard way.

A Real Example From a Client Integration

One integration worth describing in detail: a shipping carrier API used by an e-commerce client to pull tracking updates several times a day. Early on we set a strict consecutive-failure threshold of three, copied straight from a tutorial, and the breaker was tripping several times a week on nothing more than ordinary network noise. Every trip meant a Slack alert, and every alert meant someone had to check whether it was a real outage or just noise, which got tiring fast.

We switched to a rolling window (five failures out of the last twenty-five calls) and the false trips dropped to almost zero, while the breaker still caught the two genuine multi-hour outages that carrier had that quarter. The lesson wasn't "our first threshold was wrong," it was that we'd copied a number without looking at how that specific API actually failed in practice. Once we pulled the real error logs and matched the threshold to the actual failure pattern, the breaker started doing exactly what it was supposed to and nothing more.

What We Track to Know If a Threshold Needs Adjusting

We keep a simple dashboard per integration showing breaker trips per week, how long each open period lasted, and how many records ended up in the pending queue during each trip. A threshold that's producing multiple short trips a week with small queue depths is probably too sensitive and needs loosening. A threshold that never trips despite known incidents on the dependency's status page is probably too loose and needs tightening. Reviewing this dashboard monthly, rather than only after something goes wrong, has caught a couple of thresholds drifting out of alignment before they caused a real problem.

Documenting the Reasoning, Not Just the Numbers

Every threshold we set gets a short note next to it in our config, not just the number but why we picked it: what failure pattern we saw, what the business impact of a false trip versus a missed outage looks like for that specific integration. Six months later, when someone new on the team asks why the payments integration has such an aggressive threshold compared to the weather feed, that note answers the question in ten seconds instead of requiring someone to reconstruct the reasoning from scratch.

The Numbers We Keep Coming Back To

If you want a starting point rather than a blank page: five failures out of the last twenty calls, a forty-five second cooldown, and a single test call for the half-open probe. That's not the right number for every integration, but it's a reasonable default to tune from rather than guessing at zero context, and it's close to where most of our integrations have landed after a few rounds of adjustment. Treat it as a first draft, not a final answer, and expect to revisit it at least once after the first real incident.

137Foundry's engineering team has put together a fuller walkthrough of the whole pattern, including how we pair the breaker with a pending queue for deferred work, in a longer guide on adding circuit breakers to data automation jobs if you want to see the full picture rather than just the threshold-tuning piece.

#automation #engineering #api #data #python

What We Learned Migrating a Client's Design System to Tokens

A client came to us with a Figma file that had grown across three years and roughly a dozen contributors, and a codebase where the same colors and spacing values existed as hardcoded hex codes and pixel numbers scattered across hundreds of components. Nobody had done anything wrong exactly. The product had just grown faster than anyone had time to go back and clean up after. We were brought in to migrate the whole thing to a proper token structure, and a few things surprised us along the way.

Photo by Tima Miroshnichenko on Pexels

The audit took longer than the migration

We expected the migration itself, writing the token file and wiring the build pipeline, to be the long pole. It wasn't. The audit, going through the existing Figma file and codebase to figure out which of the roughly 340 distinct color values in use were intentional variations and which were just drift, took nearly three weeks. The actual token migration took about ten days once the audit was done.

That ratio surprised us at the time but makes sense in hindsight. A migration script can mechanically replace a known hex code with a token reference. It cannot tell you whether two shades of blue that are visually almost identical were meant to be the same color or were meant to be subtly different, and getting that wrong in either direction either loses an intentional distinction or bakes an accidental one permanently into the new system.

Going in, we'd told the client to expect roughly a month of combined audit and migration work based on similar projects we'd scoped before. We ended up closer to five and a half weeks, which the client was fine with once we explained why, but it's a useful data point for anyone scoping a similar project: pad the audit estimate more than feels necessary, because the audit is where the actual uncertainty lives, not the migration itself.

We found more actual bugs than expected

Going through every color usage surfaced things nobody had noticed. A form validation error state that was using a slightly different red than the rest of the error states in the product, not on purpose, just because whoever built that form eyedropped a color from a different screenshot at some point. A focus ring color that had drifted to be nearly invisible against one background but fine against others. None of these were things the client had filed as bugs. They were just accumulated small inconsistencies nobody had gone looking for until the audit forced the question.

This is a pattern worth expecting going in: a token migration is also, unavoidably, a design consistency audit. Budget time for the fixes that surface, not just the mechanical migration work.

The semantic tier is where the real conversations happened

Defining core tokens (the raw palette, spacing scale, and type scale) was mostly mechanical once we had the audited list of intentional values. The semantic tier is where we spent most of our design conversations with the client's team, because naming a token color-surface-elevated versus color-surface-secondary versus color-background-raised required actually agreeing on what those concepts meant in their product specifically, not just picking a name that sounded reasonable.

We used the W3C Design Tokens Community Group format as the shared reference point for these conversations, since having a documented external standard to point at made it easier to explain to non-technical stakeholders why the naming mattered as much as the values themselves.

The build pipeline was the easy part

Wiring Style Dictionary to generate CSS custom properties from the token source, and hooking that build into their existing CI pipeline, took about two days once the token structure itself was settled. This matches what we generally see: the tooling for the mechanical build step is mature and well documented. The hard, slow part of a token migration is almost always the human agreement about what the tokens should mean, not the code that generates output from them.

What we'd do differently on the next one

Looking back, we'd also loop in someone from the client's marketing or brand team much earlier than we did. We treated this as a design-and-engineering project for most of its first two weeks, and the marketing team only got pulled in once we hit color decisions that turned out to be tied to campaign assets they owned. Their input didn't change our approach much once we had it, but getting it three weeks earlier would have saved a round of rework on a handful of tokens we'd already named and wired into the build.

If we ran this migration again, we would start the audit with a smaller, representative sample of the product rather than trying to catalog every color usage across the entire codebase before starting any token work. A sample audit surfaces the same categories of drift, informs the same naming decisions, and gets a working token structure in front of the client's team weeks earlier, at which point real usage against real components tends to surface issues faster than a comprehensive upfront audit does anyway.

We also underestimated how much the audit itself needed input from designers who weren't originally scheduled for the project, since some of the color decisions from years earlier only made sense in the context of a marketing campaign or a client request that predated most of the current team. Getting that institutional memory involved early would have saved us a few rounds of "why is this one different" that could have been answered in five minutes by the right person.

The one thing that almost derailed the rollout

About three weeks after the new token pipeline went live, a marketing landing page built on a separate stack started visibly clashing with the rest of the product, because it was still pulling brand colors from a hardcoded copy nobody had included in the audit's scope. It wasn't technically our fault, the landing page lived outside the codebase we'd been engaged on, but it was the kind of gap that makes a whole migration look unfinished to the people signing off on it. We added "check for consuming surfaces outside the primary codebase" as a standing step in every audit we've run since, following roughly the same spirit as the MDN documentation on CSS custom properties recommends for scoping variable inheritance correctly across a whole site rather than just the primary app shell.

Where the client's system stands now

Roughly six months post-migration, the client's design and engineering teams describe their biggest win as being able to make a color or spacing change with confidence, knowing it will apply everywhere it's supposed to and nowhere it isn't, instead of hunting through the codebase for every hardcoded reference. That confidence is the actual point of a token migration. The technical scaffolding matters, but it's in service of that outcome, not the outcome itself.

We wrote up the full structure we generally recommend, the tiering approach and the governance model that keeps a token system from drifting again after launch, over at our guide to a design token pipeline that designers and developers both trust. If your team is sitting on a similar audit-shaped problem, 137Foundry has been through this migration enough times now to have a reasonably tight process for it.

#webdesign #ui #frontend #design systems #product design

Our Checklist for Auditing a Design System Before a Rebrand

Every time a client tells us a rebrand is coming, we run the same audit before touching a single color value. Rebrands are exactly the moment a design system's structure gets tested, and the teams that skip the audit are the ones who end up three weeks in discovering that half the "brand blue" usages in their product were never actually tied to a brand token in the first place. This is the checklist we run through, roughly in order.

Photo by Lisa from Pexels on Pexels

We've run this audit often enough now that we can usually tell within the first hour of looking at a codebase roughly how painful the rebrand is going to be, just from how many of these checklist items turn up problems. A system that passes most of these cleanly is a genuinely fast rebrand. A system that fails half of them needs the audit findings folded into the project timeline before anyone commits to a launch date, or the rebrand slips in ways that are hard to explain to whoever is expecting it on schedule.

1. Find every hardcoded value first

Before anything else, we search the entire codebase for raw hex codes and raw pixel values that should be tokens but aren't. This always turns up more than the client expects, usually somewhere between a dozen and a few hundred, depending on how long the product has existed. Every one of these is a value the rebrand will miss unless someone manually finds and updates it, so cataloging them up front is what prevents a rebrand from shipping half-finished.

2. Separate "meant to be the brand color" from "coincidentally the same color"

This is the step people underestimate. Not every instance of the current brand blue is actually meant to track the brand color going forward. Some usages picked that blue because it was available, not because the component's meaning is tied to the brand. Confusing these two categories during a rebrand either changes things that shouldn't change or misses things that should. We go component by component and ask, for each blue usage, "if the brand color changes, should this change with it?"

3. Check the semantic layer against the W3C Design Tokens Community Group structure

If the client already has semantic tokens, we check whether they're named for role (color-brand-primary) or for appearance (color-blue). Appearance-named tokens are a red flag going into a rebrand, because a rename that should be a one-line change at the core tier turns into a search-and-replace across every usage of a now-misleading name.

We also check whether the semantic layer is actually consumed by most components, or whether it exists alongside a lot of components still referencing raw values directly. A semantic layer that only half the product actually uses gives a false sense of security going into a rebrand, since the audit findings look better than the rebrand rollout will actually feel once real components start needing manual fixes.

4. Test the color change against accessibility contrast requirements early

A rebrand color that looked fine as a logo color can fail contrast requirements once it's used as button text or a link color against the product's existing background palette. We run the proposed new palette through a contrast checker like the one WebAIM publishes, against every background it'll actually appear on, before the rebrand ships, not after design review has already signed off on mockups that assumed it would just work.

5. Confirm the build pipeline actually regenerates on token change

We ask to see the last few commits to the token source file and confirm each one triggered an automatic build that updated the generated CSS or platform output. If the pipeline requires someone to manually run a build script, that's the moment in the rebrand timeline most likely to get forgotten under deadline pressure, and it's worth fixing before the rebrand starts rather than discovering it mid-rollout.

6. Check for a second, undocumented source of truth

Almost every audit turns up a second place brand values live: a marketing site built on a different stack, an email template system, a PDF export tool, a partner-facing widget embedded on other companies' sites. These often pull colors from their own hardcoded copy rather than the product's token pipeline, and they get missed in a rebrand rollout more often than any single component inside the main product does.

7. Plan the rollout order, not just the target state

We sequence rebrand rollouts by blast radius: core token values first (behind a flag if the platform supports it), semantic tokens next, then component-level overrides last, since those are the most isolated and lowest-risk to adjust individually if something looks wrong. Shipping all three tiers simultaneously makes it harder to isolate where a visual regression came from if one shows up. The MDN reference on CSS custom properties is our go-to when we're explaining the mechanics of that staged rollout to a client's engineering team who hasn't worked with a token-driven theme swap before.

8. Sanity-check the new palette against accessibility contrast before anyone signs off

A rebrand color that reads well as a logo or a hero banner can fail contrast requirements the moment it becomes button text or a link color against existing backgrounds. We run every proposed new value through a contrast checker against the actual surfaces it'll sit on before it goes into a client presentation, not after design sign-off, because reopening an already-approved palette is a much harder conversation than catching the issue during the audit.

9. Check how many places the logo and wordmark live as static assets

This one is easy to miss because it isn't really a token problem, it's adjacent to one. A rebrand almost always touches the logo, and static logo files tend to be scattered across a favicon, app store assets, email signatures, a PDF letterhead template, and social media profile images, none of which are wired into any token pipeline. We keep a running inventory of every static brand asset location during the audit specifically so the rebrand rollout plan accounts for them alongside the token-driven changes, rather than discovering three weeks post-launch that the app icon still shows the old mark.

What this checklist catches, in practice

Running this audit before a rebrand consistently surfaces the same categories of problem: hardcoded values that will silently miss the rebrand, semantic tokens named for appearance instead of role, a second undocumented source of brand values living outside the main product, and static brand assets nobody remembered lived outside the token system entirely. None of these are exotic findings. They're the predictable result of a design system that grew incrementally without anyone stepping back to check its structure, which describes most design systems that have been in production for more than a year or two.

We go deeper into the tiering and governance approach this checklist assumes in our full guide to building a token pipeline designers and developers both trust. If a rebrand is on your roadmap and you want a second set of eyes on whether your system is actually ready for it, 137Foundry runs this exact audit for clients before the rebrand work starts, not after something breaks.

#webdesign #ui #design systems #branding #frontend

We Just Retired a 15-Year-Old System. Here's What Almost Got Lost.

At 137Foundry, we recently wrapped a legacy modernization engagement for a client whose core operations system had been running, largely unchanged in its fundamentals, for fifteen years. It worked. It was also nearly impossible to hire for, since it ran on a stack most engineers under thirty-five had never touched, and every year that passed made the pool of people who understood it smaller.

Here's what almost slipped through, and what we did differently to catch it.

The rule nobody remembered writing

About three weeks into the audit phase, we found a conditional buried in a pricing calculation that applied a specific adjustment for orders shipped to one particular region. Nobody on the current team knew why. The commit that introduced it was over a decade old, with a message that just said "fix pricing bug." No ticket reference, no further explanation.

We eventually tracked down a former employee, now retired, through a mutual connection, who remembered the actual story: a regulatory requirement specific to that region that existed for about two years and was then repealed, but the code that implemented it was never removed because nobody wanted to touch pricing logic without being completely sure it was safe to change.

If we'd migrated without finding this, one of two things would have happened. Either we'd have carried forward a rule that no longer applied to any current regulation, silently overcharging or undercharging customers in that region for no legitimate reason, or we'd have dropped it during a code cleanup and, if the regulation had somehow still been relevant in some edge case we didn't know about, created a compliance gap nobody would have noticed until an audit.

Photo by Tima Miroshnichenko on Pexels

Why we built time for this into the project plan explicitly

This is the part of a legacy migration that's easiest to underbudget, because interviewing people and reading old commit history doesn't look like "real" engineering progress to a client watching a project timeline. We've learned to make the audit phase an explicit, separately budgeted line item rather than folding it into general "planning," specifically because that's the phase that gets compressed first under deadline pressure, and it's the phase where the actual risk lives.

What surprised us most

We expected to find undocumented business logic. We didn't expect how much of it was actively still relevant versus how much was genuinely dead weight left over from rules that no longer applied to the business at all. Roughly a third of what we catalogued during the audit turned out to be safe to simply not carry forward, once we confirmed with the client that the underlying business condition no longer existed. That's not a bad outcome. A migration is a legitimate opportunity to deliberately retire dead logic instead of blindly reproducing everything, as long as the decision to drop something is made consciously rather than by accident.

The part the client's leadership didn't expect

Going into the engagement, the client's leadership expected the risky part of the project to be the technical rewrite itself, the part their engineering team was most nervous about. By the end, everyone involved agreed the actual risk had been in the parts that didn't look technical at all: finding the retired employee, confirming which of the thirty-some undocumented quirks we catalogued were still relevant, and deciding, with the client's business stakeholders rather than unilaterally, which pieces of dead logic were genuinely safe to leave behind.

That's a useful thing to set expectations around at the start of any similar project. The engineering work is usually the predictable part. The knowledge-recovery work is where the actual surprises live, and building schedule slack around it rather than around the coding phase tends to produce a much smoother project overall.

Where this fits into how we work

We treat legacy retirement as a knowledge-extraction project first and a technical migration second, because the technical part is genuinely the easier half once you actually know what you're building toward. The full framework we use for this kind of work covers the audit process, the prioritization, and the parallel-run verification period we run before any legacy system actually gets decommissioned.

Trending Blogs

Last Seen Blogs

137Foundry Web & App Design