Embracing the fragility of the web empowers us to
build UIs capable of adapting to the functionality they can offer,
whilst still providing value to users. This article explores how
graceful degradation, defensive coding, observability, and a healthy
attitude towards failures better equips us before, during, and after an
error occurs.
Things on the web can break — the odds are stacked against us. Lots
can go wrong: a network request fails, a third-party library breaks, a
JavaScript feature is unsupported (assuming JavaScript is even
available), a CDN goes down, a user behaves unexpectedly (they
double-click a submit button), the list goes on.
Fortunately, we as engineers can avoid, or at least mitigate the
impact of breakages in the web apps we build. This however requires a
conscious effort and mindset shift towards thinking about unhappy
scenarios just as much as happy ones.
The User Experience (UX) doesn’t need to be all or nothing — just what is usable.
This premise, known as graceful degradation allows a system to continue
working when parts of it are dysfunctional — much like an electric bike
becomes a regular bike when its battery dies. If something fails only
the functionality dependent on that should be impacted.
UIs should adapt to the functionality they can offer, whilst providing as much value to end-users as possible.
Why Be Resilient
Resilience is intrinsic to the web.
Browsers ignore invalid HTML tags and unsupported CSS properties.
This liberal attitude is known as Postel’s Law, which is conveyed
superbly by Jeremy Keith in Resilient Web Design:
“Even if there are errors in the HTML or CSS, the browser
will still attempt to process the information, skipping over any pieces
that it can’t parse.”
JavaScript is less forgiving. Resilience is extrinsic. We instruct
JavaScript what to do if something unexpected happens. If an API request
fails the onus falls on us to catch the error, and subsequently decide
what to do. And that decision directly impacts users.
Resilience builds trust with users. A buggy experience reflects poorly on the brand. According to Kim and Mauborgne, convenience (availability, ease of consumption)
is one of six characteristics associated with a successful brand, which
makes graceful degradation synonymous with brand perception.
A robust and reliable UX is a signal of quality and trustworthiness,
both of which feed into the brand. A user unable to perform a task
because something is broken will naturally face disappointment they
could associate with your brand.
Often system failures are chalked up as “corner cases” — things that
rarely happen, however, the web has many corners. Different browsers
running on different platforms and hardware, respecting our user
preferences and browsing modes (Safari Reader/ assistive technologies),
being served to geo-locations with varying latency and intermittency
increase the likeness of something not working as intended.
Error Equality
Much like content on a webpage has hierarchy, failures — things going
wrong — also follow a pecking order. Not all errors are equal, some are
more important than others.
We can categorize errors by their impact. How does XYZ not working
prevent a user from achieving their goal? The answer generally mirrors
the content hierarchy.
For example, a dashboard overview of your bank account contains data
of varying importance. The total value of your balance is more important
than a notification prompting you to check in-app messages. MoSCoWs method of prioritization categorizes the former as a must-have, and the latter a nice to have.
If primary information is unavailable (i.e: network request fails) we
should be transparent and let users know, usually via an error message.
If secondary information is unavailable we can still provide the core
(must have) experience whilst gracefully hiding the degraded component.
Categorization removes the 1-1 relationship between failures and error
messages in the UI. Otherwise, we risk bombarding users and cluttering
the UI with too many error messages. Guided by content hierarchy we can
cherry-pick what failures are surfaced to the UI, and what happen
unbeknownst to end-users.
Prevention is Better than Cure #
Medicine has an adage that prevention is better than cure.
Applied to the context of building resilient UIs, preventing an error
from happening in the first place is more desirable than needing to
recover from one. The best type of error is one that doesn’t happen.
It’s safe to assume never to make assumptions, especially when
consuming remote data, interacting with third-party libraries, or using
newer language features. Outages or unplanned API changes alongside what
browsers users choose or must use are outside of our control. Whilst we
cannot stop breakages outside our control from occurring, we can
protect ourselves against their (side) effects.
Taking a more defensive approach when writing code helps reduce
programmer errors arising from making assumptions. Pessimism over
optimism favours resilience. The code example below is too optimistic:
It
assumes that debit cards exist, the endpoint returns an Array, the
array contains objects, and each object has a property named lastFourDigits
.
The current implementation forces end-users to test our assumptions. It
would be safer, and more user friendly if these assumptions were
embedded in the code:
Using a third-party method without first checking the method is available is equally optimistic:
The code snippet above assumes that the stripe
object exists, it has a property named handleCardPayment
,
and that said property is a function. It would be safer, and therefore
more defensive if these assumptions were verified by us beforehand:
Both examples check something is available before using it. Those familiar with feature detection may recognize this pattern:
Simply
asking the browser whether it supports the Clipboard API before
attempting to cut, copy or paste is a simple yet effective example of
resilience. The UI can adapt ahead of time by hiding clipboard
functionality from unsupported browsers, or from users yet to grant
permission.
Prevention is Better than Cure #
Medicine has an adage that prevention is better than cure.
Applied to the context of building resilient UIs, preventing an error
from happening in the first place is more desirable than needing to
recover from one. The best type of error is one that doesn’t happen.
It’s safe to assume never to make assumptions, especially when
consuming remote data, interacting with third-party libraries, or using
newer language features. Outages or unplanned API changes alongside what
browsers users choose or must use are outside of our control. Whilst we
cannot stop breakages outside our control from occurring, we can
protect ourselves against their (side) effects.
Taking a more defensive approach when writing code helps reduce
programmer errors arising from making assumptions. Pessimism over
optimism favours resilience. The code example below is too optimistic:
It
assumes that debit cards exist, the endpoint returns an Array, the
array contains objects, and each object has a property named lastFourDigits
.
The current implementation forces end-users to test our assumptions. It
would be safer, and more user friendly if these assumptions were
embedded in the code:
Using a third-party method without first checking the method is available is equally optimistic:
The code snippet above assumes that the stripe
object exists, it has a property named handleCardPayment
,
and that said property is a function. It would be safer, and therefore
more defensive if these assumptions were verified by us beforehand:
Both examples check something is available before using it. Those familiar with feature detection may recognize this pattern:
Simply
asking the browser whether it supports the Clipboard API before
attempting to cut, copy or paste is a simple yet effective example of
resilience. The UI can adapt ahead of time by hiding clipboard
functionality from unsupported browsers, or from users yet to grant
permission.
Preventing form resubmission in JavaScript alongside using aria-disabled="true"
is more usable and accessible than the disabled
HTML attribute. Sandrina Pereira explains Making Disabled Buttons More Inclusive in great detail.
Responding to Errors
Not all errors are preventable via defensive programming. This means
responding to an operational error (those occurring within correctly
written programs) falls on us.
Responding to an error can be modelled using a decision tree. We can either recover, fallback or acknowledge the error.
When facing an error, the first question should be, “can we recover?”
For example, does retrying a network request that failed for the first
time succeed on subsequent attempts? Intermittent micro-services,
unstable internet connections, or eventual consistency are all reasons
to try again. Data fetching libraries such as SWR offer this functionality for free.
Risk appetite and surrounding context influence what HTTP methods you
are comfortable retrying. At Nutmeg we retry failed reads (GET
requests), but not writes (POST/ PUT/ PATCH/ DELETE). Multiple attempts
to retrieve data (portfolio performance) is safer than mutating it
(resubmitting a form).
The second question should be: If we cannot recover, can we provide a
fallback? For example, if an online card payment fails can we offer an
alternative means of payment such as via PayPal or Open Banking.
Fallbacks don’t always need to be so elaborate, they can be subtle. Copy
containing text dependant on remote data can fallback to less specific
text when the request fails.
The third and final question should be: If we cannot recover, or
fallback how important is this failure (which relates to “Error
Equality”). The UI should acknowledge primary errors by informing users
something went wrong, whilst providing actionable prompts such as
contacting customer support or linking to relevant support articles.
Observability
UIs adapting to something going wrong is not the end. There is another side to the same coin.
Engineers need visibility on the root cause behind a degraded
experience. Even errors not surfaced to end-users (secondary errors)
must propagate to engineers. Real-time error monitoring services such as
Sentry or Rollbar are invaluable tools for modern-day web development.
Most error monitoring providers capture all unhandled exceptions
automatically. Setup requires minimal engineering effort that quickly
pays dividends for an improved healthy production environment and MTTA
(mean time to acknowledge).
The real power comes when explicitly logging errors ourselves. Whilst
this involves more upfront effort it allows us to enrich logged errors
with more meaning and context — both of which aid troubleshooting. Where
possible aim for error messages that are understandable to
non-technical members of the team.
Extending the earlier Stripe example with an else branch is the perfect contender for explicit error logging:
Note: This
defensive style needn’t be bound to form submission (at the time of
error), it can happen when a component first mounts (before the error)
giving us and the UI more time to adapt.
Observability helps pinpoint weaknesses in code and areas that can be
hardened. Once a weakness surfaces look at if/ how it can be hardened
to prevent the same thing from happening again. Look at trends and risk
areas such as third-party integrations to identify what could be wrapped
in an operational feature flag (otherwise known as kill switches).
Users forewarned about something not working will be less frustrated
than those without warning. Knowing about road works ahead of time helps
manage expectations, allowing drivers to plan alternative routes. When
dealing with an outage (hopefully discovered by monitoring and not
reported by users) be transparent.
Retrospectives
It’s very tempting to gloss over errors.
However, they provide valuable learning opportunities for us and our
current or future colleagues. Removing the stigma from the inevitability
that things go wrong is crucial. In Black box thinking this is described as:
“In highly complex organizations, success can happen only
when we confront our mistakes, learn from our own version of a black
box, and create a climate where it’s safe to fail.”
Being analytical helps prevent or mitigate the same error from
happening again. Much like black boxes in the aviation industry record
incidents, we should document errors. At the very least documentation
from prior incidents helps reduce the MTTR (mean time to repair) should
the same error occur again.
Documentation often in the form of RCA (root cause analysis) reports
should be honest, discoverable, and include: what the issue was, its
impact, the technical details, how it was fixed, and actions that should
follow the incident.
Closing Thoughts
Accepting the fragility of the web is a necessary step towards
building resilient systems. A more reliable user experience is
synonymous with happy customers. Being equipped for the worst
(proactive) is better than putting out fires (reactive) from a business,
customer, and developer standpoint (less bugs!).
Things to remember:
- UIs should adapt to the functionality they can offer, whilst still providing value to users;
- Always think what can wrong (never make assumptions);
- Categorize errors based on their impact (not all errors are equal);
- Preventing errors is better than responding to them (code defensively);
- When facing an error, ask whether a recovery or fallback is available;
- User facing error messages should provide actionable prompts;
- Engineers must have visibility on errors (use error monitoring services);
- Error messages for engineers/ colleagues should be meaningful and provide context;
- Learn from errors to help our future selves and others.