Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 56 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,38 +176,53 @@ encrypted firmware storage. See `Nerves.Runtime.FwupOps.prevent_revert/0`.

### Assisted firmware validation and automatic revert

Nerves firmware updates protect against update corruption and power loss
midway into the update procedure. However, what happens if the firmware update
contains bad code that hangs the device or breaks something important like
networking? Some Nerves systems support tentative runs of new firmware and if
something goes wrong, they'll revert back.
Nerves firmware updates protect against update corruption and power loss midway
into the update procedure. However, what happens if the firmware update contains
bad code that hangs the device or breaks something important like networking?
Some Nerves systems support tentative runs of new firmware and if something goes
wrong, they'll revert back.

At a high level, this involves some additional code from the developer that
knows what constitutes "working". This could be "is it possible to connect to
the firmware update server within 5 minutes of boot?"

Here's the process:

1. New firmware is installed in the normal manner. The `Nerves.Runtime.KV`
variable, `nerves_fw_validated` is set to 0. (The systems `fwup.conf` does
this)
2. The system reboots like normal.
3. The device starts a five minute reboot timer (your code needs to do this if
you want to catch hangs or super-slow boots)
4. The application attempts to make a connection to the firmware update server.
5. On a good connection, the application sets `nerves_fw_validated` to 1 by
calling `Nerves.Runtime.validate_firmware/0` and cancels the reboot timer.
6. On error, the reboot timer failing, or a hardware watchdog timeout, the
system reboots. The bootloader reverts to the previous firmware.

Some Nerves systems support a KV variable called `nerves_fw_autovalidate`. The
intention of this variable was to make that system support scenarios that
require validate and ones that don't. If the system supports this variable then
you should make sure that it is set to 0 (either via a custom fwup.conf or via
the provisioning hooks for writing serial numbers to MicroSD cards). Support for
the `nerves_fw_autovalidate` variable will likely go away in the future as steps
are made to make automatic revert on bad firmware a default feature of Nerves
rather than an add-on.
knows what constitutes "working". `Nerves.Runtime` comes with a module,
`Nerves.Runtime.StartupGuard`, that handles this by waiting for all OTP
applications to start and then validates the new firmware.

To use `Nerves.Runtime.StartupGuard`, first check whether your Nerves system
doesn't automatically validate firmware after it gets written successfully. This
was previously done on all official systems for simplicity and we're in the
process of changing that. It's easy to see. Update the firmware to your project.
Run `Nerves.Runtime.firmware_validation_status/0`. If it's validated and you
don't have the `Nerves.Runtime.StartupGuard` enabled, then it auto-validates.
Otherwise, run `Nerves.Runtime.validate_firmware/0`. To enable
`Nerves.Runtime.StartupGuard` to validate the firmware for you, add the
following to your project's `target.exs` or `config.exs`:

```elixir
config :nerves_runtime, startup_guard_enabled: true
```

Add then add the following to your project's `rel/vm.args.eex`:

```text
## Require an initialization handshake within 10 minutes
-env HEART_INIT_TIMEOUT 600
```

Of course, there's much room for improvement. For example, if your Nerves device
connects to a firmware update server, the criteria for validating new firmware
could be connecting to that server.

Recommendations for this process are:

1. Allow for enough time when in a bad state to do remote debug if that's
possible. Rebooting immediately can limit diagnostic options when unexpected
things happen remotely.
2. Link the validation code to Nerves Heart. This can protect against failures
and hangs that occur before the validation process starts.
3. Keep the heart callback code as simple as possible since heart is very
unforgiving to errors, exceptions, and slow code.

One way to start is to copy/paste `Nerves.Runtime.StartupGuard` and modify.

### U-Boot assisted automatic revert

Expand All @@ -232,23 +247,6 @@ environment variable to `"1"` to indicate that boot counting should start.
you call it to indicate that the firmware is ok, it will set `upgrade_available`
back to `"0"` and reset `"bootcount"`.

### Best effort automatic revert

Unfortunately, the bootloader for platforms like the Raspberry Pi makes it
difficult to implement the above mechanism. The following strategy cannot
protect against kernel and early boot issues, but it can still provide value:

1. Upgrade firmware the normal way. Record that the next boot will be the first
one in the application data partition.
2. On the reboot, if this is the first one, record that the boot happened and
revert the firmware with `reboot: false`. If this is not the first boot,
carry on.
3. When you're happy with the new firmware, revert the firmware again with
`reboot: false`. I.e., revert the revert. It is critical that `revert` is
only called once.

To make this handle hangs, you'll want to enable a hardware watchdog.

## Serial numbers

Finding the serial number of a device is both hardware specific and influenced
Expand Down Expand Up @@ -299,19 +297,18 @@ Task | Description

## Application environment

This section documents officially supported application environment keys.
This section documents officially supported application environment keys that
can be added to your `config.exs`, `target.exs`, or the like.

Most users shouldn't need to modify the application environment for
`nerves_runtime` except for unit testing. See the next section for testing.

Key | Default | Description
--------------- | ----------------------------------- | ------------
`:boardid_path` | `"/usr/bin/boardid"` | Path to the `boardid` binary for determining the device's serial number
`:devpath` | `/dev/rootdisk0` | The block device that firmware is stored on. `/dev/rootdisk0` is a symlink on Nerves to the real location, so this really shouldn't need to be changed.
`:fwup_env` | `%{}` | Additional environment variables to pass to `fwup`
`:fwup_path` | `"fwup"` | Path to the `fwup` binary for querying or modifying firmware status
`:kv_backend` | `Nerves.Runtime.KVBackend.UBootEnv` | The backing store for firmware slot and other low level key-value pairs. This is almost always a U-Boot environment block for Nerves
`:ops_fw_path` | `"/usr/share/fwup/ops.fw"` | Path to the `ops.fw` file for passing to `fwup` for firmware status tasks
Key | Default | Description
------------------------- | ----------------------------------- | ------------
`:boardid_path` | `"/usr/bin/boardid"` | Path to the `boardid` binary for determining the device's serial number (useful for unit tests)
`:devpath` | `/dev/rootdisk0` | The block device that firmware is stored on. `/dev/rootdisk0` is a symlink on Nerves to the real location, so this really shouldn't need to be changed. (useful for unit tests)
`:fwup_env` | `%{}` | Additional environment variables to pass to `fwup`. (useful for unit tests)
`:fwup_path` | `"fwup"` | Path to the `fwup` binary for querying or modifying firmware status. (useful for unit tests)
`:kv_backend` | `Nerves.Runtime.KVBackend.UBootEnv` | The backing store for firmware slot and other low level key-value pairs. This is almost always a U-Boot environment block for Nerves. (useful for unit tests)
`:ops_fw_path` | `"/usr/share/fwup/ops.fw"` | Path to the `ops.fw` file for passing to `fwup` for firmware status tasks. (useful for unit tests)
`:startup_guard_enabled` | `false` | Check that all OTP applications start up and then validate the firmware if needed. Reboot after 15 minutes if start up isn't successful.

## Using nerves_runtime in tests

Expand Down Expand Up @@ -357,4 +354,3 @@ All original source code in this project is licensed under Apache-2.0.

Additionally, this project follows the [REUSE recommendations](https://reuse.software)
and labels so that licensing and copyright are clear at the file level.

13 changes: 13 additions & 0 deletions lib/nerves_runtime.ex
Original file line number Diff line number Diff line change
Expand Up @@ -264,4 +264,17 @@ defmodule Nerves.Runtime do
# it's come up so far.
Function.identity(@mix_target)
end

@doc false
@spec get_expected_started_apps() :: {:ok, [atom()]} | :error
def get_expected_started_apps() do
{:ok, [[boot]]} = :init.get_argument(:boot)
contents = File.read!("#{boot}.boot")
{:script, _name, instructions} = :erlang.binary_to_term(contents)

apps = for {:apply, {:application, :start_boot, [app | _]}} <- instructions, do: app
{:ok, apps}
rescue
_ -> :error
end
end
7 changes: 6 additions & 1 deletion lib/nerves_runtime/application.ex
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ defmodule Nerves.Runtime.Application do

alias Nerves.Runtime.FwupOps
alias Nerves.Runtime.KV
alias Nerves.Runtime.StartupGuard

require Logger

Expand All @@ -20,7 +21,11 @@ defmodule Nerves.Runtime.Application do
load_services()

options = Application.get_all_env(:nerves_runtime)
children = [{FwupOps, options}, {KV, options} | target_children()]

startup_guard_children =
if options[:startup_guard_enabled], do: [{StartupGuard, options}], else: []

children = [{FwupOps, options}, {KV, options}] ++ startup_guard_children ++ target_children()

opts = [strategy: :one_for_one, name: Nerves.Runtime.Supervisor]
Supervisor.start_link(children, opts)
Expand Down
191 changes: 191 additions & 0 deletions lib/nerves_runtime/startup_guard.ex
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# SPDX-FileCopyrightText: 2025 Frank Hunleth
#
# SPDX-License-Identifier: Apache-2.0
defmodule Nerves.Runtime.StartupGuard do
@moduledoc """
Monitor system startup and validate firmware

This module provides an easy option for validating firmware for simple use
cases. Whether new firmware even needs to be validated on first boot is
determined by the Nerves system that you're using. When it doubt, an easy way
to know is if you have to run `Nerves.Runtime.validate_firmware/0` every time
you upload new firmware, then your Nerves system requires validation. While
you may eventually want to check that networking or other things work before
validating, using this module should suffice in the mean time.

## Setup

Add the following to your project's `target.exs` or `config.exs`:

```elixir
config :nerves_runtime, startup_guard_enabled: true
```

Add the following to your project's `rel/vm.args.eex`:

```text
## Require an initialization handshake within 10 minutes
-env HEART_INIT_TIMEOUT 600
```

The discussion below explains more about the heart initialization handshake
timer.

## Discussion

Here's the high level summary:

1. New firmware is unvalidated on first boot. If it's not validated, the
next reboot runs the previous firmware again.
2. This module considers firmware good if the OTP release starts all
applications successfully. If this doesn't happen in 15 minutes, the
system reboots.
3. After application startup confirmation, the running firmware is
validated if this is the first boot by calling
`Nerves.Runtime.validate_firmware/0`.
4. `StartupGuard` stops running.

This sounds good, but broken firmware can also hang or not call the code that
gives up after 15 minutes.

Protecting against hung code eventually leads to making use of a hardware
watchdog. Most Nerves systems use these and integrate it with the Erlang
heart feature. The hardware watchdog is still a last resort, so other systems
can certainly try to gracefully reboot before the hardware watchdog kicks in.

This module registers with Erlang's heart. The
`Nerves.Runtime.Heart.init_complete/0` call is a Nerves extension to heart to
cancel a timer on setting the Erlang heart callback. This addresses hangs
before setting the callback or just something skipping the code entirely.

Keep in mind that the heart callback is totally unforgiving to errors and
function calls taking too long. Making it too complicated can backfire and
cause inadvertent reboots. Rebooting too quickly on errors can impact your
ability debug partial failures. If using this code as a template, try to
keep your code in `Task` or change this to a `GenServer` or anything else
that can be supervised.
"""
use Task, restart: :transient

alias Nerves.Runtime.Heart

require Logger

@retry_delay :timer.seconds(10)
@give_up_minutes 15
@start_warning_minutes 2

@doc false
@spec start_link(keyword()) :: {:ok, pid()}
def start_link(opts) do
Task.start_link(__MODULE__, :run, [opts])
end

@doc false
@spec run(keyword()) :: :ok
def run(opts) do
retry_delay = Keyword.get(opts, :retry_delay, @retry_delay)

# Register with heart to bullet proof against hangs or other weirdness happening
# in this code.
:ok = :heart.set_callback(__MODULE__, :heart_check)
Heart.init_complete()

# Wait for all of the applications specified in the release to start.
{:ok, expected_apps} =
repeat_while(&Nerves.Runtime.get_expected_started_apps/0, :error, 10, retry_delay)

repeat_until(fn -> all_applications_started?(expected_apps) end, 10, retry_delay)

# Try getting the firmware validation status. If :unknown, hope.
status = repeat_while(&Nerves.Runtime.firmware_validation_status/0, :unknown, 10, retry_delay)

# Validate or not.
if status == :unvalidated do
Logger.info("Firmware not validated. Validating now...")
:ok = Nerves.Runtime.validate_firmware()
Logger.info("Firmware validated successfully")
else
Logger.info("Firmware valid and all applications started successfully")
end

# Stop the heart callback since all is good now
:heart.clear_callback()
end

defp repeat_until(_fun, 0, _retry_delay) do
raise RuntimeError, "Exceeded maximum retries"
end

defp repeat_until(fun, retries, retry_delay) do
if !fun.() do
Process.sleep(retry_delay)
repeat_until(fun, retries - 1, retry_delay)
end
end

defp repeat_while(_fun, _unwanted_result, 0, _retry_delay) do
raise RuntimeError, "Exceeded maximum retries"
end

defp repeat_while(fun, unwanted_result, retries, retry_delay) do
result = fun.()

if result == unwanted_result do
Process.sleep(retry_delay)
repeat_while(fun, unwanted_result, retries - 1, retry_delay)
else
result
end
end

@doc false
@spec heart_check() :: :ok | :error
def heart_check() do
uptime_minutes = get_uptime_minutes()

do_heart_check(uptime_minutes)
end

@doc false
@spec do_heart_check(non_neg_integer()) :: :ok | :error
def do_heart_check(uptime_minutes) do
cond do
uptime_minutes >= @give_up_minutes ->
Logger.error("Took too long to validate firmware. Rebooting.")
:error

uptime_minutes < @start_warning_minutes ->
:ok

uptime_minutes != Process.get(:last_warning_minutes) ->
Logger.warning(
"Firmware not validated. Check logs. Rebooting in #{@give_up_minutes - uptime_minutes} minutes if unfixed."
)

Process.put(:last_warning_minutes, uptime_minutes)
:ok

true ->
:ok
end
end

defp get_uptime_minutes() do
{total, _last_call} = :erlang.statistics(:wall_clock)
div(total, 60_000)
Copy link
Preview

Copilot AI Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The magic number 60_000 should be defined as a module attribute or constant to improve code clarity and maintainability.

Suggested change
div(total, 60_000)
div(total, @milliseconds_per_minute)

Copilot uses AI. Check for mistakes.

end

defp all_applications_started?(expected_apps) do
actual_apps = for {app, _, _} <- Application.started_applications(), do: app

unstarted_apps = expected_apps -- actual_apps

if unstarted_apps != [] do
Logger.warning("Waiting on the following applications to start: #{inspect(unstarted_apps)}")
false
else
true
end
end
end
Loading