27 min read

Embedded Firmware Development: Why Simplicity Wins in Critical Systems

Lessons learned from developing critical embedded firmware on ARM Cortex-M microcontrollers. Why bare-metal approaches often outperform RTOS and Linux-based solutions in reliability, real-time performance, and maintainability.

Embedded Firmware Development: Why Simplicity Wins in Critical Systems

In embedded systems development, there’s a persistent temptation to reach for powerful tools: RTOSes like FreeRTOS, Linux-based solutions, or heavyweight HAL abstractions. These tools promise to simplify development, but they come with hidden costs that can undermine the very characteristics that make embedded systems valuable—determinism, real-time performance, and lean resource usage.

Over years of developing critical firmware for ARM Cortex-M microcontrollers, I’ve learned a counterintuitive lesson: keeping it simple is almost always better. Bare-metal approaches often deliver superior reliability, performance, and maintainability compared to more “sophisticated” solutions.

As both an engineer who’s written this firmware and a technical leader responsible for team execution, I’ve seen how unnecessary complexity derails projects and creates technical debt. Great engineers get drawn to the latest tools and coolest technologies—but the newest isn’t always the best, and what’s interesting isn’t always what’s appropriate.

Here’s what I’ve learned about when to embrace simplicity, when complexity is justified, and how to lead teams to make the right architectural decisions.

Leadership Perspective: Protecting Your Team from Complexity

One of the most critical leadership skills in embedded development is recognizing and preventing unnecessary complexity. This isn’t just a technical decision—it’s about protecting your team, your project, and your organization from tech debt traps that look like innovation but end in failure.

The “Latest and Greatest” Trap

Great engineers are often their own worst enemy. They’re passionate about technology, eager to learn, and always aware of new tools and frameworks. This is what makes them great—but it’s also what can derail projects.

Common scenarios I’ve encountered:

Engineer: “We should use FreeRTOS for this project. It’s industry standard and I want to learn it.”

Translation: “I want to add this to my resume, regardless of whether the project needs it.”

Engineer: “Let’s use Rust instead of C. It’s memory-safe and the future of embedded.”

Translation: “I’m excited about new technology, but I haven’t considered that the rest of the team doesn’t know Rust, and we’d be pioneers debugging toolchain issues.”

Engineer: “We need a message bus architecture with pub/sub patterns.”

Translation: “I’m applying patterns from web services to a system with three cooperating modules.”

None of these engineers are wrong to be interested in these technologies. But interest doesn’t equal appropriateness.

The Leader’s Responsibility

As a technical leader, your job is to:

  1. Distinguish between genuine need and technical curiosity
  2. Protect the project from becoming a learning experiment
  3. Balance team growth with project success
  4. Set architectural guardrails that prevent complexity creep

This doesn’t mean stifling innovation or preventing growth. It means making conscious decisions about when complexity is justified.

Questions I Ask When Evaluating Technical Proposals

When an engineer proposes using an RTOS, a new language, or a complex framework, I ask:

1. What specific problem does this solve?

  • Not “what could it do” or “what might we need”
  • What actual, current problem does this address?
  • Can you quantify the benefit?

2. What simpler alternatives have you considered?

  • Did you try solving it with what we already have?
  • What’s the simplest solution that could possibly work?
  • Why is that insufficient?

3. What’s the total cost?

  • Not just lines of code, but:
    • Learning curve for the team
    • Debugging complexity
    • Maintenance burden
    • RAM/flash overhead
    • Integration complexity

4. What’s the reversibility?

  • Can we remove this later if it doesn’t work out?
  • Or are we locked in once we start down this path?

5. Who owns this?

  • If you leave the team, who maintains it?
  • Is the expertise transferable?

Red Flags in Technical Discussions

Over the years, I’ve learned to recognize warning signs:

Red Flag: “Everyone uses [technology X]”

  • Reality: Popularity doesn’t equal appropriateness for your use case
  • Response: “Show me three projects with similar constraints to ours where it worked well”

Red Flag: “It’s more elegant/modern/clean”

  • Reality: Aesthetic preferences don’t justify added complexity
  • Response: “How does that elegance translate to measurable project benefit?”

Red Flag: “We might need this flexibility later”

  • Reality: YAGNI (You Aren’t Gonna Need It) applies to embedded too
  • Response: “Let’s solve today’s problem today. We can refactor if that future arrives.”

Red Flag: “The alternative is writing boilerplate code”

  • Reality: Sometimes explicit is better than implicit
  • Response: “Let’s look at how much boilerplate we’re actually avoiding vs. the framework we’d pull in”

Red Flag: “I can handle the complexity”

  • Reality: You won’t be the only one touching this code
  • Response: “Can the most junior person on the team debug this at 2am?”

Setting Architectural Standards

As a leader, I establish clear default choices:

Our Team Standards (Example):

  • Default: Bare-metal C with CMSIS
  • RTOS allowed when: Three or more independent concurrent processes with proven state machine complexity
  • HAL allowed when: Rapid prototyping or complex peripheral (USB, Ethernet, graphics)
  • Dynamic allocation: Forbidden except with written justification and review
  • External libraries: Require architecture review before inclusion

These aren’t rigid rules—they’re defaults that require justification to override.

This framework:

  • Prevents ad-hoc complexity
  • Forces engineers to articulate why complexity is needed
  • Creates consistency across projects
  • Protects future maintainers

The Tech Debt Trap

Unnecessary complexity is the root of technical debt. I’ve seen projects fail or require complete rewrites because engineers chose complexity for the wrong reasons:

Case 1: The RTOS That Wasn’t Needed

  • Engineer added FreeRTOS “for structure”
  • Project had three simple state machines
  • Result: 6 months of debugging priority inversions and race conditions
  • Outcome: Rewritten in bare-metal in 3 weeks, never had issues again

Case 2: The Premature Abstraction

  • Engineer built HAL abstraction “for portability” across MCU families
  • Project targeted one specific chip
  • Result: Debugging required stepping through 4 layers of indirection
  • Outcome: Performance issues, increased flash usage, eventually ripped out

Case 3: The Framework Overkill

  • Engineer integrated graphics framework for simple status display
  • Framework was 200KB, project had 256KB flash total
  • Result: Constant battles with memory limits, features cut
  • Outcome: Rewrote with direct framebuffer drawing, recovered 180KB

Common pattern: Engineer saw a cool technology and found a way to justify it, rather than solving the actual problem with appropriate tools.

Balancing Team Growth and Project Success

The question: “How do we let engineers grow without turning projects into experiments?”

My approach:

1. Separate Learning from Delivery

  • Side projects and tech demos for exploration
  • Hackathons for trying new tools
  • Proof-of-concepts before committing to production

2. Controlled Introduction

  • New technology on non-critical features first
  • Pilot projects before organization-wide adoption
  • Post-mortems to evaluate if it was worth it

3. “Innovation Budget”

  • 10-15% of project work can be “new”
  • 85-90% should be proven, understood technology
  • Prevents both stagnation and chaos

4. Mentorship, Not Mandates

  • Explain why we’re choosing simplicity
  • Show the consequences of complexity from past projects
  • Help engineers understand total cost, not just coding

When to Override Simplicity

I’ve advocated strongly for simplicity throughout this article, but leadership also means recognizing when complexity is justified:

Valid reasons to accept complexity:

  1. Proven bottleneck: You’ve measured and confirmed simple approach won’t work
  2. Safety/Reliability requirement: Complexity adds genuine fault tolerance
  3. Regulatory mandate: Certification requires specific approaches
  4. Team capability: Team has deep expertise in the complex tool
  5. Strategic investment: Technology aligns with long-term architectural direction

The key: These are evidence-based decisions, not speculative ones.

Leading by Example

The most powerful thing you can do as a technical leader is write simple code yourself.

When I’m hands-on with firmware:

  • I use bare-metal approaches
  • I write direct register access
  • I avoid abstractions unless clearly justified
  • I document why I chose simple over complex

This sets the tone. Engineers see that:

  • Simplicity isn’t laziness—it’s discipline
  • Senior engineers don’t need frameworks to prove their expertise
  • Clean, direct code is valued more than clever complexity

The Hard Conversation

Sometimes you have to say no. This is uncomfortable, especially with talented engineers who are genuinely excited.

How I handle it:

“I appreciate your enthusiasm for [technology X]. It’s powerful and I understand why you want to use it. But for this project, I don’t see evidence that the benefits outweigh the costs. Here’s what I need to see to change my mind: [specific, measurable criteria].

In the meantime, let’s solve this problem with [simpler approach]. If we hit limitations, we can revisit. And I’d love to support you exploring [technology X] in [side project/hackathon/future pilot].”

What this does:

  • Acknowledges their interest (respect)
  • Requires evidence, not opinion (objectivity)
  • Offers path forward (compromise)
  • Provides alternative outlet (growth opportunity)

Most engineers appreciate this approach. They want to build great products, not just use cool tech. Framing it as “what’s best for the product” usually resonates.

Measuring Success

How do you know if you’re making good complexity decisions?

Positive indicators:

  • New engineers can be productive quickly (< 2 weeks)
  • Debugging sessions are measured in minutes, not hours
  • Code reviews focus on logic, not framework internals
  • Bug density is low and decreasing
  • Engineers can work across different projects easily

Warning signs:

  • “Only [person X] understands this module”
  • Bugs take days to reproduce and fix
  • Flash/RAM constantly at limits
  • Engineers reluctant to modify certain code
  • Turnover correlates with working on complex systems

The Ultimate Question

When evaluating any architectural decision, I ask:

“If our best engineer leaves tomorrow, can the team maintain this?”

If the answer is no, you’ve probably over-engineered.

If the answer is yes, you’ve built something sustainable.

That’s the mark of good technical leadership: building systems that outlive any individual contributor’s tenure.

The Complexity Trap

The Seductive Appeal of “Just Add an RTOS”

When facing a moderately complex embedded project, the conventional wisdom often sounds like this:

“Just use FreeRTOS. It’ll make multitasking easier, and it’s free!”

Or worse:

“This needs networking and a file system. Let’s just run Linux. Why fight it?”

These recommendations sound reasonable—until you consider the tradeoffs you’re implicitly accepting:

RTOS Hidden Costs:

  • Memory overhead: Task stacks, kernel heap, control structures (8-16KB minimum, often much more)
  • Timing variability: Scheduler latency, context switches, priority inversions
  • Complexity: Mutexes, semaphores, queues—each a potential deadlock or race condition
  • Debugging difficulty: Concurrent bugs are notoriously hard to reproduce and fix
  • Learning curve: Team needs to understand RTOS internals, not just your application

Linux Hidden Costs:

  • Resource requirements: Minimum 32MB RAM, substantial flash, faster processor
  • Non-deterministic timing: Kernel preemption, virtual memory, device drivers all introduce latency
  • Boot time: Seconds instead of milliseconds
  • Power consumption: Dramatically higher idle and active power draw
  • Complexity squared: Kernel configuration, device trees, root filesystem, bootloaders
  • Update/security burden: Kernel patches, CVEs, package management

For many embedded applications, these costs far outweigh the benefits.

When Bare-Metal is Better: The 90% Use Case

The vast majority of embedded projects fall into this category:

  • Single core processor with straightforward peripheral interaction
  • Predictable timing requirements measured in microseconds or milliseconds
  • Limited concurrency that can be handled with interrupt-driven state machines
  • Resource constraints where every kilobyte of RAM and flash matters
  • Reliability-critical applications where deterministic behavior is paramount
  • Fast boot requirements where milliseconds count
  • Low power applications where sleep modes and power efficiency are critical

If your project fits this profile, a bare-metal approach delivers:

  1. Deterministic real-time performance: Worst-case execution time (WCET) is predictable and bounded
  2. Minimal resource usage: No kernel overhead means more resources for your application
  3. Complete control: You know exactly what the processor is doing at all times
  4. Fast boot times: Initialize peripherals and go (1-100ms typical)
  5. Lower power consumption: Efficient use of sleep modes without kernel overhead
  6. Easier debugging: Sequential logic is easier to reason about than concurrent tasks
  7. Smaller attack surface: Less code means fewer potential vulnerabilities

Real-World Example: Event-Driven State Machine Architecture

Let me illustrate with a pattern I’ve used successfully across multiple critical firmware projects on ARM Cortex-M microcontrollers.

The Architecture

Core Concept: Event-driven execution with interrupt handlers and Wait-For-Interrupt (WFI) idle mode.

int main(void)
{
    // Initialize system clock
    Clock_Init();
    
    // Initialize peripherals
    UART_Init();       // Debug output
    GPIO_Init();       // LEDs, control outputs
    Button_IRQ_Init(); // User input with hardware debouncing
    Watchdog_Init();   // Automotive-grade reliability
    
    // Initialize any hardware accelerators
    DMA_Init();        // DMA for efficient data transfers
    Display_Init();    // Display controller if present
    
    // Event-driven main loop
    while (1) {
        // Process any pending events
        if (has_display_update) {
            Update_Display();
            has_display_update = false;
        }
        
        if (has_sensor_reading) {
            Process_Sensor_Data();
            has_sensor_reading = false;
        }
        
        // Refresh watchdog
        Watchdog_Refresh();
        
        // Sleep until next interrupt (power efficient)
        __WFI();
    }
}

SysTick Handler (1ms periodic interrupt):

void SysTick_Handler(void)
{
    // Simple, predictable ISR
    time_ms++;
    
    // Update debounce logic
    Button_DebounceUpdate();
    
    // Update LED blinking
    Update_LEDs_Periodic();
    
    // Backup watchdog refresh
    if ((time_ms % 1000) == 0) {
        Watchdog_Refresh();
    }
}

Button Interrupt Handler (GPIO EXTI):

void Button_IRQ_Handler(void)
{
    // Read hardware state
    uint8_t button_state = (BUTTON_PORT->IDR & (1 << BUTTON_PIN)) ? 1 : 0;
    
    // Update debounce state machine
    // (Software debouncing for reliability)
    
    if (button_debounced && button_edge_detected) {
        // Signal main loop
        button_event_pending = true;
        button_event = BUTTON_PRESSED;
    }
    
    // Clear interrupt flag
    EXTI->PR = (1 << BUTTON_PIN);
}

What This Architecture Achieves

Deterministic Response Times:

  • Button press to LED change: < 100μs (interrupt latency + GPIO write)
  • Display update: < 20ms (with hardware acceleration)
  • Watchdog refresh: guaranteed every 1ms + main loop iteration

Resource Efficiency:

  • Flash usage: Minimal footprint (no OS overhead)
  • RAM usage: Efficient (only what you need)
  • Idle power: Very low (WFI sleep mode between events)
  • No dynamic memory allocation (malloc/free) - zero fragmentation risk

Maintainability:

  • Single-threaded logic - easy to reason about execution flow
  • No race conditions - interrupt handlers set flags, main loop processes
  • No deadlocks - no mutexes or semaphores needed
  • Straightforward debugging - logic analyzer shows exact timing

Reliability:

  • Independent watchdog catches firmware hangs
  • Hardware debouncing prevents false triggers
  • Deterministic execution paths - no scheduler surprises

The Direct Register Access Advantage

Modern embedded development often relies on vendor HAL (Hardware Abstraction Layer) libraries. While convenient, HAL introduces unnecessary overhead and complexity for many applications.

HAL vs. Direct Register Access

HAL Approach (using STM32 HAL as an example):

// Initialize GPIO with HAL
GPIO_InitTypeDef GPIO_InitStruct = {0};
GPIO_InitStruct.Pin = GPIO_PIN_13;
GPIO_InitStruct.Mode = GPIO_MODE_OUTPUT_PP;
GPIO_InitStruct.Pull = GPIO_NOPULL;
GPIO_InitStruct.Speed = GPIO_SPEED_FREQ_HIGH;
HAL_GPIO_Init(GPIOG, &GPIO_InitStruct);

// Toggle GPIO with HAL
HAL_GPIO_WritePin(GPIOG, GPIO_PIN_13, GPIO_PIN_SET);

Direct Register Approach (bare-metal):

// Initialize GPIO directly
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOGEN;  // Enable clock
GPIOG->MODER &= ~(3U << (13 * 2));     // Clear mode bits
GPIOG->MODER |= (1U << (13 * 2));      // Set as output
GPIOG->OSPEEDR |= (3U << (13 * 2));    // High speed

// Toggle GPIO directly
GPIOG->BSRR = (1 << 13);  // Set pin (1 cycle atomic operation)

Why Direct Access Wins

Performance:

  • HAL: Multiple function calls, parameter validation, overhead
  • Direct: Single instruction, predictable cycle count
  • Real-world impact: HAL GPIO toggle ~50-100 cycles; direct register ~1-3 cycles

Code Size:

  • HAL: Links entire HAL GPIO module (~5-10KB flash)
  • Direct: Only the exact register writes you use (~50-100 bytes)

Transparency:

  • HAL: Abstraction hides what’s actually happening
  • Direct: You see exactly what hardware is being touched

Debugging:

  • HAL: Step through layers of abstraction
  • Direct: Set breakpoint, read register, done

When HAL Makes Sense

I’m not advocating for never using HAL. It has legitimate use cases:

  • Rapid prototyping: Get something working quickly
  • Complex peripherals: USB, Ethernet, SD card—where low-level details are intricate
  • Cross-family portability: Switching between chip families from the same vendor
  • Large teams: Consistent API reduces onboarding time

But for production firmware where performance, size, and determinism matter, direct register access is often superior.

CMSIS: The Right Level of Abstraction

The ARM CMSIS (Cortex Microcontroller Software Interface Standard) provides the sweet spot: standardized peripheral access without heavyweight abstractions.

What CMSIS Provides:

  • Peripheral structs: Clean register access (e.g., GPIOG->BSRR)
  • Standard definitions: __WFI(), __disable_irq(), etc.
  • Core functions: NVIC configuration, SysTick, etc.
  • Zero overhead: Macros and inline functions compile to direct register access

What CMSIS Doesn’t Force On You:

  • Heavy abstraction layers
  • Callback frameworks
  • Configuration generators
  • Massive linked libraries

This gives you:

#include "stm32f4xx.h"  // CMSIS device header (STM32 example)

// Direct, readable register access
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOGEN;  // Enable GPIO clock
GPIOG->MODER |= (1U << (13 * 2));      // Set as output

// Standard ARM Cortex-M core functions (portable across vendors)
__disable_irq();            // Atomic section
NVIC_EnableIRQ(EXTI0_IRQn); // Enable interrupt
__WFI();                    // Wait for interrupt

Clean, efficient, portable across ARM Cortex-M devices, and no mystery about what’s happening at the hardware level.

When to Consider FreeRTOS

Despite my advocacy for simplicity, there are legitimate scenarios where FreeRTOS adds value:

Valid Use Cases for FreeRTOS

1. True Multi-Threading Needs

  • Multiple independent processes with different timing requirements
  • Example: Sensor data collection (10Hz) + network communication (async) + UI updates (30Hz)
  • Benefit: Task abstraction simplifies logical separation

2. Complex Synchronization

  • Multiple producers/consumers with shared resources
  • Example: Data pipeline with buffering between stages
  • Benefit: Queue and semaphore primitives reduce custom synchronization code

3. Dynamic Priority Management

  • Runtime priority changes based on system state
  • Example: Adaptive control systems with changing workload priorities
  • Benefit: RTOS scheduler handles priority preemption automatically

4. Team Scale and Modularity

  • Large teams where task isolation reduces coupling
  • Example: 10+ engineers working on different subsystems
  • Benefit: Tasks provide natural module boundaries

The Threshold Test

Ask yourself:

  • Can interrupts + state machines handle this? → Stay bare-metal
  • Is timing variability acceptable? → If no, stay bare-metal
  • Do I have RAM/flash headroom? → If no, stay bare-metal
  • Is the team comfortable with RTOS concepts? → If no, stay bare-metal

If you answer “yes” to all these, FreeRTOS might be justified. But even then, consider the simplest solution that could possibly work.

When to Consider Linux (Hint: Rarely for MCUs)

Linux on embedded systems has valid use cases—but typical ARM Cortex-M microcontrollers rarely fit them.

Valid Use Cases for Linux

When Linux Makes Sense:

  • Complex networking: Full TCP/IP stack, TLS, multiple protocols
  • File systems: Large storage with file management (SD cards, eMMC)
  • High-level languages: Running Python, Node.js, or similar
  • Rich ecosystems: Need existing Linux packages and tools
  • Development speed: Rapid iteration with familiar userspace tools

Minimum Hardware Reality:

  • RAM: 32MB minimum (realistically 128MB+ for comfortable margin)
  • Flash: 8MB+ for kernel, rootfs, applications
  • Processor: 400MHz+ ARM Cortex-A series
  • Power budget: Idle current in tens of milliamps
  • Boot time: Seconds, not milliseconds

Where This Fits:

  • STM32MP1 (Cortex-A7 + Cortex-M4)
  • Raspberry Pi Compute Module
  • i.MX6/7 series
  • Not typical Cortex-M microcontrollers (STM32F4/H7/L4, Nordic nRF, ESP32-C series, etc.)

The Microcontroller Reality Check

A typical high-end Cortex-M MCU:

  • RAM: 256KB - 1MB
  • Flash: 512KB - 2MB
  • Clock: 120-600MHz ARM Cortex-M4/M7
  • Target use: Real-time control, sensor acquisition, motor control

These specs don’t support Linux. Period.

Trying to squeeze Linux onto a Cortex-M microcontroller is an academic exercise, not a production solution. If you need Linux capabilities, you need different hardware (Cortex-A processor or higher).

Practical Decision Framework

Here’s the framework I use when architecting embedded firmware:

Start Here (Default for MCU Projects)

Bare-Metal + Event-Driven Architecture

Characteristics:

  • Interrupt-driven peripheral handling
  • State machines for control logic
  • WFI for power efficiency
  • Direct register access for critical paths
  • CMSIS for standardization

When to Move Beyond:

  • Clear evidence that complexity justifies the overhead
  • Not “might need” or “could be useful”—actual requirements

Decision Tree

┌─────────────────────────────────┐
│  Start: Bare-Metal Event-Driven │
└─────────────┬───────────────────┘
              │
              ▼
    ┌─────────────────────┐
    │ Need 3+ concurrent  │
    │ independent tasks?  │
    └─────┬─────────┬─────┘
         No        Yes
          │         │
          │         ▼
          │    ┌────────────────┐
          │    │ Can state      │
          │    │ machines + ISR │
          │    │ handle it?     │
          │    └──┬──────────┬──┘
          │      Yes         No
          │       │          │
          │       │          ▼
          │       │    ┌──────────────┐
          │       │    │ Consider     │
          │       │    │ FreeRTOS     │
          │       │    └──────────────┘
          │       │
          ▼       ▼
    ┌──────────────────┐
    │ Stay Bare-Metal  │
    └──────────────────┘

┌─────────────────────────────────┐
│  Need networking or file system?│
└─────────────┬───────────────────┘
              │
              ▼
    ┌─────────────────────┐
    │ Simple needs?       │
    │ (HTTP, MQTT, FAT)   │
    └─────┬─────────┬─────┘
         Yes        No
          │         │
          ▼         ▼
    ┌─────────┐  ┌──────────────┐
    │ LwIP +  │  │ Need Linux   │
    │ FatFS   │  │ → Different  │
    │ on RTOS │  │   hardware   │
    └─────────┘  └──────────────┘

The 80/20 Rule

In my experience:

  • 80% of MCU projects → Bare-metal is optimal
  • 15% of MCU projects → FreeRTOS adds value
  • 5% of MCU projects → Need fundamentally different hardware (MPU/SoC, not MCU)

Real-World Reliability: The Watchdog Philosophy

Critical firmware needs reliability mechanisms that work regardless of your software architecture. One of the most important: the independent watchdog timer.

Hardware Watchdog Implementation

Example using STM32 Independent Watchdog:

void Watchdog_Init(void)
{
    // STM32 Independent Watchdog (IWDG)
    // Uses separate LSI oscillator - survives even if main clock fails
    
    IWDG->KR = 0x5555;  // Enable register access
    IWDG->PR = 0x04;    // Prescaler: 40kHz / 64 = 625Hz
    IWDG->RLR = 625;    // Reload: 625 / 625Hz = 1 second timeout
    IWDG->KR = 0xCCCC;  // Start watchdog
}

void Watchdog_Refresh(void)
{
    IWDG->KR = 0xAAAA;  // Reload counter
}

Note: Most MCUs (Nordic nRF, ESP32, TI MSP430, etc.) have similar independent watchdog peripherals with comparable configuration approaches.

Critical Design Decision: The watchdog timeout defines your maximum acceptable firmware hang time.

  • Too short (< 100ms): False resets from legitimate long operations
  • Too long (> 5s): System hangs go undetected for too long
  • Sweet spot (500ms - 2s): Catches real hangs without false positives

The Refresh Pattern

Correct Pattern (main loop + backup):

int main(void) {
    Watchdog_Init();
    
    while (1) {
        // Process events
        Handle_Button_Events();
        Update_Display();
        Process_Sensors();
        
        // Refresh watchdog at end of main loop
        Watchdog_Refresh();
        
        __WFI();
    }
}

// Backup refresh in SysTick (every 1 second)
void SysTick_Handler(void) {
    if ((time_ms % 1000) == 0) {
        Watchdog_Refresh();  // Safety net
    }
}

Why Two Refresh Points?

  1. Main loop refresh: Proves main loop is running
  2. SysTick backup: Prevents reset if main loop blocks briefly on legitimate work

This dual approach catches real firmware hangs while tolerating brief legitimate blocking operations (e.g., flash writes, DMA transfers).

RTOS Complication

With FreeRTOS, watchdog management becomes more complex:

// Which task refreshes the watchdog?
// Option 1: Idle task (doesn't prove application is healthy)
// Option 2: Watchdog task (adds overhead, priority inversion risks)
// Option 3: Every task reports in (complex, coupling across tasks)

Bare-metal avoids this complexity entirely. One main loop, one clear refresh point, deterministic behavior.

Build System: Keep It Simple Here Too

Just as firmware benefits from simplicity, so do build systems.

CMake for Embedded: The Right Tool

Modern embedded development deserves modern build tools. Makefiles work but are brittle and hard to maintain. CMake provides structure without excessive complexity.

Minimal Embedded CMake:

cmake_minimum_required(VERSION 3.22)
project(stm32_project C ASM)

set(CMAKE_C_STANDARD 11)

# MCU Configuration
set(MCU_FLAGS
    -mcpu=cortex-m4
    -mthumb
    -mfpu=fpv4-sp-d16
    -mfloat-abi=hard
)

# Compiler flags
add_compile_options(
    ${MCU_FLAGS}
    -Wall
    -fdata-sections
    -ffunction-sections
)

# Linker flags
set(CMAKE_EXE_LINKER_FLAGS 
    "${MCU_FLAGS} -specs=nano.specs -T${LINKER_SCRIPT} -Wl,--gc-sections"
)

# Sources
add_executable(${TARGET}.elf
    src/main.c
    src/clock_cfg.c
    src/gpio_cfg.c
    # ... more sources
)

# Generate binary
add_custom_command(TARGET ${TARGET}.elf POST_BUILD
    COMMAND arm-none-eabi-objcopy -O binary ${TARGET}.elf ${TARGET}.bin
)

What This Achieves:

  • Clean, readable configuration
  • IDE integration (CLion, VS Code)
  • Compile commands export for tooling
  • Cross-platform (Windows, Mac, Linux)
  • Easy to extend without becoming unmaintainable

What to Avoid:

  • Code generators (STM32CubeMX can help, but don’t let it own your codebase)
  • Complex meta-build systems
  • Vendor lock-in tools
  • Over-abstracted build frameworks

Memory Management: Static Allocation is Your Friend

Dynamic memory allocation (malloc/free) is a common source of bugs in embedded systems:

  • Fragmentation: Heap becomes unusable over time
  • Non-deterministic: Allocation time varies
  • Failure handling: What if malloc returns NULL?
  • Debugging: Memory leaks are hard to find

Static Allocation Strategy

// Bad: Dynamic allocation
void process_data(void) {
    uint8_t *buffer = malloc(1024);
    if (buffer == NULL) {
        // Now what?
    }
    // ... use buffer ...
    free(buffer);  // Easy to forget!
}

// Good: Static allocation
#define BUFFER_SIZE 1024
static uint8_t buffer[BUFFER_SIZE];

void process_data(void) {
    // Buffer always available
    // No allocation failure
    // No fragmentation
    // No leaks
}

Benefits:

  • Deterministic: Memory layout known at compile time
  • Reliable: No allocation failures at runtime
  • Debuggable: Memory map is fixed and visible
  • Efficient: No malloc/free overhead

Cost:

  • Must know maximum sizes at compile time
  • Can’t dynamically scale to varying workloads

For most embedded systems, especially safety-critical ones, this tradeoff heavily favors static allocation.

Power Efficiency: The WFI Advantage

Power consumption matters, even in non-battery applications. Lower power means:

  • Less heat generation
  • Smaller power supplies
  • Better EMI characteristics
  • Reduced operating costs

Wait-For-Interrupt (WFI) Pattern

while (1) {
    // Process any pending events
    if (event_pending) {
        Process_Event();
        event_pending = false;
    }
    
    // Sleep until next interrupt
    __WFI();
}

What Happens:

  • Core clock stops (CPU idles)
  • Peripherals continue running
  • Interrupts wake the processor
  • Resume execution after WFI
  • Microsecond-scale wake latency

Power Savings: Typical high-performance Cortex-M power consumption:

  • Run mode (high speed): 50-100mA
  • Sleep mode (WFI): 20-40mA
  • Deep sleep: 1-5mA (peripherals off)

In typical event-driven firmware, the processor spends 80-95% of time in WFI, cutting average power consumption dramatically.

Example: STM32F429 at 180MHz draws ~90mA running, ~35mA in WFI sleep mode.

RTOS Comparison

FreeRTOS supports idle task hooks and tickless modes, but:

  • More complex to configure correctly
  • Idle task still incurs scheduler overhead
  • Tickless mode requires careful tuning
  • Wake latency is less predictable

Bare-metal WFI is simpler and often more power-efficient.

Debugging: Simplicity Enables Better Visibility

One often-overlooked advantage of bare-metal firmware: it’s dramatically easier to debug.

Debug Strategies

1. Hardware Debug (SWD/JTAG)

// Set breakpoint in main loop
while (1) {
    Process_Events();  // <- Breakpoint here
    __WFI();
}

// Examine exact register state
// Step through assembly if needed
// No scheduler interference

2. Logic Analyzer/Oscilloscope

// Toggle GPIO to mark timing
GPIOG->BSRR = (1 << DEBUG_PIN);    // Set high
Critical_Function();                // Measure this
GPIOG->BSRR = (1 << (DEBUG_PIN + 16));  // Set low

// Logic analyzer shows exact timing

3. Serial Debug Output

void UART_Debug_Init(void) {
    // Minimal UART setup (STM32 example)
    RCC->APB2ENR |= RCC_APB2ENR_USART1EN;
    // ... configure TX pin ...
    USART1->BRR = UART_BRR_VALUE;
    USART1->CR1 = USART_CR1_TE | USART_CR1_UE;
}

// printf support (standard across ARM toolchains)
int _write(int file, char *ptr, int len) {
    for (int i = 0; i < len; i++) {
        while (!(USART1->SR & USART_SR_TXE));
        USART1->DR = ptr[i];
    }
    return len;
}

// Now use printf for debugging
printf("[STATE] Transition: STOP -> START\r\n");

This pattern works across virtually all ARM Cortex-M MCUs—just adjust register names for your specific chip.

RTOS Debugging Challenges:

  • Breakpoints can disturb task timing
  • Watchpoints may trigger on scheduler operations
  • Task context switches obscure execution flow
  • Race conditions are non-deterministic
  • printf from multiple tasks needs synchronization

Bare-metal execution is sequential and predictable, making debugging straightforward.

Testing: Simplicity Enables Determinism

Firmware testing is hard. Simplicity makes it less hard.

Unit Testing Strategy

// Module: button_debounce.c
typedef enum {
    BUTTON_RELEASED,
    BUTTON_PRESSED
} ButtonState;

ButtonState Button_GetState(void) {
    // Debounce logic
}

// Test: test_button_debounce.c
void test_debounce_press(void) {
    // Simulate button press sequence
    for (int i = 0; i < DEBOUNCE_SAMPLES; i++) {
        Button_Sample(GPIO_HIGH);
    }
    assert(Button_GetState() == BUTTON_PRESSED);
}

With bare-metal architecture:

  • Modules are independent: Easy to test in isolation
  • No hidden state: Behavior is deterministic
  • No threading: Tests are reproducible
  • Direct hardware access: Can mock at register level

Integration Testing

For hardware-dependent code:

// Hardware abstraction for testing
#ifdef UNIT_TEST
    #define GPIO_READ(port, pin)  mock_gpio_read(port, pin)
#else
    #define GPIO_READ(port, pin)  ((port)->IDR & (1 << (pin)))
#endif

This minimal abstraction enables:

  • Unit tests on host PC (fast iteration)
  • Integration tests on target hardware
  • Same code in both environments

Documentation: The Code IS the Documentation

Simple, direct code is self-documenting:

// This is clear without comments
void LED_RUN_On(void) {
    LED_RUN_PORT->BSRR = (1 << LED_RUN_PIN);
}

// This requires extensive comments
void HAL_GPIO_WritePin(GPIO_TypeDef *GPIOx, uint16_t GPIO_Pin, 
                       GPIO_PinState PinState) {
    // What is this doing to hardware?
    // How long does it take?
    // What are side effects?
}

Key Principle: Code should be readable and obvious. If you need extensive comments to explain what’s happening, the code is probably too complex.

Good documentation for embedded systems:

  • Block comments: Explain why, not what
  • Hardware references: Cite datasheet sections for register operations
  • Timing constraints: Document critical timing requirements
  • State machines: Diagram state transitions

Common Objections and Responses

“Bare-metal doesn’t scale!”

Response: Define “scale.”

  • More features? Modular architecture scales fine (see: Linux kernel, started bare-metal)
  • More developers? Clear module boundaries work regardless of RTOS
  • More processors? If you need multi-core, you need different hardware anyway
  • More complexity? Often a sign you need simpler architecture, not more tooling

“RTOS provides proven synchronization primitives!”

Response: True, but do you need them?

  • Interrupt flags + atomic operations handle most cases
  • Critical sections are just __disable_irq() / __enable_irq()
  • State machines eliminate most synchronization needs
  • If you truly need complex synchronization, maybe your architecture is wrong

“HAL provides portability across chip families!”

Response: At what cost?

  • Portability you don’t need is wasted overhead
  • Most projects target one specific MCU
  • When porting is needed, register differences are usually minor
  • CMSIS provides enough abstraction for most real-world portability needs within ARM Cortex-M

“What about safety certifications (IEC 61508, ISO 26262)?”

Response: Certification cares about process, not tools.

  • Bare-metal firmware can absolutely be certified
  • In fact, simpler code often certifies more easily (less to prove)
  • Certified RTOS helps, but isn’t required
  • Safety comes from design, testing, and review—not from using an RTOS

“This is premature optimization!”

Response: Simplicity isn’t optimization—it’s sound architecture.

  • Choosing bare-metal first is the simpler choice
  • Adding RTOS later (if needed) is possible
  • Removing RTOS later (if unnecessary) is painful
  • Start simple, add complexity only when justified

When I Was Wrong: Lessons from Adding Unnecessary Complexity

Early in my embedded career, I made the mistake of reaching for FreeRTOS on a motor control project. The requirements were straightforward:

  • Read encoder position (interrupt-driven)
  • Execute PID control loop (1kHz)
  • Update PWM outputs (hardware timer)
  • Communicate over CAN bus (async)

I thought: “Multiple timing domains, this needs an RTOS!”

What happened:

  • FreeRTOS added 12KB overhead (20% of available flash)
  • Task priorities created unexpected behavior (priority inversion on CAN)
  • Context switches added jitter to PID loop timing
  • Debugging became much harder (race condition in CAN TX queue)
  • Total development time increased by 3 weeks

After rewriting in bare-metal:

  • Encoder: EXTI interrupt, update position variable
  • PID: TIM interrupt at 1kHz, pure computation
  • PWM: Hardware timer, no CPU involvement
  • CAN: TX interrupt, simple FIFO in ISR

Results:

  • Flash usage: 8KB total (down from 20KB)
  • PID loop jitter: < 1μs (was 10-50μs)
  • No race conditions
  • Debugging took minutes instead of hours
  • Total development time: 1 week

Lesson: The apparent complexity that suggested “need RTOS” was actually solved more simply with interrupts and a clear architecture.

The Path Forward: A Practical Approach

If you’re starting a new microcontroller project, here’s my recommended approach:

Phase 1: Minimal Viable Firmware (Day 1)

  1. Clock configuration: Get SYSCLK running at target speed
  2. Debug UART: Printf working, see what’s happening
  3. SysTick: 1ms interrupt for timing
  4. GPIO: Blink an LED (prove main loop runs)
  5. Watchdog: Independent watchdog configured

Goal: Bare-bones platform that boots, runs, and reports status.

Phase 2: Core Functionality (Days 2-5)

  1. Peripheral drivers: One peripheral at a time, direct register access
  2. Interrupt handlers: Keep short, set flags, return
  3. State machines: Main loop processes flags, executes logic
  4. Testing: Validate each peripheral independently

Goal: Core application functionality working reliably.

Phase 3: Optimization & Polish (Days 6-10)

  1. Power optimization: Add WFI, configure clocks for efficiency
  2. Timing validation: Logic analyzer proves timing requirements met
  3. Error handling: Robust handling of fault conditions
  4. Documentation: Register configurations, timing, rationale

Goal: Production-ready firmware.

Phase 4: Evaluation (Day 10+)

Ask honestly:

  • Is the architecture maintainable?
  • Are timing requirements met?
  • Is power consumption acceptable?
  • Are resources (flash/RAM) within budget?
  • Is debugging reasonable?

If yes to all: Ship it. You’re done.

If no: Identify the specific problem, then consider if additional tools (RTOS, etc.) would actually solve it. Often the answer is better architecture, not more tools.

Conclusion: The Wisdom of Simplicity

The embedded systems industry has a tendency to over-engineer solutions. We reach for powerful tools—RTOSes, Linux, heavyweight frameworks—because they promise to make development easier. Sometimes they deliver on that promise.

But more often, especially on resource-constrained microcontrollers, these tools introduce more problems than they solve:

  • Complexity: More to learn, more to debug, more to go wrong
  • Resource overhead: RAM and flash consumed by infrastructure, not application
  • Timing variability: Determinism lost to schedulers and kernel preemption
  • Debugging difficulty: Concurrent systems are fundamentally harder to understand
  • Power consumption: Idle overhead from background OS activity

Bare-metal firmware, built with a clean event-driven architecture, avoids all these issues. It’s:

  • Simpler: Sequential logic, clear execution flow
  • Leaner: Every byte of flash and RAM goes to your application
  • Faster: No scheduler overhead, direct hardware access
  • More deterministic: Predictable, bounded timing
  • Easier to debug: Straightforward cause-and-effect
  • More power-efficient: Efficient idle states without OS overhead

This doesn’t mean never use an RTOS. It means: start simple, and only add complexity when there’s clear, measurable justification.

For Engineers

The best firmware is the simplest firmware that meets requirements. Don’t let curiosity about new technology override sound architectural judgment. The coolest tech isn’t always the right tech.

Build your expertise on fundamentals—direct hardware control, interrupt-driven design, deterministic execution. These skills transfer across all embedded platforms and never go out of style.

For Technical Leaders

Your responsibility extends beyond individual technical decisions. You’re building sustainable systems and capable teams.

Protect your projects from unnecessary complexity. Guide engineers toward appropriate solutions, not just interesting ones. Set architectural standards that prevent tech debt before it starts.

And remember: the measure of good technical leadership isn’t the sophistication of your architecture—it’s whether your team can maintain it after you’re gone.

Final Thought

In my experience across dozens of embedded projects on various ARM Cortex-M platforms, the pattern is clear: bare-metal event-driven architecture wins in 80% of cases.

When you’re tempted to add complexity, ask yourself:

  • As an engineer: “Am I solving a real problem, or am I experimenting?”
  • As a leader: “Can my team maintain this, or am I creating a dependency?”

Trust in simplicity. Your future self, your teammates, and the next engineer who inherits your code will all thank you.


Working on embedded systems or considering architecture decisions for critical firmware? Connect with me on LinkedIn to share experiences and discuss approaches.