Embedded Firmware Development: Why Simplicity Wins in Critical Systems

In embedded systems development, there’s a persistent temptation to reach for powerful tools: RTOSes like FreeRTOS, Linux-based solutions, or heavyweight HAL abstractions. These tools promise to simplify development, but they come with hidden costs that can undermine the very characteristics that make embedded systems valuable—determinism, real-time performance, and lean resource usage.

Over years of developing critical firmware for ARM Cortex-M microcontrollers, I’ve learned a counterintuitive lesson: keeping it simple is almost always better. Bare-metal approaches often deliver superior reliability, performance, and maintainability compared to more “sophisticated” solutions.

As both an engineer who’s written this firmware and a technical leader responsible for team execution, I’ve seen how unnecessary complexity derails projects and creates technical debt. Great engineers get drawn to the latest tools and coolest technologies—but the newest isn’t always the best, and what’s interesting isn’t always what’s appropriate.

Here’s what I’ve learned about when to embrace simplicity, when complexity is justified, and how to lead teams to make the right architectural decisions.

Leadership Perspective: Protecting Your Team from Complexity

One of the most critical leadership skills in embedded development is recognizing and preventing unnecessary complexity. This isn’t just a technical decision—it’s about protecting your team, your project, and your organization from tech debt traps that look like innovation but end in failure. This section explains how technical leaders can guide teams toward appropriate solutions and prevent complexity creep.

The “Latest and Greatest” Trap

Great engineers are often their own worst enemy. They’re passionate about technology, eager to learn, and always aware of new tools and frameworks. This is what makes them great—but it’s also what can derail projects.

Common scenarios I’ve encountered:

Engineer: “We should use FreeRTOS for this project. It’s industry standard and I want to learn it.”

Translation: “I want to add this to my resume, regardless of whether the project needs it.”

Engineer: “Let’s use Rust instead of C. It’s memory-safe and the future of embedded.”

Translation: “I’m excited about new technology, but I haven’t considered that the rest of the team doesn’t know Rust, and we’d be pioneers debugging toolchain issues.”

Engineer: “We need a message bus architecture with pub/sub patterns.”

Translation: “I’m applying patterns from web services to a system with three cooperating modules.”

None of these engineers are wrong to be interested in these technologies. But interest doesn’t equal appropriateness.

The Leader’s Responsibility

As a technical leader, your job is to:

Distinguish between genuine need and technical curiosity
Protect the project from becoming a learning experiment
Balance team growth with project success
Set architectural guardrails that prevent complexity creep

This doesn’t mean stifling innovation or preventing growth. It means making conscious decisions about when complexity is justified.

Questions I Ask When Evaluating Technical Proposals

When an engineer proposes using an RTOS, a new language, or a complex framework, I ask:

1. What specific problem does this solve?

Not “what could it do” or “what might we need”
What actual, current problem does this address?
Can you quantify the benefit?

2. What simpler alternatives have you considered?

Did you try solving it with what we already have?
What’s the simplest solution that could possibly work?
Why is that insufficient?

3. What’s the total cost?

Not just lines of code, but:
- Learning curve for the team
- Debugging complexity
- Maintenance burden
- RAM/flash overhead
- Integration complexity

4. What’s the reversibility?

Can we remove this later if it doesn’t work out?
Or are we locked in once we start down this path?

5. Who owns this?

If you leave the team, who maintains it?
Is the expertise transferable?

Red Flags in Technical Discussions

Over the years, I’ve learned to recognize warning signs:

Red Flag: “Everyone uses [technology X]”

Reality: Popularity doesn’t equal appropriateness for your use case
Response: “Show me three projects with similar constraints to ours where it worked well”

Red Flag: “It’s more elegant/modern/clean”

Reality: Aesthetic preferences don’t justify added complexity
Response: “How does that elegance translate to measurable project benefit?”

Red Flag: “We might need this flexibility later”

Reality: YAGNI (You Aren’t Gonna Need It) applies to embedded too
Response: “Let’s solve today’s problem today. We can refactor if that future arrives.”

Red Flag: “The alternative is writing boilerplate code”

Reality: Sometimes explicit is better than implicit
Response: “Let’s look at how much boilerplate we’re actually avoiding vs. the framework we’d pull in”

Red Flag: “I can handle the complexity”

Reality: You won’t be the only one touching this code
Response: “Can the most junior person on the team debug this at 2am?”

Setting Architectural Standards

As a leader, I establish clear default choices:

Our Team Standards (Example):

Default: Bare-metal C with CMSIS
RTOS allowed when: Three or more independent concurrent processes with proven state machine complexity
HAL allowed when: Rapid prototyping or complex peripheral (USB, Ethernet, graphics)
Dynamic allocation: Forbidden except with written justification and review
External libraries: Require architecture review before inclusion

These aren’t rigid rules—they’re defaults that require justification to override.

This framework:

Prevents ad-hoc complexity
Forces engineers to articulate why complexity is needed
Creates consistency across projects
Protects future maintainers

The Tech Debt Trap

Unnecessary complexity is the root of technical debt. I’ve seen projects fail or require complete rewrites because engineers chose complexity for the wrong reasons:

Case 1: The RTOS That Wasn’t Needed

Engineer added FreeRTOS “for structure”
Project had three simple state machines
Result: 6 months of debugging priority inversions and race conditions
Outcome: Rewritten in bare-metal in 3 weeks, never had issues again

Case 2: The Premature Abstraction

Engineer built HAL abstraction “for portability” across MCU families
Project targeted one specific chip
Result: Debugging required stepping through 4 layers of indirection
Outcome: Performance issues, increased flash usage, eventually ripped out

Case 3: The Framework Overkill

Engineer integrated graphics framework for simple status display
Framework was 200KB, project had 256KB flash total
Result: Constant battles with memory limits, features cut
Outcome: Rewrote with direct framebuffer drawing, recovered 180KB

Common pattern: Engineer saw a cool technology and found a way to justify it, rather than solving the actual problem with appropriate tools.

Balancing Team Growth and Project Success

The question: “How do we let engineers grow without turning projects into experiments?”

My approach:

1. Separate Learning from Delivery

Side projects and tech demos for exploration
Hackathons for trying new tools
Proof-of-concepts before committing to production

2. Controlled Introduction

New technology on non-critical features first
Pilot projects before organization-wide adoption
Post-mortems to evaluate if it was worth it

3. “Innovation Budget”

10-15% of project work can be “new”
85-90% should be proven, understood technology
Prevents both stagnation and chaos

4. Mentorship, Not Mandates

Explain why we’re choosing simplicity
Show the consequences of complexity from past projects
Help engineers understand total cost, not just coding

When to Override Simplicity

I’ve advocated strongly for simplicity throughout this article, but leadership also means recognizing when complexity is justified:

Valid reasons to accept complexity:

Proven bottleneck: You’ve measured and confirmed simple approach won’t work
Safety/Reliability requirement: Complexity adds genuine fault tolerance
Regulatory mandate: Certification requires specific approaches
Team capability: Team has deep expertise in the complex tool
Strategic investment: Technology aligns with long-term architectural direction

The key: These are evidence-based decisions, not speculative ones.

Leading by Example

The most powerful thing you can do as a technical leader is write simple code yourself.

When I’m hands-on with firmware:

I use bare-metal approaches
I write direct register access
I avoid abstractions unless clearly justified
I document why I chose simple over complex

This sets the tone. Engineers see that:

Simplicity isn’t laziness—it’s discipline
Senior engineers don’t need frameworks to prove their expertise
Clean, direct code is valued more than clever complexity

The Hard Conversation

Sometimes you have to say no. This is uncomfortable, especially with talented engineers who are genuinely excited.

How I handle it:

“I appreciate your enthusiasm for [technology X]. It’s powerful and I understand why you want to use it. But for this project, I don’t see evidence that the benefits outweigh the costs. Here’s what I need to see to change my mind: [specific, measurable criteria].

In the meantime, let’s solve this problem with [simpler approach]. If we hit limitations, we can revisit. And I’d love to support you exploring [technology X] in [side project/hackathon/future pilot].”

What this does:

Acknowledges their interest (respect)
Requires evidence, not opinion (objectivity)
Offers path forward (compromise)
Provides alternative outlet (growth opportunity)

Most engineers appreciate this approach. They want to build great products, not just use cool tech. Framing it as “what’s best for the product” usually resonates.

Measuring Success

How do you know if you’re making good complexity decisions?

Positive indicators:

New engineers can be productive quickly (< 2 weeks)
Debugging sessions are measured in minutes, not hours
Code reviews focus on logic, not framework internals
Bug density is low and decreasing
Engineers can work across different projects easily

Warning signs:

“Only [person X] understands this module”
Bugs take days to reproduce and fix
Flash/RAM constantly at limits
Engineers reluctant to modify certain code
Turnover correlates with working on complex systems

The Ultimate Question

When evaluating any architectural decision, I ask:

“If our best engineer leaves tomorrow, can the team maintain this?”

If the answer is no, you’ve probably over-engineered.

If the answer is yes, you’ve built something sustainable.

That’s the mark of good technical leadership: building systems that outlive any individual contributor’s tenure.

The Complexity Trap

Engineers often reach for powerful tools like RTOSes or Linux-based solutions when simpler approaches would work better. This section exposes the hidden costs of unnecessary complexity and explains why simpler approaches often deliver better results.

The Seductive Appeal of “Just Add an RTOS”

When facing a moderately complex embedded project, the conventional wisdom often sounds like this:

“Just use FreeRTOS. It’ll make multitasking easier, and it’s free!”

Or worse:

“This needs networking and a file system. Let’s just run Linux. Why fight it?”

These recommendations sound reasonable—until you consider the tradeoffs you’re implicitly accepting:

RTOS Hidden Costs:

Memory overhead: Task stacks, kernel heap, control structures (8-16KB minimum, often much more)
Timing variability: Scheduler latency, context switches, priority inversions
Complexity: Mutexes, semaphores, queues—each a potential deadlock or race condition
Debugging difficulty: Concurrent bugs are notoriously hard to reproduce and fix
Learning curve: Team needs to understand RTOS internals, not just your application

Linux Hidden Costs:

Resource requirements: Minimum 32MB RAM, substantial flash, faster processor
Non-deterministic timing: Kernel preemption, virtual memory, device drivers all introduce latency
Boot time: Seconds instead of milliseconds
Power consumption: Dramatically higher idle and active power draw
Complexity squared: Kernel configuration, device trees, root filesystem, bootloaders
Update/security burden: Kernel patches, CVEs, package management

For many embedded applications, these costs far outweigh the benefits.

When Bare-Metal is Better: The 90% Use Case

The vast majority of embedded projects fall into this category. This section describes the typical embedded application profile and explains why bare-metal approaches are optimal for most microcontroller projects.

Single core processor with straightforward peripheral interaction
Predictable timing requirements measured in microseconds or milliseconds
Limited concurrency that can be handled with interrupt-driven state machines
Resource constraints where every kilobyte of RAM and flash matters
Reliability-critical applications where deterministic behavior is paramount
Fast boot requirements where milliseconds count
Low power applications where sleep modes and power efficiency are critical

If your project fits this profile, a bare-metal approach delivers:

Deterministic real-time performance: Worst-case execution time (WCET) is predictable and bounded
Minimal resource usage: No kernel overhead means more resources for your application
Complete control: You know exactly what the processor is doing at all times
Fast boot times: Initialize peripherals and go (1-100ms typical)
Lower power consumption: Efficient use of sleep modes without kernel overhead
Easier debugging: Sequential logic is easier to reason about than concurrent tasks
Smaller attack surface: Less code means fewer potential vulnerabilities

Real-World Example: Event-Driven State Machine Architecture

Let me illustrate with a pattern I’ve used successfully across multiple critical firmware projects on ARM Cortex-M microcontrollers. This section provides a concrete example of bare-metal event-driven architecture with code examples.

The Architecture

Core Concept: Event-driven execution with interrupt handlers and Wait-For-Interrupt (WFI) idle mode.

int main(void)
{
    // Initialize system clock
    Clock_Init();
    
    // Initialize peripherals
    UART_Init();       // Debug output
    GPIO_Init();       // LEDs, control outputs
    Button_IRQ_Init(); // User input with hardware debouncing
    Watchdog_Init();   // Automotive-grade reliability
    
    // Initialize any hardware accelerators
    DMA_Init();        // DMA for efficient data transfers
    Display_Init();    // Display controller if present
    
    // Event-driven main loop
    while (1) {
        // Process any pending events
        if (has_display_update) {
            Update_Display();
            has_display_update = false;
        }
        
        if (has_sensor_reading) {
            Process_Sensor_Data();
            has_sensor_reading = false;
        }
        
        // Refresh watchdog
        Watchdog_Refresh();
        
        // Sleep until next interrupt (power efficient)
        __WFI();
    }
}

SysTick Handler (1ms periodic interrupt):

void SysTick_Handler(void)
{
    // Simple, predictable ISR
    time_ms++;
    
    // Update debounce logic
    Button_DebounceUpdate();
    
    // Update LED blinking
    Update_LEDs_Periodic();
    
    // Backup watchdog refresh
    if ((time_ms % 1000) == 0) {
        Watchdog_Refresh();
    }
}

Button Interrupt Handler (GPIO EXTI):

void Button_IRQ_Handler(void)
{
    // Read hardware state
    uint8_t button_state = (BUTTON_PORT->IDR & (1 << BUTTON_PIN)) ? 1 : 0;
    
    // Update debounce state machine
    // (Software debouncing for reliability)
    
    if (button_debounced && button_edge_detected) {
        // Signal main loop
        button_event_pending = true;
        button_event = BUTTON_PRESSED;
    }
    
    // Clear interrupt flag
    EXTI->PR = (1 << BUTTON_PIN);
}

What This Architecture Achieves

Deterministic Response Times:

Button press to LED change: < 100μs (interrupt latency + GPIO write)
Display update: < 20ms (with hardware acceleration)
Watchdog refresh: guaranteed every 1ms + main loop iteration

Resource Efficiency:

Flash usage: Minimal footprint (no OS overhead)
RAM usage: Efficient (only what you need)
Idle power: Very low (WFI sleep mode between events)
No dynamic memory allocation (malloc/free) - zero fragmentation risk

Maintainability:

Single-threaded logic - easy to reason about execution flow
No race conditions - interrupt handlers set flags, main loop processes
No deadlocks - no mutexes or semaphores needed
Straightforward debugging - logic analyzer shows exact timing

Reliability:

Independent watchdog catches firmware hangs
Hardware debouncing prevents false triggers
Deterministic execution paths - no scheduler surprises

The Direct Register Access Advantage

Modern embedded development often relies on vendor HAL (Hardware Abstraction Layer) libraries. While convenient, HAL introduces unnecessary overhead and complexity for many applications. This section compares HAL approaches with direct register access and explains when each is appropriate.

HAL vs. Direct Register Access

HAL Approach (using STM32 HAL as an example):

// Initialize GPIO with HAL
GPIO_InitTypeDef GPIO_InitStruct = {0};
GPIO_InitStruct.Pin = GPIO_PIN_13;
GPIO_InitStruct.Mode = GPIO_MODE_OUTPUT_PP;
GPIO_InitStruct.Pull = GPIO_NOPULL;
GPIO_InitStruct.Speed = GPIO_SPEED_FREQ_HIGH;
HAL_GPIO_Init(GPIOG, &GPIO_InitStruct);

// Toggle GPIO with HAL
HAL_GPIO_WritePin(GPIOG, GPIO_PIN_13, GPIO_PIN_SET);

Direct Register Approach (bare-metal):

// Initialize GPIO directly
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOGEN;  // Enable clock
GPIOG->MODER &= ~(3U << (13 * 2));     // Clear mode bits
GPIOG->MODER |= (1U << (13 * 2));      // Set as output
GPIOG->OSPEEDR |= (3U << (13 * 2));    // High speed

// Toggle GPIO directly
GPIOG->BSRR = (1 << 13);  // Set pin (1 cycle atomic operation)

Why Direct Access Wins

Performance:

HAL: Multiple function calls, parameter validation, overhead
Direct: Single instruction, predictable cycle count
Real-world impact: HAL GPIO toggle ~50-100 cycles; direct register ~1-3 cycles

Code Size:

HAL: Links entire HAL GPIO module (~5-10KB flash)
Direct: Only the exact register writes you use (~50-100 bytes)

Transparency:

HAL: Abstraction hides what’s actually happening
Direct: You see exactly what hardware is being touched

Debugging:

HAL: Step through layers of abstraction
Direct: Set breakpoint, read register, done

When HAL Makes Sense

I’m not advocating for never using HAL. It has legitimate use cases:

Rapid prototyping: Get something working quickly
Complex peripherals: USB, Ethernet, SD card—where low-level details are intricate
Cross-family portability: Switching between chip families from the same vendor
Large teams: Consistent API reduces onboarding time

But for production firmware where performance, size, and determinism matter, direct register access is often superior.

CMSIS: The Right Level of Abstraction

The ARM CMSIS (Cortex Microcontroller Software Interface Standard) provides the sweet spot: standardized peripheral access without heavyweight abstractions. This section explains why CMSIS offers the right balance between portability and simplicity.

What CMSIS Provides:

Peripheral structs: Clean register access (e.g., GPIOG->BSRR)
Standard definitions: __WFI(), __disable_irq(), etc.
Core functions: NVIC configuration, SysTick, etc.
Zero overhead: Macros and inline functions compile to direct register access

What CMSIS Doesn’t Force On You:

Heavy abstraction layers
Callback frameworks
Configuration generators
Massive linked libraries

This gives you:

#include "stm32f4xx.h"  // CMSIS device header (STM32 example)

// Direct, readable register access
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOGEN;  // Enable GPIO clock
GPIOG->MODER |= (1U << (13 * 2));      // Set as output

// Standard ARM Cortex-M core functions (portable across vendors)
__disable_irq();            // Atomic section
NVIC_EnableIRQ(EXTI0_IRQn); // Enable interrupt
__WFI();                    // Wait for interrupt

Clean, efficient, portable across ARM Cortex-M devices, and no mystery about what’s happening at the hardware level.

When to Consider FreeRTOS

Despite my advocacy for simplicity, there are legitimate scenarios where FreeRTOS adds value. This section identifies the valid use cases for RTOS and provides criteria for deciding when it’s justified.

Valid Use Cases for FreeRTOS

1. True Multi-Threading Needs

Multiple independent processes with different timing requirements
Example: Sensor data collection (10Hz) + network communication (async) + UI updates (30Hz)
Benefit: Task abstraction simplifies logical separation

2. Complex Synchronization

Multiple producers/consumers with shared resources
Example: Data pipeline with buffering between stages
Benefit: Queue and semaphore primitives reduce custom synchronization code

3. Dynamic Priority Management

Runtime priority changes based on system state
Example: Adaptive control systems with changing workload priorities
Benefit: RTOS scheduler handles priority preemption automatically

4. Team Scale and Modularity

Large teams where task isolation reduces coupling
Example: 10+ engineers working on different subsystems
Benefit: Tasks provide natural module boundaries

The Threshold Test

Ask yourself:

Can interrupts + state machines handle this? → Stay bare-metal
Is timing variability acceptable? → If no, stay bare-metal
Do I have RAM/flash headroom? → If no, stay bare-metal
Is the team comfortable with RTOS concepts? → If no, stay bare-metal

If you answer “yes” to all these, FreeRTOS might be justified. But even then, consider the simplest solution that could possibly work.

When to Consider Linux (Hint: Rarely for MCUs)

Linux on embedded systems has valid use cases—but typical ARM Cortex-M microcontrollers rarely fit them. This section explains the hardware requirements for Linux and when it makes sense versus bare-metal or RTOS approaches.

Valid Use Cases for Linux

When Linux Makes Sense:

Complex networking: Full TCP/IP stack, TLS, multiple protocols
File systems: Large storage with file management (SD cards, eMMC)
High-level languages: Running Python, Node.js, or similar
Rich ecosystems: Need existing Linux packages and tools
Development speed: Rapid iteration with familiar userspace tools

Minimum Hardware Reality:

RAM: 32MB minimum (realistically 128MB+ for comfortable margin)
Flash: 8MB+ for kernel, rootfs, applications
Processor: 400MHz+ ARM Cortex-A series
Power budget: Idle current in tens of milliamps
Boot time: Seconds, not milliseconds

Where This Fits:

STM32MP1 (Cortex-A7 + Cortex-M4)
Raspberry Pi Compute Module
i.MX6/7 series
Not typical Cortex-M microcontrollers (STM32F4/H7/L4, Nordic nRF, ESP32-C series, etc.)

The Microcontroller Reality Check

A typical high-end Cortex-M MCU:

RAM: 256KB - 1MB
Flash: 512KB - 2MB
Clock: 120-600MHz ARM Cortex-M4/M7
Target use: Real-time control, sensor acquisition, motor control

These specs don’t support Linux. Period.

Trying to squeeze Linux onto a Cortex-M microcontroller is an academic exercise, not a production solution. If you need Linux capabilities, you need different hardware (Cortex-A processor or higher).

Practical Decision Framework

Here’s the framework I use when architecting embedded firmware. This section provides a decision tree and practical guidance for choosing the right architecture approach for your embedded project.

Start Here (Default for MCU Projects)

Bare-Metal + Event-Driven Architecture

Characteristics:

Interrupt-driven peripheral handling
State machines for control logic
WFI for power efficiency
Direct register access for critical paths
CMSIS for standardization

When to Move Beyond:

Clear evidence that complexity justifies the overhead
Not “might need” or “could be useful”—actual requirements

Decision Tree

┌─────────────────────────────────┐
│  Start: Bare-Metal Event-Driven │
└─────────────┬───────────────────┘
              │
              ▼
    ┌─────────────────────┐
    │ Need 3+ concurrent  │
    │ independent tasks?  │
    └─────┬─────────┬─────┘
         No        Yes
          │         │
          │         ▼
          │    ┌────────────────┐
          │    │ Can state      │
          │    │ machines + ISR │
          │    │ handle it?     │
          │    └──┬──────────┬──┘
          │      Yes         No
          │       │          │
          │       │          ▼
          │       │    ┌──────────────┐
          │       │    │ Consider     │
          │       │    │ FreeRTOS     │
          │       │    └──────────────┘
          │       │
          ▼       ▼
    ┌──────────────────┐
    │ Stay Bare-Metal  │
    └──────────────────┘

┌─────────────────────────────────┐
│  Need networking or file system?│
└─────────────┬───────────────────┘
              │
              ▼
    ┌─────────────────────┐
    │ Simple needs?       │
    │ (HTTP, MQTT, FAT)   │
    └─────┬─────────┬─────┘
         Yes        No
          │         │
          ▼         ▼
    ┌─────────┐  ┌──────────────┐
    │ LwIP +  │  │ Need Linux   │
    │ FatFS   │  │ → Different  │
    │ on RTOS │  │   hardware   │
    └─────────┘  └──────────────┘

The 80/20 Rule

In my experience:

80% of MCU projects → Bare-metal is optimal
15% of MCU projects → FreeRTOS adds value
5% of MCU projects → Need fundamentally different hardware (MPU/SoC, not MCU)

Real-World Reliability: The Watchdog Philosophy

Critical firmware needs reliability mechanisms that work regardless of your software architecture. One of the most important: the independent watchdog timer. This section explains how to implement watchdog timers effectively and why they’re essential for reliable embedded systems.

Hardware Watchdog Implementation

Example using STM32 Independent Watchdog:

void Watchdog_Init(void)
{
    // STM32 Independent Watchdog (IWDG)
    // Uses separate LSI oscillator - survives even if main clock fails
    
    IWDG->KR = 0x5555;  // Enable register access
    IWDG->PR = 0x04;    // Prescaler: 40kHz / 64 = 625Hz
    IWDG->RLR = 625;    // Reload: 625 / 625Hz = 1 second timeout
    IWDG->KR = 0xCCCC;  // Start watchdog
}

void Watchdog_Refresh(void)
{
    IWDG->KR = 0xAAAA;  // Reload counter
}

Note: Most MCUs (Nordic nRF, ESP32, TI MSP430, etc.) have similar independent watchdog peripherals with comparable configuration approaches.

Critical Design Decision: The watchdog timeout defines your maximum acceptable firmware hang time.

Too short (< 100ms): False resets from legitimate long operations
Too long (> 5s): System hangs go undetected for too long
Sweet spot (500ms - 2s): Catches real hangs without false positives

The Refresh Pattern

Correct Pattern (main loop + backup):

int main(void) {
    Watchdog_Init();
    
    while (1) {
        // Process events
        Handle_Button_Events();
        Update_Display();
        Process_Sensors();
        
        // Refresh watchdog at end of main loop
        Watchdog_Refresh();
        
        __WFI();
    }
}

// Backup refresh in SysTick (every 1 second)
void SysTick_Handler(void) {
    if ((time_ms % 1000) == 0) {
        Watchdog_Refresh();  // Safety net
    }
}

Why Two Refresh Points?

Main loop refresh: Proves main loop is running
SysTick backup: Prevents reset if main loop blocks briefly on legitimate work

This dual approach catches real firmware hangs while tolerating brief legitimate blocking operations (e.g., flash writes, DMA transfers).

RTOS Complication

With FreeRTOS, watchdog management becomes more complex:

// Which task refreshes the watchdog?
// Option 1: Idle task (doesn't prove application is healthy)
// Option 2: Watchdog task (adds overhead, priority inversion risks)
// Option 3: Every task reports in (complex, coupling across tasks)

Bare-metal avoids this complexity entirely. One main loop, one clear refresh point, deterministic behavior.

Build System: Keep It Simple Here Too

Just as firmware benefits from simplicity, so do build systems. This section recommends CMake for embedded projects and explains why modern build tools beat traditional Makefiles without adding unnecessary complexity.

CMake for Embedded: The Right Tool

Modern embedded development deserves modern build tools. Makefiles work but are brittle and hard to maintain. CMake provides structure without excessive complexity.

Minimal Embedded CMake:

cmake_minimum_required(VERSION 3.22)
project(stm32_project C ASM)

set(CMAKE_C_STANDARD 11)

# MCU Configuration
set(MCU_FLAGS
    -mcpu=cortex-m4
    -mthumb
    -mfpu=fpv4-sp-d16
    -mfloat-abi=hard
)

# Compiler flags
add_compile_options(
    ${MCU_FLAGS}
    -Wall
    -fdata-sections
    -ffunction-sections
)

# Linker flags
set(CMAKE_EXE_LINKER_FLAGS 
    "${MCU_FLAGS} -specs=nano.specs -T${LINKER_SCRIPT} -Wl,--gc-sections"
)

# Sources
add_executable(${TARGET}.elf
    src/main.c
    src/clock_cfg.c
    src/gpio_cfg.c
    # ... more sources
)

# Generate binary
add_custom_command(TARGET ${TARGET}.elf POST_BUILD
    COMMAND arm-none-eabi-objcopy -O binary ${TARGET}.elf ${TARGET}.bin
)

What This Achieves:

Clean, readable configuration
IDE integration (CLion, VS Code)
Compile commands export for tooling
Cross-platform (Windows, Mac, Linux)
Easy to extend without becoming unmaintainable

What to Avoid:

Code generators (STM32CubeMX can help, but don’t let it own your codebase)
Complex meta-build systems
Vendor lock-in tools
Over-abstracted build frameworks

Memory Management: Static Allocation is Your Friend

Dynamic memory allocation (malloc/free) is a common source of bugs in embedded systems. This section explains why static allocation is preferred for embedded systems and provides strategies for managing memory safely.

The Problems with Dynamic Allocation

Fragmentation: Heap becomes unusable over time
Non-deterministic: Allocation time varies
Failure handling: What if malloc returns NULL?
Debugging: Memory leaks are hard to find

Static Allocation Strategy

// Bad: Dynamic allocation
void process_data(void) {
    uint8_t *buffer = malloc(1024);
    if (buffer == NULL) {
        // Now what?
    }
    // ... use buffer ...
    free(buffer);  // Easy to forget!
}

// Good: Static allocation
#define BUFFER_SIZE 1024
static uint8_t buffer[BUFFER_SIZE];

void process_data(void) {
    // Buffer always available
    // No allocation failure
    // No fragmentation
    // No leaks
}

Benefits:

Deterministic: Memory layout known at compile time
Reliable: No allocation failures at runtime
Debuggable: Memory map is fixed and visible
Efficient: No malloc/free overhead

Cost:

Must know maximum sizes at compile time
Can’t dynamically scale to varying workloads

For most embedded systems, especially safety-critical ones, this tradeoff heavily favors static allocation.

Power Efficiency: The WFI Advantage

Power consumption matters, even in non-battery applications. Lower power means better efficiency and reliability. This section explains how Wait-For-Interrupt (WFI) patterns enable dramatic power savings in embedded systems.

Less heat generation
Smaller power supplies
Better EMI characteristics
Reduced operating costs

Wait-For-Interrupt (WFI) Pattern

while (1) {
    // Process any pending events
    if (event_pending) {
        Process_Event();
        event_pending = false;
    }
    
    // Sleep until next interrupt
    __WFI();
}

What Happens:

Core clock stops (CPU idles)
Peripherals continue running
Interrupts wake the processor
Resume execution after WFI
Microsecond-scale wake latency

Power Savings: Typical high-performance Cortex-M power consumption:

Run mode (high speed): 50-100mA
Sleep mode (WFI): 20-40mA
Deep sleep: 1-5mA (peripherals off)

In typical event-driven firmware, the processor spends 80-95% of time in WFI, cutting average power consumption dramatically.

Example: STM32F429 at 180MHz draws ~90mA running, ~35mA in WFI sleep mode.

RTOS Comparison

FreeRTOS supports idle task hooks and tickless modes, but:

More complex to configure correctly
Idle task still incurs scheduler overhead
Tickless mode requires careful tuning
Wake latency is less predictable

Bare-metal WFI is simpler and often more power-efficient.

Debugging: Simplicity Enables Better Visibility

One often-overlooked advantage of bare-metal firmware: it’s dramatically easier to debug. This section explains debugging strategies for bare-metal firmware and why simpler architectures lead to better visibility and faster problem resolution.

Debug Strategies

1. Hardware Debug (SWD/JTAG)

// Set breakpoint in main loop
while (1) {
    Process_Events();  // <- Breakpoint here
    __WFI();
}

// Examine exact register state
// Step through assembly if needed
// No scheduler interference

2. Logic Analyzer/Oscilloscope

// Toggle GPIO to mark timing
GPIOG->BSRR = (1 << DEBUG_PIN);    // Set high
Critical_Function();                // Measure this
GPIOG->BSRR = (1 << (DEBUG_PIN + 16));  // Set low

// Logic analyzer shows exact timing

3. Serial Debug Output

void UART_Debug_Init(void) {
    // Minimal UART setup (STM32 example)
    RCC->APB2ENR |= RCC_APB2ENR_USART1EN;
    // ... configure TX pin ...
    USART1->BRR = UART_BRR_VALUE;
    USART1->CR1 = USART_CR1_TE | USART_CR1_UE;
}

// printf support (standard across ARM toolchains)
int _write(int file, char *ptr, int len) {
    for (int i = 0; i < len; i++) {
        while (!(USART1->SR & USART_SR_TXE));
        USART1->DR = ptr[i];
    }
    return len;
}

// Now use printf for debugging
printf("[STATE] Transition: STOP -> START\r\n");

This pattern works across virtually all ARM Cortex-M MCUs—just adjust register names for your specific chip.

RTOS Debugging Challenges:

Breakpoints can disturb task timing
Watchpoints may trigger on scheduler operations
Task context switches obscure execution flow
Race conditions are non-deterministic
printf from multiple tasks needs synchronization

Bare-metal execution is sequential and predictable, making debugging straightforward.

Testing: Simplicity Enables Determinism

Firmware testing is hard. Simplicity makes it less hard. This section provides testing strategies for embedded firmware and explains why simpler architectures are easier to test reliably.

Unit Testing Strategy

// Module: button_debounce.c
typedef enum {
    BUTTON_RELEASED,
    BUTTON_PRESSED
} ButtonState;

ButtonState Button_GetState(void) {
    // Debounce logic
}

// Test: test_button_debounce.c
void test_debounce_press(void) {
    // Simulate button press sequence
    for (int i = 0; i < DEBOUNCE_SAMPLES; i++) {
        Button_Sample(GPIO_HIGH);
    }
    assert(Button_GetState() == BUTTON_PRESSED);
}

With bare-metal architecture:

Modules are independent: Easy to test in isolation
No hidden state: Behavior is deterministic
No threading: Tests are reproducible
Direct hardware access: Can mock at register level

Integration Testing

For hardware-dependent code:

// Hardware abstraction for testing
#ifdef UNIT_TEST
    #define GPIO_READ(port, pin)  mock_gpio_read(port, pin)
#else
    #define GPIO_READ(port, pin)  ((port)->IDR & (1 << (pin)))
#endif

This minimal abstraction enables:

Unit tests on host PC (fast iteration)
Integration tests on target hardware
Same code in both environments

Documentation: The Code IS the Documentation

Simple, direct code is self-documenting. This section explains how simplicity improves code readability and provides guidance on effective documentation practices for embedded systems.

// This is clear without comments
void LED_RUN_On(void) {
    LED_RUN_PORT->BSRR = (1 << LED_RUN_PIN);
}

// This requires extensive comments
void HAL_GPIO_WritePin(GPIO_TypeDef *GPIOx, uint16_t GPIO_Pin, 
                       GPIO_PinState PinState) {
    // What is this doing to hardware?
    // How long does it take?
    // What are side effects?
}

Key Principle: Code should be readable and obvious. If you need extensive comments to explain what’s happening, the code is probably too complex.

Good documentation for embedded systems:

Block comments: Explain why, not what
Hardware references: Cite datasheet sections for register operations
Timing constraints: Document critical timing requirements
State machines: Diagram state transitions

Common Objections and Responses

Engineers and stakeholders often raise concerns about bare-metal approaches. This section addresses the most common objections with evidence-based responses.

“Bare-metal doesn’t scale!”

Response: Define “scale.”

More features? Modular architecture scales fine (see: Linux kernel, started bare-metal)
More developers? Clear module boundaries work regardless of RTOS
More processors? If you need multi-core, you need different hardware anyway
More complexity? Often a sign you need simpler architecture, not more tooling

“RTOS provides proven synchronization primitives!”

Response: True, but do you need them?

Interrupt flags + atomic operations handle most cases
Critical sections are just __disable_irq() / __enable_irq()
State machines eliminate most synchronization needs
If you truly need complex synchronization, maybe your architecture is wrong

“HAL provides portability across chip families!”

Response: At what cost?

Portability you don’t need is wasted overhead
Most projects target one specific MCU
When porting is needed, register differences are usually minor
CMSIS provides enough abstraction for most real-world portability needs within ARM Cortex-M

“What about safety certifications (IEC 61508, ISO 26262)?”

Response: Certification cares about process, not tools.

Bare-metal firmware can absolutely be certified
In fact, simpler code often certifies more easily (less to prove)
Certified RTOS helps, but isn’t required
Safety comes from design, testing, and review—not from using an RTOS

“This is premature optimization!”

Response: Simplicity isn’t optimization—it’s sound architecture.

Choosing bare-metal first is the simpler choice
Adding RTOS later (if needed) is possible
Removing RTOS later (if unnecessary) is painful
Start simple, add complexity only when justified

When I Was Wrong: Lessons from Adding Unnecessary Complexity

Early in my embedded career, I made the mistake of reaching for FreeRTOS on a motor control project. This section shares a personal lesson learned from adding unnecessary complexity and the benefits of choosing simplicity instead. The requirements were straightforward:

Read encoder position (interrupt-driven)
Execute PID control loop (1kHz)
Update PWM outputs (hardware timer)
Communicate over CAN bus (async)

I thought: “Multiple timing domains, this needs an RTOS!”

What happened:

FreeRTOS added 12KB overhead (20% of available flash)
Task priorities created unexpected behavior (priority inversion on CAN)
Context switches added jitter to PID loop timing
Debugging became much harder (race condition in CAN TX queue)
Total development time increased by 3 weeks

After rewriting in bare-metal:

Encoder: EXTI interrupt, update position variable
PID: TIM interrupt at 1kHz, pure computation
PWM: Hardware timer, no CPU involvement
CAN: TX interrupt, simple FIFO in ISR

Results:

Flash usage: 8KB total (down from 20KB)
PID loop jitter: < 1μs (was 10-50μs)
No race conditions
Debugging took minutes instead of hours
Total development time: 1 week

Lesson: The apparent complexity that suggested “need RTOS” was actually solved more simply with interrupts and a clear architecture.

The Path Forward: A Practical Approach

If you’re starting a new microcontroller project, here’s my recommended approach. This section provides a phased roadmap for building embedded firmware, from minimal viable firmware to production-ready code.

Phase 1: Minimal Viable Firmware (Day 1)

Clock configuration: Get SYSCLK running at target speed
Debug UART: Printf working, see what’s happening
SysTick: 1ms interrupt for timing
GPIO: Blink an LED (prove main loop runs)
Watchdog: Independent watchdog configured

Goal: Bare-bones platform that boots, runs, and reports status.

Phase 2: Core Functionality (Days 2-5)

Peripheral drivers: One peripheral at a time, direct register access
Interrupt handlers: Keep short, set flags, return
State machines: Main loop processes flags, executes logic
Testing: Validate each peripheral independently

Goal: Core application functionality working reliably.

Phase 3: Optimization & Polish (Days 6-10)

Power optimization: Add WFI, configure clocks for efficiency
Timing validation: Logic analyzer proves timing requirements met
Error handling: Robust handling of fault conditions
Documentation: Register configurations, timing, rationale

Goal: Production-ready firmware.

Phase 4: Evaluation (Day 10+)

Ask honestly:

Is the architecture maintainable?
Are timing requirements met?
Is power consumption acceptable?
Are resources (flash/RAM) within budget?
Is debugging reasonable?

If yes to all: Ship it. You’re done.

If no: Identify the specific problem, then consider if additional tools (RTOS, etc.) would actually solve it. Often the answer is better architecture, not more tools.

Conclusion: The Wisdom of Simplicity

The embedded systems industry has a tendency to over-engineer solutions. We reach for powerful tools—RTOSes, Linux, heavyweight frameworks—because they promise to make development easier. Sometimes they deliver on that promise.

But more often, especially on resource-constrained microcontrollers, these tools introduce more problems than they solve:

Complexity: More to learn, more to debug, more to go wrong
Resource overhead: RAM and flash consumed by infrastructure, not application
Timing variability: Determinism lost to schedulers and kernel preemption
Debugging difficulty: Concurrent systems are fundamentally harder to understand
Power consumption: Idle overhead from background OS activity

Bare-metal firmware, built with a clean event-driven architecture, avoids all these issues. It’s:

Simpler: Sequential logic, clear execution flow
Leaner: Every byte of flash and RAM goes to your application
Faster: No scheduler overhead, direct hardware access
More deterministic: Predictable, bounded timing
Easier to debug: Straightforward cause-and-effect
More power-efficient: Efficient idle states without OS overhead

This doesn’t mean never use an RTOS. It means: start simple, and only add complexity when there’s clear, measurable justification.

For Engineers

The best firmware is the simplest firmware that meets requirements. Don’t let curiosity about new technology override sound architectural judgment. The coolest tech isn’t always the right tech.

Build your expertise on fundamentals—direct hardware control, interrupt-driven design, deterministic execution. These skills transfer across all embedded platforms and never go out of style.

For Technical Leaders

Your responsibility extends beyond individual technical decisions. You’re building sustainable systems and capable teams.

Protect your projects from unnecessary complexity. Guide engineers toward appropriate solutions, not just interesting ones. Set architectural standards that prevent tech debt before it starts.

And remember: the measure of good technical leadership isn’t the sophistication of your architecture—it’s whether your team can maintain it after you’re gone.

Final Thought

In my experience across dozens of embedded projects on various ARM Cortex-M platforms, the pattern is clear: bare-metal event-driven architecture wins in 80% of cases.

When you’re tempted to add complexity, ask yourself:

As an engineer: “Am I solving a real problem, or am I experimenting?”
As a leader: “Can my team maintain this, or am I creating a dependency?”

Trust in simplicity. Your future self, your teammates, and the next engineer who inherits your code will all thank you.

Working on embedded systems or considering architecture decisions for critical firmware? Connect with me on LinkedIn to share experiences and discuss approaches.