Embedded Headaches: Lessons from Building a Solar Charge Controller

I want to write the post I wish existed when I started this project. Most tutorials show you a working schematic and a clean control loop and leave out the three weeks of debugging between "designed it" and "it actually works reliably." So this is the other version — the one with the noise problems, the algorithm oscillation, the communication failures, and the accidental power-on incident I don't like thinking about.

Problem 1: ADC Noise and the Buck Converter from Hell

The first hardware I/O challenge was reading panel voltage and current cleanly. Sounds simple. You've got a 12-bit ADC on the microcontroller, a voltage divider for the panel voltage, a shunt resistor and op-amp for current sensing. Should give you millivolt-level resolution.

What I actually got was readings that jumped by 200-300mV peak-to-peak in a pattern that correlated exactly with the PWM switching frequency of the buck converter. The switching noise from the converter — operating at around 100kHz — was coupling directly into the ADC input through shared ground and through the power supply rails.

The fix was a two-layer approach. Hardware first: RC low-pass filters on both ADC inputs, with cutoff around 1kHz (well below the switching frequency but fast enough not to miss real voltage changes). Also added a ferrite bead on the microcontroller power rail, which helped reduce rail noise significantly. Then software: even after filtering, I wasn't trusting a single reading.

// Oversampling + averaging for ADC noise rejection
// Reading 16 samples and averaging gives ~1 bit of extra resolution
// and significantly reduces noise variance

#define ADC_OVERSAMPLE_COUNT  16

uint16_t adc_read_averaged(uint8_t channel) {
    uint32_t sum = 0;

    // Allow ADC input to settle after channel switch
    adc_select_channel(channel);
    adc_read();  // discard first reading

    for (uint8_t i = 0; i < ADC_OVERSAMPLE_COUNT; i++) {
        sum += adc_read();
        __delay_us(10);  // small delay between samples
    }

    return (uint16_t)(sum / ADC_OVERSAMPLE_COUNT);
}

// For panel voltage measurement, we also apply a simple
// exponential moving average across control loop cycles
static float panel_voltage_filtered = 0.0f;
#define EMA_ALPHA  0.15f

void update_panel_voltage(void) {
    float raw = adc_to_volts(adc_read_averaged(CH_PANEL_VOLTAGE));
    panel_voltage_filtered = EMA_ALPHA * raw
                           + (1.0f - EMA_ALPHA) * panel_voltage_filtered;
}

The combination of hardware RC filtering, oversampling, and an exponential moving average in software got the noise down to an acceptable level — maybe 20-30mV peak-to-peak, which was fine for MPPT accuracy.

Problem 2: MPPT Algorithm Oscillation

The basic Perturb-and-Observe MPPT algorithm is straightforward: increase duty cycle, measure power, if power went up keep going, if it went down reverse direction. Works great in steady-state conditions with a stable irradiance. Becomes a mess when a cloud passes over the panel.

What happens under rapidly changing irradiance is that the algorithm is perturbing in one direction, the cloud simultaneously reduces available power, and the algorithm incorrectly attributes the power drop to its own perturbation direction and reverses. Then the cloud passes, power comes back up, and the algorithm reverses again. You end up oscillating around completely the wrong point, chasing shadows rather than tracking the actual maximum power point.

The fix that actually worked was a variable perturbation step size. Under stable conditions — irradiance changing slowly, power measurement consistent — use a smaller step for finer tracking. Under dynamic conditions — power changing faster than the perturbation can account for — either temporarily pause perturbation and wait for conditions to stabilize, or increase step size to get back to the right region faster.

I also increased the sampling rate of the outer control loop. At 10Hz, you're taking measurements 100ms apart, which is plenty of time for a passing cloud to mess up your measurements. At 50Hz, you're capturing the dynamics well enough to distinguish "power changed because I perturbed" from "power changed because of external irradiance change."

The MPPT algorithm literature is full of improvements to P&O that handle dynamic conditions better. InC (Incremental Conductance) is more mathematically principled but harder to tune. Variable step P&O with irradiance change detection is simpler and worked well enough for my application.

Problem 3: I2C Failing Silently Under Load

I was using I2C to communicate between the main controller and a secondary board handling display and logging. It worked fine on the bench. Under real load conditions — buck converter switching, battery at low state of charge getting hit with high charge current — the I2C would occasionally fail silently.

By "silently" I mean: the transaction would complete from the master's perspective, the HAL function would return HAL_OK, but the data the secondary board received was garbage or partially transmitted. Took me an embarrassingly long time to figure out what was happening.

The root cause was voltage droop on the 3.3V rail when the battery was under heavy charge load. The I2C pull-up resistors were referenced to 3.3V, and when the rail drooped even briefly, the pull-up current dropped enough that the rise time on the I2C lines exceeded the spec — but just barely, not enough to cause a NACK, just enough to cause the receiving device to sample at the wrong time occasionally.

Two fixes: dedicated LDO regulator for the communication and MCU section (isolating it from the power stage), and proper I2C error recovery in the software:

// I2C communication with retry and bus recovery
#define I2C_MAX_RETRIES  3
#define I2C_TIMEOUT_MS   10

bool i2c_send_data(uint8_t addr, uint8_t *data, uint8_t len) {
    for (uint8_t attempt = 0; attempt < I2C_MAX_RETRIES; attempt++) {
        HAL_StatusTypeDef result = HAL_I2C_Master_Transmit(
            &hi2c1, addr << 1, data, len, I2C_TIMEOUT_MS
        );

        if (result == HAL_OK) {
            return true;
        }

        // Bus recovery: if stuck, try to reset the I2C peripheral
        if (hi2c1.State == HAL_I2C_STATE_BUSY_TX) {
            i2c_bus_recover();  // toggle SCL to unstick any held SDA
        }

        HAL_Delay(5);  // brief pause before retry
    }

    // Log persistent failure for diagnostics
    error_flags |= ERR_I2C_COMMS;
    return false;
}

// Bus recovery: manually toggle SCL 9 times to release a stuck slave
void i2c_bus_recover(void) {
    HAL_I2C_DeInit(&hi2c1);
    // ... configure SCL/SDA as GPIO outputs, toggle SCL 9 times,
    // send STOP condition, reconfigure as I2C peripheral
    HAL_I2C_Init(&hi2c1);
}

The lesson: never assume HAL_OK means the data was received correctly, and never leave communication loops without explicit error recovery. In embedded systems, partial failures are common and need explicit handling.

Problem 4: Real-Time Constraints and the Starved Main Loop

I started with a simple super-loop architecture: everything in the main loop, with timing managed by polling a millisecond counter. The control loop ran at whatever rate the main loop completed, which was variable.

This worked until I added UART logging. Debug logging via UART at 115200 baud, especially logging floating point values (slow on MCUs without an FPU, and I was using an STM32F103), was adding several milliseconds of latency per loop iteration. My "50Hz control loop" was actually running at 20-30Hz under heavy logging, and the timing variation was enough to degrade the MPPT tracking accuracy.

The fix was moving the control loop to a hardware timer interrupt. TIM2 configured to fire at exactly 50Hz, with the MPPT control algorithm running in the ISR. Main loop handles non-time-critical tasks: UART logging, display updates, I2C communication with secondary board, parameter storage to flash.

// TIM2 interrupt handler — the MPPT control loop
// Executes at exactly 50Hz regardless of main loop state
void TIM2_IRQHandler(void) {
    if (__HAL_TIM_GET_FLAG(&htim2, TIM_FLAG_UPDATE)) {
        __HAL_TIM_CLEAR_FLAG(&htim2, TIM_FLAG_UPDATE);

        // Critical section: update measurements
        panel_voltage = get_panel_voltage();
        panel_current = get_panel_current();
        battery_voltage = get_battery_voltage();

        // Compute power and run MPPT
        float panel_power = panel_voltage * panel_current;
        mppt_update(panel_voltage, panel_power);

        // Apply duty cycle to buck converter PWM
        set_pwm_duty(mppt_duty_cycle);

        // Set flag for main loop to handle logging
        control_loop_tick = true;
    }
}

Separating time-critical control from non-time-critical I/O was the right architecture from the start — I just had to learn that lesson the hard way.

Problem 5: Thermal Management Debugging

At duty cycles above about 70%, the main switching MOSFET was getting warm. Not alarmingly warm at first — maybe 45-50°C above ambient — but the trend was concerning and at high irradiance conditions it was climbing further.

I pulled out a thermal camera (one of the affordable smartphone-attachment ones — worth every penny for hardware debugging) and confirmed the MOSFET was the hotspot. The synchronous rectifier MOSFET was running much cooler, which suggested the issue was the high-side switch specifically.

Went back to the datasheet and recalculated the losses properly. I had estimated RDS(on) at room temperature and at the low gate voltage my gate driver was delivering. What I hadn't accounted for: RDS(on) roughly doubles over the operating temperature range from 25°C to 100°C. As the MOSFET gets warm, its resistance increases, which means more power dissipation, which means it gets warmer still — a positive feedback loop.

The solution was a combination of a better gate driver that could deliver the full 10V gate voltage the MOSFET was rated for (I had been driving it at 5V from the MCU, leaving significant RDS(on) on the table), and a small heatsink pressed onto the MOSFET package. After both changes, temperature stabilized around 30°C above ambient at full load — acceptable for the application.

Problem 6: The Accidental Power-On Bug

This one still makes me wince. During early bring-up, I had a sequence bug in my GPIO initialization. The MCU's GPIO registers are in an undefined state at reset before you initialize them. I was initializing the MOSFET gate drive pin late in the startup sequence, after some peripheral initialization that happened to toggle other pins on the same GPIO port.

Due to how the STM32's GPIO port registers work, writing to one bit in a non-atomic operation can briefly glitch adjacent pins depending on your read-modify-write sequence. On two or three occasions during development, the MOSFET gate pin would briefly pulse high during startup — for perhaps a few hundred microseconds — before I initialized it to the correct low state.

I discovered this with an oscilloscope, not by frying anything — I got lucky that the battery connected during those tests was current-limited. But the potential consequence was real: a brief uncontrolled MOSFET turn-on during startup, with an uncharged inductor, could cause a current spike that would stress both the MOSFET and the battery.

The fix was two-fold: initialize all power-stage GPIO pins to safe states (MOSFET gate LOW) as the absolute first thing in the startup code, before any other peripheral initialization. And add a hardware pull-down resistor on the gate drive line so that even if the MCU is in reset or the gate driver hasn't initialized, the MOSFET defaults to off.

GPIO initialization order in embedded systems is not just a code style issue — it's a safety issue. Anything connected to a power stage needs to default to safe state at power-on, both in software initialization order and in hardware. Belt and suspenders, always.

The finished controller works reliably now. It's tracking MPPT within a few percent of theoretical maximum even in dynamic conditions, the thermal management is solid, and the communication is stable. But the path from schematic to working hardware was considerably more involved than the tutorials suggested — which, come to think of it, is true of basically every embedded project I've ever worked on.