I have a habit of picking projects that sound straightforward and then discovering they aren’t. A few months ago I decided I wanted to actually understand sequence models, not just read about them. Reading papers gives you vocabulary. Building something gives you intuition. The two are not the same thing.

This post is what I wish someone had written before I started. It is long. It covers data plumbing, architecture choices, a probabilistic extension, an ablation study that humbled me, and the interpretability work that finally made the model feel like something I understood rather than something I had trained.

If you are learning deep learning and want to see how the pieces actually fit together on a real time-series problem, this is for you.

Why Electricity Prices

Electricity is a strange commodity. You cannot store it cheaply at scale, so supply has to match demand minute by minute. Prices in liberalised markets like Spain’s are set a day ahead through an auction. Generators bid, retailers buy, and a clearing price emerges for each of the next 24 hours.

Those prices have structure you can almost feel:

  • Diurnal: low overnight, two daily peaks (morning ramp, evening peak)
  • Weekly: weekends are cheaper than weekdays
  • Seasonal: winter heating, summer cooling, shoulder seasons quieter
  • Weather-driven: wind drives renewable supply, temperature drives demand
  • Spiky: occasional extreme events that no smooth model captures well

This is a wonderful playground. The signal is rich enough that simple baselines work surprisingly well, but messy enough that a naive deep model will embarrass you. That gap between “obvious baseline” and “actually good” is where you learn.

The Dataset

I used the Kaggle “Energy consumption, generation, prices and weather” collection. Two tables:

  • Energy table: 35,064 hourly rows for Spain, 29 columns covering generation by source, load, and prices
  • Weather table: 178,396 rows of meteorological observations across Valencia, Madrid, Bilbao, Barcelona, and Seville

Both span 1 January 2015 to 31 December 2018. The target is price actual, the realised hourly price in EUR/MWh.

Data Plumbing Is Most of the Work

Nobody tells you this loudly enough. On a real time-series problem, the modelling code is small. The plumbing around it is large, fragile, and the place where almost every silent bug lives.

Here is what I had to handle:

Uninformative columns. Two columns in the energy table were always empty (generation hydro pumped storage aggregated, forecast wind offshore eday ahead). Five generation types were always zero. Drop them.

Timezones. This is where most people get burned. Prices are quoted in local Spanish time, which means CET in winter and CEST in summer. If you parse timestamps as UTC and forget to convert, you will silently shift every price by an hour twice a year. I parsed everything to UTC, then converted to Europe/Madrid, which handles DST transitions automatically.

DST itself. Spring-forward gives you a missing hour. Fall-back gives you a duplicate. I accepted the missing hour and resolved duplicates by keeping the first occurrence. Both are defensible; the important thing is to do something deliberate.

Missing values. I forward-filled then backward-filled with a limit of three consecutive hours. Anything longer is suspicious enough to want to know about, not silently impute.

The whitespace bug. One city in the weather table was labelled " Barcelona" with a leading space. Five cities became six. An inner join on city name dropped a chunk of data and I did not notice until I checked row counts. Strip whitespace on string columns. Always.

Units. Temperatures came in Kelvin. I converted to Celsius mostly so my sanity checks made sense (“is 287 a reasonable temperature?” is a worse question than “is 14 a reasonable temperature?”).

Spatial aggregation. I averaged the five cities into one Spain-wide weather row per hour. This throws away spatial variation, but it keeps the feature count small and gives me something to compare against if I later try city-level features.

The cleaned tables joined cleanly on the hourly index, giving 35,064 rows by 40 columns.

Calendar Features

I deliberately kept feature engineering minimal. The temptation on time-series problems is to add 50 features and watch your model overfit gloriously. I added the bare minimum:

  • hour_sin, hour_cos (period 24)
  • dow_sin, dow_cos (period 7)
  • month_sin, month_cos (period 12)
  • is_weekend binary flag
  • is_holiday binary flag for Spanish public holidays

The sin/cos encoding matters. If you encode hour as an integer 0-23, the model sees hour 23 and hour 0 as maximally distant when in reality they are adjacent. Sin/cos puts cyclical features on a unit circle so distance behaves correctly.

That is 8 calendar features plus 11 weather features, for 19 model inputs.

What I Was Not Allowed to Use

This is critical and easy to miss. The dataset contains columns like total load actual, generation wind onshore, and price day ahead. These are post-hoc realisations or operator forecasts that you would not have at prediction time. Using them is leakage. I excluded all actual load and generation columns, all forecast columns, and the operator’s own day-ahead price from the model’s feature set. They stayed in the dataframe purely for baseline evaluation.

This is the kind of thing you only catch by writing down, before you start, exactly what information would be available at the moment of prediction. Do this. It will save you from a model that scores beautifully and means nothing.

Splitting

Time-series splits are not random. You cannot shuffle. The future must come after the past or you are leaking information.

  • Train: 2015-01-01 00:00 to 2017-06-30 23:59 (CET)
  • Validation: 2017-07-01 00:00 to 2018-01-01 18:00
  • Test: 2018-01-01 19:00 to 2018-12-31 23:59

That gives 21,887 hours for training, 4,436 for validation, 8,741 for testing. Normalisation statistics (StandardScaler for features, mean/std for prices) come from the training set only. Touch the test set with anything during training and you have leakage.

I set torch.manual_seed(42) and lived with it. Reproducibility matters more than people pretend.

The Architecture

I wanted a model that could (a) digest a window of past prices and (b) condition on known future inputs (calendar, weather forecasts for the target day). The natural shape is an encoder-decoder.

Price History (336, 1) ── LSTM ──▶ hidden (128) ─┐
                                                  ├─ concat ─▶ MLP Decoder ─▶ 24 hourly forecasts
Weather + Calendar (24, 19) ── Linear+ReLU ──▶ context (128) ─┘

Three components:

1. Price encoder (LSTM). A two-layer LSTM with hidden size 128 reads the previous 14 days of hourly prices (336 values). I take the final hidden state as a summary vector. Why 14 days? Long enough to see two full weekly cycles, short enough that the LSTM does not have to remember things from a month ago.

2. Future context encoder. The 19 features for each of the 24 target hours give a 24 by 19 matrix. I flatten it to a 456-dimensional vector and project through a single linear layer with ReLU to get a 128-dim context vector. This is deliberately simple. The features are aligned by hour and the model can learn whatever interactions matter.

3. Decoder (MLP). Concatenate [hidden; context] into a 256-dim vector. Two linear layers (256 to 128, 128 to 24) with ReLU and dropout produce 24 hourly forecasts in a single shot.

Total trainable parameters: 293,656. Small enough to train on a free Colab T4 in minutes. There is a lesson here. You do not need a giant model to learn something interesting.

Why Not a Transformer?

Honest answer: because I was trying to learn LSTMs and the dataset is small. Transformers shine when you have a lot of data and long context. With 22k training hours and a 336-step input, an LSTM is a reasonable choice and trains quickly. I will reach for transformers when the problem demands it, not because they are fashionable.

Training

Standard supervised regression setup:

  • Loss: MSE on normalised prices
  • Optimiser: AdamW with weight decay 1e-4
  • Schedule: OneCycleLR with max LR 1e-3
  • Epochs: up to 50 with early stopping (patience 7 on validation loss)
  • Batch size: 32
  • Gradient clipping: norm 1.0 (LSTMs explode without it)
  • Dropout: 0.2 between LSTM layers, 0.3 in the decoder

A few things worth pulling out:

OneCycleLR is underrated. Warm up the learning rate, then anneal it down. It tends to find better minima than a constant LR with less hyperparameter fiddling. If you have not tried it, try it.

Gradient clipping is not optional for LSTMs. I forgot it once. Loss went to NaN within two epochs. Clipping the gradient norm to 1.0 fixes this for free.

Early stopping works. The model converged around epoch 13 and I cut training at epoch 20. There is no virtue in training to a fixed epoch count when validation loss has plateaued.

Training took about 20 seconds on an Apple M-series GPU. That speed matters. Fast iteration is how you actually learn what the model does.

Baselines, And Why You Need Them

Before you celebrate any deep learning result, you have to compare against baselines that a sceptical colleague would propose. I picked three:

  1. Day-ahead persistence: tomorrow’s price for hour h equals today’s price for hour h. Trivial. Often hard to beat on hourly data.
  2. Seasonal naive (weekly): tomorrow equals the same hour one week ago. Captures weekday/weekend structure.
  3. Operator forecast: the price day ahead column, which is the actual auction clearing price set the day before delivery.

Test-set MAE in EUR/MWh:

BaselineMAE
Day-ahead persistence5.20
Seasonal naive (weekly)6.21
Operator forecast8.86

Persistence wins among the baselines. The operator forecast looks bad but is not really comparable; it is the clearing price set before delivery, while price actual includes real-time adjustments. Different quantities.

The headline number to beat is 5.20 EUR/MWh.

Main Results

The full LSTM model achieved a test-set MAE of 4.82 EUR/MWh, an MAPE of 8.41%, and an RMSE of 6.28. That is a 7.3% improvement over persistence.

Is 7.3% impressive? Honestly, it is modest. On a problem dominated by autocorrelation, persistence is a tough baseline. Beating it by single digits with a deep model is real progress, but it is not the kind of result that makes you write a press release. It is the kind of result that makes you ask “why isn’t the gap bigger, and what does that tell me about the problem?”

Spoiler: the ablations told me a lot.

Going Probabilistic

Point forecasts have a fundamental limitation: they tell you nothing about uncertainty. A prediction of 50 EUR/MWh that could plausibly be 30 or 70 is very different from one that is reliably 48 to 52. For anyone making decisions on these forecasts, the interval matters as much as the point.

The cleanest way to get intervals from a neural network is quantile regression. Instead of predicting one number per hour, predict three: the 10th, 50th, and 90th percentiles of the predictive distribution. The interval between Q10 and Q90 is your 80% prediction interval.

The architectural change is tiny. Replace the final layer (128 to 24) with (128 to 72), then reshape to (24, 3). The loss function does the heavy lifting. The pinball loss for quantile tau is:

$$ L_\tau(\hat{q}, y) = \begin{cases} \tau (y - \hat{q}), & y \geq \hat{q} \ (1 - \tau)(\hat{q} - y), & y < \hat{q} \end{cases} $$

This is asymmetric on purpose. For tau = 0.9, under-predictions are penalised 9 times more than over-predictions, which pushes the predicted quantile up to where roughly 90% of observations fall below it. Average the loss across all three quantile levels and 24 hours and you have a single training objective.

I trained the quantile model with the same hyperparameters as the point model. The training loop auto-detects whether the output has more than one quantile and switches loss functions accordingly. The Q50 output achieved MAE 4.92, only marginally worse than the dedicated point model. So you do not really lose accuracy by adopting quantile regression; you gain uncertainty estimates.

Calibration

Here is where it got interesting. The 80% prediction interval (Q10 to Q90) covered the actual price only 71.2% of the time, well below the nominal 80%. Decomposing:

  • Q10 nominal coverage 0.10, observed 0.04
  • Q50 nominal coverage 0.50, observed 0.32
  • Q90 nominal coverage 0.90, observed 0.75

The intervals are systematically too narrow, with the lower tail worst affected. This makes physical sense. Sharp downward spikes happen when high renewable output coincides with low demand. They are rare, hard to predict, and the model learns to be conservative about them.

The fix is post-hoc recalibration (isotonic regression on the validation set) or training with more quantile levels for a smoother coverage curve. I left both as future work, but knowing the model is miscalibrated and in which direction is itself useful.

I considered MC Dropout and mixture density networks before settling on quantile regression. Both are perfectly good. Quantile regression won because it needs a single forward pass at inference and the loss directly optimises the quantities I care about. Engineering simplicity matters.

The Ablation Study That Humbled Me

This is the part of the project I learned most from. I ran three ablations to figure out which design choices actually mattered.

AblationMAEDelta vs full
Full LSTM model4.82baseline
Remove weather features4.70better by 0.12
Remove calendar features5.12worse by 0.30
Replace LSTM with 2-layer MLP4.47better by 0.35

Two of these are uncomfortable.

Weather Features Hurt

Removing all 11 weather features made the model slightly better. This is the kind of result that makes you immediately suspect a bug. I checked. There is no bug.

The most likely explanation: the 14-day price history already encodes weather-driven demand variation through autocorrelation. The weather features as I structured them (Spain-wide hourly means) introduce noise that outweighs whatever marginal signal they add. Better feature engineering (per-city features, attention-weighted aggregation, weather forecasts at finer granularity) might recover the signal. As I had it, they were a net negative.

This is the kind of negative result you should always report. If I had only published the “winning” configuration, I would be hiding something genuinely informative.

Calendar Features Win

Removing all 8 calendar features degraded MAE by 6.2%. Calendar is the dominant feature group. This is unsurprising in retrospect; electricity demand is dominated by when people are awake, at work, and at home. The model needs hour-of-day and day-of-week to do its job.

Lesson: domain structure matters more than fancy modelling. Calendar features are nearly free and they are doing most of the work.

A Two-Layer MLP Beats the LSTM

This one stung. A simple MLP that flattens the entire input (336 hours of price history plus 24 by 19 future features) into one giant vector achieved MAE 4.47 against the LSTM’s 4.82.

I did not believe it. So I ran both models across three random seeds:

ModelMAE (mean ± std)
2-layer MLP4.45 ± 0.05
LSTM4.65 ± 0.19

The MLP is consistently better and has lower variance. The LSTM’s sequential processing is not earning its keep on this problem.

There is a real lesson here. LSTMs help when you have variable-length inputs, when the sequential structure is essential to the task, or when you need to maintain state across very long contexts. For a fixed-length forecasting problem with a 336-step input, an MLP that sees everything at once can do at least as well, train faster, and be easier to reason about.

I kept the LSTM as the “main” model in my writeup because the assignment of architectures to problems was a hypothesis I wanted to test. If you are picking an architecture for a real production system, pick the simpler one when results are tied. Almost always.

Error Analysis

Beyond the headline MAE, I broke down errors along two axes.

By hour of day. Errors were elevated during morning ramp-up (hours 7-9) and evening peak (19-21). These are the volatility hours, when demand transitions and price elasticity is highest. The model handles flat overnight hours easily and struggles where the action is.

By season. Errors were highest in winter (DJF) and spring (MAM), lowest in summer (JJA). Winter has heating-driven volatility; spring has the transition between regimes. Summer is hot, predictable, and air-conditioner-driven.

Predicted-vs-actual scatter showed tight clustering around the identity line with mild fan-out at extreme prices. No catastrophic systematic bias. The model knows what it does not know, more or less.

Interpretability

A model you cannot interrogate is a model you should not trust. I did three complementary analyses.

Weight-Based Importance

The future context encoder has a weight matrix of shape (128, 456). I took the absolute value and aggregated across output neurons and hours for each of the 19 input features. The top weather features by aggregated weight were wind speed, 3-hour rain, and wind direction. Temperature features ranked lower.

This is physically sensible. Wind drives renewable generation, which drives supply-side price moves. Temperature drives demand, which is already implicit in the price history.

Gradient Saliency

I computed |dy_hat / dx_future| for each test sample and averaged by feature and target hour. The result is a heatmap of “how much does the prediction at hour h depend on weather feature f”.

Wind speed dominated, peaking at late-evening hours (21-23). This is exactly when solar generation drops off and wind becomes the residual renewable. The model learned this temporal sensitivity pattern without anyone telling it to. That moment, looking at the heatmap and seeing physics emerge from gradients, was the first time the model felt like something I had built rather than something I had configured.

Spike Analysis

I defined price spikes as hours above the 90th percentile and compared gradient magnitudes during spike vs normal hours. Gradients were 0.48 to 0.95 times smaller during spikes. The model relies more on price history (autocorrelation dominates during spike clusters) than on weather inputs during extreme events. Plausible. Spikes persist; weather is more useful at the margin during normal market conditions.

What I Would Do Differently

If I started over tomorrow, with everything I learned:

  1. Start with the MLP. Then ask whether the LSTM adds anything. I went the other way and the LSTM became sunk cost. Always start with the simplest baseline that could work.
  2. Build the test harness first. I wrote 62 automated tests across the pipeline. Most were retrofitted. The ones I wrote up front (timezone handling, data leakage checks, normalisation invariants) saved me hours. The ones I wrote later mostly confirmed bugs I had already found by other means.
  3. Treat feature engineering as a hyperparameter. I added 11 weather features as a block. The ablation showed they hurt. If I had built feature groups that I could toggle from the start, I would have caught this earlier.
  4. Train probabilistically from day one. The point model and the quantile model differ by one line of code. The quantile model gives you everything the point model does plus uncertainty. There is no good reason to start without it.
  5. Multi-seed everything. Single-seed comparisons lied to me about the LSTM-vs-MLP gap until I ran three seeds. Variance matters and you cannot estimate it from one run.

What This Project Taught Me About Deep Learning

A few things I now believe more strongly than I did before:

Baselines are sacred. Persistence beat my first three attempts. If your model cannot beat a one-line baseline, you do not have a model, you have a regression that ate too much electricity.

Data plumbing dominates. The total time I spent on architecture decisions was a fraction of the time I spent on timezones, missing values, leakage checks, and feature alignment. This ratio is not a sign of bad engineering. It is the job.

Simple beats complex more often than you think. A two-layer MLP beat a careful LSTM encoder-decoder. This is not an anti-deep-learning point. It is a “match the model to the problem” point. The problem had a fixed-length input and a clean structure. The MLP saw it directly. The LSTM had to compress and re-expand it.

Negative results are results. The weather features hurting performance is genuinely useful information. So is the under-coverage of the prediction intervals. Hide neither.

Interpretability earns trust. The moment the gradient saliency map showed wind speed dominating late-evening sensitivity, the model stopped being a black box. You should always be able to point at your model and say “here is why I believe it learned something real”.

The Stack

For anyone wanting to reproduce this kind of project, what I used:

  • Python 3.11, PyTorch for modelling
  • pandas for data plumbing, NumPy for numerics
  • scikit-learn for StandardScaler and metrics
  • Matplotlib for plotting
  • pytest for the test suite
  • A Mac with an M-series GPU for training; a Colab T4 would be equivalent

No specialised forecasting libraries. The point of the project was to understand the components, not to call library.fit(data).

If You Are Learning Deep Learning

Pick a problem with the following properties:

  1. A real dataset. Not MNIST. Something with messy timestamps, missing values, units that need converting. The plumbing is the lesson.
  2. A trivial baseline. Persistence, a moving average, the most-common-class predictor. You need to know whether your model is doing anything.
  3. A clear evaluation metric. Time-series splits, no leakage, train/val/test discipline. This protects you from yourself.
  4. A path to probabilistic output. Even if you start with point predictions, plan for uncertainty. The world is not deterministic.
  5. An interpretability angle. Pick something where you can sanity-check what the model learned against domain knowledge. This is how you build trust in your own work.

Electricity prices ticked all five for me. The next project might be different. The framework is the same.

The honest summary of this project is: I built a model that beat the strongest naive baseline by 7.3%, discovered that a simpler architecture beat my main one, and learned more from the ablation table than from the headline number. That is roughly what success in deep learning looks like when you are starting out. Modest gains, surprising negative results, and a much better intuition for the next problem.

Which is the whole point.