In partnership with PLMA, this group is for practitioners from energy utilities, solution providers, and trade allies to share load management expertise and explore innovative approaches to program delivery, pricing constructs, and technology adoption.


Load Forecasting Tutorial (part 1): Data Preparation

Miha Grabner's picture
Data Scientist Electric Power Research Institute Milan Vidmar

Creator of a LinkedIn group AI in Smart Grids. I also write blogs about Energy Data Science, take a look: . My main research field is Data Analytics and Machine...

  • Member since 2019
  • 20 items added with 17,176 views
  • May 20, 2020

Welcome to the first part of the blog series about Load Forecasting. In this series of tutorials, I will guide you through the whole process of a load forecasting workflow, from preparing the data to building a machine learning model. I will provide a lot of tips and tricks that I have found useful throughout the time.


The tutorial consists of the following blogs:

  1. Data preparation

  2. Exploratory Data Analysis

  3. How to develop a benchmark model?

  4. How to evaluate model performance?

  5. How to improve a benchmark model?

  6. How to develop a Neural Network model?

A common rule of thumb says that about 80 % of the time in data science projects is spent on data preparation and only 20 % on machine learning modeling. I think that this highly depends on the industry. When speaking about the electrical energy industry I can confidently say that this is true. A lot of utilities around the world have still not implemented data processing systems. This means that you will have to gather the data first (usually from different data sources) and clean it before using it.

These are common anomalies found in energy datasets:

  1. Strange behavior of the load (e.g. due to temporary supply reconfiguration etc.)

  2. Outliers

  3. Missing values

  4. Duplicated timestamps

  5. Missing timestamps

The figure below shows an example of a supply reconfiguration between March 15 and March 23. There are various causes for this such as disconnection of a larger industrial consumer, reconfiguration of a feeder supply (from one substation to another substation), etc.

In the figure below you can see a spike (outlier) and an example of missing observations.

I will not go to the details about how to handle all aforementioned anomalies, because it depends on a lot of factors and there is no general rule on how to do it (maybe I write a detailed blog about outlier detection and missing values imputation later).

One thing that I have to emphasize is: Beware of data leakage.

Data leakage occurs when you use knowledge from the “future” (test set representing unseen data) to create the model. This will result in an outstanding model performance on a test set, whereas in a production environment the performance will be worse.

If you clean your whole dataset first and then apply machine learning, you are cheating! Why? Data processing has to be embedded in a pipeline together with machine learning algorithms. This pipeline has to be built on a train set and evaluated on a test set. It seems reasonable, but a lot of people make this mistake without even noticing it.

Especially when you are working with people in the industry which are trying to help you and do not have experience with building machine learning models this can occur faster than you thing.

Remember, people tend to overfit!

Forecasting workflow consists of the following steps:

  1. Gathering the data

  2. Data preparation

  3. Exploratory data analysis

  4. Feature Engineering

  5. Tuning Machine Learning algorithms

  6. Evaluating the performance

  7. Putting the model into the production environment


Tip, that will make your life easier

First focus on coding the whole workflow from gathering and preparing the data, using simple machine learning algorithms and evaluate the performance (more about general machine learning workflow here). This will allow you to get a bigger picture of your problem and you will be able to focus on the things that seem most important for improving the performance whether this is data preparation, tuning machine learning algorithms or something else.


At the beginning of your project use simple data preparation (such as dropping missing or anomalous observations) and basic algorithms – make predictions as soon as possible. Another very useful approach is to analyze train and validation errors to spot if something strange is going on.


The main suggestion: do not spend to much time on any of the aforementioned steps at the beginning until you find out which parts are most important. Then, go back to the identified bottlenecks and improve what seems most important. Remember that machine learning modeling is a repetitive process.



In this blog, I didn’t go to the details about the data preparation, whereas I wanted to point out a few topics that I find very important to understand to make a good data preparation pipeline.

The next blog is about Exploratory Data Analysis, which is a crucial topic, since it enables a better understanding of the data and consequently better feature engineering in machine learning modeling.


Original blog posted here. 

Miha Grabner's picture
Thank Miha for the Post!
Energy Central contributors share their experience and insights for the benefit of other Members (like you). Please show them your appreciation by leaving a comment, 'liking' this post, or following this Member.
More posts from this member
Spell checking: Press the CTRL or COMMAND key then click on the underlined misspelled word.
Matt Chester's picture
Matt Chester on May 20, 2020

Data preparation seems like the self-evident first step, but do you find that some people will rush past it and end up working with faulty data? Is this a problem in the utility industry in your opinion?

Get Published - Build a Following

The Energy Central Power Industry Network is based on one core idea - power industry professionals helping each other and advancing the industry by sharing and learning from each other.

If you have an experience or insight to share or have learned something from a conference or seminar, your peers and colleagues on Energy Central want to hear about it. It's also easy to share a link to an article you've liked or an industry resource that you think would be helpful.

                 Learn more about posting on Energy Central »