How to efficiently pseudonymize smart meter data (without cross-tables)
image credit: Leon Maruša - Elektro Celje
- Oct 12, 2020 10:43 am GMTOct 10, 2020 3:06 pm GMT
- 413 views
Smart meter data can tell a lot about the consumer, especially if the measurements are being recorded in high enough resolution such as 1 or 5 minutes. While such high measurement resolution is still not widely used by DSOs, a good 15 minute or 1 hour resolution is a common practice. If you take a look at someone’s load profile in this time resolution you might not figure what movie they are watching (which is something researchers have implemented on 2s data , ), but it’s enough to quickly determine if someone is home or not, whether they are sleeping or if they are charging their electric car.
Since smart meter data is used in many research projects to develop new analytical tools or new business services, it’s crucial to protect the user’s privacy. This is especially true if the smart meter data must be shared outside the utility, such as research institutions, faculties, algorithm developers, etc. In EU we have GDPR (General Data Protection Rules) which states that any data from which a person is identifiable is considered personal data and is owned by that person. In the country where I’m from Slovenia we have an additional energy act which states that 15 minute resolution load profiles which can be linked to the individual consumer are also considered personal data. This link can be anything from consumer’s physical coordinates, address and post number or even meter ID.
Personal data is something that shouldn’t be taken for granted and should be on every company’s important notice. To put things into perspective, according to Enforcement Tracker (GDPR fines database link) the sum of all issued fines for not complying with GDPR in Europe has reached an astonishing 490 million EUR, up until September 2020. The highest fine ever issued (although still not final) was to British Airways for insufficient technical and organizational measures to ensure information security, which currently holds the record at a grand total of 204 million EUR .
In order to protect the consumer’s personal data but still enable the usage of smart meter measurements we must remove enough information so that consumers are not identifiable, but at the same time we must track which information belongs to which consumer. Imagine this example if you were to provide only smart meter load profiles with all the identifiers removed to a 3rd party for development of analytical service they would give you back the results (e. g. load forecasting, cluster belonging, etc.), but now you wouldn’t know which result is linked to which consumer. To overcome this, you must include one of the identifiers, pseudonymize it for the 3rd party and keep the original identifier and its linkage to yourself. This is where cross-tables (or sometimes called mapping tables) come into play, as you gave pseudonymized IDs to the 3rd party, you need to keep the list on how to link pseudonymized IDs with the original ones, so you can later on identify consumers based on results you receive. Figure 1 shows the procedure using cross tables.
Figure 1: Pseudonymizing consumer's identification using a cross-table.
Cross-tables are quite a common practice since they provide a simple solution for pseudonymization and de-pseudonymization of the data. In principle you only have to generate a substitute (preferably random) ID for each consumer and write it somewhere next to original ID and keep that list safe. The substitute IDs can be generated in many ways, but the most common today’s practice is to generate random values for each original ID using random number generators or UUID (Universally unique identifier) generator. As with any (de)pseudonymization method cross-tables do have their pros and cons as well.
- Quick and simple pseudonymization and de-pseudonymization of the data.
- Easy to understand and handle.
- No need for additional external information to de-pseudonymize the data.
- Cross-tables are usually written in some sort of file on a disk. They can easily be misplaced, mis-sent, copied or accidentally shared on common disks.
- If hackers breach through to your file system, they can easily steal them.
- Each dataset on each project should have their own cross-table, even if different datasets use same consumer details. You can give out one pseudonymized dataset and in another project, you give out another dataset with different type of data but with the same pseudonymized IDs. Now someone can link the two datasets to the same pseudonymized ID and if you know two distinct information belong to the same person, sometimes that is enough to make a consumer identifiable. This is also known as the “mosaic effect”.
- Managing different cross-tables can become messy when you have many projects opened simultaneously. This is also where you’re most likely to mis-share the wrong cross-table with your project colleagues.
In addition, creating a pseudonymized dataset together a with cross-table sometimes requires custom software. In smart meter data this is especially true when you’re handling large datasets, that are beyond the limits of Excel or some other tool or they require more RAM from your computer than you have available.
PSEUDONYMIZATION THROUGH HASH FUNCTIONS
A different approach that we’re currently testing at Elektro Celje (Slovenian DSO) is pseudonymization of smart meter data without cross-tables using hash functions. If you have never heard of hash function before, basically this is a set of complex mathematical operations, that are done over variable length input data to map it into fixed length output data. A good hash function will meet two criteria: it should be fast to compute and it should minimize duplication of output values, meaning you can generate output (or so called hash) from your input, but it should be practically impossible to recalculate the hash back to the original input value.
Figure 2: Hashing and breaking hashes.
A good thing about hash functions is also that they completely change the output hash even if only one of the characters in the input ID is changed. An example for this when hashing with SHA-256 can be seen on Figure 3.
Figure 3: Changing only one character in the input value completely changes the output hash.
As far as security of individual hash functions go I wouldn’t dare to go into details, since I’m not an expert on cryptography, but there is a general rule that you should use the hash functions which have already proven to be secure over the years and have not yet been broken. Examples of these are: SHA-256, SHA-512, SHA3 family, etc. Some of these hash functions are also the basis for encryption of bank transfers, secure data communications or even cryptocurrency and blockchain applications.
Since DSOs usually store their smart meter data in one or the other types of databases the hashing can be already done on database level, when you’re exporting the data for 3rd party processor. An example of such an export in Microsoft SQL would be:
SELECT HASHBYTES('SHA2_512', Meter_ID) AS Pseudo_ID,
Here we’re using HASHBYTES function which takes the original meter ID and calculates SHA-512 hash on the fly, while querying the data. This means that we already get out the data that is pseudonymized and we don’t need to do any post-pseudonymization, nor do we need a special software for that. Similar functions exist in other types of databases as well, for example MySQL has SHA2() function, PostgreSQL uses pgcrypto module and so on. With this approach you also don’t need to keep the cross-table, since you can always go back to your database and recalculate hashes from your original IDs, all you need to know is which hash function you used.
MAKING YOUR PSEUDONYMIZED IDs MORE SECURE
Now there is a catch in using this approach. Let’s say your meter or consumer IDs consist only of numbers (e. g. 341279). Someone can simply calculate all the hashes for numbers from 0 to few billion and compare those hashes to your hashes in pseudonymized table. Sure enough the attacker will soon figure out the original IDs to your data (and if he doesn’t know the hash function you used, he can always try to calculate hashes for many of them). And yes, even a bit more complex meter IDs like ‘meter563145’ are not impossible to guess with some clever dictionary attacks. This is also a common phenomenon in password cracking, that’s why they tell you to use more than 8 characters with upper, lower and special case letters.
In order to secure your hashes, it’s recommended to use something which is called salt in cryptography. A salt is a string, usually long and as random as possible that is concatenated to your original ID before hashing is done. This way an attacker trying to de-pseudonymize your data will have to guess not only your smart meter IDs but also the salt you have added in order to compute the right hashes. An example of adding salt in SQL query would be:
DECLARE @Salt VARCHAR(MAX)
SET @Salt = '16f0!43be@865a#5d!4fcef6.f5fb_11ea'
SELECT HASHBYTES('SHA2_512', Meter_ID + @Salt) as Pseudo_ID,
Adding salt to your hashes makes them more secure. There is a methodology of breaking hashes which is called rainbow table attack, where basically each combination of let’s say 1-7 characters from ASCII table already has precomputed hashes, meaning an attacker doesn’t have to waste loads of computer power to calculate the hashes he’s trying to guess, he only needs to compare precomputed hashes from the rainbow table to the hashes that you have in your pseudonymized data. There are many rainbow tables for many different sets of characters and different hash functions in existence. You can find a few rainbow tables here (link) and they are quite large often several 100 GB in size. Adding long and complicated salts to your IDs, will make a rainbow attack much more difficult and your data more secure.
It is important to keep your salts safe and to use different salts for each project. Now this might seem similar to the messy keeping of individual cross-tables for each project we have discussed before, and you’re right, you still have to keep one information for each project separately. But I would figure it’s a lot easier and safer to keep a few salty strings in a password manager tool, locked under additional master password only you know, than to keep loads of cross-table files on shared (or local) disks. But anyhow this is totally up to your specific case, some will prefer cross-tables, others will prefer hashing, the important thing is no matter what you choose, do it right, and with security in mind.
Of course, there is no such thing as 100 % data security, there’s always going to be a way to de-pseudonymize your data by someone from outside your organization. Even using modern hashing algorithms that are claimed to be safe today, may not be tomorrow. For all we know they could already have been broken. But the point remains, pick a pseudonymization strategy, if going for hashing use the latest proven and tested functions, and do as best as you can do to protect your customer’s privacy.
Pseudonymizing meter data by using salty hashing is something we’re currently testing at Elektro Celje in data exports for EU projects we’re involved in. We would love to hear your opinion about what you think on this subject and your suggestions for improvements. Thank you.
 Winsniewski C., Smart meter hacking can disclose which TV shows and movies you watch. Available at: https://nakedsecurity.sophos.com/2012/01/08/28c3-smart-meter-hacking-can-disclose-which-tv-shows-and-movies-you-watch/
 28c3: Smart Hacking for Privacy. Available at Youtube: https://youtu.be/YYe4SwQn2GE
 British Information Commissioner’s Office, Intention to fine British Airways £183.39m under GDPR for data breach. Available at: https://ico.org.uk/about-the-ico/news-and-events/news-and-blogs/2019/07/ico-announces-intention-to-fine-british-airways/