24th april 2023
duckdb & the general data protection regulation - a match made in heaven
in this post we will
i was recently made aware of spotify’s gdpr data download dashboard, most people just think of cookie banners but the right to portability and access of pii is really radical and makes some really interesting and, for want of a better word, 'confessional' work possible if you are a sufficient level of anorak.
duckdb is so great, there are a lot of hyped 'data' focused projects which i think are complete vaporware (have you tried airbyte?), but i really think duckdb is the future of local analytics and the ‘early to mid scale’ data infrastructure will be a graduation from excel / google sheets to something that either is, or is powered by, duckdb.
spotify send the listening history information over as a zip file of json array blobs. this is perfect for duckdb’s read_json function, we just feed it in a globbed descriptor of the files (endsong_*.json) and then tell duckdb about the structure and data type of the json array, in this specific case we end up with a query that looks like this.
there are some problems with this, the timestamps aren’t always in the right format so we’ve set them to strings and integers (as that is what they are in the raw data), some of the artist and track names are blank, and our nulls aren’t nully. i also want some extra fields such as how old i was when i listened to the song.
we then need to clean this up a bit, we want to write the transform to our extract and load, i wrote a relatively simple view which looks like this.
from this we can do all the trivial stuff.. which two songs bookend our listening history?
the freemasons remix of uninvited, which we first listened to at 2011-12-04 17:07:17 (i think i can actually remember my mum asking me to play this song?), and more recently, can we do this? by speakers corner quarter at 2023-04-19 22:10:30.
but there’s a lot more interesting work we can do! one of my favourite things spotify wrapped does is give you your top song that year - but i can’t remember this every year, and i’d like to know what it was historically.
to grab our top track, by year, by play time (actual listened minutes i feel is a better figure for what songs we like as opposed to play count) we can use this query.
|16||kanye west||all of the lights||1167844||1|
|17||alex metric||open your eyes - tim mason festival remix||3493505||1|
|18||siren||buckets of blood||461199||1|
|21||objekt||kern, vol. 3 - continuous mix||3078130||1|
|22||kaitlyn aurelia smith||a new day||1387916||1|
|24||mac-talla nan creag||ballad of glen nevis||12606009||1|
|27||big thief||simulation swarm||11097261||1|
this is recognisable to me, i’d actually been trying to remember the name of that siren song recently, too.
my overarching point is that gdpr, and organizations that are large enough to have proper compliance with it, have paved the way for a world in which the "personal dashboard" is more feasible than ever before. exploring information that these organizations have collected about you over the years can be incredibly rewarding. there is no need to constantly update a library or logbook; this information has been collected passively from your behavior over the years. by examining this data, you can learn a great deal about yourself and your habits. all it takes is a duck themed in-process analytics database and a bit of patience.
thanks for reading!