←←←back

3rd may 2023

gdpr and duckdb - journeys through space and time

thanks to people who reached out after my last post - there were more of you than i thought there would be (literally one person), we especially extend a warm thanks and welcome to those of you that said you couldn’t understand what i was on about but that it seemed cool.

a combination of duckdb releasing their new geospatial extension, a conversation with my friend josh in the pub and a long ago activated toggle on a menu deep inside the google maps app made me think about the possibility of using a similar workflow that i used for the spotify post to query my location history, and, even better, i could publicly expose this to any onlooker who cares to google me!

the first step is to download our location history data from google, this is trivial enough and they have a good portal for doing so here.

we get this in a big blob of json, and interestingly you can see how google has added features to it over time. initially we effectively just get the device the reading was taken on, the timestamp and the latitude and longitude. however, towards the end of our readings (from ~2019 onwards) we start to get all manner of interesting things such as whether the device is charging, or the estimated likelihood that the user is currently on a ferry.

i have written a simple (and not great, nor fast, nor 'idiomatic' before some nerd emails me) rust program that takes this json in and then converts it to a format that is easier to work with, we then use a jq oneliner to convert this to a csv.

we then take our old friend duckdb and install the geospatial extension. taking the csv effectively unchanged, we load the data into duckdb, the query we use to do this looks like this.

the result of this query is a table, with a row for each reading (a reading is the the latitude and longitude i was recorded at, and the timestamp that recording was taken at), in total there are over 170000 rows. it is worth saying here that this data is lossy, if my phone was off or i wasn't using google maps it is likely we don't have a record of that location. to get a mental model of the ramifications of how data is collected, this data is particularly rich if you were getting public transport to the location you were travelling to and had to check directions multiple times along the way.

similarly to the spotify post, there's now some relatively easy analysis we can run on this data. my life for the past few years has been a combination of working and living in london, working and living in edinburgh, working in london and living in fife, and working in london and living in edinburgh. due to this, i have travelled up and down the country quite a lot. we can use a query to work out our average latitude and longitude and then group by year to see how this has changed over time.

2014201920222023

i moved to london about a fifth of the way through 2019 and you can see that from the map, the distance the '2019' is from london is about that the distance you'd expect for four fifths of the year plus weekend trips and christmas at home.

as with everything, there are limits to how interesting data from a single source (where i cannot publish non-aggregated location values without doxxing myself) can be. analysis which joins multiple sources together produces far more interesting and novel insights. luckily, since most 'event' data has a timestamp (including this data source!) we can link across this common temporal dimension to produce a more interesting analysis!

to further revisit the previous post, i remember a couple of years ago spotify had a number of billboard adverts which referenced individual users behaviours near the those billboards.

i quite like the idea of doing something like this for an individual user (me), and have spent enough time on trains that i have a varied travel (and listening history) on those trains. what if i built a map which had the top song i listened by playtime in the county in which i listened to it?

to achieve this we need to take our location history and enrich it with the county that each of the latitude and longitudes fall within. the first step in this task is to download a geojson file which contains the boundaries of each county in the uk, this is available on github thanks to the hard work of martin chorley. then, and i am going to be frank here, we are on a journey into the completely undocumented, we use duckdb's geospatial extension to load the geojson file into our database.

this results in us having a list of counties with the county's name and the polygon respresentation (in lat, long coords) of the county's boundary.

after this, we take our location history and classify the county each point falls within using the st_within function. this is naturally computationally expensive work with little room for shortcuts, so it will take some time, even on a fast computer.

after this we have a record of the county that each measurement falls within but we need to transform this into a record of the start time and end time of each county visit, this is done with a more complex query which i've included here. the transformation is visible in the tables below.

latitudelongitudetimestampcounty_name
51.5578667-0.05667022019-08-03 20:40:26.477hackney
51.5578667-0.05667022019-08-03 20:42:26.719hackney
51.5578667-0.05667022019-08-03 20:44:27.393hackney
51.5578667-0.05667022019-08-03 20:46:32.986hackney

this is a raw record of a location at a certain point in time, but what we actually want is 'windows' for time spent in counties like below.

start_timeend_timecounty_name
2019-08-14 14:03:26.5952019-08-14 14:05:28.031city of london
2019-08-14 16:28:32.3332019-08-14 16:30:34.493tower hamlets
2019-08-14 16:37:43.4012019-08-14 16:56:01.897city of london
2019-08-14 16:57:36.1882019-08-14 16:59:36.292islington

we then do some extended analysis (grouping the song plays into counties based on that query, ranking the songs by county and playtime, only selecting the top song per county) but eventually we get to a point where we have a table which has a record of each county, the top song and the artist that song is by.

from this we can then generate a map, i've included only the central london section below as the image is absolutely massive, but you can see the effect we achieve by interleaving these data sources.

metallica - enter sandmanheather leigh - gold teethvarg²™ - red line ii (127 sätra c)florist - moon beginsdamien dubrovnik - arrow 2kara-lis coverdale - graftsblackhaine - saddleworthmica levi - lovetim hecker - this lifebjörk - arisen my sensessource direct - black rose - original mix

i have made a playlist with some of these songs included.

if there is anything to take from this i suppose it is the frankly incredible level of data google stores about users, is the accuracy of bus times (this location data helps build the prediction models) in cities so benefitial on a population scale that it is worth this? i am genuinely unsure to be honest, but it is definitely worth being conscious of. reading articles and books about large technology firms and their practices is one thing, but placing it in your personal context is very much another. i have another more involved, more collaborative (?) project in the works so please keep an eye out if you are interested in using tooling like this against your own data.

thanks again for anyone who took the time to read this!