3rd may 2023
gdpr and duckdb - journeys through space and time
thanks to people who reached out after my last post - there were more of you than i thought there would be (literally one person), we especially extend a warm thanks and welcome to those of you that said you couldn’t understand what i was on about but that it seemed cool.
a combination of duckdb releasing their new geospatial extension, a conversation with my friend josh in the pub and a long ago activated toggle on a menu deep inside the google maps app made me think about the possibility of using a similar workflow that i used for the spotify post to query my location history, and, even better, i could publicly expose this to any onlooker who cares to google me!
the first step is to download our location history data from google, this is trivial enough and they have a good portal for doing so here.
we get this in a big blob of json, and interestingly you can see how google has added features to it over time. initially we effectively just get the device the reading was taken on, the timestamp and the latitude and longitude. however, towards the end of our readings (from ~2019 onwards) we start to get all manner of interesting things such as whether the device is charging, or the estimated likelihood that the user is currently on a ferry.
i have written a simple (and not great, nor fast, nor 'idiomatic' before some nerd emails me) rust program that takes this json in and then converts it to a format that is easier to work with, we then use a jq oneliner to convert this to a csv.
we then take our old friend duckdb and install the geospatial extension. taking the csv effectively unchanged, we load the data into duckdb, the query we use to do this looks like this.
the result of this query is a table, with a row for each reading (a reading is the the latitude and longitude i was recorded at, and the timestamp that recording was taken at), in total there are over 170000 rows. it is worth saying here that this data is lossy, if my phone was off or i wasn't using google maps it is likely we don't have a record of that location. to get a mental model of the ramifications of how data is collected, this data is particularly rich if you were getting public transport to the location you were travelling to and had to check directions multiple times along the way.
similarly to the spotify post, there's now some relatively easy analysis we can run on this data. my life for the past few years has been a combination of working and living in london, working and living in edinburgh, working in london and living in fife, and working in london and living in edinburgh. due to this, i have travelled up and down the country quite a lot. we can use a query to work out our average latitude and longitude and then group by year to see how this has changed over time.