The Data Worker’s Manifesto
This article is a re-post of an article that first appeared on www.geo.ebp.ch.
I’m quite sure it’s not best practice to give one’s talk an unintelligible title. Nevertheless, that’s what I did, so let me explain what the different parts mean:
I chose “state of the union” as a fancy way of expressing that I’m directing my talk primarily at fellow geoinformation and data people.
With “data” we usually refer to raw observations of some phenomenon. We’ll discuss later, how helpful that definition turns out to be.
“Enabling tech” would usually expand to “technology” and the term is used to denote a technical development that makes novel applications possible in the first point. However, in the context of this talk it may be worthwhile to keep the 2nd potential meaning of the stub “tech” – “technique” – in mind, as well.
Finally, the ‽ is called an interrobang and nicely reflects the semantic ambivalence of combining ? and ! into one punctuation mark.
Sometime in the last decade, we as a society have moved from a situation where data was usually scarce to one where (many forms of) data are abundant. Where before, the first step of analysis was often one of interpolation between valuable data points, we now filter, subsample, and aggregate our data. Not all domains are the same in this respect, obviously. But I think the generalisation pretty much holds, as (often ill-applied) labels such as “big data” or “humongous data” indicate. (Well, the latter is obviously a joke; but think about why it works as such.)
Big drivers of this development are a) the Web and its numerous branches and platforms and b) smartphones, tablets, phablets and what have you, or more broadly speaking: embedded sensors, GPS loggers, tracking and fleet management systems, automotive sensors, wearables, ‘self-tracking’ or ‘quantified-self’ technology, networked hardware such as appliances (think Internet of Things) and the like.
In what follows I’m going to talk primarily on crowdsourced data. (In other contexts, crowdsourced (geographic) data is also called e.g. Volunteered Geographic Information, VGI, (a term fraught with problems), or User-Generated Content, UGC.) But some of the assertions also hold for data in general.
Crowdsourced data, i.e. data that:
– is gathered from many contributors,
– in a decentralised fashion,
– following (at best) informal rules and protocols,
– voluntarily, unknowingly or with incentives,
has some issues.
The large-scale advent of this crowdsourced data of course coincides with the development of the so-called Web 2.0 (in German also referred to as the ‘participation Web’), where anybody could not just be a consumer, but also (at least, in theory) a producer, or: a produser. Or so we were told.
But: crowdsourced data is biased
Assuming (somewhat simplifying) that the presence of people effects the build-up of infrastructure, in an ideal world this map would feature a uniform colour everywhere. However, there are regions where relative data density in OSM exceeds that of other regions by 3–4 orders of magnitude! Compare this to the density of placenames in the GeoNames Gazetteer!
Clearly, offering an “open platform” and encouraging participation is not enough to really level the playing field in user-generatation of content. In some regions people might not have the means (spare-time, economic freedom, hardware, software, education, technical skills, access to stable (broadband) Internet, motivation) to participate or they might e.g. have reservations against this kind of project or the organisations behind it.
Spatially heterogeneous density is just one example of bias we find in crowdsourced data. Another one is termed user contribution bias, where a very small proportion of contributors (think Twitter users, Flickr photographers, Facebook posters, …) creates a large proportion of the data. Depending on the platform we see very lopsided distributions with few percent of users being behind a large share of the content. In his Master’s thesis, Timo Grossenbacher found that in his sample of Twitter, 7% of the users created 50% of the tweets. Despite all techno-optimism: clearly, not everyone is a produser and clearly not all contributors create equal amounts of content!
Talking of different kinds of bias: OSM has also been found sexist, for example. OSM contributors (like in many crowdsourcing initiatives) are, as a tendency, young, male, technologically minded, with above average education. Narrow groups of contributors may, inadvertently or consciously, favour their own interests in creating content.
OSM’s “bottom-up data model” (basically, the community discusses and decides what is mapped how) gives contributors allocative power, i.e. what most people (or the most industrious contributors?) adopt as their practice has good chances to evolve into community (best?) practice.
Further, some patterns in crowdsourced data may be very surprising.
One example this talk has already touched upon is user contribution bias, where a small group dominates the crowdsourcing activity. A more complicated example of surprising insights hidden in crowdsourced data is in the figure on the left. Remember that in Wikipedia, the self-declared repository for the sum of all human knowledge it’s well known, that the spatial distribution of geocoded and “geocode-able” articles is strongly biased. A map I made with my colleagues at the OII shows that a part of Europe features as many Wikipedia articles as the rest of the world. (By the way, there is this interesting Wikipedia page that discusses all kinds of biases that affect Wikipedia.)
Now, as the figure shows, despite this known severe lack of content e.g. in the Middle East and North Africa (MENA), only about a third of edits that are made by contributors in that region are about articles in the same region. Surprisingly, a large proportion of MENA’s (in absolute terms low) editing activity is geared towards contributing to articles outside their own region, about phenomena in North America, Asia and Europe. If you expected, as many people do, that contributors edit mostly about phenomena in their immediate environment and that they tend to “fill in gaps” in content, this insight comes as a surprise.
Cultural, personal (education, careers, family relations, travel, tourism, …), linguistic, historical, colonial, political, and many more reasons may play into this.
The new abundance of data, the proliferation of open (government) data, APIs and the current popularity of information or data visualisation (infoviz/dataviz) as well as data-driven journalism (DDJ) has led to many more people and institutions obtaining, processing, analysing, visualising and disseminating data.
While this may be welcomed by data-inclined people in general, unfortunately it sometimes leads to people attaching false meaning to data or to interpreting insights into data that are not supported by it.
This example shows geocoded tweets in response to the release of a Beyoncé album. In my opinion, while technologically interesting, the visualisation has severe flaws in terms of (re)presentation, cartography and infoviz best practices. But: even more importantly, it utterly fails to mention e.g., that a) Twitter users are a highly biased, small subgroup of the general population, that b) the proportion of geocoded tweets is estimated to be in the very low percent numbers (often, < 3% is indicated!), that c) user contribution bias is likely at play, that d) geolocation may be faulty, etc. etc.
Finally, this figure shows the result of “ping[ing] all the devices on the internet” according to John Matherly of Shodan. This figure and story went viral, it appeared e.g. on Gizmodo, The Next Web, IFLScience!, and many more.
Turns out, if you dig a bit deeper, there are some rather important disclaimers: e.g. a very limited window during which the analysis was reportedly carried out and, more importantly, only pinging devices addressed using IPv4, not considering IPv6. You can read about these on this Reddit thread.
Turns out some countries in Asia that have recently invested heavily into broadband Internet infrastructure and also large parts of Africa where the Internet is mainly used on mobile devices, use IPv6 and thus show up as black holes or rather dark regions on this “map of the Internet”.
Sadly, the relative lack of access to Internet, content and netizens in Africa is a truth (cf. the OII Wikipedia analyses mentioned above). However, the situation, at least in terms of connected devices is not as dire as this map makes you believe!
However, I think the very fact that the map played into this common narrative of unconnected, offline regions is an important factor in its massive proliferation (a.k.a. ‘going viral’). Unfortunately, it seems all this sharing happened without discussions on the data source, data collection method, processing steps, and important disclaimers about the data’s validity and legitimacy – and, let’s face it, very little critical reception and reflection on part of the audience, i.e. us.
The effects? – The original tweet has been retweeted more than 5,500 times! Go figure.
With these examples in mind, let’s turn to the classic Data-Information-Knowledge-Wisdom workflow or pyramid. In the DIKW mindset, data is composed of raw observations. Only structuring, pattern-detection, and asking the right questions turn data into information. Memorised, recalled and applied in a suitable context, information becomes knowledge. And finally, there’s the wisdom stage that is concerned with ‘why’ rather than ‘what’, ‘when’, ‘where’ and ‘how’ etc.
Well, turns out, one can argue rather well that ‘raw data’ does not, in fact, exist.
Data – and I would argue also crowdsourced data – is usually collected with an intent, an application in mind or, if not that, at least with a specific method, from a certain group of people, by a defined group of people, using a certain measuring device. Whether this happens implicitly or explicitly and willingly does not matter in this context. Clearly, however, these factors all potentially affect the applications the data can sensibly be used for.
So, there goes the title of my talk: ‘data’ may not actually be ‘raw’. And overly focussing on technology and missing out on the underlying technique can be dangerous!
Putting it bluntly: Unlike this car, data is never general-purpose.
For all these reasons, and because I care about our profession and about what is being done with data in the society at large (think: data-driven churnalism journalism, evidence-based politics, etc.) I would like to propose:
The Data Worker’s Manifesto.
It consists of only few, easily memorised principles:
Know your data!
Know the sources of your data, collection methodology, the sample size and composition, consistency, pre-processing steps possibly carried out by others or by yourself, more generally: the lineage, biases, quality issues, limitations, legitimate appliations and use cases. Know all these very well. If you don’t, try to find out. If you can’t be sure, refrain from using the data.
Discuss data and how it’s being used.
The Internet and social media are wonderful things where thousands of links are shared. Ever so often you may see an analysis with un(der)-documented input data or methodology.
Reflect critically what others may share blindly. If you have questions: remember, the Web is a two-way street these days. Gently but firmly ask them and make your sharing of, and investment into, any analysis dependent on the answer.
Create and share metadata!
If you do data-based analyses and produce visualisations, always keep track of what you have done with the data: Did you apply filters? Remove (suspected) outliers? Subsample, downsample, disaggregate, aggregate, combine, split, join, clean, purge, merge, … the data? Document your steps and assumptions and share this metadata to give your collaborators and your audience insight into data provenance and your methodology, along with the results.
If you share your insights in a social media content (e.g. a map as a PNG file), I recommend burning the metadata into the result, i.e. put the metadata somewhere into the content so that it’s hard to remove. Because said content will – at some point – be taken, proliferated, received and analysed out of context. Guaranteed.
3b is very similar to 3: Create and share metadata!
Seriously: I know metadata is uncool and not sexy at all to maintain. But nothing good comes from not doing it!
Experts are valuable.
While the “end of theory” has been proclaimed, I think the “report of [its] death has been greatly exaggerated”.
Being, or being in contact with, a domain specialist is still very valuable. Sometimes, especially for harder, i.e. more interesting, analyses, it’s indispensible. In the very least, expert knowledge may save you from doing something silly with data you don’t completely understand.
We’re in this together.
I feel we are all still coming to terms with the new opportunities the Web and some of the data-related developments I mentioned provide to us (let alone methodological and computational improvements and societal developments). It can be a bumpy, but in any case an exciting, ride, so let’s buckle up, meet and talk and share our experiences – but that’s obviously why all of you have come to this GeoBeer in the first place!
I feel that despite all these potential pitfalls we should perceive the abundant data, especially new data types such as crowdsourced and open government data, as huge opportunities!
I’m convinced that, with the right people and the right mindset, we can do great things, privately or politically, that have the potential to improve our respective environments ever so slightly.
I feel that Switzerland as a democratic and affluent country provides us with an especially friendly environment to get involved, in business, in research, and in societal goals.
Thank you all for your attention!