On AI and community owned data

🔢 On AI and community owned data

TL; DR – Imagine a world where communities nurture, own, govern and sell access to their data. Community owned data could give us a better, more vibrant AI ecosystem – and reward those whose data makes AI models tick.

Can aerial images of roads (like the ones above), taken by drones and analysed through custom-built algorithms, tell you which roads needed repair most urgently?

A few years ago, through the UK government funded Frontier Tech Hub, we¹ set out to test whether this was possible.

Drive through rural Tanzania, and you’ll likely have a bumpy, uncomfortable journey. Dry, uneven roads criss-cross a country three times the size of the United Kingdom. And it’s not just your comfort at stake. In Tanzania, living more than 5km from a good quality road leads to a steep drop in average income, and isolation from markets, friends, and family.

The money to fix roads was there, from both donors and the government of Tanzania’s Roads Fund Board, but it was spent based on anecdote, politics, and happenstance. You might get lucky, if a Minister happened to drive down your road or your local politician had favour and connections. If not, your roads might languish in disrepair for years. Could a tech-enabled, automated workflow help prioritise road maintenance budgets more fairly?

Existing algorithms, trained on smooth roads in Europe and the US, had no value in Tanzania, where 95%+ of roads are dirt or partly paved. We needed to train an algorithm from the ground up specifically for Tanzania. For which, we needed to build a training data set: aerial images of Tanzanian roads, hand labelled as “very good”, “good”, “poor” or so on based on eye-witness information on their condition.

This was many months’ work. It involved evaluating and stitching together photos from drones battered by wind and rain; driving hundreds of miles in cars fitted with sensors to monitor bumpiness; and manually assigning photos with classification labels². And it wasn’t just James, Bertrand and the University of Nottingham team. Local partners managed the drones, stored and transferred datasets, and supported the work at each step. And the UK government, through Tim and the Frontier Tech Hub, provided the funding to make it possible.

Hard work and smarts came together to create this asset: a first-of-its-kind, labelled dataset, that evaluated the condition of unpaved roads. A dataset representing the kinds of roads you get in most of the world.

Write to James, and he’ll share his dataset for free. It’s a publicly available counter balance to data skewed towards the Global North.

This is a big win for representation. By proactively including unpaved roads in training datasets and AI models, the team extended the benefits of AI to Tanzania and other non-Western, low income countries. And you can extend the template beyond roads: by diligently building a dataset that represents the under-represented, it’s possible to extend the benefits of AI to more people³.

Which is good – but is it enough?

I don’t think it is. To truly extend the benefits of AI, we need to move beyond representation, and towards participation and ownership.

What if more communities crafted datasets that articulated their world? Datasets that are crafted, not scraped, and capture the nuance and truth of a part of the world, from the vantage point of those who experienced it.

And what if, instead of this data being available for anyone to use, the community had the right to govern who could use these datasets, and how?

And lastly, what if anytime the data was used, the community could accrue some of the reward in return for their efforts? (rather than just consuming the downstream benefits of more representative AI models)

More representative data leads to higher quality AI models⁴. The developers of these models (Open AI, Anthropic, Google, etc), and those who use them to build products, stand to benefit a lot from them.

Through participation and ownership, we have the chance to align this incentive with real, sustainable benefits for those whose data might contribute to these higher-quality models. Benefits in the form of governance – agency over which models use their data and how. And benefits in the form of financial rewards, for their part in improving AI performance.

There’s a group of people (in a part of the world you might not expect) paving the way towards this future.

The government of New Zealand is building a Māori Data Governance Model (MDGov) – putting data on the Māori – the indigenous people of New Zealand – in Māori hands. Concretely, this means:

A Māori Chief Data Steward, with the resource and mandate to represent Māori interests and perspectives when it comes to data that represents them.
Access to data on Māori to be based on trust and clear intentions. Rather than one-of consent, data access should be considered an ongoing relationship, with the ability for data on Māori to be repatriated back to Māori hands through the Chief Data Steward.
The ability for Māori to lead and design their own ways of classifying the data. And for Māori (again, through the Chief Data Steward) to be the primary judges of data quality and accuracy, based on relevance to their experience as a group. For instance, Māori researchers and communities co-designed the Te Kupenga post-census survey of 5000 Māori adults, ensuring it captured subjective factors related to household and extended family wellbeing.

A model like this gestures to a world where datasets are nurtured and governed by those they represent. And while the MDGov is primarily about how the government of New Zealand should engage with data on Maori, their model shows us how any group of people could nurture, own and offer their data to enrich the AI models we use.

What about the other side of the marketplace? Who would buy these community-owned datasets?

I spend a lot of time working with edtech startups and products. When it comes to generative AI, one of the things that excites startups the most is how it might give teachers the means to create their own high-quality educational content: translating their deep understanding of their students and subject matter into quizzes, exercises, videos, stories, and so on.

How do we ensure teachers can use the best possible AI models? One way is to train those models on datasets that communities of teachers themselves have nurtured, reflective of what they believe is high-quality content that children engage with and learn from. These models might be worth billions of dollars. And if we can get community-owned data right, then teachers and their wisdom could be at the heart of it, and benefit financially for their labour.

That’s one example.

Here’s another.

I’m writing the first draft of this piece using Lex, a writing processor that uses AI to suggest edits, prompt thoughts, and review work. Right now, you can switch between the mainstream LLMs (GPT, Claude, and Gemini) in Lex.

What if, in the future, you could switch between customised versions of LLMs, trained on high-quality datasets that reflect the experiences and wisdom of different communities? Prompts from an AI model trained on data from experts in African history? Or Lord of the Rings enthusiasts? Or those with lived experience of the Pakistan-India partition?⁵

Or what if government officials could access AI models trained on data from groups traditionally excluded from the policymaking process? Philosophers. Environmental activists. Minority groups. Geographically isolated communities.

We’re moving to a world where popular LLMs are being tweaked and customised, for different products and use cases. Let’s tweak them with data from the ground up, best-in-class because it is nurtured and owned by communities.

To summarise, I’m making the case for two new ways to think about data.

Today, we think of data as generated by and belonging to an individual, who can consent to giving it away. Tomorrow, we’ll think of data as crafted, governed by and belonging to communities.

Today, we think of data as scraped from the (open and freely accessible) internet. Tomorrow, we’ll think of access to data as negotiated with communities, who get a share of the reward from its use in training high-quality, representative AI models.

I genuinely believe community-owned data can be a ‘win’ for everyone. And, it can avoid two of technology’s biggest and most common unintended consequences.

The first is inequality. Tech tends to deliver outsized benefits for a small, concentrated group who build, own and invest in it. But by turning a much bigger chunk of the world’s population into active AI producers, the value AI will generate will proliferate much more equally, rewarding the wisdom and sweat of those whose data makes AI models work in the first place.

The second is homogenisation. If we develop the language and incentives for community owned data, to be applied in different ways to different models for different use cases, we’ll all live in a more vibrant AI ecosystem. Away from chatbots spitting the same cookie-cutter text, and towards products and outputs that authentically represents the world’s diversity.

That’s an AI future I’m excited about.

¹ The team included Tim (a pioneering civil servant at the British High Commission in Tanzania), and James and Bertrand (machine learning specialists at the University of Nottingham). A truly mission-driven group.

² You can read about their efforts in this delightful series of blogs the team wrote while they were doing it

³ That might be Black patients getting the right diagnosis from AI at the hospital, because of more racially representative training data. Or non-English speakers benefitting from non-English AI models (trained on non-English language data), in the same way we benefit from ChatGPT.

⁴ To give an example of this: according to one paper published by Deepmind, greater representation “has the clear instrumental value of generating more robust tests and thereby surfacing more blindspots leading to higher performing systems”

⁵ To their credit, Open AI is already moving in this direction with their “store for custom versions of ChatGPT”. It’s unclear whether creators of GPTs get fair compensation for their efforts.

🎬 Thanks to Alice Sholto-Douglas, Felicity Brand, Kavir Kaycee and Daniel Sisson for looking at drafts of this

🤔 Got thoughts? Don’t keep them to yourself. Email me on asad@asadrahman.io. Let’s figure this out together.

If you enjoyed this, subscribe to get pieces just like it straight to your inbox. One email, every so often (and nothing else).

A selection of roads from Zanzibar, in Tanzania. Photo credit: NLabs. Originally published in a blog by James Golding)