How Twitter got research right

22nd Nov 2021 | Source: The Verge

Platformer is an independent newsletter from Casey Newton that follows the intersection of Silicon Valley and democracy. Subscribe here.

It has not been a happy time for researchers at big tech companies. Hired to help executives understand platforms’ shortcomings, research teams inevitably reveal inconvenient truths. Companies hire teams to build “responsible AI” but bristle when their employees discover algorithmic bias. They boast about the quality of their internal research but disavow it when it makes its way to the press. At Google, this story played out in the forced departure of ethical AI researcher Timnit Gebru and the subsequent fallout for her team. At Facebook, it led to Frances Haugen and the Facebook Files.

For these reasons, it’s always of note when a tech platform takes one of those unflattering findings and publishes it for the world to see. At the end of October, Twitter did just that. Here’s Dan Milmo in the Guardian:

Twitter has admitted it amplifies more tweets from right-wing politicians and news outlets than content from left-wing sources.

The social media platform examined tweets from elected officials in seven countries – the UK, US, Canada, France, Germany, Spain and Japan. It also studied whether political content from news organisations was amplified on Twitter, focusing primarily on US news sources such as Fox News, the New York Times and BuzzFeed. […]

The research found that in six out of seven countries, apart from Germany, tweets from right-wing politicians received more amplification from the algorithm than those from the left; right-leaning news organisations were more amplified than those on the left; and generally politicians’ tweets were more amplified by an algorithmic timeline than by the chronological timeline.

Twitter’s blog post on the subject was accompanied by a 27-page paper that further describes the study’s findings and research and methodology. It wasn’t the first time this year that the company had volunteered empirical support for years-old, speculative criticism of its work. This summer, Twitter hosted an open competition to find bias in its photo-cropping algorithms. James Vincent described the results at The Verge:

The top-placed entry showed that Twitter’s cropping algorithm favors faces that are “slim, young, of light or warm skin color and smooth skin texture, and with stereotypically feminine facial traits.” The second and third-placed entries showed that the system was biased against people with white or grey hair, suggesting age discrimination, and favors English over Arabic script in images.

These results were not hidden in a closed chat group, never to be discussed. Instead, Rumman Chowdhury — who leads machine learning ethics and responsibility at Twitter — presented them publicly at DEF CON and praised participants for helping to illustrate the real-world effects of algorithmic bias. The winners were paid for their contributions.

On one hand, I don’t want to overstate Twitter’s bravery here. The results the company published, while opening it up to some criticisms, are nothing that is going to result in a full Congressional investigation. And the fact that the company is much smaller than Google or Facebook parent Meta, which both serve billions of people, means that anything found by its researchers is less likely to trigger a global firestorm.

At the same time, Twitter doesn’t have to do this kind of public-interest work. And in the long run, I do believe it will make the company stronger and more valuable. But it would be relatively easy for any company executive or board member to make a case against doing it.

For that reason, I’ve been eager to talk to the team responsible for it. This week, I met virtually with Chowdhury and Jutta Williams, product lead for Chowdhury’s team. (Inconveniently, as of October 28th: the Twitter team’s official name is Machine Learning Ethics, Transparency, and Accountability: META.) I wanted to know more about how Twitter is doing this work, how it has been received internally, and where it’s going next.

Here’s some of what I learned.

Twitter is betting that public participation will accelerate and improve its findings. One of the more unusual aspects of Twitter’s AI ethics research is that it is paying outside volunteer researchers to participate. Chowdhury was trained as an ethical hacker and observed that her friends working in cybersecurity are often able to protect systems more nimbly by creating financial incentives for people to help.

“Twitter was the first time that I was actually able to work at an organization that was visible and impactful enough to do this and also ambitious enough to fund it,” said Chowdhury, who joined the company a year ago when it acquired her AI risk management startup. “It’s hard to find that.”
"Often, only the loudest voices are addressed, while major problems are left to linger"

It’s typically difficult to get good feedback from the public about algorithmic bias, Chowdhury told me. Often, only the loudest voices are addressed, while major problems are left to linger because affected groups don’t have contacts at platforms who can address them. Other times, issues are diffuse through the population, and individual users may not feel the negative effects directly. (Privacy tends to be an issue like that.)

Twitter’s bias bounty helped the company build a system to solicit and implement that feedback, Chowdhury told me. The company has since announced it will stop cropping photos in previews after its algorithms were found to largely favor the young, white, and beautiful.

Responsible AI is hard in part because no one understands fully understands decisions made by algorithms. Ranking algorithms in social feeds are probabilistic — they show you things based on how likely you are to like, share, or comment on them. But there’s no one algorithm making that decision — it’s typically a mesh of multiple (sometimes dozens) of different models, each making guesses that are then weighted differently according to ever-shifting factors.

That’s a major reason why it’s so difficult to confidently build AI systems that are “responsible” — there is simply a lot of guesswork involved. Chowdhury pointed out the difference here between working on responsible AI and cybersecurity. In security, she said, it’s usually possible to unwind why the system is vulnerable, so long as you can discover where the attacker entered it. But in responsible AI, finding a problem often doesn’t tell you much about what created it.

That’s the case with the company’s research on amplifying right-wing voices, for example. Twitter is confident that the phenomenon is real but can only theorize as to the reasons behind it. It may be something in the algorithm. But it might also be a user behavior — maybe right-wing politicians tend to tweet in a way to elicit more comments, for example, which then causes their tweets to be weighted more heavily by Twitter’s systems.

“There’s this law of unintended consequences to large systems,” said Williams, who previously worked at Google and Facebook. “It could be so many different things. How we’ve weighted algorithmic recommendation may be a part of it. But it wasn’t intended to be a consequence of political affiliation. So there’s so much research to be done.”

There’s no real consensus on what ranking algorithms “should” do. Even if Twitter does solve the mystery of what’s causing right-wing content to spread more widely, it won’t be clear what the company should do about it. What if, for example, the answer lies not in the algorithm but in the behavior of certain accounts? If right-wing politicians simply generate more comments than left-wing politicians, there may not be an obvious intervention for Twitter to make.

“I don’t think anybody wants us to be in the business of forcing some sort of social engineering of people’s voices,” Chowdhury told me. “But also, we all agree that we don’t want amplification of negative content or toxic content or unfair political bias. So these are all things that I would love for us to be unpacking.”

That conversation should be held publicly, she said.

Twitter thinks algorithms can be saved. One possible response to the idea that all our social feeds are unfathomably complex and cannot be explained by their creators is that we should shut them down and delete the code. Congress now regularly introduces bills that would make ranking algorithms illegal, or make platforms legally liable for what they recommend, or force platforms to let people opt out of them.

Twitter’s team, for one, believes that ranking has a future.

“The algorithm is something that can be saved,” Williams said. “The algorithm needs to be understood. And the inputs to the algorithm need to be something that everybody can manage and control.”

With any luck, Twitter will build just that kind of system.

Of course, the risk in writing a piece like this is that, in my experience, teams like this are fragile. One minute, leadership is pleased with its findings and enthusiastically hiring for it; the next, it’s withering by attrition amidst budget cuts or reorganized out of existence amidst personality conflicts or regulatory concerns. Twitter’s early success with META is promising, but META’s long-term future is not assured.

In the meantime, the work is likely to get harder. Twitter is now actively at work on a project to make its network decentralized, which could shield parts of the network from its own efforts to build the network more responsibly. Twitter CEO Jack Dorsey has also envisioned an “app store for social media algorithms,” giving users more choice around how their feeds are ranked.
"“I’m not sure it’s feasible for us to jump right into a marketplace of algorithms”"

It’s difficult enough to rank one feed responsibly — what it means to make a whole app store of algorithms “responsible” will be a much larger challenge.

“I’m not sure it’s feasible for us to jump right into a marketplace of algorithms,” Williams said. “But I do think it’s possible for our algorithm to understand signal that’s curated by you. So if there’s profanity in a tweet, as an example: how sensitive are you to that kind of language? Are there specific words that you would consider very, very profane and you don’t want to see? How do we give you controls for you to establish what your preferences are so that that signal can be used in any kind of recommendation?

“I think that there’s a third-party signal more than there is a third-party bunch of algorithms,” Williams said. “You have to be careful about what’s in an algorithm.”