that the problem needing fixing is the women themselves. Why can’t a woman be more like a man?

Rachael Tatman rubbishes the suggestion that the problem lies in women’s voices rather than the technology that doesn’t recognise them: studies have found that women have ‘significantly higher speech intelligibility’,27 perhaps because women tend to produce longer vowel sounds28 and tend to speak slightly more slowly than men.29 Meanwhile, men have ‘higher rates of disfluency, produce words with slightly shorter durations, and use more alternate (‘sloppy’) pronunciations’.30 With all this in mind, voice-recognition technology should, if anything, find it easier to recognise female rather than male voices – and indeed, Tatman writes that she has ‘trained classifiers on speech data from women and they worked just fine, thank you very much’.

Of course, the problem isn’t women’s voices. It’s our old friend, the gender data gap. speech-recognition technology is trained on large databases of voice recordings, called corpora. And these corpora are dominated by recordings of male voices. As far as we can tell, anyway: most don’t provide a sex breakdown on the voices contained in their corpus, which in itself is a data gap of course.31 When Tatman looked into the sex ratio of speech corpora only TIMIT (‘the single most popular speech corpus in the Linguistic Data Consortium’) provided data broken down by sex. It was 69% male. But contrary to what these findings imply, it is in fact possible to find recordings of women speaking: according to the data on its website, the British National Corpus (BNC)32 is sex-balanced.33

Voice corpora are not the only male-biased databases we’re using to produce what turn out to be male-biased algorithms. Text corpora (made up of a wide variety of texts from novels, to newspaper articles, to legal textbooks) are used to train translation software, CV-scanning software, and web search algorithms. And they are riddled with gendered data gaps. Searching the BNC34 (100 million words from a wide range of late twentieth-century texts) I found that female pronouns consistently appeared at around half the rate of male pronouns.35 The 520-million-word Corpus of Contemporary American English (COCA) also has a 2:1 male to female pronoun ratio despite including texts as recent as 2015.36 Algorithms trained on these gap-ridden corpora are being left with the impression that the world actually is dominated by men.

Image datasets also seem to have a gender data gap problem: a 2017 analysis of two commonly used datasets containing ‘more than 100,000 images of complex scenes drawn from the web, labeled with descriptions’ found that images of men greatly outnumber images of women.37 A University of Washington study similarly found that women were under-represented on Google Images across the forty-five professions they tested, with CEO being the most divergent result: 27% of CEOs in the US are female, but women made up only 11% of the Google Image search results.38 Searching for ‘author’ also delivered an imbalanced result, with only 25% of the Google Image results for the term being female compared to 56% of actual US authors, and the study also found that, at least in the short term, this discrepancy did affect people’s views of a field’s gender proportions. For algorithms, of course, the impact will be more long term.

As well as under-representing women, these datasets are misrepresenting them. A 2017 analysis of common text corpora found that female names and words (‘woman’, ‘girl’, etc.) were more associated with family than career; it was the opposite for men.39 A 2016 analysis of a popular publicly available dataset based on Google News found that the top occupation linked to women was ‘homemaker’ and the top occupation linked to men was ‘Maestro’.40 Also included in the top ten gender-linked occupations were philosopher, socialite, captain, receptionist, architect and nanny – I’ll leave it to you to guess which were male and which were female. The 2017 image dataset analysis also found that the activities and objects included in the images showed a ‘significant’ gender bias.41 One of the researchers, Mark Yatskar, saw a future where a robot trained on these datasets who is unsure of what someone is doing in the kitchen ‘offers a man a beer and a woman help washing dishes’.42

These cultural stereotypes can be found in artificial intelligence (AI) technologies already in widespread use. For example, when Londa Schiebinger, a professor at Stanford University, used translation software to translate a newspaper interview with her from Spanish into English, both Google Translate and Systran repeatedly used male pronouns to refer to her, despite the presence of clearly gendered terms like ‘profesora’ (female professor).43 Google Translate will also convert Turkish sentences with gender-neutral pronouns into English stereotypes. ‘O bir doktor,’ which means ‘S/he is a doctor’ is translated into English as ‘He is a doctor’, while ‘O bir hemsire (which means ‘S/he is a nurse’) is rendered ‘She is a nurse’. Researchers have found the same behaviour for translations into English from Finnish, Estonian, Hungarian and Persian.

The good news is that we now have this data – but whether or not coders will use it to fix their male-biased algorithms remains to be seen. We have to hope that they will, because machines aren’t just reflecting our biases. Sometimes they are amplifying them – and by a significant amount. In the 2017 images study, pictures of cooking were over 33% more likely to involve women than men, but algorithms trained on this dataset connected pictures of kitchens with women 68% of the time. The paper also found that the higher the original bias, the stronger the amplification effect, which perhaps explains how the algorithm came to label a photo of a portly balding man standing in front of a stove as female. Kitchen > male pattern baldness.

James Zou, assistant professor of biomedical science at Stanford, explains why this matters. He gives the example of someone searching for ‘computer programmer’ on a program trained on a dataset that associates that term more closely with a man than a woman.44 The algorithm could deem a

Добавить отзыв
ВСЕ ОТЗЫВЫ О КНИГЕ В ИЗБРАННОЕ

0

Вы можете отметить интересные вам фрагменты текста, которые будут доступны по уникальной ссылке в адресной строке браузера.

Отметить Добавить цитату