AI and the List of Dirty, Naughty … and Otherwise Bad Words

Comedian George Carlin had a list of Seven Words You Can’t Say on TV. Parts of the internet have a list of 402 banned words, plus one emoji, ?.

Slack uses the open source List of Dirty, Naughty, Obscene, and Otherwise Bad Words, found on GitHub, to help groom its search suggestions. Open source mapping project OpenStreetMap uses it to sanitize map edits. Google artificial intelligence researchers recently removed web pages containing any of the words from a dataset used to train a powerful new system for making sense of language.

LDNOOBW, as intimates know it, has been a low profile utility for years, but recently became more prominent. Blocklists try to bridge the gulf between the mechanical logic of software and the organic contradictions of human behavior and language. But such lists are inevitably imperfect and can spawn unintended consequences. Some AI researchers have criticized Google’s use of LDNOOBW as narrowing what its software knows about humanity. Another, similar, open source list of “bad” words caused chat software Rocket.Chat to censor attendees of an event called Queer in AI from using the word queer.

The initial List of Dirty, Naughty, Obscene, and Otherwise Bad Words was drawn up in 2012, by employees of stock photo site Shutterstock. Dan McCormick, who led the company’s engineering team, wanted a roll of the obscene or objectionable as a safety feature for the autocomplete feature of the site’s search box. He was happy for users to type whatever they wanted, but didn’t want the site to actively suggest terms people might be surprised to see pop up in an open office. “If someone types in B, you don’t want the first word that comes up to be boobs,” says McCormick, who left Shutterstock in 2015.

He and some coworkers took Carlin’s Seven Words, tapped the darkest corners of their brains, and used Google to learn sometimes bewildering slang for sexual acts. They posted their initial 342 entries to GitHub with a note inviting contributions and the suggestion that it could “spice up your next game of Scrabble :)”

Almost nine years later, LDNOOBW, as aficionados know it, is longer and more influential than ever. Shutterstock employees continued curating their list of crudities after McCormick’s departure, with help from outside suggestions, eventually reaching 403 entries for English. The list won users outside the company, including at OpenStreetMap and Slack. There are versions of the list in more than two dozen other languages, including three entries for Klingon—QI’yaH!—and 37 for Esperanto. Shutterstock declined to comment on the list and claimed it is no longer a company project, although it still bears the company’s name and copyright assertion on GitHub.

Artificial intelligence researchers at Google recently won LDNOOBW new fame—and infamy. In 2019, company researchers reported using the list to filter the web pages included in a collection of billions of words scraped from the web called the Colossal Clean Crawled Corpus. The censored collection powered a recent Google project that created the largest language AI system the company has revealed, showing strong results on tasks such as reading comprehension questions or tagging sentences from movie reviews as positive or negative.

Similar projects have created software that generates astonishingly fluid text. But some AI researchers question Google’s use of LDNOOBW to filter its AI input, saying that blacked out a lot of knowledge. Striking out pages featuring obscenities, racial slurs, anatomical terms or the word sex regardless of context would remove abusive forum postings—but also swaths of educational and medical material, news coverage about sexual politics, and information about Paridae songbirds. Google didn’t discuss that side effect in its research papers.

“Words on the list are many times used in very offensive ways but they can also be appropriate depending on context and your identity,” says William Agnew, a machine learning researcher at the University of Washington. He is a cofounder of the community group Queer in AI, whose web pages on encouraging diversity in the field would likely be excluded from Google’s AI primer for using the word sex on pages about improving diversity in the AI workforce. LDNOOBW appears to reflect historical patterns of disapproval of homosexual relationships, Agnew says, with entries including “gay sex” and “homoerotic.”

Source

Author: showrunner