IBM Diversity in Faces (DiF) dataset of "annotations of one million publicly available face images." 1 The dataset was created in 2019 to address existing biases in overwhelmingly light-skinned and male-dominated facial datasets. IBM believed that the dataset "will encourage deeper researcher on this important topic and accelerate efforts towards creating more fair and accurate face recognition systems." 1
However, the dataset caused a fierce backlash after it became widely known through an article published on NBC News. IBM is now being sued in a class action lawsuit led by a photographer whose photos and biometrics were used without consent. He is seeking damages of $5,000 for each intentional violation of the Illinois Biometric Information Privacy Act, or $1,000 for each negligent violation, for everyone affected. The lawsuit aims to represent all Illinois citizens whose biometric data was used in the dataset.
Exposing.ai has made available a portion of the IBM DiF dataset metadata on https://exposing.ai/search. Use your Flickr username, NSID, photo URL, or #hashtag to check if your images were used.
The IBM DiF dataset originates from the YFCC100M dataset, a fountainhead of image data from which many derivative datasets were developed. YFCC100M provides a large collection of approximately 100 million photos that were labeled with Creative Commons licenses which were assumed to be valid for use in machine learning or facial recognition systems. The MegaFace dataset was also developed using this method. Both used a subset of images of the YFCC100M dataset specifically because they were licensed with Creative Commons. According to the authors of the IBM DiF dataset, they "proceeded with the download [of the face image] only if the license type was Creative Commons." 1 Images that did not use a Creative Commons license were not used. In effect, Creative Commons licenses increased the chances of face images appearing in publicly available face recognition dataset, many of which contain images of children, weddings, and family vacations. An analysis of the top image tags in IBM DiF (see charts below) shows that "party", "family", "wedding", and "friends" are among the top 10 most used tags in the dataset sample analyzed by Exposing.ai.
Following the publication of the NBC story in March 2019, then president of Creative Commons Ryan Merkley addressed the situation by explaining that even though some may disagree, "fair use allows all types of content to be used freely."
Creative Commons later published a FAQ that addresses the issue further, clarifying that images may be used in machine learning datasets as long as the license conditions are respected.
"Where a CC-licensed work is distributed as part of a database or dataset, and assuming copyright (or in the European Union, copyright or sui generis database rights) is triggered, then the license conditions must be respected. This means providing the required attribution information in a way that is reasonable under the circumstances. Our licenses allow for some flexibility, and in some cases that may be as simple as providing a link to the website where the relevant attribution information is provided. Visit our marking practices page for more information."
The IBM DiF dataset includes many images licensed with attribution requirements (see charts below). However, IBM does not provide any attribution links, nor any public credit for any of the images. The authors of photographs in the IBM DiF dataset are denied attribution for the original works in any meaningful, human readable way. This could amount to thousands or possibly hundreds of thousands of violations of the Creative Commons licensing requirement that have yet to be legally challenged.
If you reference or use any data from the Exposing.ai project, cite our original research as follows:
@online{Exposing.ai, author = {Harvey, Adam. LaPlace, Jules.}, title = {Exposing.ai}, year = 2021, url = {https://exposing.ai}, urldate = {2021-01-01} }
If you reference or use any data from IBM DiF cite the author's work:
@article{Whitelam2017IARPAJB, author = "Whitelam, Cameron and Taborsky, Emma and Blanton, Austin and Maze, Brianna and Adams, Jocelyn C. and Miller, Tim and Kalka, Nathan D. and Jain, Anil K. and Duncan, James A. and Allen, Kristen E and Cheney, Jordan and Grother, Patrick", title = "IARPA Janus Benchmark-B Face Dataset", journal = "2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)", year = "2017", pages = "592-600" }