60 of approximately 1 million face images collected by IBM from Flickr for use in the face recognition research. Faces are blurred to protect privacy. Visualization by Adam Harvey / Exposing.ai licensed under CC-BY-NC with original images licensed and attributed under Creative Commons CC-BY (attribution required, no commercial use).

IBM Diversity in Faces

IBM Diversity in Faces (DiF) dataset of "annotations of one million publicly available face images." 1 The dataset was created in 2019 to address existing biases in overwhelmingly light-skinned and male-dominated facial datasets. IBM believed that the dataset "will encourage deeper researcher on this important topic and accelerate efforts towards creating more fair and accurate face recognition systems." 1

However, the dataset caused a fierce backlash after it became widely known through an article published on NBC News. IBM is now being sued in a class action lawsuit led by a photographer whose photos and biometrics were used without consent. He is seeking damages of $5,000 for each intentional violation of the Illinois Biometric Information Privacy Act, or $1,000 for each negligent violation, for everyone affected. The lawsuit aims to represent all Illinois citizens whose biometric data was used in the dataset.

An Illinois resident initiated a class-action lawsuit against IBM for using his biometric information without consent in the IBM Diversity in Faces dataset. The lawsuit is still active.

Exposing.ai has made available a portion of the IBM DiF dataset metadata on https://exposing.ai/search. Use your Flickr username, NSID, photo URL, or #hashtag to check if your images were used.

Origins

The IBM DiF dataset originates from the YFCC100M dataset, a fountainhead of image data from which many derivative datasets were developed. YFCC100M provides a large collection of approximately 100 million photos that were labeled with Creative Commons licenses which were assumed to be valid for use in machine learning or facial recognition systems. The MegaFace dataset was also developed using this method. Both used a subset of images of the YFCC100M dataset specifically because they were licensed with Creative Commons. According to the authors of the IBM DiF dataset, they "proceeded with the download [of the face image] only if the license type was Creative Commons." 1 Images that did not use a Creative Commons license were not used. In effect, Creative Commons licenses increased the chances of face images appearing in publicly available face recognition dataset, many of which contain images of children, weddings, and family vacations. An analysis of the top image tags in IBM DiF (see charts below) shows that "party", "family", "wedding", and "friends" are among the top 10 most used tags in the dataset sample analyzed by Exposing.ai.

Following the publication of the NBC story in March 2019, then president of Creative Commons Ryan Merkley addressed the situation by explaining that even though some may disagree, "fair use allows all types of content to be used freely."

Attribution Denied

Creative Commons later published a FAQ that addresses the issue further, clarifying that images may be used in machine learning datasets as long as the license conditions are respected.

"Where a CC-licensed work is distributed as part of a database or dataset, and assuming copyright (or in the European Union, copyright or sui generis database rights) is triggered, then the license conditions must be respected. This means providing the required attribution information in a way that is reasonable under the circumstances. Our licenses allow for some flexibility, and in some cases that may be as simple as providing a link to the website where the relevant attribution information is provided. Visit our marking practices page for more information."

The IBM DiF dataset includes many images licensed with attribution requirements (see charts below). However, IBM does not provide any attribution links, nor any public credit for any of the images. The authors of photographs in the IBM DiF dataset are denied attribution for the original works in any meaningful, human readable way. This could amount to thousands or possibly hundreds of thousands of violations of the Creative Commons licensing requirement that have yet to be legally challenged.

Supplementary Images

IBM DiF Mosaic 1. Licensed under Creative Commons CC-BY with attribution

IBM DiF Mosaic 2. Licensed under Creative Commons CC-BY with attribution

IBM DiF Mosaic 3. Licensed under Creative Commons CC-BY with attribution

IBM DiF Mosaic 4. Licensed under Creative Commons CC-BY with attribution

IBM DiF Mosaic 5. Licensed under Creative Commons CC-BY with attribution

IBM DiF Mosaic 6. Licensed under Creative Commons CC-BY with attribution

IBM Diversity in Faces Attributes
Dataset Name	IBM DiF
Dataset Name Full	IBM Diversity in Faces
Total Images	1,070,000
Initial Purpose	Face recognition and cranio-facial analysis
Year Published	2019
Dataset Website	https://www.research.ibm.com/artificial-intelligence/trusted-ai/diversity-in-faces/

Photos from Flickr.com in IBM Diversity in Faces
Total Flickr Photos	1,070,000
Total Flickr Users	98,153
Active on Flickr.com*	98,153
Inactive/removed on Flickr.com*	971,847
API Data Accessed	October 2019
Included in YFC100M	98,154
Photos w/ Geo Data	29,582
Searchable on Exposing.ai	98,154

IBM DiF Copyright Distribution

IBM DiF Creative Commons license distribution | Download Data (CSV) | Download Chart (SVG)

IBM DiF Creative Commons License Distribution

IBM DiF Creative Commons license distribution | Download Data (CSV) | Download Chart (SVG)

IBM DiF Image Upload Year Distribution

IBM DiF Creative Commons license distribution | Download Data (CSV) | Download Chart (SVG)

Top 10 IBM DiF Image #Tags

Top 10 image #tags used in IBM DiF | Download Data (CSV) | Download Chart (SVG)

Top 10 Geocoded Cities IBM DiF

Top 10 cities for geocoded photos in IBM DiF | Download Data (CSV) | Download Chart (SVG)

Citing This Work

If you reference or use any data from the Exposing.ai project, cite our original research as follows:

@online{Exposing.ai,
  author = {Harvey, Adam},
  title = {Exposing.ai},
  year = 2021,
  url = {https://exposing.ai},
  urldate = {2021-01-01}
}

If you reference or use any data from IBM DiF cite the author's work:

@article{Whitelam2017IARPAJB,
    author = "Whitelam, Cameron and Taborsky, Emma and Blanton, Austin and Maze, Brianna and Adams, Jocelyn C. and Miller, Tim and Kalka, Nathan D. and Jain, Anil K. and Duncan, James A. and Allen, Kristen E and Cheney, Jordan and Grother, Patrick",
    title = "IARPA Janus Benchmark-B Face Dataset",
    journal = "2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)",
    year = "2017",
    pages = "592-600"
}

References

1 a b cMichele Merler, et al. "Diversity in Faces". ArXiv abs/1901.10436. (2019):
2 Michele Merler, et al. "Facial Coding Scheme Reference 1 Craniofacial Distances". (2019):