Exposing.ai
VGG Face
Examples of Flickr images used in the VGG Face dataset that is used for training face recognition algorithms. Faces are blurred to protect privacy. Visualization by Adam Harvey / Exposing.ai licensed under CC-BY-NC with original images licensed and attributed under Creative Commons CC-BY (attribution required, no commercial use).
Examples of Flickr images used in the VGG Face dataset that is used for training face recognition algorithms. Faces are blurred to protect privacy. Visualization by Adam Harvey / Exposing.ai licensed under CC-BY-NC with original images licensed and attributed under Creative Commons CC-BY (attribution required, no commercial use).

VGG Face

VGG Face is a dataset of 2.6 million face images of 2,622 people that is used development face recognition technology. The development of VGG Face dataset was supported by United States Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract number 2014-14071600010. 1

The dataset is mostly comprised of celebrities, public figures, actors, and politicians whose names were chosen "by extracting males and females, ranked by popularity, from the Internet Movie Data Base (IMDB) celebrity list." 1 Information from IMDB about ethnicity, age, and kinship was also collected. Then names were then fed into Google Image Search, and 200 images were downloaded for each of the 5,000 preliminary identities. These initial 200 images were given to human annotators to determine which identities had sufficient imagery to be added to the VGG Face dataset. The updated list of names was deduplicated against the Labeled Faces in the Wild and YouTube Faces dataset, resulting in 2,622 identities. Apparently, at no point did any individual provide consent or information about how their biometric information was being used.

The VGG Face dataset was created to provide access to biometric data to researchers working on face recognition technologies. The authors cite the asymmetry between academic access to face data and the large datasets held private by Google and Facebook, whose datasets are orders of magnitude larger than any public dataset. But in making the VGG Face dataset "freely available to the research community", an assumption is made that celebrity biometrics are a free and public commodity. A full list of names can still be downloaded from the website http://www.robots.ox.ac.uk/~vgg/data/vgg_face/.

For example, a file called John_Oliver.txt includes dozens of URLs for images of the comedian John Oliver who coincidentally had railed against the dangers of face recognition in a June 15, 2020 episode of LastWeekTonight. Yet, John's face, along with 2,621 others have been influential in the development of face recognition tools. The primary justification provided in the author's research paper is that because Facebook and Google have hundreds of million of faces, academic researchers should also have access to similar mountains of data, including John Oliver's face. In considering their rationale it's important to also consider that development of the VGG Face dataset was funded by the Office of Director of National Intelligence and the Intelligence Advanced Research Projects Activity (IARPA), neither of whom are academic organizations.

Another interesting detail about VGG Face is that the licenses included in the VGG Face dataset claims the images are covered under a Creative Commons Attribution-NonCommercial license. However, earlier in the license it states that the VGG researchers do not provide images due to copyright restrictions. This is clarified when downloading the dataset and realizing that in fact many of the images are actually copyrighted photos downloaded from stock photo agencies. One example is this image of John Oliver and his wife taken at the White House and then later used in the VGG Face dataset to train face recognition algorithms http://image.shutterstock.com/display_pic_with_logo/9037/136659113/stock-photo-washington-april-john-oliver-and-wife-kate-norley-at-the-white-house-correspondents-dinner-136659113.jpg. The license file is available at http://www.robots.ox.ac.uk/~vgg/data/vgg_face/licence.txt and excerpted below:

Since the original images are subject to copyright, we do not make
them available directly. We instead provide URLs and associated face
detections from the dataset.

The images are covered under a Creative Commons
Attribution-NonCommercial 4.0 International license (Please read the
license terms here. http://creativecommons.org/licenses/by-nc/4.0/).
 Dataset comparison between the VGG Face dataset and larger, privatized, industrial datasets from the research paper "Deep Face Recognition" by Parkhi et al. that introduced the VGG Face dataset.
Dataset comparison between the VGG Face dataset and larger, privatized, industrial datasets from the research paper "Deep Face Recognition" by Parkhi et al. that introduced the VGG Face dataset.

Information Supply Chain

To help understand how VGG Face has been used around the world by commercial, military, and academic organizations; existing publicly available research citing VGG Face Dataset was collected, verified, and geocoded to show how AI training data has proliferated around the world. Click on the markers to reveal research projects at that location.

Citation data is collected using SemanticScholar.org then dataset usage verified and geolocated. Citations are used to provide an estimated overview of how and where images were used based on institutional affiliations. Thicker lines represent more citations. Please zoom in to see all institutions, as cities may have multiple points very close together.

VGG Face Copyright Distribution

Years
VGG Face Creative Commons license distribution | Download data

VGG Face Creative Commons License Distribution

Years
VGG Face Creative Commons license distribution | Download data

VGG Face Image Upload Year Distribution

Years
VGG Face Creative Commons license distribution | Download data

Top 10 VGG Face Image #Tags

Years
Top 10 image #tags used in VGG Face | Download data

Top 10 Geocoded Cities VGG Face

Years
Top 10 cities for geocoded photos in VGG Face | Download data

Citing This Work

If you reference or use any data from the Exposing.ai project, cite our original research as follows:

@online{Exposing.ai,
  author = {Harvey, Adam. LaPlace, Jules.},
  title = {Exposing.ai},
  year = 2021,
  url = {https://exposing.ai},
  urldate = {2021-01-01}
}

If you reference or use any data from VGG Face cite the author's work:

@inproceedings{Parkhi2015DeepFR,
    author = "Parkhi, Omkar M. and Vedaldi, Andrea and Zisserman, Andrew",
    title = "Deep Face Recognition",
    booktitle = "BMVC",
    year = "2015"
}

References

  • 1 abOmkar M. Parkhi, et al. "Deep Face Recognition". (2015):