VGG Face is a dataset of 2.6 million face images of 2,622 people that is used development face recognition technology. The development of VGG Face dataset was supported by United States Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract number 2014-14071600010. 1
The dataset is mostly comprised of celebrities, public figures, actors, and politicians whose names were chosen "by extracting males and females, ranked by popularity, from the Internet Movie Data Base (IMDB) celebrity list." 1 Information from IMDB about ethnicity, age, and kinship was also collected. Then names were then fed into Google Image Search, and 200 images were downloaded for each of the 5,000 preliminary identities. These initial 200 images were given to human annotators to determine which identities had sufficient imagery to be added to the VGG Face dataset. The updated list of names was deduplicated against the Labeled Faces in the Wild and YouTube Faces dataset, resulting in 2,622 identities. Apparently, at no point did any individual provide consent or information about how their biometric information was being used.
The VGG Face dataset was created to provide access to biometric data to researchers working on face recognition technologies. The authors cite the asymmetry between academic access to face data and the large datasets held private by Google and Facebook, whose datasets are orders of magnitude larger than any public dataset. But in making the VGG Face dataset "freely available to the research community", an assumption is made that celebrity biometrics are a free and public commodity. A full list of names can still be downloaded from the website http://www.robots.ox.ac.uk/~vgg/data/vgg_face/.
For example, a file called John_Oliver.txt
includes dozens of URLs for images of the comedian John Oliver who coincidentally had railed against the dangers of face recognition in a June 15, 2020 episode of LastWeekTonight. Yet, John's face, along with 2,621 others have been influential in the development of face recognition tools. The primary justification provided in the author's research paper is that because Facebook and Google have hundreds of million of faces, academic researchers should also have access to similar mountains of data, including John Oliver's face. In considering their rationale it's important to also consider that development of the VGG Face dataset was funded by the Office of Director of National Intelligence and the Intelligence Advanced Research Projects Activity (IARPA), neither of whom are academic organizations.
Another interesting detail about VGG Face is that the licenses included in the VGG Face dataset claims the images are covered under a Creative Commons Attribution-NonCommercial license. However, earlier in the license it states that the VGG researchers do not provide images due to copyright restrictions. This is clarified when downloading the dataset and realizing that in fact many of the images are actually copyrighted photos downloaded from stock photo agencies. One example is this image of John Oliver and his wife taken at the White House and then later used in the VGG Face dataset to train face recognition algorithms http://image.shutterstock.com/display_pic_with_logo/9037/136659113/stock-photo-washington-april-john-oliver-and-wife-kate-norley-at-the-white-house-correspondents-dinner-136659113.jpg. The license file is available at http://www.robots.ox.ac.uk/~vgg/data/vgg_face/licence.txt and excerpted below:
Since the original images are subject to copyright, we do not make them available directly. We instead provide URLs and associated face detections from the dataset. The images are covered under a Creative Commons Attribution-NonCommercial 4.0 International license (Please read the license terms here. http://creativecommons.org/licenses/by-nc/4.0/).
To help understand how VGG Face has been used around the world by commercial, military, and academic organizations; existing publicly available research citing VGG Face Dataset was collected, verified, and geocoded to show how AI training data has proliferated around the world. Click on the markers to reveal research projects at that location.
If you reference or use any data from the Exposing.ai project, cite our original research as follows:
@online{Exposing.ai, author = {Harvey, Adam. LaPlace, Jules.}, title = {Exposing.ai}, year = 2021, url = {https://exposing.ai}, urldate = {2021-01-01} }
If you reference or use any data from VGG Face cite the author's work:
@inproceedings{Parkhi2015DeepFR, author = "Parkhi, Omkar M. and Vedaldi, Andrea and Zisserman, Andrew", title = "Deep Face Recognition", booktitle = "BMVC", year = "2015" }