MegaFace is a large-scale public face recognition training dataset that serves as one of the most important benchmarks for commercial face recognition vendors. It includes 4,753,320 faces of 672,057 identities from 3,311,471 photos downloaded from 48,383 Flickr users' photo albums. All photos included a Creative Commons licenses, but most were not licensed for commercial use.
This analysis explores how the MegaFace face recognition dataset exploited the good intentions of Flickr users and the Creative Commons license system to advance facial recognition technologies around the world by companies including Alibaba, Amazon, Google, CyberLink, IntelliVision, N-TechLab (FindFace.pro), Mitsubishi, Orion Star Technology, Philips, Samsung 1, SenseTime, Sogou, Tencent, and Vision Semantics to name only a few. According to the press release from the University of Washington, "more than 300 research groups [were] working with MegaFace" as of 2016, including multiple law enforcement agencies.
To understand which licenses were applied to the images in the MegaFace dataset we analyzed the metadata for all 3,311,471 images from 48,383 Flickr accounts and found that 69% (2,284,369) of the images prohibited commercial use, while only 31% (1,027,102) allowed it. But all 3,311,471 images required some form of attribution, of which none was provided by the MegaFace dataset nor any of the research projects that used it. This would amount to 3,311,471 violations of Creative Commons licenses for each commercial use of the dataset if it were to be enforced.
Defining commercial use of training data is still a gray area. But the intent of the dataset is clear. According to the research paper introducing the dataset, the motivation for creating MegaFace was commercial in nature: "let's say one wishes to create an application that uses the best face recognition algorithm out there, how would they know which algorithm is better to implement or buy?" 1 In other words, how can commercial face recognition vendors prove their product is superior? Simple: they compete in open challenges, using the MegaFace dataset as a baseline for comparison with other algorithms, and then advertise the results.
According to BiometricUpdate.com, a news website for the biometrics industry, the MegaFace dataset has now become "one of the most reliable and popular frameworks of reference in assessing facial recognition performance, particularly on a massive scale". It frequently appears in press releases and promotional material for top facial recognition vendors.
The MegaFace dataset begins in 2004, the first year Flickr began offering free online photo sharing to Internet users. Since the beginning Flickr recommended and promoted Creative Commons (CC) licenses as a way to facilitate sharing and reposting images. Featured images on their homepage prominently displayed CC licensing, the majority of their licensing options were CC, and later they provided unlimited free hosting for images that used CC licenses. Their strategy worked. By 2010 Flickr had surpassed 100 million CC-licensed images.
Assuming that both creators and users understood the licensing agreement, this was a huge success. Photographers gained an audience, families could easily share images, publishers had access to free content, and the Internet was a better place.
But the assumptions around sharing were changing. Three years later, a group of researchers from Lawrence Livermore National Laboratory, Berkeley, Yahoo Labs, and In-Q-Tel's Lab41 realized this valuable resource could also be shared for science. Then in 2014 they released Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M). At the time, and still today, it is the largest public multimedia collection ever released for machine learning, containing 99.2 million photos and 800,000 videos, along with their user-generated metadata. Their intention was to provide "publicly shareable and legally usable data that is flexible and rich enough to promote advancement [...] in achieving research growth and facilitating synergy within the research community". 3 Their strategy seemed to work, too.
In 2015 researchers at the University of Washington tapped YFCC100M to create the MegaFace face recognition dataset. All 4,753,320 annotated faces from 3,311,471 images in the MegaFace dataset were derived from the original YFC100M dataset. The only public dataset with a comparable number of images was Microsoft Research's MS-Celeb-1M dataset. Incidentally, MS-Celeb has since been withdrawn due to our joint investigation with the Financial Times.
In 2017, one year after the release of MegaFace, SenseTime Limited (CN) funded a new derivative dataset based on the original MegaFace dataset. Their new dataset, called MegaAge, was used to study facial age analysis. Then again in 2018, MegaFace was used to create another face dataset, called TinyFace, for the purpose of studying face recognition on low resolution imagery, such as CCTV. And yet again in 2019, the MegaFace dataset was used to create another face recognition dataset called DiveFace by a group called SensitiveNets from Madrid, which aims to "train unbiased and discrimination-aware face recognition algorithms".
Not only does MegaFace appear in an ever-growing list of research projects and derivative datasets funded and used by giant technology companies, it is also appears in patents. A 2018 patent from China called "Deep learning-based face recognition and face verification supervised learning method" (patent number CN108256450A) claims that "experimental data sets of the present invention comprises a largest face recognition database MegaFace Challenge" and that "The method of the present experiment only on MegaFace database to a data set of three gallery to test the proposed model of the present method." The figures included in the patent publication even include images from the MegaFace dataset.
According to reporting from the New York Times, the MegaFace dataset has been downloaded thousands of times by "companies and government agencies around the world", including "U.S. defense contractor Northrop Grumman; In-Q-Tel, the investment arm of the Central Intelligence Agency; ByteDance the parent company of the Chinese social media app TikTok; and the Chinese surveillance company Megvii."
Exposing.ai also found download requests from the Danish National Police Team NC3, Defence Science & Technology (DST) Group in Edinburgh, European Law Enforcement Agency (EUROPOL), Facebook, Google, Huawei, Hikvision, IARPA, Military Technical Academy of Bucharest, Microsoft, Russian law-enforcement and security agency contractor StilSoft.ru, SenseNets, SenseTime, Tencent, Turkish Police, and also by Hoan Ton-That the CEO of Clearview.ai among thousands more. The FOIA data, originally obtained by NYT reporter Kashmir Hill, is being reviewed and cleaned and will eventually posted on this site. The original document can also be downloaded here.
To help understand how MegaFace Dataset has been used around the world by commercial, military, and academic organizations; existing publicly available research citing MegaFace Dataset was collected, verified, and geocoded to show how AI training data has proliferated around the world. Click on the markers to reveal research projects at that location.
What did the MegaFace download include? The downloads provided by the University of Washington included 13 compressed files including a total 855GB of imagery, a JSON file with metadata, a version of tightly-cropped faces, and two compressed files with disjoint images. The 855GB of images included in the MegaFace dataset are entirely from Flickr and entirely derivatives of the YFCC100M dataset, which is governed by Creative Commons legal requirements.
According to Creative Commons, images must include the appropriate author credit and follow the commercial or non-commercial restrictions as labeled by the author. However, the MegaFace dataset does not provide any such metadata to credit the creators. The example below shows the extent of the data included for each photo. The field labeled box
corresponds to the location of the face detection box. The exp_bb
field corresponds to an expanded face detection box. The landmarks
field corresponds to a 68 facial landmarks describing facial features. The full_img_url
is simply the direct URL to the image on Flickr.com.
{ "box": { "top": ####, "right": ####, "left": ####, "bottom": #### }, "full_img_url": "https://farm6.staticflickr.com/####/##########_##########_o.jpg", "exp_bb": { "top": ####, "right": ####, "left": ####, "bottom": #### }, "landmarks": "[(#, #), (#, #), ..., (#, #)]" }
If you reference or use any data from the Exposing.ai project, cite our original research as follows:
@online{Exposing.ai, author = {Harvey, Adam. LaPlace, Jules.}, title = {Exposing.ai}, year = 2021, url = {https://exposing.ai}, urldate = {2021-01-01} }
If you reference or use any data from MegaFace cite the author's work:
@article{Nech2017LevelPF, author = "Nech, Aaron and Kemelmacher-Shlizerman, Ira", title = "Level Playing Field for Million Scale Face Recognition", journal = "2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)", year = "2017", pages = "3406-3415" }