A case study on publicly available facial recognition datasets for the Munich Security Conference's Transnational Security Report
National AI strategies often rely on transnational data sources to capitalize on recent advancements in deep learning and neural networks. Researchers benefiting from these transnational data flows can yield quick and significant gains across diverse sectors from health care to biometrics. But new challenges emerge when national AI strategies collide with national interests.
Our earlier research on the MS Celeb and Duke datasets published with the Financial Times revealed that several computer vision image datasets created by US companies and universities were unexpectedly also used for research by the National University of Defense Technology in China, along with top Chinese surveillance firms including SenseTime, SenseNets, CloudWalk, Hikvision, and Megvii/Face++ which have all been linked to oppressive surveillance in the Xinjiang region of China.
In this new research for the Munich Security Conference's Transnational Security Report we provide summary statistics about the origins and endpoints of facial recognition information supply chains. To make it more personal, we gathered additional data on the number of public photos from embassies that are currently being used in facial recognition training datasets.
Origins: In total, we found over 24 million non-cooperative, non-consensual photos in 30 publicly available face recognition and face analysis datasets. Of these 24 million images, over 15 million face images are from Internet search engines, over 5.8 million from Flickr.com, over 2.5 million from the Internet Movie Database (IMDb.com), and nearly 500,000 from CCTV footage. All 24 million images were collected without any explicit consent, a type of face image that researchers call "in the wild". Every image contains at least one face and many photos contain multiple faces. There are approximately 1 million unique identities across all 24 million images.
Out of the 24 million images analyzed, at least 8,428 embassy images were found in face recognition and facial analysis datasets. These images were found by cross-referencing Flickr IDs and URLs between datasets to locate 5,667 images in the MegaFace dataset, 389 images in the IBM Diversity in Faces datasets, and 2,372 images in the Who Goes There dataset. MegaFace is one of the most widely used publicly available face recognition datasets for academic, commercial, and defense-related research.
In total, these 8,428 images were found to be used in at least 42 countries with most citations originating in China and most images originating from US embassies. The images were found to be used in research projects with links to commercial and defense organization including Google, Microsoft, National University of Defense Technology in China, SenseTime, Tencent, Mitsubishi, ExpertSystems (Italy), Siren Solution (Ireland), and Paradigma Digital (Spain).
The embassy and consulate photos below were all found in either the MegaFace or IBM Diversity in Faces datasets. Consulates were only included if marked as "EMBASSY" by the U.S. Department of State’s Social Media Presence List. Photos below were chosen because of inclusion of an embassy logo. All photos originated on Flickr.com and were published with a Creative Commons license.
To make this analysis slightly more personal for Munich Security Conference readers, several photos from the US Consulate in Munich were located. Coincidentally, one of the images is from the Deutsch-amerikanischer Datenschutztag symposium (data protection day).
This brief research aims to shed light on the emerging politics of data. A photo is no longer just a photo when it can also be surveillance training data, and datasets can no longer be separated from the development of software when software is now built with data. "Our relationship to computers has changed", says Geoffrey Hinton, one of the founders of modern day neural networks and deep learning. "Instead of programming them, we now show them and they figure it out." 1. Data is a new kind of code.
As data becomes more political, national AI strategies might also want to include transnational dataset strategies.
Research and text: © Adam Harvey
The list of of embassies used for this analysis are from the U.S. Department of State’s Social Media Presence List combined with manual search results. In some cases, the official U.S. Dept. of State list describes consulates and missions as embassies. For example, the US Consulate Munich and the US Mission Canada are marked as "EMBASSY". Consulates and missions listed as embassies by the U.S. Dept. of State list are included in this analysis.
The Who Goes There dataset is used for ethnicity analysis and is included because ethnicity analysis can be used as part of facial recognition.
Citations are gathered from SemanticScholar.com.
If you reference this research project or use any data from the Exposing.ai project, cite our original research as follows:
@online{Exposing.ai, author = {Harvey, Adam}, title = {Exposing.ai}, year = 2021, url = {https://exposing.ai}, urldate = {2021-01-01} }