Face Scrub is a dataset of 141,130 faces of 695 public figures. The images were collected "from searches for public figures using a search engine." 1 According to the authors of Face Scrub they compiled a list of names from the Internet Movie Database (IMDb) and used an Internet search engine to download images.
The dataset includes the full name and source URL for each image providing insight into the websites that were scraped for biometric data. The domains include nytimes.com, timeinc.com, justjared.com, hawtcelebs.com, huffpost.com, vanityfair.com, and hundreds more. None of the images include any licensing information and many of the URLs are no longer accessible. There are 211 images from Flickr used in Face Scrub can be checked in the Exposing.ai search engine.
This dataset is included in the analyses on Exposing.ai because it is highly popular with biometrics and surveillance researchers, including several research project citations from Microsoft and SenseTime.
To help understand how FaceScrub Dataset has been used around the world by commercial, military, and academic organizations; existing publicly available research citing FaceScrub was collected, verified, and geocoded to show how AI training data has proliferated around the world. Click on the markers to reveal research projects at that location.
If you reference or use any data from the Exposing.ai project, cite our original research as follows:
@online{Exposing.ai, author = {Harvey, Adam. LaPlace, Jules.}, title = {Exposing.ai}, year = 2021, url = {https://exposing.ai}, urldate = {2021-01-01} }
If you reference or use any data from FaceScrub cite the author's work:
@article{Ng2014ADA, author = "Ng, Hongwei and Winkler, Stefan", title = "A data-driven approach to cleaning large face datasets", journal = "2014 IEEE International Conference on Image Processing (ICIP)", year = "2014", pages = "343-347" }