Example mages from the FaceScrub dataset. Faces are blurred to protect privacy. Visualization by Adam Harvey / Exposing.ai. All images licensed under CC-BY-NC with original images licensed and attributed under Creative Commons CC-BY (attribution required, no commercial use).

FaceScrub

Face Scrub is a dataset of 141,130 faces of 695 public figures. The images were collected "from searches for public figures using a search engine." 1 According to the authors of Face Scrub they compiled a list of names from the Internet Movie Database (IMDb) and used an Internet search engine to download images.

The dataset includes the full name and source URL for each image providing insight into the websites that were scraped for biometric data. The domains include nytimes.com, timeinc.com, justjared.com, hawtcelebs.com, huffpost.com, vanityfair.com, and hundreds more. None of the images include any licensing information and many of the URLs are no longer accessible. There are 211 images from Flickr used in Face Scrub can be checked in the Exposing.ai search engine.

This dataset is included in the analyses on Exposing.ai because it is highly popular with biometrics and surveillance researchers, including several research project citations from Microsoft and SenseTime.

Information Supply Chain

To help understand how FaceScrub Dataset has been used around the world by commercial, military, and academic organizations; existing publicly available research citing FaceScrub was collected, verified, and geocoded to show how AI training data has proliferated around the world. Click on the markers to reveal research projects at that location.

Academic
Commercial
Military / Government

Citation data is collected using SemanticScholar.org then dataset usage verified and geolocated. Citations are used to provide an estimated overview of how and where images were used based on institutional affiliations. Thicker lines represent more citations. Please zoom in to see all institutions, as cities may have multiple points very close together.

FaceScrub Attributes
Dataset Name	FaceScrub
Dataset Name Full	FaceScrub
Total Images	105,830
Identities	530
Initial Purpose	Face recognition and detection
Year Published	2014
Dataset Website	http://vintage.winklerbros.net/facescrub.html

Photos from Flickr.com in FaceScrub
Total Flickr Photos	222
Total Flickr Users	130
Active on Flickr.com*	211
Inactive/removed on Flickr.com*	11
API Data Accessed	October 2019
Included in YFC100M	14
Photos w/ Geo Data	24
Searchable on Exposing.ai	222

FaceScrub Copyright Distribution

FaceScrub Creative Commons license distribution | Download Data (CSV) | Download Chart (SVG)

FaceScrub Creative Commons License Distribution

FaceScrub Creative Commons license distribution | Download Data (CSV) | Download Chart (SVG)

FaceScrub Image Upload Year Distribution

FaceScrub Creative Commons license distribution | Download Data (CSV) | Download Chart (SVG)

Top 10 FaceScrub Image #Tags

Top 10 image #tags used in FaceScrub | Download Data (CSV) | Download Chart (SVG)

Top 10 Geocoded Cities FaceScrub

Top 10 cities for geocoded photos in FaceScrub | Download Data (CSV) | Download Chart (SVG)

Citing This Work

If you reference or use any data from the Exposing.ai project, cite our original research as follows:

@online{Exposing.ai,
  author = {Harvey, Adam},
  title = {Exposing.ai},
  year = 2021,
  url = {https://exposing.ai},
  urldate = {2021-01-01}
}

If you reference or use any data from FaceScrub cite the author's work:

@article{Ng2014ADA,
    author = "Ng, Hongwei and Winkler, Stefan",
    title = "A data-driven approach to cleaning large face datasets",
    journal = "2014 IEEE International Conference on Image Processing (ICIP)",
    year = "2014",
    pages = "343-347"
}

References

1 aHongwei Ng, et al. "A data-driven approach to cleaning large face datasets". 2014 IEEE International Conference on Image Processing (ICIP). (2014): 343-347.