Exposing.ai
FaceScrub Dataset
Example mages from the FaceScrub dataset. Faces are blurred to protect privacy. Visualization by Adam Harvey / Exposing.ai. All images licensed under CC-BY-NC with original images licensed and attributed under Creative Commons CC-BY (attribution required, no commercial use).
Example mages from the FaceScrub dataset. Faces are blurred to protect privacy. Visualization by Adam Harvey / Exposing.ai. All images licensed under CC-BY-NC with original images licensed and attributed under Creative Commons CC-BY (attribution required, no commercial use).

FaceScrub

Face Scrub is a dataset of 141,130 faces of 695 public figures. The images were collected "from searches for public figures using a search engine." 1 According to the authors of Face Scrub they compiled a list of names from the Internet Movie Database (IMDb) and used an Internet search engine to download images.

The dataset includes the full name and source URL for each image providing insight into the websites that were scraped for biometric data. The domains include nytimes.com, timeinc.com, justjared.com, hawtcelebs.com, huffpost.com, vanityfair.com, and hundreds more. None of the images include any licensing information and many of the URLs are no longer accessible. There are 211 images from Flickr used in Face Scrub can be checked in the Exposing.ai search engine.

This dataset is included in the analyses on Exposing.ai because it is highly popular with biometrics and surveillance researchers, including several research project citations from Microsoft and SenseTime.

Information Supply Chain

To help understand how FaceScrub Dataset has been used around the world by commercial, military, and academic organizations; existing publicly available research citing FaceScrub was collected, verified, and geocoded to show how AI training data has proliferated around the world. Click on the markers to reveal research projects at that location.

Citation data is collected using SemanticScholar.org then dataset usage verified and geolocated. Citations are used to provide an estimated overview of how and where images were used based on institutional affiliations. Thicker lines represent more citations. Please zoom in to see all institutions, as cities may have multiple points very close together.

FaceScrub Copyright Distribution

Years
FaceScrub Creative Commons license distribution | Download data

FaceScrub Creative Commons License Distribution

Years
FaceScrub Creative Commons license distribution | Download data

FaceScrub Image Upload Year Distribution

Years
FaceScrub Creative Commons license distribution | Download data

Top 10 FaceScrub Image #Tags

Years
Top 10 image #tags used in FaceScrub | Download data

Top 10 Geocoded Cities FaceScrub

Years
Top 10 cities for geocoded photos in FaceScrub | Download data

Citing This Work

If you reference or use any data from the Exposing.ai project, cite our original research as follows:

@online{Exposing.ai,
  author = {Harvey, Adam. LaPlace, Jules.},
  title = {Exposing.ai},
  year = 2021,
  url = {https://exposing.ai},
  urldate = {2021-01-01}
}

If you reference or use any data from FaceScrub cite the author's work:

@article{Ng2014ADA,
    author = "Ng, Hongwei and Winkler, Stefan",
    title = "A data-driven approach to cleaning large face datasets",
    journal = "2014 IEEE International Conference on Image Processing (ICIP)",
    year = "2014",
    pages = "343-347"
}

References

  • 1 aHongwei Ng, et al. "A data-driven approach to cleaning large face datasets". 2014 IEEE International Conference on Image Processing (ICIP). (2014): 343-347.