Capturing Many Faces

R. Thora Bjornsdottir; Vít Třebický; Lisa DeBruine; Iris Holzleitner

1 Introduction

Faces are rich sources of information, playing a key role in human social perception. They strongly draw attention, serving as a primary means of person recognition and informing social inferences (e.g., Freeman & Johnson, 2016; Sutherland & Young, 2022; Young & Burton, 2017; Zebrowitz, 1997). Various subfields within psychology—including vision science, person perception, social cognition, affective science, cognitive neuroscience, behavioural science, and person recognition—address research questions involving faces and thus necessitate the use of face images as stimuli. Here, we introduce a novel face image database for research use. This database was collected as part of an international multi-lab collaboration and addresses various limitations of existing databases. Crucially, we take a transparent approach aligned with Open Science best practices and make openly available the reproducible protocol we followed to collect the images, enabling future expansion of the database.

1.1 Existing face databases

A wealth of face stimulus databases exists, created for various purposes. Many of these are available (see e.g., Workman & Chatterjee, 2021 face image meta-database to search many of these) for research purposes allowing access to a broad variety of face stimuli, well-suited to address questions in a given subfield. For example, there are many different databases in which photographed individuals (referred to as targets or models) display different facial expressions of emotion (e.g., Ebner et al., 2010; Schalk et al., 2011, of use in affective science), appear in varying lighting and angles (e.g., Burton et al., 2010; Gao et al., 2008, useful for person recognition research), or belong to various age or racial/ethnic groups (for perception researcher). However, many studies tend to rely on a small number of those stimulus databases, which introduces potential problems. For example, 4107 published papers cite the Karolinska Directed Emotional Faces (Lundqvist et al., 2015), 4375 cite the NimStim Set of Facial Expressions (Tottenham et al., 2009), 2461 cite the Chicago Face Database (Ma et al., 2015), and 3219 the Radboud Faces Database (Langner et al., 2010). Although citation counts are not a direct measure of the use of these databases, they nonetheless indicate their popularity. This frequent (over)use of the same sets of stimuli can compromise generalizability when various research questions are tested on only a limited sample of face images from specific backgrounds (L. M. DeBruine et al., 2022; Yarkoni, 2022). Participants, especially those participating online, may also become familiar with frequently used stimuli, potentially introducing biases to their responses.

Of course, not all research involving face images uses stimuli from published databases. It is common for researchers to purposefully collect images themselves. But these stimulus sets are seldom openly shared (e.g., due to ethical limitations). Although this circumvents the issues with frequently-used databases, it introduces its own set of problems. Chiefly, face images collected by researchers for their own research use are often insufficiently documented. Although the resulting images may be described and/or illustrated with an example, the process of image collection is rarely described in adequate detail. This is often the case for even those databases openly available for research use. This lack of image acquisition (and processing) methods documentation limits transparency, replication attempts–for example, another researcher attempting to replicate a finding using new face images collected themselves, rather than the images collected by the original researchers (see Třebický et al., 2024)–and assessments of results’ validity (as undercharacterization of stimuli can preclude identification of issues like confounding factors). Furthermore, different research teams often employ protocols that can differ significantly from each other in a plethora of major and minor details, including hair (e.g., free hairstyle or pulled back; facial hair allowed or not), facial accessories, makeup, lighting conditions, and type of shirt worn. These often-unreported details further hinder the comparison between different studies.

In many databases (both publicly available and not), there are also limitations in the diversity of stimuli. For example, many include only images of individuals who self-identify as White or Western Martinez (2025). This reinforces the centering of Whiteness and Western culture in psychological research, importantly limiting the ecological validity and generalizability of conclusions (see also Henrich et al., 2010; Roberts & Mortenson, 2023). Recently created databases have sought to address this, specifically collecting images of multiracial/multi-ethnic and non-Western samples, for example (e.g., Bastanfard et al., 2007; Chen et al., 2021; Courset et al., 2018; Meyers et al., 2024; Saribay et al., 2018; Trzewik et al., 2025), but more such work is needed to further diversify the pool of available face research stimuli. Many world regions and ancestries remain underrepresented (e.g. East African, First Nations, Australian Aboriginal people). This may be due, in part, to resource limitations: Equipment and setups for collecting images suitable for research use can be prohibitively expensive and complex, making database collection not feasible for everyone and everywhere, thereby limiting database diversity. Additionally, researchers with the necessary equipment may be limited in terms of the diversity of their available participant sample in terms of ancestry and cultural heritage that may shape visible facial features. Furthermore, to date, no extant database contains images of individuals from multiple world regions.

A final limitation of extant face image databases is their largely fixed state. That is, once collected, the number of stimuli available does not change. There are some exceptions to this, such as multiple waves of additions to the Chicago Face Database (Lakshmi et al., 2021; Ma et al., 2021). Continued additions to databases are not the norm, however. This not only constrains the number of stimuli in any given database, increasing the likelihood of the frequent reuse of any individual stimulus, but can also render some databases obsolete over time (e.g., due to very dated model appearance/styling).

1.2 Open Science considerations

Open and transparent methods are essential for fully understanding, evaluating, and replicating research. Although methods reporting has become more open and accessible with researchers sharing materials such as questionnaire wording, stimuli, and programming scripts, the reporting of face image database collection procedures often lacks crucial details for full transparency and reproducibility. This matters because differences in often-undocumented details such as focal length, for example, can importantly affect faces’ appearance, subsequently impacting perceptions of them (e.g., Třebický et al., 2016; see Třebický et al., 2024 for further discussion).

Big Team Science (i.e., multi-lab) endeavours and large-scale collaborations are now recognized as a vital part of improving science. Within psychology, multiple such initiatives have importantly contributed to addressing a variety of research questions, improving diversity and generalizability in the field (e.g., in the domain of face perception, Jones et al.’s (2021) Psychological Science Accelerator project). Such an approach has yet to be applied to stimulus database collection, but it represents an opportunity to address many of the limitations of face image databases raised.

1.3 The current work

Here, we sought to address the limitations of existing face image databases through a Big Team Science approach. This served as the first project of ManyFaces, an international consortium of face perception and recognition researchers formed in 2022.

To address transparency and reproducibility issues, we developed an openly available, reproducible protocol for face image collection. The protocol covers collecting multiple images (varying in standardization, viewing angle, and facial expression) of each target/model to benefit a wide array of possible research areas and questions related to face perception and recognition. Following this protocol, we collected images (Phase 1) in 20 different labs across the world, spanning 11 countries and five world regions (Europe, Latin America, North America, the Middle East, and Southeast Asia). This resulted in a diverse set of models and images, more than would be possible to collect in or by a single lab. Moreover, this database is not static but can be added to in the future by any interested researcher following the protocol. We also validated the database (Phase 2) by recruiting online perceivers to rate a subset of the images (i.e., front-facing images) to generate norming data for key social trait perceptions and emotion recognition. We make this image database and norming data available for future research.

Altogether, we introduce a new, diverse, openly-available face stimulus set that cancontinue to grow in the future. We believe that this effort will broaden the generalizability of findings in face perception and recognition research.

2 Study/phase 1: Protocol development & stimulus collection

To set face perception and recognition Big Team Science in motion and enable conducting multi-site studies, a set of stimuli suitable for most areas (e.g., in terms of research question, geographic location) is needed. Therefore, we developed a protocol allowing us to collect such a database of face images. We developed the protocol with maximum usability and accessibility in mind, making it transparent and readable to non-experts, and designed for use with attainable (vs. highly specialized and expensive) equipment, with minimal setup requirements, and without requiring expertise in photography. Note that the protocol is for in-lab image collection to minimize external noise, to enable the collection of models’ self-report data, and to ensure the consensual use of models’ images (vs. scraping images from online sources or generating them). Photographing faces in a controlled lab environment thus strikes a balance in terms of ecological validity and ethical concerns (see Trzewik et al., 2025 for discussion of the value of lab-photographed vs. artificially generated and ambient face images).

2.1 Method

For both studies/phases, the University of Glasgow provided ethical approval. We obtained additional local ethical approval at collaborating institutions where required.

2.1.1 Protocol Development

The team curated a set of equipment and developed a protocol for collecting standardized, reproducible images of faces for research. Following a survey among the ManyFaces members¹, we constructed the protocol for the collection of a variety of images that would be useful for a broad variety of research questions. Specifically, we included images varying in their standardization of appearance (standardized and unstandardized, e.g., white t-shirt and hair pulled back vs. clothing and hair as worn by the participant/model on that day), viewing angle (frontal, profile, ¾), and facial expression (neutral, natural, angry, disgusted, fearful, happy, sad, surprised).

Prior to beginning data collection, the ManyFaces team pre-tested the protocol to ensure clarity of instructions and consistency of images collected across sites. We revised the protocol to address issues that arose (e.g., revised facial expression elicitation instructions, clarified camera settings). The final protocol can be found on the OSF (https://osf.io/qxwvz; see also https://osf.io/nkahx for all materials). In sum, the protocol detailed the following:

2.1.1.1 Image Types

The protocol defined the categories of images in terms of their standardisation of appearance, viewing angle, and facial expression.

Standardisation of Appearance. In the standardised images, models wore a white crew neck t-shirt, had their hair pulled back or covered, included no adornments/accessories (excepting those that could not be removed for cultural reasons), and wore minimal or no makeup (models were informed about this beforehand). In contrast, unstandardised images showed models in their own clothing, with their hair as they came into the lab, and with adornments, except for glasses or anything that obscured the face or neck.
Viewing Angle. Full frontal portraits showed models facing the camera, left and right profile portraits showed each side of models’ faces in profile, and left and right ¾ portraits showed models’ faces at a 45-degree angle from the axis of the camera.
Facial Expression. Neutral images were of models refraining from making any facial expression, natural images displayed a natural expression for the model, and each facial expression of emotion displayed the expression the model would make if they were feeling the specified emotion (models instructed e.g., “face the camera with the expression you would make if you were feeling happy”).

2.1.1.2 Image Prioritisation

The protocol specified which images to take (i.e., the combinations of appearance standardisation, viewing angle, and facial expression), in what order, and which were most crucial to collect (as voted by the ManyFaces members) if researchers were under time constraints². The order of image collection was as follows, with starred (*) images prioritised/required for each model:

Exposure calibration photo*
Identification & calibration photo*
Unstandardized - Natural - Full frontal portrait
Unstandardized - Neutral - Full frontal portrait*
Unstandardized - Neutral - Left and Right profile portraits
Unstandardized - Neutral - Left and Right ¾ portraits
Unstandardized - Happy - Full frontal portrait*
Unstandardized - Happy - Left and Right profile portraits
Unstandardized - Happy - Left and Right ¾ portraits
Unstandardized - All other expressions (in random or appropriate order) - Full frontal portrait
Standardized - Natural - Full frontal portrait
Standardized - Neutral - Full frontal portrait*
Standardized - Neutral - Left and Right profile portraits*
Standardized - Neutral - Left and Right ¾ portraits*
Standardized - Happy - Full frontal portrait*
Standardized - Happy - Left and Right profile portraits
Standardized - Happy - Left and Right ¾ portraits
Standardized - All other expressions (in random or appropriate order) - Full frontal portrait

Researchers with additional time could also capture the other expressions at different viewing angles (Left/Right profile, ¾) with either unstandardized or standardized appearance, as well as a video of standardized appearance to test out 3D image capture.

2.1.1.3 Equipment

All collaborating sites/labs used the same set of equipment. Images were captured using a Canon EOS 250d (also called Canon EOS Rebel SL3 in some regions) camera with a kit lens (Canon 18-55mm IS STM lens), and lit by an LED Ring light (Fovitec Bi-Colour LED 18” Ring Light) on a stand. For colour calibration, Calibrite ColourChecker Classic card was used. The image collection spaces were to have a white background with a chair for models to sit in. Researchers provided white t-shirts for models to wear for standardised images.

2.1.1.4 Setup

The setup of the room used for image collection at each site required removing or minimising external light sources (e.g., using a windowless room, window blinds, turning off overhead lights) and colour spill. The protocol specified the distances at which to set up the chair for models to sit in relative to the white background and the lighting and camera rig. Camera setup instructions, including the specified shooting and focusing modes, shutter speed, aperture, ISO, white balance, colour space, and file type.

2.1.1.5 Procedure

Finally, the protocol detailed the written informed consent process, how to prepare and position models, and how to take each kind of image listed in the image prioritisation section. This included the positioning of the model’s head for each viewing angle and facial expression elicitation instructions.

2.1.2 Stimulus Collection

Following the protocol, 20 labs (1 in Austria, 2 in Brazil, 2 in Canada, 2 in Germany, 1 in Israel, 3 in Malaysia, 1 in Mexico, 1 in the Netherlands, 1 in Serbia, 4 in the UK, and 2 in the US) collected images of an average of 10.6 models per site. As specified by the protocol, researchers collected multiple images of each model, within the time constraints of the study session.

Models furthermore completed a demographic questionnaire (https://osf.io/7w5du/files/q3p7d) on Experimentum, reporting their age, gender, ethnicity, height, and weight. They also reported whether or not they were wearing makeup (specifying what kinds, e.g., foundation, eye makeup, semi-permanent makeup) and whether they had ever experienced anything that could affect the shape of their face (e.g., broken nose, cosmetic surgery or injections, orthodontic work) and specified this if they were willing. Finally, they completed a debriefing questionnaire (https://osf.io/7w5du/files/26b8c) asking them about their experience of posing for the images (e.g., whether instructions were clear, whether any part of the process was uncomfortable).

2.2 Results

2.2.1 Images

211 total models provided their images, following withdrawals and exclusions for poor image quality. Table 1 details the number of models for each kind of photo. Example images are shown in Figure 1. Images are available for research use and can be requested here: [link to be added] Note that images cannot be shared without submitting a request and completing a research use agreement, in line with ethical constraints.

Table 1: The number of images available for each image type. The column std is for standardised images, with neutral clothing, no makeup, and hair pulled back, while unstd is for unstandardised images with natural hair and makeup.

emotion	std	unstd
neutral	205	177
anger	187
disgust	184
fear	175
happy	199
sad	183
surprised	184

Source: Article Notebook

Figure 1: Example images of each level of standardisation, angle, and expression

2.2.2 Model Demographics

For data cleaning, we first downloaded and reshaped the raw data from Experimentum. In the next step, we ensured that the models’ gender, race/ethnicity, and units of height and weight were consistently formatted across labs.

For gender and race/ethnicity, words presented in languages other than English were recoded to be presented in English (e.g., “mulher” to “female”, “preta” to “Black”). We then classified self-described race/ethnicity into one of seven categories (“White” (n = 89), “Black” (n = 5), “Asian” (n = 57), “Indigenous” (n = 0), “MENA” (n = ) (Middle Eastern or North African), “Latine” (n = 10), or “Mixed” (n = 15)), where possible. Descriptions that could not be clearly sorted into these categories were given “Ambiguous label” (n = ) and non-entries were recoded as NA (n = 7).

For height, we ensured all data were presented in centimeters, and for weight, we ensured that all data were presented in kilograms. Models could report height and weight in metric or imperial units, so we converted from imperial to metric where required. We also sanity-checked the reported units, assuming heights > 100 to be in centimeters and those < 100 to be in inches, and weights < 80 to be in kilograms. Any non-entries of height or weight for a model was recoded as NA .

Table 2: Summary of model demographic info

		Age (years)			Height (cm)			Weight (kg)
gender	n	min	max	mean	min	max	mean	min	max	mean
female	124	18.0	65.0	27.7	149.0	180.3	164.9	37.0	108.9	62.9
male	78	18.0	80.0	26.8	163.0	193.0	178.3	54.0	130.0	79.6
non-binary	7	19.0	37.0	24.7	165.0	201.0	175.0	59.0	110.0	74.1
NA	1	27.0	27.0	27.0	169.0	169.0	169.0	61.0	61.0	61.0
missing	1	30.0	30.0	30.0	167.0	167.0	167.0	60.0	60.0	60.0

3 Study/phase 2: Validation/norming data

We next obtained perceptions/ratings of a subset of the photos (front-facing) to validate the emotion expressions (their perceived emotion category and intensity) and collect norming data on central social perceptions, namely perceived attractiveness, dominance, trustworthiness, gender-typicality, memorability, and age. We chose these ratings due to their central importance in the person perception, face recognition, and emotion perception literatures Oosterhof & Todorov (2008). We preregistered this study on the OSF (https://osf.io/4d5v9).

3.1 Method

3.1.1 Image Processing

RAW images were processed using webmorphR (L. M. DeBruine et al., 2022), which facilitates scriptable processing of images using imagemagick (ImageMagick Studio LLC, 2024) (a full script of the processing steps is available at https://github.com/ManyFacesTeam/imgprep). Briefly,

Each face was delineated using the Face++ automatic face detection algorithm (see https://www.faceplusplus.com/) to generate a 106-point template.
All images were resized to 1000w by 1500h pixels to standardise size (two different RAW formats were used, which resulted in two different image sizes)
The median RGB colour value of a 100×100 pixel patch at the upper left corner was calculated to fill in any edges from alignment.
The image was repositioned and cropped (not rotated or resized) such that
- The image size was 675w by 900h pixel
- point 71 (between the eyes) was relocated to position [.5w, .4h]
This aligned image was saved as a lossless PNG using imagemagick default settings (e.g., sRGB colour space).
A white balance correction was applied to the resulting images, calculated from the mean RGB values in the 25x25 pixel top-right corner patch (white background) ³
These images were converted to JPEGs with a quality setting of 75 to reduce file size for stimulus display online.

Figure 2: Image processing steps: (A) the raw image, (B) the image automatically delineated and cropped to a standard alignment, (C) white balance correction.

3.1.2 Stimuli

The number of stimuli was determined by the number of models recruited across the 20 research labs and how many models posed for each image type. The maximum number of targets per image type was 205⁴, meaning each rater saw up to 205 stimuli. We obtained ratings of the front-facing standardised neutral (n = 205), unstandardised neutral (n = 177), standardised angry (n = 187), standardised disgusted (n = ), standardised fearful (n = 175), standardised happy (n = 199), standardised sad (n = 183), and standardised surprised (n = 184).

Example face images for the 8 types of stimuli: unstandardised (177), neutral (205), anger (187), disgust (184), fear (175), happy (199), sad (183), surprised (184) — Figure 3: An example of each type of stimulus, with the total number of stimuli of that type in parentheses.

3.1.3 Attention Check Stimuli

Additionally, we created stimuli for attention checks. These were white images with the same size and aspect ratio as the face stimuli, but contained only the written instruction to choose a specific response, (e.g., ‘Choose “fear”’ or ‘Choose “3”’).

3.1.4 Measures

3.1.4.1 Standardised Neutral Faces

3.1.4.1.1 Trait Ratings

We obtained ratings of faces’ attractiveness, dominance, trustworthiness, memorability, and gender-typicality (‘How attractive [dominant, trustworthy, memorable, gender-typical] does this person look?’). Ratings were on scales ranging from 1 (not at all) to 7 (very).

3.1.4.1.2 Demographic Impressions

We obtained ratings of faces’ perceived age (‘How old does this person look?’), with responses collected in 5-year ranges/brackets (i.e., 16-20, 21-25, …, 76-80, 81+).

3.1.4.2 Unstandardised Neutral Faces

3.1.4.2.1 Trait Ratings

We obtained ratings of faces’ attractiveness, dominance, and trustworthiness on scales ranging from 1 (not at all) to 7 (very).

3.1.4.3 Standardised Emotional Faces

3.1.4.3.1 Emotion Categorisation

We obtained impressions of the emotion each person was expressing (‘What emotion is this person expressing?’), choosing one from: anger, disgust, fear, happiness, sadness, surprise, other. Here, raters categorized a counterbalanced mixture of expressions (from one of six counterbalanced conditions) rather than faces all showing the same expression. The 201 identities with emotion images were divided into six groups of up to 34 images, and each counterbalanced condition showed a different emotion for each of the six groups, such that no identity was shown more than once to each rater. Since not all identities had all six emotions, the number of images in each counterbalanced condition ranged from 179 to 193.

3.1.4.3.2 Emotion Intensity Ratings

We obtained ratings of how intensely faces expressed each intended emotion (‘How intensely is this person expressing anger [disgust, fear, happiness, sadness, surprise]?’) from 1 (not at all) to 7 (very). Here, raters only rated all faces showing one emotion expression (e.g., all angry faces) and rated the intensity only of the intended expression (e.g., angry faces only rated on anger intensity).

3.1.5 Procedure

We collected ratings via Experimentum (L. DeBruine et al., 2020); a structure file for the exact experimental setup is available on the OSF (https://osf.io/7w5du/files/zc2b3). After a brief introduction to the study and online informed consent, each participant (rater) was randomly allocated to one of the ratings (e.g., rating all 205 standardised neutral faces on how memorable they look). All available faces were displayed one at a time, in a randomised order for each rater. The question/prompt and response scale remained visible at the top of the screen, above the photo throughout the study. The study automatically progressed to the next trial once the rater responded by clicking on the response scale. There was no time limit to provide a response. We included seven attention checks embedded in the study, which directed raters to provide a specific response.

Following rating or categorising all stimuli, raters self-reported their gender, age, race/ethnicity, country of residence, and device type used for the study (desktop, laptop, tablet, mobile). They also completed an honesty/attention check question, asking them if they engaged with the study seriously, with assurance of payment regardless of response (choosing from ‘no, I was not really paying attention’ and ‘yes, I tried to give my authentic first impressions’).

We recruited fluent English-speaking raters through Prolific . We collected all data in May 2025⁵.

3.1.6 Participants

We aimed to collect 100 raters per rating condition to achieve stable averages and allow for exclusions (Hehman et al., 2025), totalling 2100 raters. Altogether, 2115 raters completed the study. See Results for exclusions and demographics of the final sample.

3.2 Results

3.2.1 Data Cleaning and Exclusions

In the raters’ demographic questionnaire, we standardized participants’ recording of their race/ethnicity similarly to the models’ by recoding their inputs into one of seven categories (“White”, “Black”, “Asian”, “Indigenous/Pacific Islander”, “MENA” (Middle Eastern or North African), “Latine”, or “Mixed”), or as “Ambiguous label” when this was not possible. Any non-entries were recoded as NA.

A total of 2115 raters completed 2158 rating or categorisation tasks. We found that some raters did not complete all trials in tasks, some raters completed more than one task, and some raters completed more than the maximum number of trials in a task (likely by restarting the study or bypassing the back button block). Therefore, we removed incomplete tasks from our data, retained raters’ first complete tasks, and filtered our duplicate trials by keeping only raters’ first ratings for a duplicated trial. After these exclusions, and before implementing the pre-registered plan for data exclusions, we had complete and clean data from 1936 raters.

Our pre-registered plan for data exclusions included removing raters who gave overly consistent responses, committed overly fast responses, self-reported not taking the study seriously when asked whether or not they completed the study authentically, and failed attention checks. In total, we excluded 49 raters for our pre-registered reasons for data exclusions. Table 3 shows the number of raters we excluded for each of our reasons.

Table 3: The number of raters excluded from analysis.

Reason for exclusion	N
Overly consistent	13
Overly fast	20
Failed honesty check	2
Failed attention checks	19
Total	49

We defined overly consistent responses as those raters who responded to at least 90% of trials identically. During agreement analyses (see Agreement Indicators for the Ratings), we found that 4 additional raters only gave two unique responses across all trials and also excluded these as overly consistent. We defined overly fast responses as those raters whose median reaction time fell below the 1st percentile of the overall distribution of median reaction times (see Figure 1 for the distribution of median reaction times). For our attention checks, the threshold for inclusion was to accurately complete six or more attention checks (i.e., participants were excluded if they failed more than one of the checks). Lastly, participants were excluded based on a self-reported honesty check, i.e., if they reported not taking the study seriously rather than taking it authentically. After these exclusions, we had 1887 raters in our sample (see Table 4 for the number of raters per task).

Figure 4: Median reaction time distribution

Table 4: Number of raters per task. The emotion rating task was broken into 6 counterbalanced versions.

Rating Task	N
attractive	84
dominant	91
trustworthy	88
attractive (unstd)	84
dominant (unstd)	94
trustworthy (unstd)	87
memorable	82
gender-typical	80
age	85
anger	84
disgust	91
fear	90
happiness	91
sadness	103
surprise	84
emotion 1	97
emotion 2	87
emotion 3	88
emotion 4	97
emotion 5	86
emotion 6	114

Anonymous data are available on Github (https://github.com/ManyFacesTeam/pilot-ratings).

3.2.2 Rater Demographics

We collected the following demographic data from our raters (N = 1652 provided data): Age, gender, residence, ethnicity, and devices on which the ratings were completed. See Figure 5 and Tables 3-5 for demographic information.

Figure 5: Histogram of rater age, separated by gender

Table 5: Rater country of residence

Country of Residence	N
ZA	822
GB	254
US	221
PL	49
PT	44
CA	41
ES	34
IT	27
KE	26
GR	19
DE	18
HU	13
IN, MX	11
CL	10
FR	9
IE	6
BR, NL, SE	5
AU, HR	4
CZ, EE, LV	3
BE, CH, DK, IL, SI	2
AR, AT, CN, DZ, FI, KZ, MA, SK	1

Table 6: Rater race/ethnicity

Ethnicity	N
Black	928
White	580
Asian	69
Latine	34
Mixed	28
Ambiguous label	16
MENA	5
Indigenous / Pacific Islander	3
Missing	2

Table 7: Devices used by raters to complete the study

Device Used	N
laptop	743
desktop	465
phone	418
tablet	37
missing	2

3.2.3 Agreement Indicators for the Ratings

3.2.3.1 Standardized Trait Ratings

In the first step, we calculated intraclass correlation coefficients [ICC(2,k)] for ratings of standardized images. The number of raters ranged from 80 to 91 for each trait. The average reliability across raters was excellent for ratings of attractiveness, dominance, trustworthiness, and gender-typicality, but poorer for ratings of memorability (see Table 8).

Table 8: Intraclass correlation coefficients for ratings of standardized neutral faces

			95% CI
Rating	N	ICC	lower	upper
attractive	84	0.93	0.91	0.94
dominant	91	0.88	0.85	0.90
trustworthy	88	0.88	0.85	0.90
gender-typical	80	0.87	0.84	0.90
memorable	82	0.58	0.50	0.65

Next, we examined the number of raters required for trait ratings to reach stable levels of reliability defined as ICC(2,k) values of 0.75 (see Figure 3). Attractiveness ratings stabilized with the fewest participants, reaching a median ICC(2,k) of 0.75 at approximately 20 raters. It was also the only trait to exceed a median ICC(2,k) of 0.90, which occurred with 60 raters. Ratings for dominance, trustworthiness, and gender-typicality reached a median ICC(2,k) of 0.75 with 40 raters (see Figure 6).

(a) Dashed horizontal lines indicate ICC(2,k) of 0.75 and 0.90, respectively.

To further check inter-rater agreement, we used both Cronbach’s alpha (α) and McDonald’s omega total (ωt; see Table 9). For four of the five traits, both α and ωt exceeded 0.90, indicating excellent inter-rater agreement. For the fifth trait (memorability), α was lower (0.66), suggesting reduced inter-rater agreement. ωt for memorability was substantially higher (0.87), suggesting better agreement under the assumption that raters varied in how consistently they judged memorability, a pattern consistent with the lower ICC observed for this trait.

Table 9: Cronbach’s alpha and McDonald’s omega total for ratings of standardized neutral faces.

Rating	N	alpha	omega
attractive	84	0.96	0.96
dominant	91	0.90	0.92
trustworthy	88	0.91	0.93
gender-typical	80	0.92	0.93
memorable	82	0.66	0.87

3.2.3.2 Unstandardized Trait Ratings

As with standardized trait ratings, we calculated intraclass correlation coefficients for ratings of unstandardized images. The number of raters ranged from 84 to 94. Average inter-rater reliability was excellent for all three traits (see Table 10). This was further supported by Cronbach’s α ranging from 0.88 to 0.95 and McDonald’s ωt ranging from 0.9 to 0.96 (see Table 11).

Table 10: Agreement for ratings of unstandardized neutral faces

			95% CI
Rating	N	ICC	lower	upper
attractive (unstd)	84	0.92	0.91	0.94
dominant (unstd)	94	0.84	0.80	0.87
trustworthy (unstd)	87	0.86	0.83	0.89

Table 11: Cronbach’s alpha and McDonald’s omega total for ratings of unstandardized neutral faces.

Rating	N	alpha	omega
attractive (unstd)	84	0.95	0.96
dominant (unstd)	94	0.88	0.90
trustworthy (unstd)	87	0.90	0.92

3.2.3.3 Emotion Intensity Ratings

Next, we calculated intraclass correlation coefficients, Cronbach’s α, and McDonald’s ωt for emotion intensity ratings. The number of raters ranged from 84 to 103. The average reliability across raters was excellent for all emotions, with ICC(2,k) ranging from 0.95 to 0.98, Cronbach’s α ranging from 0.97 to r, and McDonald’s ωt ranging from 0.97 to 0.99 (see Table 12 and Table 13).

Table 12: Intraclass correlation coefficients for ratings of emotional faces’ emotion intensity

			95% CI
Rating	N	ICC	lower	upper
anger	84	0.95	0.94	0.96
disgust	91	0.98	0.98	0.98
fear	90	0.96	0.95	0.97
happiness	91	0.98	0.98	0.99
sadness	103	0.95	0.94	0.96
surprise	84	0.98	0.98	0.98

Table 13: Cronbach’s alpha and McDonald’s omega total for ratings of emotional faces’ emotion intensity

Rating	N	alpha	omega
anger	84	0.97	0.97
disgust	91	0.98	0.99
fear	90	0.98	0.98
happiness	91	0.99	0.99
sadness	103	0.97	0.97
surprise	84	0.98	0.99

3.2.4 Points of Stability

To determine the number of raters required for stable ratings, we computed the point of stability [POS; Hehman et al. (2025)], defined as the smallest sample size at which 95% of resampled means fall within a corridor of stability of ±0.5 points and do not again exceed it. As shown in Figures 4-6, the ratings of standardized images reached stability between 37 and 44 raters, the ratings of unstandardized images between 37 and 41 raters, and emotion intensity ratings of standardized emotional faces between 29 and 50 raters.

Figure 7: Points of stability for ratings of standardised neutral faces

Figure 8: Points of stability for ratings of unstandardised neutral faces

Figure 9: Points of stability for emotion intensity ratings of standardised emotional faces of each expression

3.2.5 Descriptive Statistics

In this section, we report the descriptive statistics for ratings of the neutral standardized images, the neutral unstandardized images, the emotion categorization task, the intensity of expressed emotions, and the ages of the models.

3.2.5.1 Standardized Neutral Trait Ratings

Trustworthiness, dominance, and memorability ratings were approximately normally distributed and showed similar patterns of central tendency and variance. In contrast, attractiveness ratings were somewhat right-skewed, reflected in a relatively lower mean, while gender-typicality ratings were left-skewed, with a relatively higher mean (see Table 14 for descriptive statistics and Figure 10 for rating distributions).

Table 14: Descriptive statistics for ratings of standardized neutral faces

Rating	N	M	SD
memorable	82	4.14	0.27
dominant	91	3.91	0.46
attractive	84	3.44	0.62
trustworthy	88	4.20	0.46
gender-typical	80	5.16	0.48

Figure 10: Histograms for ratings of standardized neutral faces

3.2.5.2 Unstandardized Neutral Trait Ratings

The ratings for dominance were approximately normally distributed, whereas the ratings for attractiveness were slightly skewed right and the ratings for trustworthiness were slightly skewed left, each reflected in their relatively low and high mean ratings (see Table 15 for descriptive statistics and Figure 11 for rating distributions).

Table 15: Descriptive statistics for ratings of unstandardized neutral faces

Rating	N	M	SD
attractive (unstd)	84	3.47	0.61
trustworthy (unstd)	87	4.31	0.46
dominant (unstd)	94	3.87	0.41

Figure 11: Histograms for ratings of unstandardized neutral faces

3.2.5.3 Emotion Categorization and Intensity Ratings

Next, we explored raters’ emotion categorizations and perceived emotion intensity expressed by our models. Raters’ categorizations generally aligned with the models’ expression of emotion, with the greatest alignment was observed for expression of happiness, and the least for fear (Table 16, Figure 12).

Table 16: Emotion categorisation proportions

	Rated Expression
Actual Model Expression	anger	disgust	fear	happiness	sadness	surprise	other
anger	0.31	0.12	0.05	0.05	0.20	0.07	0.20
disgust	0.13	0.38	0.07	0.09	0.15	0.07	0.12
fear	0.06	0.06	0.14	0.08	0.12	0.35	0.19
happiness	0.00	0.01	0.01	0.91	0.02	0.01	0.04
sadness	0.12	0.09	0.06	0.04	0.41	0.04	0.24
surprise	0.02	0.03	0.07	0.21	0.05	0.51	0.12

(a) Downward arrows indicate correct categorisations (categorisations matching intended model expression). A=anger, D=disgust, F=fear, H=happiness, S=sadness, U=surprise, O=other

Next, we calculated the average ratings of emotion intensity for each emotion (Table 17). These ratings were very similar, on average, between the different emotions except for happiness, which was rated with greater intensity than all other emotions. On average, these ratings were located around the middle of the 7-point scale. Figure 13 shows the distribution of ratings on the scale for each emotion.

Table 17: Descriptive statistics for emotion intensity ratings

Rating	N	M	SD
anger	84	3.41	0.82
disgust	91	3.64	1.12
fear	90	3.30	0.95
happiness	91	4.41	1.09
sadness	103	3.69	0.74
surprise	84	3.54	1.26

Figure 13: Histograms of emotion intensity ratings

3.2.5.4 Perceived Age

The distribution of raters’ perceptions of models’ ages is shown in Figure 14. 26-30 years was the most-chosen age across raters and models. The correlation between models’ mode perceived age and actual age appears in Figure 15.

Figure 15: Correlation between models’ actual age and mode perceived age

4 Discussion

4.1 The value of the current work

This database provides a uniquely diverse face image stimulus set in terms of model nationality and image variety (appearance standardization, viewing angle, and facial expression). The diversity of images within the database makes it potentially useful to address a variety of research questions in multiple subfields of psychology (e.g., face recognition, emotion). It accomplishes this while being transparent and reproducible via the openly available protocol. Additionally, the database need not remain static: Other researchers can follow the protocol and add to the database, further increasing its sample size and diversity. This protocol can also serve as a template for researchers interested in collecting stimulus images, in terms of highlighting the kinds of details that need to be considered and documented)

The data associated with the images are also valuable. First, we collected models’ self-reported demographic information, which is useful for various kinds of research questions. The validation/norming data we collected also provide a useful starting point for researchers interested in using the images, for example enabling researchers to choose subsets of images most suited to address their research questions. These data also provide information about the minimum number of raters needed for reliable mean ratings for different judgments (see Hehman et al., 2025).

This work also provides proof of concept that various labs, all following the same protocol and using the same equipment, can take comparable images to form a coherent face database. This broadens opportunities for future image database collection: A database does not need to all be collected by one lab in one location.

4.2 Reflections on the Process & Limitations

The leadership team found that having a broad variety of areas of expertise and diverse research backgrounds within the broader team was invaluable in both developing the database and in troubleshooting issues. We moreover found that setting out expectations at the start of the project through a collaboration agreement was essential. However, we ran into various issues throughout the process, which are common to Big Team Science initiatives. For example, with so many labs and individual researchers involved, timelines for completion necessarily stretched. This made momentum difficult to maintain at times. Planning in generous buffer time for such large-scale projects and managing team members’ expectations around timelines is therefore essential. We also faced difficulty in finding an optimal way to communicate effectively with all team members, given varying preferences (e.g., via email vs. other team communication platforms). It is perhaps worth surveying members of a big team at project start about communication possibilities, as well as outlining more specific communication expectations in a collaboration agreement.

After Phase 1, we surveyed the ManyFaces research team to provide feedback on the process of collecting images using the protocol. Team members commonly expressed the desire for additional concise guidance in following the protocol, including an additional abbreviated protocol with numbered steps to follow during data collection and a video tutorial (in addition to the provided illustrative images) showing the necessary steps for researchers unfamiliar with camera equipment. These are requests that we can incorporate into future protocol updates. Doing so could both make the process easier for researchers and minimize errors and deviations from the protocol, ensuring greater consistency between sites.

The major issue raised by the research team was the difficulty of eliciting facial expressions as described in the protocol. The protocol took a straightforward instructive approach to emotion elicitation (e.g., asking participants to face the camera with the expression they would make if they were feeling happy), following consultation with emotion experts in ManyFaces and discussion of feasibility with the wider team. However, models self-reported difficulty with the facial expression posing, in line with researchers’ feedback. The emotion validation data also indicate that models struggled to pose the facial expressions of emotion. The data furthermore suggest that perceivers had difficulty identifying emotions from these posed facial expressions. In support of this idea, recent research shows posed expressions to not appear genuine and reveals differing perceptions of posed and spontaneous expressions (Dawel et al., 2017, 2025). Future work, including possible updates to our own protocol, may therefore focus on developing a better framework to elicit facial expressions. This could include taking inspiration from databases of naturally-induced emotion (e.g., Miolla et al., 2023; Sneddon et al., 2012) and considering potential cultural differences and specific local needs.

The research team also raised several comments about the provided and not-provided equipment, specifically the flimsiness of the light stand and the missing standardized backdrop (due to shipping constraints, participating researchers were to source a white background, e.g., a wall, seamless fabric, or paper). It is worth highlighting that the images were captured with equipment selected due to several practical constraints: budget, university-approved vendors, shipping logistics, interoperability, and ease of use for non-experts in varying conditions. The available project budget and administrative restrictions on approved vendors (their offerings and stock) for the University of Glasgow primarily limited the attainable camera, lens, and lighting setup. In general, we opted to create a setup that would not be prohibitively expensive to acquire, could be shipped in a single parcel, and would be compatible (substitutable) with equipment other researchers may have available. The last consideration was creating a setup with the fewest degrees of freedom, and thus less room for error. Therefore, we opted to use an LED ring light with a camera mounted inside the ring in landscape orientation on a single stand (vs. portrait orientation using a different kind of mount or separate stands for the camera and light). Ring lights became ubiquitous and easy to acquire in recent years; they only need to be positioned square, level, and in front of the sitter. Other, more complex setups can be created; however, the price, mobility, and repeatability of such setups may represent major constraints to consistent and reliable image collection. It is therefore worth considering trade-offs between different kinds of equipment and setups in future work.

In addition to our choices of equipment, we also made certain pragmatic choices during image processing that future work could improve on. For example, we white-balanced the images rather than fully colour-correcting them, as we were able to create a fully reproducible scripted method for the former process, but not the latter. This was due to limitations such as varied colour checker chart placement and orientation in models’ photos and a lack of expertise on the research team to work around this in a reproducible way. We therefore opted to simply white-balance to keep the process entirely open and reproducible, rather than fully colour-correct using a manual and not fully reproducible method. The process of aligning and sizing the images was also driven by cropping needs. That is, due to the landscape orientation of the photos, there was limited vertical space that we could crop, constraining the possibilities for face alignment and image size. Future work could consider using alternative equipment setups to enable capture images in portrait mode. Variation in faces’ size in the images, due to some deviations from the distances specified in the protocol, also affected the image processing steps. This could be addressed in future work by more clearly highlighting not only the key aspects of the protocol that should be kept constant, but also why these aspects should be constant (i.e., clarifying the reasoning to all team members).

Finally, we collected images of 211 models, but no single image type included images of all models. Rather, the maximum number of models represented in a single image type was 205 images (standardized neutral, front-facing). This led to missing perceived age/ethnicity data for six models who had unstandardized but not standardized neutral images available.

4.3 Conclusion

Here, we introduce a new diverse face image database, which will be useful to researchers interested in a variety of questions related to social perception, person recognition, and vision science. We demonstrate that a cohesive database can be compiled across a variety of sites, opening doors for future additions to this database and the development of future multi-lab databases.

References

Bainbridge, W. A., Isola, P., & Oliva, A. (2013). The intrinsic memorability of face photographs. Journal of Experimental Psychology: General, 142(4), 1323–1334. https://doi.org/10.1037/a0033872

Bastanfard, A., Nik, M. A., & Dehshibi, M. M. (2007). Iranian Face Database with age, pose and expression. 2007 International Conference on Machine Vision, 50–55. https://doi.org/10.1109/ICMV.2007.4469272

Burton, A. M., White, D., & McNeill, A. (2010). The Glasgow Face Matching Test. Behavior Research Methods, 42(1), 286–291. https://doi.org/10.3758/BRM.42.1.286

Chen, J. M., Norman, J. B., & Nam, Y. (2021). Broadening the stimulus set: Introducing the American Multiracial Faces Database. Behavior Research Methods, 53(1), 371–389. https://doi.org/10.3758/s13428-020-01447-8

Cook, R., & Over, H. (2021). Why is the literature on first impressions so focused on White faces? Royal Society Open Science, 8(9), 211146. https://doi.org/10.1098/rsos.211146

Courset, R., Rougier, M., Palluel-Germain, R., Smeding, A., Jonte, J. M., Chauvin, A., & Muller, D. (2018). The Caucasian and North African French Faces (CaNAFF): A Face Database. International Review of Social Psychology, 31(1). https://doi.org/10.5334/irsp.179

Dawel, A., Krumhuber, E. G., & Palermo, R. (2025). Faking It Isn’t Making It: Research Needs Spontaneous and Naturalistic Facial Expressions. Affective Science. https://doi.org/10.1007/s42761-025-00320-1

Dawel, A., Wright, L., Irons, J., Dumbleton, R., Palermo, R., O’Kearney, R., & McKone, E. (2017). Perceived emotion genuineness: Normative ratings for popular facial expression stimuli and the development of perceived-as-genuine and perceived-as-fake sets. Behavior Research Methods, 49(4), 1539–1562. https://doi.org/10.3758/s13428-016-0813-2

DeBruine, L. M., Holzleitner, I. J., Tiddeman, B., & Jones, B. C. (2022). Reproducible Methods for Face Research. PsyArXiv. https://doi.org/10.31234/osf.io/j2754

DeBruine, L., Lai, R., Jones, B., Abdullah, R., & Mahrholz, G. (2020). Experimentum. Zenodo. https://doi.org/10.5281/zenodo.4010579

Ebner, N. C., Riediger, M., & Lindenberger, U. (2010). FACES—A database of facial expressions in young, middle-aged, and older women and men: Development and validation. Behavior Research Methods, 42(1), 351–362. https://doi.org/10.3758/BRM.42.1.351

Freeman, J. B., & Johnson, K. L. (2016). More Than Meets the Eye: Split-Second Social Perception. Trends in Cognitive Sciences, 20(5), 362–374. https://doi.org/10.1016/j.tics.2016.03.003

Gao, W., Cao, B., Shan, S., Chen, X., Zhou, D., Zhang, X., & Zhao, D. (2008). The CAS-PEAL Large-Scale Chinese Face Database and Baseline Evaluations. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 38(1), 149–161. https://doi.org/10.1109/TSMCA.2007.909557

Hehman, E., Xie, S. Y., Ofosu, E. K., & Nespoli, G. A. (2025). Assessing the Point at Which Averages Are Stable: A Tutorial in the Context of Impression Formation. Social Cognition, 43(5), 488–501. https://doi.org/10.1521/soco.2025.43.5.488

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33(2-3), 61–83. https://doi.org/10.1017/S0140525X0999152X

ImageMagick Studio LLC. (2024). ImageMagick (Version 7.1.1) [Computer software]. https://imagemagick.org

Jack, R. E., Sun, W., Delis, I., Garrod, O. G. B., & Schyns, P. G. (2016). Four not six: Revealing culturally common facial expressions of emotion. Journal of Experimental Psychology: General, 145(6), 708–730. https://doi.org/10.1037/xge0000162

Jones, B. C., DeBruine, L. M., Flake, J. K., Liuzza, M. T., Antfolk, J., Arinze, N. C., Ndukaihe, I. L. G., Bloxsom, N. G., Lewis, S. C., Foroni, F., Willis, M. L., Cubillas, C. P., Vadillo, M. A., Turiegano, E., Gilead, M., Simchon, A., Saribay, S. A., Owsley, N. C., Jang, C., … Coles, N. A. (2021). To which world regions does the valence–dominance model of social perception apply? Nature Human Behaviour, 5(1), 159–169. https://doi.org/10.1038/s41562-020-01007-2

Lakshmi, A., Wittenbrink, B., Correll, J., & Ma, D. S. (2021). The India Face Set: International and Cultural Boundaries Impact Face Impressions and Perceptions of Category Membership. Frontiers in Psychology, 12. https://www.frontiersin.org/articles/10.3389/fpsyg.2021.627678

Langner, O., Dotsch, R., Bijlstra, G., Wigboldus, D. H. J., Hawk, S. T., & Knippenberg, A. van. (2010). Presentation and validation of the Radboud Faces Database. Cognition and Emotion, 24(8), 1377–1388. https://doi.org/10.1080/02699930903485076

Lundqvist, D., Flykt, A., & Öhman, A. (2015). Karolinska Directed Emotional Faces. https://doi.org/10.1037/t27732-000

Ma, D. S., Correll, J., & Wittenbrink, B. (2015). The Chicago face database: A free stimulus set of faces and norming data. Behavior Research Methods, 47(4), 1122–1135. https://doi.org/10.3758/s13428-014-0532-5

Ma, D. S., Kantner, J., & Wittenbrink, B. (2021). Chicago Face Database: Multiracial expansion. Behavior Research Methods, 53(3), 1289–1300. https://doi.org/10.3758/s13428-020-01482-5

Martinez, J. E. (2025). Facecraft: Race Reification in Psychological Research With Faces. Perspectives on Psychological Science, 20(1), 182–194. https://doi.org/10.1177/17456916231194953

Meyers, C., Garay, M., & Pauker, K. (2024). Hawai‘i Face Database: A racially and ethnically diverse set of facial stimuli. https://doi.org/10.31234/osf.io/wde4b

Miolla, A., Cardaioli, M., & Scarpazza, C. (2023). Padova Emotional Dataset of Facial Expressions (PEDFE): A unique dataset of genuine and posed emotional facial expressions. Behavior Research Methods, 55(5), 2559–2574. https://doi.org/10.3758/s13428-022-01914-4

Oosterhof, N. N., & Todorov, A. (2008). The functional basis of face evaluation. Proceedings of the National Academy of Sciences, 105(32), 11087–11092. https://doi.org/10.1073/pnas.0805664105

Perrett, D. I. (2017). In Your Face: The new science of human attraction. Bloomsbury Publishing.

Roberts, S. O., & Mortenson, E. (2023). Challenging the White = Neutral Framework in Psychology. Perspectives on Psychological Science, 18(3), 597–606. https://doi.org/10.1177/17456916221077117

Saribay, S. A., Biten, A. F., Meral, E. O., Aldan, P., Třebický, V., & Kleisner, K. (2018). The Bogazici face database: Standardized photographs of Turkish faces with supporting materials. PLOS ONE, 13(2), e0192018. https://doi.org/10.1371/journal.pone.0192018

Schalk, J. van der, Hawk, S. T., Fischer, A. H., & Doosje, B. (2011). Moving faces, looking places: Validation of the Amsterdam Dynamic Facial Expression Set (ADFES). Emotion, 11(4), 907–920. https://doi.org/10.1037/a0023853

Sneddon, I., McRorie, M., McKeown, G., & Hanratty, J. (2012). The Belfast Induced Natural Emotion Database. IEEE Transactions on Affective Computing, 3(1), 32–41. https://doi.org/10.1109/T-AFFC.2011.26

Sutherland, C. A. M., Oldmeadow, J. A., Santos, I. M., Towler, J., Michael Burt, D., & Young, A. W. (2013). Social inferences from faces: Ambient images generate a three-dimensional model. Cognition, 127(1), 105–118. https://doi.org/10.1016/j.cognition.2012.12.001

Sutherland, C. A. M., & Young, A. W. (2022). Understanding trait impressions from faces. British Journal of Psychology, 113(4), 1056–1078. https://doi.org/10.1111/bjop.12583

Tottenham, N., Tanaka, J. W., Leon, A. C., McCarry, T., Nurse, M., Hare, T. A., Marcus, D. J., Westerlund, A., Casey, B., & Nelson, C. (2009). The NimStim set of facial expressions: Judgments from untrained research participants. Psychiatry Research, 168(3), 242–249. https://doi.org/10.1016/j.psychres.2008.05.006

Třebický, V., Fialová, J., Kleisner, K., & Havlíček, J. (2016). Focal Length Affects Depicted Shape and Perception of Facial Images. PLOS ONE, 11(2), e0149313. https://doi.org/10.1371/journal.pone.0149313

Třebický, V., Třebická Fialová, J., Bjornsdottir, R. T., & DeBruine, L. M. (2024). A Grim Image: Considerations for Methods of portrait photography in psychological science. OSF Preprints. https://doi.org/10.31219/osf.io/z4svb

Trzewik, M., Navon, M., Moran, T., Wardi, H., Langer, A., Hadad, B.-S., Sofer, C., & Reggev, N. (2025). The Israeli Face Database (IFD): A multi-ethnic database of faces with supporting social norming data. Behavior Research Methods, 57(7), 197. https://doi.org/10.3758/s13428-025-02723-1

Workman, C. I., & Chatterjee, A. (2021). The Face Image Meta-Database (fIMDb) & ChatLab Facial Anomaly Database (CFAD): Tools for research on face perception and social stigma. Methods in Psychology, 5, 100063. https://doi.org/10.1016/j.metip.2021.100063

Yarkoni, T. (2022). The generalizability crisis. Behavioral and Brain Sciences, 45, e1. https://doi.org/10.1017/S0140525X20001685

Young, A. W., & Burton, A. M. (2017). Recognizing Faces. Current Directions in Psychological Science, 26(3), 212–217. https://doi.org/10.1177/0963721416688114

Zebrowitz, L. (1997). Reading Faces: Window To The Soul? Routledge.

Footnotes

A call to join the consortium was initially disseminated over social media, followed by an initial meeting with members to determine the general direction of the group and potential future projects. We then developed a webpage to enable more people to join and set up a network of communication to discuss and prioritise projects.↩︎
Pilot testing gave us the estimated/recommended amount of time for data collection, but some labs may have been able to devote less time due to resource or time constraints↩︎
We did not fully colour-calibrate the images here (see Discussion) but future researchers may wish to do so.↩︎
The preregistration mis-stated the total number of models, rather than the maximum number of images, to be 205. The total number of models was 211↩︎
This was slightly later than preregistered, due to pretesting and study setup delays. Note that we also planned to collect perceived race/ethnicity data, but decided against this, due to the regional specificity of race/ethnicity labels.↩︎