• Open access
  • Published: 02 June 2023

Measurement in STEM education research: a systematic literature review of trends in the psychometric evidence of scales

  • Danka Maric   ORCID: orcid.org/0000-0003-0488-9057 1 ,
  • Grant A. Fore   ORCID: orcid.org/0000-0002-5432-0726 1 ,
  • Samuel Cornelius Nyarko   ORCID: orcid.org/0000-0002-2434-5949 1 &
  • Pratibha Varma-Nelson   ORCID: orcid.org/0000-0003-2206-7874 1 , 2  

International Journal of STEM Education volume  10 , Article number:  39 ( 2023 ) Cite this article

3636 Accesses

2 Citations

5 Altmetric

Metrics details

The objective of this systematic review is to identify characteristics, trends, and gaps in measurement in Science, Technology, Engineering, and Mathematics (STEM) education research.

We searched across several peer-reviewed sources, including a book, similar systematic reviews, conference proceedings, one online repository, and four databases that index the major STEM education research journals. We included empirical studies that reported on psychometric development of scales developed on college/university students for the context of post-secondary STEM education in the US. We excluded studies examining scales that ask about specific content knowledge and contain less than three items. Results were synthesized using descriptive statistics.

Our final sample included the total number of N = 82 scales across N = 72 studies. Participants in the sampled studies were majority female and White, most scales were developed in an unspecified STEM/science and engineering context, and the most frequently measured construct was attitudes. Internal structure validity emerged as the most prominent validity evidence, with exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) being the most common. Reliability evidence was dominated by internal consistency evidence in the form of Cronbach’s alpha, with other forms being scarcely reported, if at all.

Limitations include only focusing on scales developed in the United States and in post-secondary contexts, limiting the scope of the systematic review. Our findings demonstrate that when developing scales for STEM education research, many types of psychometric properties, such as differential item functioning, test–retest reliability, and discriminant validity are scarcely reported. Furthermore, many scales only report internal structure validity (EFA and/or CFA) and Cronbach’s alpha, which are not enough evidence alone. We encourage researchers to look towards the full spectrum of psychometric evidence both when choosing scales to use and when developing their own. While constructs such as attitudes and disciplines such as engineering were dominant in our sample, future work can fill in the gaps by developing scales for disciplines, such as geosciences, and examine constructs, such as engagement, self-efficacy, and perceived fit.

Measurement of students’ experiences and instructor interventions continues to be an important aspect of Science, Technology, Engineering and Mathematics (STEM) education. Given the enormous educational and research efforts devoted to understanding students’ development in STEM, advancing strategies that measure how well interventions and experiences work is imperative. Sound measurement strategies require access to measures, scales, and instruments with adequate psychometric evidence (Arjoon et al., 2013 ; Cruz et al., 2020 ; Decker & McGill, 2019 ; Margulieux et al., 2019 ).

Quantitative measurement instruments, such as surveys, are commonly used tools in and important aspect of conducting discipline-based STEM education research (Arjoon et al., 2013 ; Decker & McGill, 2019 ; Knekta et al., 2019 ). Salmond ( 2008 ) emphasizes the importance of measures that “accurately reflect the phenomenon or construct of interest” (p.28) when designing effective student experiences. Measurement of student experiences and teaching innovations, however, is a complex task, as there can be many elements and constructs within a single experience or innovation, which continues to challenge STEM researchers and educators (Kimberlin & Winterstein, 2008 ). An even larger issue is the lack of measurement instruments grounded within holistic STEM theoretical frameworks to guide empirical research that specifically targets STEM students (Wang & Lee, 2019 ). Similarly, researchers need to be aware of measurement instruments available to them and be able to make choices appropriate for their research needs, which can be challenging due to the inability to locate validated and reliable measures (Decker & McGill, 2019 ).

There is a gap in post-secondary STEM education research when it comes to measurement and psychometric evidence. For example, a review of the 20 top mathematics, science, and STEM education journals found that less than 2% of empirically based publications in the last 5 years were instrument validation studies (Sondergelt, 2020 ).

Appianing and Van Eck ( 2018 ) share that whereas researchers have developed valid and reliable instruments to measure students’ experiences in STEM, most of them focus on middle and high school students rather than on college students. Moreover, Hobson et al. ( 2014 ) have suggested that construction and usage of rubrics that effectively assess specific skill development such as collaboration, critical thinking, and communication in research continues to be a problem in STEM education. In summary, without effective measurement instruments and strategies that assess the efficacy of interventions and students’ experiences, it is difficult to trace and document students’ progress. Given that much of past research efforts in education have been focused on the K-12 level, and the fact that there has been increase in research interest in post-secondary STEM education research and publications (Li & Xiao, 2022 ; Li et al., 2022 ), this is especially needed at the post-secondary level.

A systematic review that compiles relevant scales in STEM post-secondary education and assesses available psychometric evidence is one strategy to mitigate the challenges above.

Scholars have also emphasized the importance of having a holistic understanding of measurement. This includes the statistical and theoretical underpinnings of validity, as well as the psychometric measures and dimensions that represent what is measured and evaluated, which are critical in the development, selection, and implementation of measurement instruments (Baker & Salas, 1992 ; Knekta et al., 2019 ). Likewise, there is a call to bring the design, testing, and dissemination of measurement instruments in STEM education to the forefront if researchers wish to have their quantitative results viewed as scientific by broader audiences (Sondergelt, 2020 ). Thus, in this study we provide insight into the measurement trends in survey instruments utilized in STEM education research and establish a resource that can be used by STEM education researchers to make informed decisions about measurement selections.

Some work has already been done to this end, with similar studies examining psychometric evidence for scales, measures, or instruments in chemistry (Arjoon et al., 2013 ), engineering (Cruz et al., 2020 ), and computer science (Decker & McGill, 2019 ; Margulieux et al., 2019 ) education research. These studies have examined and reported on psychometric evidence and the various constructs being measured in their respective fields as well as suggest professional development in measurement training for educators and researchers. For example, Arjoon et al. ( 2013 ) asserts that, to bridge the gap between what is known about measurement and what the actual accepted standards are, there is a need for measurement education within the chemistry education community. However, to our knowledge, no such study has been conducted across all of STEM education research. Thus, in the present study we build upon past work by conducting a systematic review of scales created for STEM education research, the psychometric evidence available for them, and the constructs they are measuring.

The purpose of this systematic literature review is twofold. First, we aim to examine the measurement trends in survey instruments utilized in STEM education research. For example, we are interested in identifying which validated and reliable surveys are currently used in STEM education contexts, as well as what each instrument is measuring. Second, we intend for this paper not to be a repository of STEM education instruments per se, but to be a tool to be used by intermediate and expert STEM education and Discipline-Based Education researchers (DBER) to make informed decisions about measurement when conducting their research and collaborating with others. In other words, we aim to produce a systematic literature review that critically examines the current measurement and psychometric development trends in STEM education research and, in doing so, illustrates areas, where STEM education research instrumentation might be lacking, where additional psychometric evaluation may be needed (as well as what tests those should be), and what kinds of surveys still need to be created and evaluated for STEM education research purposes. In doing so, our goal is to advance the development of robust and sound measurement instruments being used in the study of STEM education, teaching, and learning by helping researchers address some of the measurement challenges they currently face. We hope that by providing such a resource and pushing for advancements in measurement, we will contribute to the overall quality and advancement of the study of education, teaching, and learning within STEM post-secondary classrooms.

Theoretical framework

Our theoretical framework was informed by the 2014 edition of the Standards, jointly published by the American Education Research Association (AERA), the American Psychological Association (APA), and National Council on Measurement in Education (NCME). The Standards defines and outlines criteria for creating and evaluating educational and psychological tests, which includes scales and inventories under their definition. The Standards also provides criteria for test use and applications in psychological, educational, workplace, and evaluative contexts. For the purposes of the present review, we used the Standards’ definitions, operationalizations, and criteria for psychometric evidence (reliability and validity). An overview of the theoretical framework that guided the formation of the coding framework and decision-making is displayed in Fig.  1 . The definitions and operationalizations we used in the coding framework can be found below under psychometric evidence .

figure 1

Theoretical frame work of psychometric evidence

To define the term “scale”, we draw on the Standards’ definition for test as it encompasses the evaluative devices of tests, inventories, and scales, thus defining a scale as “a device or procedure in which a sample of an examinee’s behavior in a specified domain is obtained and subsequently evaluated and scored using a standardized process” (AERA, APA, NCME, 2014 , p. 2). We conceptualize validity as “the degree to which evidence and theory support the interpretations of test scores for proposed use of tests” (AERA, APA, NCME, 2014 , p. 11) and reliability as “the consistency of scores across replications of a testing procedure, regardless of how this consistency is estimated or reported” (AERA, APA, NCME, 2014 , p. 33). Finally, we corroborated the National Science Foundation’s (NSF) definition of STEM with information from a previous systematic review on STEM education (Gonzalez & Kuenzi, 2012 ; Martín-Páez et al., 2019 ) to define STEM education to include: science (biology, chemistry, computer and information science, geosciences, earth science), technology, engineering, mathematics, as well as any combinations of the above fields. Finally, we defined STEM education research as the multiple methodologies for exploring cross-disciplinary teaching, learning, and management strategies that increase student/public interest in STEM-related fields to enhance critical thinking and problem-solving abilities (Bybee, 2010 ).

We utilized the method articulated by Hess and Fore ( 2018 ), which itself was an extension of the method detailed by Borrego et al. ( 2014 ) for the present review. Following Hess and Fore ( 2018 ), we began by identifying the need for a systematic review such as this, and then proceeded to define the scope and focus of the study with three research questions (see next section). Then, we initiated scoping , which is concerned with identifying the ways in which we were going to search for relevant literature.

We then proceeded to what we term abstract screening and then full-text screening , referred to as “cataloguing” by Hess and Fore ( 2018 ). Given the context of our systematic literature review and its demands, we determined that a slight shift was needed. In the context of the Hess and Fore study (i.e., engineering ethics education), the “cataloguing” step was concerned with the creation of the initial article database and the criteria by which inclusion and exclusion were determined. We also created an article database, but we changed it to a two-step screening process ( abstract screening and full-text screening ) in the present review. During these two phases, articles were screened against inclusion and exclusion criteria that were determined a priori , which are discussed in depth in screening below.

In Hess and Fore ( 2018 ), the “exploring” step was focused on crafting a coding structure and whittling down the article database further based upon the evolving coding parameters of the study in a deductive manner. However, our systematic review required a different approach.

Rather than using a deductive coding process, we created an a priori coding structure based on information outlined in the Standards (AERA, APA, and NCME, 2014 ). This approach allowed our article database to become specialized according to our study parameters. We only used emergent codes for categorizing the constructs being measured (see below under coding ). Unlike Hess and Fore ( 2018 ), we did not further whittle down the article database in this phase, articles were simply coded for psychometric evidence present based on the coding structure. For this reason, we simply call this phase coding.

Next, during the checking phase, just as in the Hess and Fore ( 2018 ) review, we engaged in examining interrater reliability. Authors one, two, and three familiarized themselves with the coding structure and then we performed interrater reliability testing. Hess and Fore ( 2018 ) named their subsequent step “quantizing”; however, we decided to rename that step results , as we felt that this title better communicated the step’s purpose to report the descriptive statistics related to our coding efforts. The two final steps identified by Hess and Fore (i.e., “interpreting” and “narrating”) were merged into a step we simply titled discussion . In this step, we identified and interpreted our results before crafting an overview of all results.

As argued in the “ Purpose ” section above, there is a need for a broad systematic review of the literature on survey instruments across STEM fields. Seeking out and identifying a valid and reliable instrument for one’s project may be laborious and time consuming, especially for those who may be starting out as STEM education researchers or discipline-based education researchers. This study seeks to introduce current instrumentation trends across STEM fields to provide researchers with a tool to identify rigorously developed scales which may foster insight into where psychometric work still needs to be done. To accomplish this, we seek to address three research questions:

RQ1: What are the valid and reliable measures being reported for use in post-secondary STEM education research in the United States between the years 2000 and 2021? RQ2: What are the common categories within which the measures can be organized? RQ3: What is the psychometric evidence that has been reported for these STEM education measures?

We started with an initial list compiled by the first author as an internal resource for STEM education research in our institute. Building upon this, a literature search was conducted using both quality-controlled and secondary sources (see Cooper, 2010 ). Quality-controlled sources included one book (Catalano & Marino, 2020 ), similar systematic reviews (Arjoon et al., 2013 ; Cruz et al., 2020 ; Gao et al., 2020 ; Margulieux et al., 2019 ), conference proceedings (Decker & McGill, 2019 ; Verdugo-Castro et al., 2019 ), and the Physics Education Research Center (PERC) online repository of measures. Secondary sources included the Web of Science, Education Resources Information Center (ERIC), SCOPUS, and PsycINFO databases, which index the major STEM education research journals. These journals were identified in a previous systematic review examining STEM education research publication trends (Li et al., 2020 ).

Constraints and limiters

We used several constraints and limiters to narrow down the number of papers obtained in the literature search. First, given that reliability and validity are complimentary, and that high reliability is needed for high validity (Knekta et al., 2019 ), we only included papers that reported on both validity and reliability. Second, because our work and expertise primarily revolve around STEM education in the United States, we were interested in measures used in STEM education research in the US. This decision was further informed by the fact that sometimes scores can differ between groups due to scale characteristics unrelated to the actual construct being measured, thus introducing measurement bias (AERA, APA, & NCME, 2014 ). Likewise, population differences in factors such as culture and language necessitate an examination of the degree to which a scale measures the same construct across groups (Schmitt & Kuljanin, 2008 ), which can be especially important when comparing constructs, such as values, attitudes, and beliefs across groups (Milfont & Fischer, 2010 ; Van De Schoot, 2015 ). Thus, for the sake of simplicity and brevity, we only included studies conducted in the US and written in English. There were some exceptions, where samples from non-US countries were included with US samples; however, these papers reported examining measurement invariance between the US and non-US samples.

We further constrained the search to papers published between the years 2000 and 2021 (the present time of the search). Similar systematic reviews in chemistry, engineering, and interdisciplinary STEM education began their searches in the early 2000s (Arjoon et al., 2013 ; Cruz et al., 2020 ; Gao et al., 2020 ). Likewise, in the early 2000s there was an increased emphasis on institutional assessment of learning as well as the need for better assessment tools that reflect the goals and emphases of new courses and curricula being developed in STEM education (Hixson, 2013 ). Finally, the early 2000s is said to be when the term “STEM” was first used (Mohr-Schroeder et al., 2015 ; Li et al., 2020 ), which symbolically helps focus attention to STEM education efforts (Li et al., 2020 , 2022 ). Taken together, we decided that the year 2000 would be a reasonable starting point for the present review.

We finally constrained the search to papers published in peer-reviewed journals or conference proceedings by clicking ‘peer-reviewed only’ when searching databases. We also limited the search to studies that included sampled college/university students that were 18 years or older and research settings, that are in post-secondary institutions (2-year college, 4-year college, or university), and STEM courses (based on our conceptualization and operationalization of STEM education above).

Search terms

The first author derived the search terms for the literature search using the thesaurus in the ERIC database and in consultation with a university librarian. These search terms were created based upon the research questions and constraints and limiters outlined above. Terms were derived from four main constructs of interest—STEM education, higher education, measures, and psychometrics—although specific Boolean operators and searching strategies varied slightly depending on the database. For full search terms, limiters, and operators used, please see Tables S1, S2, S3, and S4 in Additional file 1 .

After the first author obtained the initial 603 studies from all sources and stored and managed them using an EndNote™ citation database, duplicates were deleted, and two rounds of screening and reduction against screening questions based on pre-determined inclusion and exclusion criteria were conducted. All screening questions could be answered with yes, no, or unsure. The unsure option was only used when enough information to answer the screening questions could not be obtained from the abstracts and thus had to undergo full-text review.

Besides the constraints and limiters outlined above, we had some extra considerations when developing screening questions. We did not consider observation protocols, interview protocols, rubrics, or similar instruments, because their development adheres to a set of standards distinct from surveys and scales and would be outside of the scope of the present study. We also excluded scales testing content knowledge, because they have limited opportunity for cross-disciplinary use due to their specificity. Finally, we omitted studies that included scales or subscales with less than three items, because using one- or two-item scales has been recognized as problematic (Eisinga et al., 2013 ). For example, in factor analysis, factors defined by one or two variables are considered unstable (Tabachnick & Fidell, 2014 ) and it has been argued that more items increase the likelihood of identifying the construct of interest (Eisinga et al., 2013 ).

Abstract screening

In the first round of screening, the first author reviewed just the abstracts and screened them for inclusion against the following five screening questions:

Does the study report the process of examining psychometric properties (i.e., evidence of validity/reliability) of a measure? (yes/no/unsure)

Does the study examine a measure meant to be used in a post-secondary setting (i.e., 4-year college, university, 2-year college)? (yes/no/unsure)

Does the study examine a quantitative measure (i.e., closed-response options such as Likert items)? (yes/no/unsure)

Are the participants in the study college/university students? (yes/no/unsure)

Has the measure been developed for a STEM education context? (yes/no/unsure)

The second round of screening included papers that had all screening questions marked with yes , and papers that had one or more screening questions marked with unsure . The most common reasons for exclusion in this round included studies not reporting the process of collecting psychometric evidence (e.g., Lock et al., 2013 ; Romine et al., 2017 ) and the measure not being developed for a STEM education context (e.g., Dixon, 2015 ; Zheng & Cook, 2012 ). A total of 114 papers were marked for inclusion and seven were marked as unsure.

Full-text screening

In the second and final round of screening, the first author obtained full-text manuscripts and any corresponding supplementary materials for the 121 studies left after abstract screening. During the full-text screening process, four additional papers were found in the references of other papers, increasing the total number to 125. These studies were screened against the following three screening questions:

Does the measure under study ask questions about specific content knowledge? (yes/no)

Is there evidence for both reliability and validity for the measure? (yes/no)

Do scales/subscales have at least three items in them? (yes/no)

To be included in coding, the first question had to be marked no , and the second and third questions marked yes . After coding, the first author further organized the sample by scale, because some papers reported developing several scales and some scales were developed across multiple papers. See the PRISMA diagram in Fig.  2 for further information on the screening and reduction process as well as final sample sizes.

figure 2

PRISMA diagram. Four additional studies were found in the second round of screening and included in the full article review

Many of the articles that were excluded in this round appeared to meet our criteria at first glance, but ultimately did not. The most common reason for this was because scales or subscales had less than three items each (e.g., Brunhaver et al., 2018 ; Jackson, 2018 ). Furthermore, although limiters were used when conducting the literature search, it was not obvious that several studies were non-US studies and these were ultimately excluded upon full-text review (e.g., Brodeur et al., 2015 ; Ibrahim et al., 2017 ). A few articles were also excluded, because the authors did not report both reliability and validity evidence (e.g., Godwin et al., 2013 ; Hess et al., 2018 ). It is important to note that just because a paper was excluded at any stage of the screening process does not mean that paper is of low quality. These papers simply did not meet our specific parameters. The final list of references of the articles included in the review can be found in Additional file 2 and further information on the scales can be found in Additional file 3 and Additional file 4 .

Per recommendations for conducting systematic reviews and meta-analyses (Cooper, 2010 ), the first author created a coding framework to pull out sample information, descriptive information, and psychometric evidence for each scale. This was compiled into a codebook, which was shared with authors two and three, who served as the second and third coders, respectively. Each section of the coding framework is described below.

Sample information

The first author extracted sample sizes and characteristics of each sample used in scale development in each study. If several studies were reported in a single publication, the first author extracted sample characteristics for each study, when available. Specifically, for each scale, the first author coded the sample age (either the mean or age range), racial distribution (by percentile), and gender distribution (by percentile).

Descriptive information

The first author extracted the following descriptive information for each scale included in the review:

The number of items in the final scale.

The number of items in each subscale.

Whether the scale is a short form of a longer, previously developed scale.

The disciplinary context of the scale.

The construct or constructs the scale is measuring.

Scale response anchors.

The education level the scale is intended for.

The disciplinary context was coded in accordance with the predetermined definition of STEM education as outlined above in the theoretical framework section. The scale constructs were coded based upon the main constructs that were operationalized and defined by the authors of the scales. Thus, if a scale author stated that the scale was designed to measure chemistry attitudes, for example, then the construct was coded as “attitudes towards chemistry.” The scale constructs were further developed into broader categories through emergent codes, which is described below under checking .

Psychometric evidence

Following similar systematic reviews (e.g., Arjoon et al., 2013 ), we created a coding structure based upon the psychometric evidence outlined in the Standards (AERA, APA, & NCME, 2014 ) for validity and reliability. Specifically, we pulled out the types of validity and reliability evidence and used binary codes (yes/no) to mark them as either present or not present for each scale. Extra definitions for statistical techniques discussed below can be found in Additional file 5 .

Per the Standards (AERA, APA, & NCME, 2014 ), validity evidence coded in this review included test content validity, response process validity, internal structure validity, and relationships with other variables. Test content validity was defined as evaluations from expert judges. Response process evidence was defined as evaluating cognitive processes engaged in by subjects through cognitive interviews, documenting response times, or tracking eye movements. Internal structure evidence was defined as the extent to which the relationships among test items and components conform to the construct on which the proposed test score interpretations are based. This included exploratory factor analysis (EFA), confirmatory factor analysis (CFA), and Differential Item Functioning (DIF). We also considered other statistical techniques not listed in The Standards such as Rasch Analysis, Q-sort methodology, Item Response Theory (IRT), and Multidimensional IRT.

Evidence based on relationships with other variables was defined as analyses of the relationship of test scores to variables external to the test. This included convergent validity, which examines whether scales are highly correlated with similar constructs, and discriminant validity, which examines whether scales are not correlated with dissimilar constructs. Test-criterion validity, which examines how accurately scores predict criterion (some attribute or outcome that is operationally distinct from the scale) performance is also included. Test criterion validity can come in the form of predictive validity, in which criterion scores are collected a later time, or concurrent validity, in which criterion scores are collected at the same time as scale scores. Finally, under evidence based on relationships, we also considered validity generalization via meta-analysis.

Reliability

Per the Standards (AERA, APA, NCME, 2014 ), reliability evidence considered in this review includes alternate-form reliability, test–retest reliability, and internal consistency. Alternate-form reliability was defined as the examination of the relationship between the measure and a different but interchangeable form of the measure, usually in the form of a correlation. Test–retest reliability is the examination of the relationship between two or more administrations of the same measure, also typically reported as a correlation. Finally, internal consistency includes the observed extent of agreement between different parts of one test that is used to estimate the reliability of form-to-form variability, which encompasses Cronbach’s alpha and split-half coefficients. We also considered coefficients not listed in the standards, such as ordinal alpha and McDonald’s Omega.

To check the trustworthiness of our data, we engaged in interrater reliability and data categorization between the first three authors.

Interrater reliability

We interpreted interrater reliability as a measure of the percentage of identically rated constructs between the three raters to ensure accuracy, precision, and reliability of coding behavior (Belur et al., 2021 ). O’Connor & Joffe ( 2020 ) suggest that it is prudent to double-code a randomly selected small amount of data (e.g., 10% of sample) rather than double-code a sizable quantity of data. Thus, due to the low number of samples (i.e., 72), we randomly double-coded 10% (seven) samples to estimate intercoder reliability. Though rules of thumb typically fall within 10% sub-samples (O’Connor & Joffe, 2020 ), in relation to Armstrong et al. (2020), our intercoder reliability value increased as we double-coded more samples but reached saturation by the fifth to seventh sample.

As described earlier, the first author created a priori rating scheme by compiling psychometric evidence for each scale. The first, second, and third authors applied the a priori coding structure to the same seven (9.72% of the total sample) articles and examined interrater reliability. The interrater agreement between the first author and second author was 93.40% and 73.40% between the first and third author. The agreement between the second author and third author was 74.27%. These values are higher than Belur et al. (2018) accepted range of 70.00% for systematic reviews. Again, the first three authors discussed and resolved all disagreements until a 100.00% agreement was achieved.

Categorization

We categorized constructs from each of the articles through emergent coding (see Drishko & Maschi, 2016 ). The first author organized the articles by scales to create initial categories of the 82 scales. The first three authors then engaged in an emergent coding process using the 72 articles and categorized the codes to create 18 primary and 11 secondary and one tertiary categories for the scales. This allowed for multiple or triple categorization of articles. Primary categories are the main overall construct the scale is measuring. When a second (or third in one case) clear but less prominent construct was evident, these were coded into the secondary and tertiary categories, respectively. For example, the article “Examining Science and Engineering Students’ Attitudes Toward Computer Science” which measures both student interest and attitudes was multiple coded under attitudes as a primary category, and under interest as a secondary category. We provide a list of the codes and their definitions in Table 1 .

Out of the 82 scales in our sample, only 12 were short forms of longer scales. The average scale length was M  = 29.84 ( SD  = 29.86), although there was a wide range with the shortest being four items and the longest being 216 items. Out of the 82 scales in the sample, 62 reported containing subscales, with the median number of subscales per scale being three. The smallest number of items in a subscale was three items, while the largest number of items in a subscale was 30 items. Full information on the number of items in each subscale within each scale can be found in Additional file 4 . We found that the majority of the scales in our sample were intended for use on an undergraduate sample or developed with undergraduate students (68.29%). Full information on education level can be found in Table 2 . Very few scales were created for use with students in a 2-year setting (i.e., community college) and for graduate-level students.

The scales in our sample used response anchors that ranged from 3-point to 101-point response anchors, although the majority (42.90%) used 5-point Likert type response anchors, followed by 6-point (21.40), and 7-point (14.30%). Two instruments had response anchors that varied within subscales. The Full Participation Science and Engineering Accessibility (Jeannis et al., 2019 ) instrument used response anchors that ranged from 1 (strongly disagree) to 5 (strongly agree) for some sets of items, and response anchors that ranged from 1 (yes) to 3 (not present) for another set of items. Likewise, the Sustainable Engineering Survey (McCormick et al., 2015 ) used response anchors that ranged from 0 (no confidence) to 100 (very confident) for one subscale and anchors ranging from 0 (strongly disagree) to 5 (strongly agree) for other subscales.

The sample size used for the main statistical analyses (i.e., EFA, CFA, Cronbach’s alpha, etc.) amongst the studies varied. The largest sample size used was N  = 15,847, while the smallest was N  = 20. Sample sizes most frequently fell between the range of 100–300 participants (34%), followed by 301 to 500 participants (24%). Full information on sample size ranges is displayed in Fig.  3 .

figure 3

Frequencies of sample size ranges. When papers reported on several studies (e.g., pilot study, main study) the sample sizes reported were counted separately. When papers reported samples from several populations (i.e., different universities), the samples were summed up and counted as one. Many papers reported sample sizes for expert judge evaluations and cognitive interviews, but we did not count those in this analysis

When available, we noted participant demographic information. Of those that reported participant age, most of the scales reported age means and ranges between 18 and 25, which is not surprising given our post-secondary focus. Only one scale reported a range between 18 and 63 and another reported a range between 20 and 31. A total of 34 scales out of the 54 that reported a gender distribution, had a majority female sample. Of the 34 scales that have participant race and ethnicity available, 32 report White as either the majority or the largest group in their sample. Two scales reported an African American or non-White majority sample, respectively, and one scale reported a Hispanic/Latinx majority in one of their samples.

Given the large timespan we were working with, we checked whether psychometric trends differed between scales created before 2010 and after 2010 to check whether a larger analysis across time is needed. However, we could not do such an analysis for discriminant validity, predictive validity, other internal structure validity, alternate form reliability, and other internal consistencies, because there were so few datapoints. A series of Chi-square analyses showed that there were no statistically significant differences in occurrences of psychometric tests before 2010 compared to after 2010, except for the frequency of CFA. CFA was conducted more frequently post-2010 (56.9%), compared to pre-2010 (29.4%), X 2 (1, 82) = 4.08, p  = 0.04. We further conducted a more granular examination by comparing pre-2010, 2010–2015, and post-2015, which did not yield any statistically significant results. Given that we only had one statistically significant result, we determined further analyses across the full 20 years were not necessary.

Scale disciplines

In our descriptive analysis, we identified several scales for each of the traditional STEM disciplines, except for geosciences which includes environmental and space science. Among the discipline-specific scales, most were designed for engineering, with biology being the least represented discipline. However, the largest proportion of scales (31.30%) were classified as unspecified STEM or science without specification to any discipline, and 7.20% of the scales were described as multidisciplinary (that is, more than one STEM discipline is specified). All scale disciplines can be found in Table 3 .

Scale categories

In conversation with each other, the authors of this paper collaboratively assigned each instrument to a category denoting what the instrument was intended to measure, as it was not always obvious how a given measure should be categorized. The agreed upon categories of the instruments and the percentage of the overall sample that each represents are listed in Tables 4 and 5 . We settled on 18 primary categories. When instruments were too complex to confine into one category, we identified secondary and/or tertiary categories.

All primary categories can be found in Table 4 . The most common kinds of instruments found in our literature sample were concerned with measuring attitudes (17.10%), various non-technical skills (13.40%), motivation (11.00%), diversity (8.50%), and interest (7.30%). That said, there was a great deal of variation with no category taking on a clear and overwhelming majority. This is unsurprising considering the breadth of our sampling. The categories that occurred the least included affective outcomes, engagement, external climate, learning gains, and social support, all of which occurred only once.

Of the sampled instruments, 17 were assigned a secondary category and one was assigned both a secondary and tertiary category (see Table 5 ). We found great variation here too, with most occurring once. There was only one instance (“community engagement”) from the secondary and tertiary categories, where the authors felt the need to add a category that was not also included as a primary category. There was no instrument in our sample that focused primarily on community engagement; however, there were four surveys that focused on community engagement as a context for the primary category of non-technical skills. These all came from the same paper reporting on the Engineering Projects in Community Service (EPICS) program (Tracy et al., 2005 ). Community engagement was also the most frequently occurring secondary category (4.90%), followed by cognitive outcomes, literacy, and motivation, all of which occurred twice.

Overall validity evidence

In our sample, the median total number of validity evidence reported per scale was three, with a range of one to five types of validity reported per scale. Table 6 displays frequencies for validity evidence. The majority of the validity evidence reported in our sample was EFA and CFA for internal structure evidence, with 26 scales reporting evidence for both. A few scales reported other types of internal structure evidence, such as IRT and Rasch Analysis (see Table 7 ). The next most frequent type of validity evidence in our sample was test content validity in the form of expert judge evaluations, then followed by concurrent test criterion validity. Validity evidence used to build nomological networks, such as convergent and discriminant validity, was reported less frequently, with convergent validity being the more prominent of the two. Although concurrent validity was used over a third of the time, evidence for test-criterion validity in the form of predicative validity was also scarce. We did not see generalization through meta-analysis reported in our sample.

We also examined the number of scales that contain joint evidence for the different combinations of the major four types of validity that grounded our theoretical framework (test content, response process, internal structure, and examining relationships). Analyses were completed using crosstabulations in IBM SPSS (version 28). Full information can be found in Table 8 . The most frequent combination (45.12%) was reporting on internal structure validity along with the examination of relationships. The second most frequent combination was examining test content validity along with internal structure. All other combinations were less frequent with the combination of relationships and response process being the scarcest.

Validity evidence by construct

We further examined validity evidence by the categories representing the commonly measured constructs in our sample. For brevity, we only discuss the most frequently reported types of evidence, although all analyses are reported in Table 9 . The one scale measuring affective outcomes, all the scales measuring engagement, learning gains, literacy, as well as the majority (over 50%) of the scales measuring motivation, course perceptions, long-term outcomes, and non-technical skills reported obtaining judge evaluations.

Scales from all categories except for the one scale on affective outcomes reported an EFA. This includes all scales measuring anxiety, engagement, external climate, learning gains, literacy, long-term outcomes, self-efficacy, and social support and at least 75% of those examining belonging and integration, course perceptions, and motivation. At least half of scales from all other categories reported on an EFA, except those measuring interest, one-third of which contained an EFA. Similarly, CFA was reported for scales in all categories except for those measuring course perceptions and literacy. We found that all the scales measuring affective outcomes, engagement, external climate, learning gains, and social support, as well as most of the anxiety, diversity, interest, long-term outcomes, motivation, and self-efficacy scales contained evidence for CFA. We observed that anywhere between one-quarter and half of scales in all other categories displayed evidence from CFA. Comparatively, DIF and other types of internal structure validity were scarcely reported.

Few scales reported response process, while reports on relationships between variables varied. Convergent validity evidence was found among all scales measuring affective outcomes, anxiety, and learning gains. We also observed convergent validity evidence among most of the self-efficacy scales, half of the course perceptions scales, over a third of the attitude scales, and for a third of the scales measuring interest. It was observed for at least a quarter of the scales measuring identity, belonging and integration, and a few of the scales for non-technical skills and motivation. Test-criterion validity, mostly in the form concurrent validity was reported for all scales measuring self-efficacy and social support, and most of the scales measuring anxiety, interest, and motivation. Half of the identity and belonging and integration scales, and one-third of the literacy scales reported concurrent validity. Concurrent validity evidence was found for one-quarter of those measuring cognitive outcomes and course perceptions, and less than a quarter of the time for all other categories. Conversely, predictive and discriminant validity were found in very few categories.

Overall reliability evidence

In our sample, the median total number of types of reliability evidence reported per scale was one, with a range of one to three types of reliability evidence per scale. Table 10 displays frequencies for reliability evidence. The most frequently reported reliability evidence in our sample was internal consistency, with a large skew towards Cronbach’s alpha. Out of the other types of internal consistency reported, the most popular was McDonald’s omega, with ordinal alpha and the person separation variable being reported only once each (see Table 11 ). Apart from internal consistency, the second most frequent reliability evidence in our sample was test–retest reliability, although there is a considerable drop in occurrence compared to Cronbach’s alpha. Likewise, only one study reported alternate form reliability in our sample and no studies in our sample reported split-half reliability.

We also examined the frequencies of joint evidence reported for the different combinations of reliability evidence using crosstabulations in IBM SPSS (version 28). Split-half reliability was not included in this analysis as it was not present in our sample. Full information can be found in Table 12 . Given how much Cronbach’s alpha dominated the reliability evidence found in our sample and how infrequent other sources were, it is unsurprising that there was not much joint evidence found. The most frequent combination was Cronbach’s alpha reported with test–retest reliability. Other combinations only occurred once.

Reliability evidence by construct

Just as with validity evidence, we further broke down analyses for reliability by categories (see Table 13 ). Given the frequency of Cronbach’s alpha in our sample, it is unsurprising that most of the scales across categories reported conducting Cronbach’s alpha for internal consistency. It is reported for all the scales measuring affective outcomes, anxiety, belonging and integration, cognitive outcomes, course perceptions, external climate, engagement, learning gains, literacy, long-term outcomes, motivation, and self-efficacy, respectively. Likewise, the majority of the scales in all other categories reported Cronbach’s alpha for internal consistency. Other evidence for internal consistency was reported by almost half of the scales measuring self-efficacy, one third of the scales measuring literacy, and a quarter of the scales measuring belonging and integration as well identity. Beyond a few of the scales measuring interest, motivation, and non-technical skills, no other scales in any other categories reported examining other types of internal consistency.

Test–retest reliability was reported for most of the scales measuring anxiety and one-quarter of the scales measuring cognitive outcomes as well as identity. Beyond that, test–retest reliability was reported for a few of the scales measuring attitudes, diversity, motivation, and self-efficacy. No other scales in any other categories contained test–retest reliability evidence. Only scales measuring anxiety contained alternate form reliability evidence.

The most frequently reported validity evidence in our sample was test content validity and internal structure validity. Specifically, evaluations from expert judges were reported for test content validity for nearly half of the scales and were present in most of the categories in our sample. Previous systematic reviews have similarly observed test content as a commonly reported type of validity evidence (Arjoon et al., 2013 ; Cruz et al., 2020 ; Decker & McGill, 2019 ). Although informative, scholars (e.g., Reeves et al., 2016 ) have argued that this type of validity evidence alone is not sufficient. We only had two scales in our sample that only have evaluations from expert judges as validity evidence.

For internal structure evidence, EFA and CFA were reported for over half and nearly half of the scales in our sample, respectively, and were well-represented in all but a few of the categories. However, other forms of internal structure validity, such as DIF, were much less prominent. Comparatively, a systematic review of chemistry education measures found internal structure validity evidence was reported in about half of the sample, with EFA being the most common and DIF completely lacking (Arjoon et al., 2013 ). While CFA is underutilized in chemistry education measures (Arjoon et al., 2013 ), it is reported more frequently across all STEM education research here, although our sample follows the trend of DIF being underreported. Similarly, other forms of internal structure validity were rare. While EFA and CFA can provide essential information about a scale’s internal structure, other types of internal structure validity evidence can be valuable or even more appropriate.

Compared to test content and internal structure validity, we found that response process evidence was much less present, with only 13 scales reporting cognitive interviews. This aligns with similar work, which found a dearth of response process validity (Arjoon et al., 2013 ; Cruz et al., 2020 ), with cognitive interviews reported for only four out of 20 scales in one review (Arjoon et al., 2013 ). We also observed that evidence for relationships between variables—convergent, discriminant, and test-criterion validity—were far less present, with convergent and concurrent validity being reported on much more frequently than their counterparts. In contrast, previous work finds that all but one of the chemistry education scales in their sample reported some form of relationship with other variables (Arjoon et al., 2013 ). Looking across all STEM education disciplines, there may need to be more work to collect evidence based on relationships and build nomological networks around the constructs being measured.

Internal consistency, namely, Cronbach’s alpha, was the most dominant reliability evidence and was prominent in all categories. All other forms of reliability evidence were reported far less frequently and were less represented across categories. This is unsurprising as others have reported similar observations in their reviews (Arjoon et al., 2013 ; Cruz et al., 2020 ; Decker & McGill, 2019 ), and as Cronbach’s alpha is mistakenly provided as the only evidence for validity in many biology education research papers (Knekta et al., 2019 ). In comparison, test–retest reliability was reported for less than ten percent of our sample, alternate form only once, and split-half reliability was not observed at all, aligning with previous work (Arjoon et al., 2013 ; Cruz et al., 2020 ; Decker & McGill, 2019 ).

Although several gaps were observed, several surveys in the sample contained more comprehensive evidence and drew from several sources, which gave them a higher chance of being robust when used in a research setting. For example, the Engineering Professional Responsibility Assessment Tool (Canney & Bielefeldt, 2016 ) reported the highest amount of validity sources (five) and reported at least one piece of evidence from each of the four main categories in our theoretical framework. That said, this survey only reported one type of reliability evidence—ordinal alpha. The Engineering Skills Self-Efficacy Scale (Mamaril et al., 2016 ) also provided more comprehensive evidence with five sources reported and three of the categories in the theoretical framework represented (test content, internal structure, and relationships. This scale also reported two forms of internal constancy—Cronbach’s alpha and McDonald’s omega.

Surveys that reported more comprehensive reliability evidence were rare, although the Abbreviated Math Anxiety Scale (Hopko et al., 2003 ) drew from the most sources in our sample (three)—alternate form, test–retest, and internal consistency. This scale also reported four sources of validity evidence from two categories (internal structure and relationships to other variables).

Implications for STEM education research

Psychometric development.

Although a full discussion on psychometric evidence is beyond the scope of this review, validity is considered a unitary concept in contemporary theory (APA, AERA, & NCME, 2014 ; Reeves et al., 2016 ). This can be viewed as a continuum in which one’s confidence in a measure grows as accumulating evidence increases support for its intended interpretations. There were several scales that did not go beyond reporting on EFA for validity evidence, and while this is a good starting point, EFA alone is not enough evidence and should ideally be corroborated with other sources (Knekta et al., 2019 ) For example, following up with a CFA can expand upon EFA by confirming the underlying factor structure found in the previous analytical results (DeVellis, 2017 ). We also echo past work (Arjoon et al., 2013 ) in encouraging scale developers in STEM education to examine DIF as it can provide valuable information about a scale’s multidimensionality and whether items function differently among distinct groups (APA, AERA, NCME, 2014 ; Arjoon et al., 2013 ). Likewise, we suggest considering other forms of internal structure validity when appropriate, such as Rasch Analysis, IRT, or Q-sort methodology, to name a few. For example, IRT can allow researchers to examine item-level characteristics, such as item difficulty and item discrimination, as compared to factor analyses.

Beyond internal structure, other forms of validity provide valuable information.

Specifically, response process evidence, such as cognitive interviews, can provide insight as to how participants are interpreting and reasoning through questions (APA, AERA, NCME, 2014 ; Arjoon et al., 2013 ), which can provide important qualitative data missing in other forms of validity. Likewise, building a nomological network by examining a scale’s relationships (or lack thereof) to other variables can illuminate how the scale fits in with a broader theoretical framework (Arjoon et al., 2013 ).

However, the median total of validity evidence sources amongst the scales in our sample was three. Furthermore, the majority of joint evidence reported was between internal structure and relationships and between internal structure and test content validity. Taken together, there was not a lot of breadth when it comes to the validity evidence that was examined. Although there is no “optimal number” of sources, drawing from multiple sources of evidence typically creates a more robust measure (APA, AERA, NCME, 2014 ). We recommend researchers carefully consider the goals of a measure and seek to examine a breadth of validity evidence and accumulate as much evidence as is needed and is feasible within their specific research contexts.

Reliability is also a fundamental issue in measurement that takes several different forms (DeVellis, 2017 ). However, we mostly observed evidence for internal consistency, which only provides evidence on the relationships between individual items. Alternate form evidence can demonstrate reliability by examining the relationship between the scale and an alternate scale, essentially replicating the scale (APA, AERA, NCME, 2014 ). Split-half reliability follows a similar logic to alternate form reliability by examining how two halves of a scale relate to each other (DeVellis, 2017 ). Test–retest reliability provides insight into a scale’s consistency over time (DeVellis, 2017 ). Put simply, distinct types of reliability evidence provide different information, have various sources of error, and certain sources of evidence may be preferable depending on the context and needs of the research (APA, AERA, NCME, 2014 ). Despite this, very few scales in our sample examined some combination of reliability evidence and the median total of reliability sources was one. Given that each of these techniques has strengths and weaknesses, we encourage researchers to diversify reliability evidence in STEM education research, so that different sources of evidence can complement each other.

Prominence of Cronbach’s alpha

Not only was Cronbach’s alpha the most prominent form of internal consistency, but it was also the only type of reliability evidence observed for 64 of 82 scales in our sample. This is no surprise as Cronbach’s alpha is commonly associated with instrument reliability in science education research (Taber, 2018 ). Although a full discussion around Cronbach’s alpha is beyond the scope of the present review, it has been argued that Cronbach’s alpha is not an ideal estimate of internal consistency, because it is typically a lower bound for the actual reliability of a set of items (DeVellis, 2017 ; Sijtsma, 2009 ; Taber, 2018 ). Beyond that, Cronbach’s alpha relies on assumptions that are rarely met; and these assumption violations can lead to internal consistency estimate inflation (see DeVellis, 2017 and Dunn et al., 2014 ). It has also been critiqued, because the cutoffs (i.e., α = 0.70) for what constitutes good or acceptable internal consistency are arbitrary (Taber, 2018 ).

Cronbach’s alpha is also designed for continuous data, and it is argued that social science measures may not be continuous, thus making Cronbach’s alpha inappropriate to use. The majority of the scales in our sample were on a 5-point scale and most scales used either Likert or semantic differential response formats (see DeVellis, 2017 for discussion on response formats). Although the exact number of response options to include depends on a myriad of factors, it is argued that these types of response formats are ordinal rather than continuous in a strict sense, because one cannot assume that the intervals between response options are equal (DeVellis, 2017 ). Thus, scholars argue that this can lead to inaccuracies in Cronbach’s alpha and suggest ordinal alpha as an alternative (DeVellis, 2017 ).

We recommend that researchers critically engage with the use of Cronbach’s alpha and not to solely rely on it as evidence for internal consistency or overall reliability. Researchers have suggested additions and alternatives such as using bootstrapping to find the confidence interval around Cronbach’s alpha to obtain a range of values for internal consistency, or using McDonald’s Omega, to name a few (see DeVellis, 2017 and Dunn et al., 2014 for a full review). We suggest examining the individual advantages and disadvantages of each of these methods and using what is the most appropriate.

Disciplinary trends

As identified above, there were several disciplines represented in our sample with the most common disciplines being Unspecified STEM (31.3%), Engineering (25.3%), Chemistry (10.8%), and Mathematics (10.8%). Due to the federal push for advancing and investing in STEM education in the US (Holdren et al., 2010 ; Olson & Riordan, 2012 ), it is unsurprising to see unspecified STEM education instruments being the most popular scale discipline. Engineering lagging only slightly behind the unspecified STEM category was also foreseeable due to engineering education being a well-established discipline focused on quality discipline-based education research. However, we observed a distinct lack of scales in geosciences, as well as very few scales coming out of biology, computer science, and technology. As other disciplinary professionals further establish and/or expand their discipline-based education research efforts, we anticipate seeing more validated instruments arising therefrom.

Categorical trends

We observed a breadth of constructs being measured by the scales in our sample. Interestingly, we found that several constructs were seldom measured, which includes but not limited to engagement, belonging and integration, anxiety, and self-efficacy. A review in computer science education found that a sizable portion of their sample measured what they deemed non-cognitive processes, including self-efficacy, anxiety, and sense of belonging, among others (Decker & McGill, 2019 ). Another similar review found 76 measures out of 197 papers in their sample examined what they called experience measures, including motivation, self-efficacy, and engagement (Marguilieux et al., 2019 ). Finally, a review on assessment in interdisciplinary STEM education found that “the affective domain”, which includes awareness, attitudes, beliefs, motivation, interest, and perceptions of STEM careers was the most frequent assessment target in their sample of papers (Gao et al., 2020 ). Aside from motivation, which was our second largest group in the primary categories, we noted many of these constructs only a few times in our sample. Thus, we recommend doing more work to develop scales measuring these constructs across the entirety of STEM education research.

Constructs that were observed more frequently included non-technical skills, constructs related to diversity, self-efficacy, and interest. We found that the majority of measures for non-technical skills came out of engineering. This is likely due, in part, to engineering education’s significant focus on training, workforce development, and the need for professionalism in industry. Given such foci, as well as the extent to which engineering education has been embraced as its own disciplinary field, it is unsurprising to encounter extensive work in engineering ethics (Hess & Fore, 2018 ), professionalism (Felder & Brent, 2003 ; Layton, 1986 ; Shuman, et al., 2005 ), and interpersonal/societal engagement (Hess et al., 2018 , 2021 ). With this in mind, we recommend other STEM disciplines consider examining these important professional skills.

Scales related to diversity, such as scales measuring racial/sex bias in STEM or stereotype threat susceptibility, were also a larger group in our sample. Given that there are significant disparities that exist in STEM education as well as calls to action to address these disparities, close achievement gaps, and diversify the STEM workforce (Jones et al., 2018 ), this was foreseeable. This trend also aligns with a review of high-impact empirical studies in STEM education, which found that the most frequently published topic pertained to cultural, social, and gender issues in STEM education (Li et al., 2022 ). Although these scales compromise one of the larger groups in our sample, because of how dispersed our categories were, there are only seven total. Given that diversity issues affect all STEM fields and that being able to assess diverse students’ experiences is an important aspect in addressing disparities and gaps, there is more work to be done in the development of these scales.

Similarly, self-efficacy and interest were observed five and six times, respectively. Although these were among the larger groups in the sample, these are objectively not large numbers. Interest and self-efficacy work with each other as well with other factors to play an integral role in student motivation, which affects students’ academic behaviors, achievements, and choices (Knekta et al., 2020 ; Mamaril et al., 2016 ). Given these far-reaching effects, more measures examining these constructs across all domains of STEM education are needed.

Although no one category took on a clear majority, the construct of attitudes constituted the largest group. This is unsurprising as attitudes (among other non-cognitive constructs) are emphasized by many science educators as important for scientific literacy (Xu & Lewis, 2011 ). In our sample, attitudes encompassed a range of constructs. Some scales asked about students’ beliefs on certain topics (e.g., Adams et al., 2006 ), others were more evaluative (e.g., Hoegh & Moskal, 2009 ), many reported generally examining attitudes (e.g., Cashin & Elmore, 2005 ), others assessed students' epistemologies and expectations (e.g., Wilcox & Lewandowski, 2016 ), and some focused on the cognitive affective components of attitudes (e.g., Xu & Lewis, 2011 ). In social psychology, which has a rich history in attitudes research, an attitude is defined as “a psychological tendency that is expressed by evaluating a particular entity with some degree of favor or disfavor” (Eagly & Chaiken, 1993 ). Attitudes are comprised of cognitive (thoughts), affective (feelings), and behavioral (actions) responses to an object or event. Attitudes can be formed primarily or exclusively based on any combination of these three processes and any of these can serve as indicators for attitudes in measurement. Given the range we observed, when measuring attitudes, we suggest that researchers ground these scales in attitude theory and to specifically define which aspects are being measured.

Practical implications for researchers

Our goal for the present systematic review is to serve STEM education researchers, DBER professionals, and professionals that work in STEM education centers and institutions across the United States. Whether one is conducting research themselves or collaborating with STEM professionals to help them conduct STEM education research, we hope that researchers may use this review as a foundation when making decisions about measurement. Several practical implications are discussed below.

First, researchers and professionals may use this review when deciding whether to create a new scale, to use a pre-existing scale as is, or to adapt. In general, it is ideal to use a scale that already exists (DeVellis, 2017 ). The present review gives an overview of what is available, allowing researchers to determine whether they can use scales from our sample or adapt them. When multiple adequate pre-existing measures are available, it is important to consider the amount, variety, and quality of reported psychometric evidence. As psychometric development is an ongoing process, scales with a greater variety of quality psychometric evidence will be more robust and trustworthy. Although we appreciate that there will be occasions when only one scale is available, our goal is that this systematic review can inform research decisions about what scales are most trustworthy and robust. Finally, it is important to remember that one should use the full scale, and not just ‘cherry-pick’ questions when using pre-existing measures. One reason for this is because factor analysis is designed to examine the relationships between sets of survey items and whether subsets of items relate more to each other rather than other subsets (Knekta et al., 2019 ). Thus, if one uses a select few items from a set, the validity evidence previously collected is no longer relevant, and re-validating the scale is recommended before its use.

When using pre-existing measures, one must also consider the sample that was used to develop the measure and the context within which it was developed. Sample demographics (i.e., race, gender, etc.) are an important factor to consider due to the potential for measurement bias. It is rare for measurement parameters to be perfectly equal across groups and measurement occasions (Van de Schoot, 2015 ) and thus there is always potential for measurement bias. For this reason, it is important to report sample demographics when developing a scale and to also consider demographics when using a pre-existing scale. If the population one is sampling from is quite different from the one that was used to develop the scale, it may not be appropriate to use that measure without examining measurement invariance before its use.

Due to the potential for measurement bias across different groups, it also may not always be appropriate to use pre-existing measures developed for other disciplines or even measures developed for an unspecified STEM context, especially if one’s research questions are discipline specific. However, one can always adapt pre-existing measures if they do not wish to create a new one. Several scales in our sample were adapted and/or revalidated from scales designed for a different discipline or for all college students, such as the Academic Motivation Scale—Chemistry (Liu et al., 2017 ), the Civic-Minded Graduate Scale in Science and Engineering (Hess et al., 2021 ), and the Colorado Learning Attitudes about Science Survey (Adams et al., 2006 ). Researchers can look to these as examples when adapting or revalidating scales within their own disciplines. Depending on what is being measured, significant adaptations to measures from other disciplines may not be needed. To illustrate, when adapting an academic motivation scale, one may only need to change the discipline referenced in scale items. However, if one seeks to examine non-technical skills (often also referred to as soft skills) specific to a single discipline, then significant changes may be necessary, thus requiring more involved adaptations and psychometric evaluations.

Similarly, older scales may sometimes need to be re-examined for use in today’s context and society. One should carefully examine scale items and consider whether they fit in the context in which they are trying to use them. Sometimes items or words pertain to something that is no longer relevant or is obsolete in today’s context. However, this does not mean the scale or measure itself is of poor quality. One can change items or make updates and re-examine psychometric evidence accordingly.

Finally, the present review aims to give researchers and professionals a sense of where there are gaps and allow them to make more informed decisions about when to create a new scale in a developing field, such as STEM education research. As we have emphasized the complex and ongoing process that is psychometric development, researchers may look to this review to get a broad sense of what kind of psychometric evidence can be examined and the purposes they serve. However, we encourage corroborating this review with the resources we cite such as the Standards (APA, AERA, NCME, 2014 ), DeVellis ( 2017 ) and Knekta et al. ( 2019 ) and other quality resources available.

Limitations and future directions

Several significant limitations of this study are inherent in the inclusion criteria and sampling strategy, which examined only higher education STEM research in the United States. Although sampled literature consisted of diverse STEM fields, a single country and a concentration on only undergraduate and graduate education limits the possibilities of generalization to a broader population. We also excluded literature from dissertations, thesis, and non-peer-reviewed articles which have the possibility to limit our findings. We suggest further studies should examine the measurement trends in survey instruments utilized in STEM education research among a wider population of samples from other countries and education levels, such as K-12, and to include dissertations and theses, and non-peer-reviewed papers to extend our findings. Although we did not find it necessary to conduct analyses across time in our sample, as STEM education research grows and measurement further develops, especially as more disciplines get involved and more constructs are added, future research may examine trends across time once more datapoints exist. Finally, our samples contained uneven group sizes in categories and disciplines which made comparative analysis of the samples difficult for the researchers.

Summary of recommendations

The following recommendations were developed through a synthesis of the patterns we observed in our analyses as well as information from the Standards —the theoretical framework that informed this review. Although we hope that these recommendations may serve our readers well in their own pursuits, it is important to note that they do not cover the full scope of psychometric development. Discussing the full nuance of the process is beyond the scope of this work and we strongly encourage readers to engage with the Standards and other resources we cite (i.e., DeVellis, 2017 ), which have the space to provide more detailed discussion on these topics.

Measurement is fully dependent on the context of one’s research and decisions will be unique to each researcher. Before making any decisions, carefully consider the research context, questions, goals, and population.

It is typically preferable to use a pre-existing measure whenever possible (DeVellis, 2017 ). If one has found a scale that might be a good fit for their research, we recommend:

Comparing the population and context that the scale was developed with and within to one’s own. Are they similar? This will help determine how suitable the scale may be or if adaptations will be needed.

Using scales in full, the way that they were intended to be used when they were developed. You cannot ‘cherry-pick’ items. If items need to be removed, because they are not relevant, then collecting psychometric evidence again would be needed.

If one has determined they need to create their own scale, there are many ways to begin.

We recommend looking at the relevant theories, past research, similar scales, or using qualitative data to create scales. All of these are good starting points, but it is up to the researcher to decide which one is the most appropriate.

When collecting psychometric evidence, consider what is needed as well as what is most feasible. This includes considering the size of the sample one is working with, timeframe, the structure of the scale itself, and its intended use. Given the relationship between validity and reliability, we recommend collecting some form of validity evidence and some form of reliability evidence.

Because measurement invariance can arise with many group variables, we recommend collecting demographic data from the sample the scale was developed on. This includes (but is not limited to) variables, such as race, gender, class standing, and age.

Many of the scales in our sample rely solely on exploratory factor analysis and/or confirmatory factor analysis for validity evidence. Validity is the degree to which accumulating evidence supports the interpretation of a scale for an intended use (AERA, APA, and NCME, 2014 ). Adequate support typically involves multiple sources of evidence but does not always require all sources of evidence outlined in this review. We recommend considering what interpretations and intended uses one has for a scale and then deciding which sources will be most appropriate. For example, if one wishes to use a scale to predict student GPA, then they would need to collect predictive validity evidence. If one wishes to propose that a scale is suitable for a specific content domain, it would be prudent to collect test content validity evidence. It is prudent to collect evidence that supports all intended propositions that underly a scale’s intended interpretations.

When collecting reliability evidence, consider what kinds of decisions will be informed by the scale’s use. Consider how reversible these decisions would be and whether they can be corroborated with other information sources (AERA, APA, and NCME, 2014 ). Although reliable and precise measurement is always important, these considerations will inform how modest or high that degree of precision should be.

Most scales in our sample collected information on internal consistency only, even though it typically takes multiple sources of evidence. Just as with validity evidence, we recommend collecting from as many sources of evidence as possible while taking into consideration the intended purposes of the scale. For example, if one is proposing that a scale’s items are interrelated, then internal consistency is important to measure. If one is measuring an attribute that is not expected to change across an extended time period, test–retest reliability with scores collected across 2 days would be appropriate. Which sources of evidence and how many one wishes to draw upon will look different for each researcher.

When choosing statistical tests for validity and reliability evidence collection, we recommend not simply relying on what is most typically used. For example, there are many options that exist for examining internal structure validity and internal consistency, yet many of the scales in this review rely on exploratory factor analysis and Cronbach’s alpha, respectively. We recommend considering a broader spectrum of statistical analyses when collecting psychometric evidence.

Through this systematic literature review, we have found that there is a great deal of quantitative instrumentation being used by STEM education researchers and evaluators to measure the outcomes of STEM education activities and courses. That said, there are many published instruments that lack thorough assessments of their validity and reliability. Of those instruments that have been held up to rigorous testing of validity and reliability, there is still more work that can be done, particularly regarding the use of different approaches—other than Cronbach’s alpha—to examine reliability. As STEM education researchers build up a canon of validated and reliable instruments measuring a variety of different learning outcomes, we approach the potential for creating a repository for STEM education surveys.

Moving forward, there is a need for more instruments to be created for a greater diversity of learning outcomes and STEM fields, as well as a need for more rigorous and diversified psychometric development of these instruments. STEM education researchers, as mentioned above, may benefit from having more scales that measure engagement, sense of belonging, perceived fit, anxiety, and self-efficacy. It may also be worthwhile for STEM education researchers to examine the education psychology literature (and related fields) to identify additional instruments that may have never been or that have rarely been used in STEM settings. Such an approach could open STEM education researchers up to a variety of new kinds of validated and reliable instruments, allowing for more complex, sophisticated, and insightful studies, and analyses.

Availability of data and materials

Supporting data is available in Additional file 6 and Additional file 7 .

Abbreviations

Science, technology, engineering, and mathematics

Discipline-based education research

American Education Research Association

American Psychological Association

National council on measurement in education

National science foundation

Physics education research center

Education resources information center

Exploratory factor analysis

Confirmatory factor analysis

Differential item functioning

Item response theory

Engineering projects in community service

Adams, W. K., Perkins, K. K., Podolefsky, N. S., Dubson, M., Finkelstein, N. D., & Wieman, C. E. (2006). New instrument for measuring student beliefs about physics and learning physics: The Colorado Learning Attitudes about Science Survey. Physical Review Special Topics-Physics Education Research, 2 (1), 010101.

Article   Google Scholar  

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (Eds.). (2014). Standards for educational and psychological testing . American Educational Research Association.

Appianing, J., & Van Eck, R. N. (2018). Development and validation of the Value-Expectancy STEM Assessment Scale for students in higher education. International Journal of STEM Education, 5 (1), 1–16.

Arjoon, J. A., Xu, X., & Lewis, J. E. (2013). Understanding the state of the art for measurement in chemistry education research: Examining the psychometric evidence. Journal of Chemical Education, 90 (5), 536–545.

Baker, D. P., & Salas, E. (1992). Principles for measuring teamwork skills. Human Factors, 34 (4), 469–475.

Belur, J., Tompson, L., Thornton, A., & Simon, M. (2021). Interrater reliability in systematic review methodology: Exploring variation in coder decision-making. Sociological Methods & Research, 50 (2), 837–865.

Borrego, M., Foster, M. J., & Froyd, J. E. (2014). Systematic literature reviews in engineering education and other developing interdisciplinary fields. Journal of Engineering Education, 103 (1), 45–76.

Brodeur, P., Larose, S., Tarabulsy, G., Feng, B., & Forget-Dubois, N. (2015). Development and construct validation of the mentor behavior scale. Mentoring & Tutoring: Partnership in Learning, 23 (1), 54–75.

Brunhaver, S. R., Bekki, J. M., Carberry, A. R., London, J. S., & McKenna, A. F. (2018). Development of the Engineering Student Entrepreneurial Mindset Assessment (ESEMA). Advances in Engineering Education, 7 (1), n1.

Google Scholar  

Bybee, R. W. (2010). What is STEM education? Science, 329 (5995), 996–996.

Canney, N. E., & Bielefeldt, A. R. (2016). Validity and reliability evidence of the engineering professional responsibility assessment tool. Journal of Engineering Education, 105 (3), 452–477.

Cashin, S. E., & Elmore, P. B. (2005). The Survey of Attitudes Toward Statistics scale: A construct validity study. Educational and Psychological Measurement, 65 (3), 509–524.

Catalano, A. J., & Marino, M. A. (2020). Measurements in evaluating science education: A compendium of instruments, scales, and tests. ProQuest Ebook Central https://ebookcentral-proquest-com.proxy.ulib.uits.iu.edu

Cooper, H. M. (2010). Research synthesis and meta-analysis: A step-by-step approach (4th ed.). Sage Publications Inc.

Cruz, M. L., Saunders-Smits, G. N., & Groen, P. (2020). Evaluation of competency methods in engineering education: A systematic review. European Journal of Engineering Education, 45 (5), 729–757.

Decker, A., & McGill, M. M. (2019, February). A topical review of evaluation instruments for computing education. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education (pp. 558–564).

DeVellis, R. F. (2017). Scale development: Theory and applications . Sage publications.

Dixson, M. D. (2015). Measuring student engagement in the online course: The Online Student Engagement scale (OSE). Online Learning, 19 (4), n4.

Drishko, J. W., & Maschi, T. (2016). Content analysis . Oxford University Press.

Dunn, T. J., Baguley, T., & Brunsden, V. (2014). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology, 105 (3), 399–412.

Eagly, A. H., & Chaiken, S. (1993). The psychology of attitudes . Harcourt Brace Jovanovich College Publishers.

Eisinga, R., Grotenhuis, M. T., & Pelzer, B. (2013). The reliability of a two-item scale: Pearson, Cronbach, or Spearman-Brown? International Journal of Public Health, 58 (4), 637–642.

Felder, R. M., & Brent, R. (2003). Designing and teaching courses to satisfy the ABET engineering criteria. Journal of Engineering Education, 92 (1), 7–25.

Gao, X., Li, P., Shen, J., & Sun, H. (2020). Reviewing assessment of student learning in interdisciplinary STEM education. International Journal of STEM Education . https://doi.org/10.1186/s40594-020-00225-4

Godwin, A., Potvin, G., & Hazari, Z. (2013). The development of critical engineering agency, identity, and the impact on engineering career choices. In 2013 ASEE Annual Conference & Exposition (pp. 23–1184).

Gonzalez, H. B., & Kuenzi, J. J. (2012). Science, technology, engineering, and mathematics (STEM) education: A primer. Congressional Research Service, Library of Congress.

Hess, J. L., Chase, A., Fore, G. A., & Sorge, B. (2018). Quantifying interpersonal tendencies of engineering and science students: A validation study. The International Journal of Engineering Education, 34 (6), 1754–1767.

Hess, J. L., & Fore, G. (2018). A systematic literature review of US engineering ethics interventions. Science and Engineering Ethics, 24 (2), 551–583.

Hess, J. L., Lin, A., Fore, G. A., Hahn, T., & Sorge, B. (2021). Testing the Civic-Minded Graduate Scale in science and engineering. International Journal of Engineering Education, 37 (1), 44–64.

Hixson, S. H. (2013). Trends in NSF-Supported Undergraduate Chemistry Education, 1992-2012. In Trajectories of Chemistry Education Innovation and Reform (pp. 11–27). American Chemical Society.

Hobson, C. J., Strupeck, D., Griffin, A., Szostek, J., & Rominger, A. S. (2014). Teaching MBA students teamwork and team leadership skills: An empirical evaluation of a classroom educational program. American Journal of Business Education (AJBE), 7 (3), 191–212.

Hoegh, A., & Moskal, B. M. (2009). Examining science and engineering students’ attitudes toward computer science. In 2009 39th IEEE Frontiers in Education Conference (pp. 1–6). IEEE.

Holdren, J., Lander, E., & Varmus, H. (2010). Prepare and inspire: K-12 science. technology, engineering and math (STEM) education for America’s Future . Executive Office of the President of the United States.

Hopko, D. R., Mahadevan, R., Bare, R. L., & Hunt, M. K. (2003). The abbreviated math anxiety scale (AMAS) construction, validity, and reliability. Assessment, 10 (2), 178–182.

Ibrahim, A., Aulls, M. W., & Shore, B. M. (2017). Teachers’ roles, students’ personalities, inquiry learning outcomes, and practices of science and engineering: The development and validation of the McGill attainment value for inquiry engagement survey in STEM disciplines. International Journal of Science and Mathematics Education, 15 (7), 1195–1215.

Jackson, C. R. (2018). Validating and adapting the motivated strategies for learning questionnaire (MSLQ) for STEM courses at an HBCU. Aera Open, 4 (4), 2332858418809346.

Jeannis, H., Goldberg, M., Seelman, K., Schmeler, M., & Cooper, R. A. (2019). Participation in science and engineering laboratories for students with physical disabilities: Survey development and psychometrics. Disability and Rehabilitation: Assistive Technology .

Jones, J., Williams, A., Whitaker, S., Yingling, S., Inkelas, K., & Gates, J. (2018). Call to action: Data, diversity, and STEM education. Change the Magazine of Higher Learning, 50 (2), 40–47.

Kimberlin, C. L., & Winterstein, A. G. (2008). Validity and reliability of measurement instruments used in research. American Journal of Health-System Pharmacy, 65 (23), 2276–2284.

Knekta, E., Runyon, C., & Eddy, S. (2019). One size doesn’t fit all: Using factor analysis to gather validity evidence when using surveys in your research. CBE Life Sciences Education, 18 (1), 1–17.

Knekta, E., Rowland, A. A., Corwin, L. A., & Eddy, S. (2020). Measuring university students’ interest in biology: Evaluation of an instrument targeting Hidi and Renninger’s individual interest. International Journal of STEM Education, 7 (1), 1–16.

Layton, E. T., Jr. (1986). The Revolt of the Engineers . Johns Hopkins University Press.

Li, Y., Wang, K., Xiao, Y., & Froyd, J. E. (2020). Research and trends in STEM education: A systematic review of journal publications. International Journal of STEM Education, 7 (1), 1–16.

Li, Y., & Xiao, Y. (2022). Authorship and topic trends in STEM education research. International Journal of STEM Education, 9 (1), 1–7.

Li, Y., Xiao, Y., Wang, K., Zhang, N., Pang, Y., Wang, R., Qi, C., Yuan, Z., Xu, J., Nite, S. B., & Star, J. R. (2022). A systematic review of high impact empirical studies in STEM education. International Journal of STEM Education, 9 (1), 72.

Liu, Y., Ferrell, B., Barbera, J., & Lewis, J. E. (2017). Development and evaluation of a chemistry-specific version of the academic motivation scale (AMS-Chemistry). Chemistry Education Research and Practice, 18 (1), 191–213.

Lock, R. M., Hazari, Z., & Potvin, G. (2013). Physics career intentions: The effect of physics identity, math identity, and gender. In AIP Conference Proceedings (Vol. 1513, No. 1, pp. 262–265). American Institute of Physics.

Mamaril, N. A., Usher, E. L., Li, C. R., Economy, D. R., & Kennedy, M. S. (2016). Measuring undergraduate students’ engineering self-efficacy: A validation study. Journal of Engineering Education, 105 (2), 366–395.

Margulieux, L., Ketenci, T. A., & Decker, A. (2019). Review of measurements used in computing education research and suggestions for increasing standardization. Computer Science Education, 29 (1), 49–78.

Martín-Páez, T., Aguilera, D., Perales-Palacios, F. J., & Vílchez-González, J. M. (2019). What are we talking about when we talk about STEM education? A review of literature. Science Education, 103 (4), 799–822.

McCormick, M., Bielefeldt, A. R., Swan, C. W., & Paterson, K. G. (2015). Assessing students’ motivation to engage in sustainable engineering. International Journal of Sustainability in Higher Education, 16 (2), 136–154.

Milfont, T. L., & Fischer, R. (2010). Testing measurement invariance across groups: Applications in cross-cultural research. International Journal of Psychological Research, 3 (1), 111–130.

Mohr-Schroeder, M. J., Cavalcanti, M., & Blyman, K. (2015). STEM education: Understanding the changing landscape. In A practice-based model of STEM teaching (pp. 3–14). Brill.

O’Connor, C., & Joffe, H. (2020). Intercoder reliability in qualitative research: debates and practical guidelines. International Journal of Qualitative Methods, 19 , 1609406919899220.

Olson, S., & Riordan, D. G. (2012). Engage to excel: producing one million additional college graduates with degrees in science, technology, engineering, and mathematics. Report to the president. Executive Office of the President.

Reeves, T. D., Marbach-Ad, G., Miller, K. R., Ridgway, J., Gardner, G. E., Schussler, E. E., & Wischusen, E. W. (2016). A conceptual framework for graduate teaching assistant professional development evaluation and research. CBE Life Sciences Education, 15 (2), es2.

Romine, W. L., Walter, E. M., Bosse, E., & Todd, A. N. (2017). Understanding patterns of evolution acceptance—A new implementation of the Measure of Acceptance of the Theory of Evolution (MATE) with Midwestern university students. Journal of Research in Science Teaching, 54 (5), 642–671.

Salmond, S. S. (2008). Evaluating the reliability and validity of measurement instruments. Orthopaedic Nursing, 27 (1), 28–30.

Schmitt, N., & Kuljanin, G. (2008). Measurement invariance: Review of practice and implications. Human Resource Management Review, 18 (4), 210–222.

Shuman, L. J., Besterfield-Sacre, M., & McGourty, J. (2005). The ABET “professional skills”—Can they be taught? Can they be assessed? Journal of Engineering Education, 94 (1), 41–55.

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74 (1), 107–120.

Sondergelt, T. A. (2020). Shifting sights on STEM education quantitative instrumentation development: The importance of moving validity evidence to the forefront rather than a footnote. School Science and Mathematics, 120 (5), 259–261.

Tabachnick, B. G. & Fidell, L.S. (2014). Using multivariate statistics. Pearson Education Limited.

Taber, K. S. (2018). The use of Cronbach’s alpha when developing and reporting research instruments in science education. Research in Science Education, 48 (6), 1273–1296.

Tracy, S., & Immekus, J., & Maller, S., & Oakes, W. (2005), Evaluating the outcomes of a service-learning based course in an engineering education program: Preliminary results of the assessment of the engineering projects in community service epics . In 2005 ASEE Annual Conference & Exposition.

Van De Schoot, R., Schmidt, P., De Beuckelaer, A., Lek, K., & Zondervan-Zwijnenburg, M. (2015). Measurement invariance. Frontiers in Psychology, 6 , 1064.

Verdugo-Castro, S., García-Holgado, A., & Sánchez-Gómez, M. C. (2019). Analysis of instruments focused on gender gap in STEM education. In Proceedings of the Seventh International Conference on Technological Ecosystems for Enhancing Multiculturality (pp. 999–1006).

Wang, X., & Lee, S. Y. (2019). Investigating the psychometric properties of a new survey instrument measuring factors related to upward transfer in STEM fields. The Review of Higher Education, 42 (2), 339–384.

Wilcox, B. R., & Lewandowski, H. J. (2016). Students’ epistemologies about experimental physics: Validating the Colorado Learning Attitudes about Science Survey for experimental physics. Physical Review Physics Education Research, 12 (1), 010123.

Xu, X., & Lewis, J. E. (2011). Refinement of a chemistry attitude measure for college students. Journal of Chemical Education, 88 (5), 561–568.

Zheng, R., & Cook, A. (2012). Solving complex problems: A convergent approach to cognitive load measurement. British Journal of Educational Technology, 43 (2), 233–246.

Download references

Acknowledgements

We would like to acknowledge and thank Precious-Favour Nguemuto Taniform for her work in coding the education level of each scale, which was an important contribution in putting together the table in Additional file 4 .

This study was not supported by any internal or external funding sources.

Author information

Authors and affiliations.

STEM Education Innovation and Research Institute, Indiana University-Purdue University, 755 W. Michigan, UL1123, Indianapolis, IN, 46202, USA

Danka Maric, Grant A. Fore, Samuel Cornelius Nyarko & Pratibha Varma-Nelson

Department of Chemistry and Chemical Biology, Indiana University-Purdue University, 402 N. Blackford Street, LD 326, Indianapolis, IN, 46202, USA

Pratibha Varma-Nelson

You can also search for this author in PubMed   Google Scholar

Contributions

The first author was responsible for organizing the project team, creating the protocol and coding framework, conducting literature searches and managing collected studies, screening, coding, data analysis, and writing a substantial amount of the manuscript. The second author was responsible for providing feedback during protocol and codebook development, coding, and writing a substantial amount of the manuscript. The third author was responsible for coding and writing a substantial amount of the manuscript. The fourth author was responsible for helping to organize the project team, overseeing the progress of the project, and providing continuous consultation, feedback, and edits during manuscript preparation.

Corresponding author

Correspondence to Danka Maric .

Ethics declarations

Ethics approval and consent to participate.

Because our study does not involve any animals, humans, human data, human tissue or plants, this section is not applicable.

Consent for publication

Because our manuscript does not contain any individual person’s data, this section is not applicable.

Competing interests

We affirm no competing interests financially or otherwise with the outcomes of this research.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1..

 Search terms, boolean phrases, and limiters used in databases.

Additional file 2.

Articles reviewed.

Additional file 3.

 Scale names, publications, year published, and journal/conference.

Additional file 4.

 Scale descriptive information.

Additional file 5.

 Definitions of statistical techniques.

Additional file 6.

 Primary dataset.

Additional file 7.

 Categories dataset.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Maric, D., Fore, G.A., Nyarko, S.C. et al. Measurement in STEM education research: a systematic literature review of trends in the psychometric evidence of scales. IJ STEM Ed 10 , 39 (2023). https://doi.org/10.1186/s40594-023-00430-x

Download citation

Received : 17 October 2022

Accepted : 19 May 2023

Published : 02 June 2023

DOI : https://doi.org/10.1186/s40594-023-00430-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

literature review measurement instruments

Instruments to assess integrated care: a systematic review

Affiliations.

  • 1 Department of Integrated Healthcare, Bispebjerg University Hospital, Bispebjerg Bakke 23, DK-2400 Copenhagen, Denmark.
  • 2 Department of Respiratory Medicine, Hvidovre University Hospital, Kettegaard Allé 30, DK-2650 Hvidovre, Denmark.
  • PMID: 25337064
  • PMCID: PMC4203116

Introduction: Although several measurement instruments have been developed to measure the level of integrated health care delivery, no standardised, validated instrument exists covering all aspects of integrated care. The purpose of this review is to identify the instruments concerning how to measure the level of integration across health-care sectors and to assess and evaluate the organisational elements within the instruments identified.

Methods: An extensive, systematic literature review in PubMed, CINAHL, PsycINFO, Cochrane Library, Web of Science for the years 1980-2011. Selected abstracts were independently reviewed by two investigators.

Results: We identified 23 measurement instruments and, within these, eight organisational elements were found. No measurement instrument covered all organisational elements, but almost all studies include well-defined structural and process aspects and six include cultural aspects; 14 explicitly stated using a theoretical framework.

Conclusion and discussion: This review did not identify any measurement instrument covering all aspects of integrated care. Further, a lack of uniform use of the eight organisational elements across the studies was prevalent. It is uncertain whether development of a single 'all-inclusive' model for assessing integrated care is desirable. We emphasise the continuing need for validated instruments embedded in theoretical contexts.

Keywords: integrated care; measurement instruments; organisational elements; systematic literature review.

Advertisement

Advertisement

Health insurance literacy assessment tools: a systematic literature review

  • Review Article
  • Open access
  • Published: 03 August 2021
  • Volume 31 , pages 1137–1150, ( 2023 )

Cite this article

You have full access to this open access article

  • Ana Cecilia Quiroga Gutiérrez   ORCID: orcid.org/0000-0003-1649-6680 1  

3155 Accesses

3 Citations

1 Altmetric

Explore all metrics

This systematic literature review aimed to find and summarize the content and conceptual dimensions assessed by quantitative tools measuring Health Insurance Literacy (HIL) and related constructs.

Using a Peer Review of Electronic Search Strategy (PRESS) and the PRISMA guideline, a systematic literature review of studies found in ERIC, Econlit, PubMed, PsycInfo, CINAHL, and Google Scholar was performed in April 2019. Measures for which psychometric properties were evaluated were classified based on the Paez et al. conceptual model for HIL and further assessed using COnsensus-based Standards for the selection of health Measurement Instruments (COSMIN) Risk of Bias checklist and criteria for good measurement properties.

Out of 123 original tools, only 19 were tested for psychometric and measurement properties; 18 of these 19 measures were developed and used in the context of Medicare. Four of the found measures tested for psychometric properties evaluated all four domains of HIL according to Paez et al.’s conceptual model; the rest of the measures assessed three (3), two (8), or one domain (4) of HIL.

Most measurement tools for HIL and related constructs have been developed and used in the context of the USA health insurance system, primarily in Medicare, while there is a paucity of measurement tools for private health insurances and from other countries. Furthermore, there is a lack of conceptual consistency in the way HIL is measured. Standardization of HIL measurement is crucial to further understand the role and interactions of HIL in other insurance contexts.

Similar content being viewed by others

literature review measurement instruments

The Swiss Health Insurance Literacy Measure (HILM-CH): Measurement Properties and Cross-Cultural Validation

Tess L. C. Bardy

literature review measurement instruments

Association of Health Insurance Literacy with Health Care Utilization: a Systematic Review

Brian F. Yagi, Jamie E. Luster, … Renuka Tipirneni

literature review measurement instruments

The evolution of health literacy assessment tools: a systematic review

Sibel Vildan Altin, Isabelle Finke, … Stephanie Stock

Avoid common mistakes on your manuscript.

Introduction

The selection of a less-than-optimal plan for a specific individual, or the complete lack of coverage, can have severe financial and health consequences (Bhargava and Loewenstein 2015 ; Bhargava et al. 2015 , 2017 ; Flores et al. 2017 ). Simultaneously, health insurance represents one of the most complex and costly products individuals consume (Paez et al. 2014 ). A survey commissioned in 2008 by eHealth, Inc., an online marketplace for health insurance in the USA, found that most consumers lack a basic understanding of health insurance terms. Similarly, consumer testing in the USA found that low numeracy and confusion regarding cost-sharing terms can hinder optimal selection of a health plan (Quincy 2012a ). A more recent study using data collected from nonelderly adults in the June 2014 wave of the Health Reform Monitoring Survey (HRMS) found that both literacy and numeracy were more likely to be lower for uninsured adults compared to insured ones, especially for those with lower income and eligibility for subsidized coverage. The authors concluded that “gaps in literacy and numeracy among the uninsured will likely make navigating the health care system difficult” (Long et al. 2014 ). Another study from the USA, found a negative association between insurance comprehension and odds of choosing a plan that was at least $500 more expensive annually compared to the cheapest option, concluding that both health insurance comprehension and numeracy were critical skills in choosing a health insurance plan that provides adequate risk protection (Barnes et al. 2015 ).

In light of these findings, and the introduction of the Affordable Care Act in the USA, the concept of health insurance literacy (HIL) has gained increasing attention over the past decade.

In 2009, McCormack and colleagues developed the first framework and instrument to measure HIL by integrating insights from the fields of health literacy and financial literacy. The instrument was originally developed to measure HIL of older adults in the USA. The study found that, in line with previous research, older adults, and those with lower education and income had lower levels of HIL (McCormack et al. 2009 ). In 2011, a roundtable sponsored by the Consumers Union in the USA that included experts from academia, advocacy, health insurance, and private research firms defined HIL as “the degree to which individuals have the knowledge, ability, and confidence to find and evaluate information about health plans, select the best plan for their own (or their family’s) financial and health circumstances, and use the plan once enrolled” (Quincy 2012b ).

In 2014, the American Institute of Research in the USA developed a conceptual model for HIL and the Health Insurance Literacy Measure (HILM) (Paez et al. 2014 ). The model was developed using information collected from: a literature review, key informant interviews, and a stakeholder group discussion. The conceptual model consists of four domains of HIL: health insurance knowledge, information seeking, document literacy, and cognitive skills, with self-efficacy as an underlying domain. In contrast to the previously mentioned framework of McCormak and colleagues (McCormack et al. 2009 ), which focused on the Medicare population, Paez et al.’s conceptual model was developed and used in the context of private health insurance.

Even though research on HIL is still in its early stages, studies show that HIL may influence how individuals use and choose health insurance services (Loewenstein et al. 2013 ; Barnes et al. 2015 ), seek health insurance information (Tennyson 2011 ), and use health services (Edward et al. 2018 ; Tipirneni et al. 2018 ; James et al. 2018 ). So far, most research addressing HIL has taken place in the USA, with relatively little literature on the topic coming from other countries. Furthermore, early literature focused mainly on consumer’s understanding of their health insurance (Lambert 1980 ; Marquis 1983 ; Cafferata 1984 ; McCall et al. 1986 ; Garnick et al. 1993 ) resulting in the use of inconsistent terminology.

Given the still recent definition of HIL and inconsistency of its assessment at the international level, this review includes HIL and related constructs such as health insurance knowledge, understanding, familiarity, comprehension, and numeracy, with the goal of identifying assessment tools that cover the domains on Paez et al.’s conceptual model for HIL.

Specifically, this review aims to (1) identify and summarize the content of tools that aim to assess HIL and related constructs, (2) describe conceptual dimensions assessed by psychometrically tested measures and when possible, (3) briefly discuss the methodological quality of measurement development and psychometric properties the found tools.

After consulting with a librarian and performing a Peer Review of Electronic Search Strategy (PRESS)(McGowan et al. 2016 ), a systematic literature search was conducted in April 2019 to identify studies using quantitative and mixed methods tools aiming to assess HIL and related constructs. In order to identify measures assessing HIL and related constructs, search terms that were used included constructs that are described in the HIL frameworks previously introduced (McCormack et al. 2009 ; Paez et al. 2014 ). Examples of search terms include “health insurance literacy,” “health insurance,” “health plan,” “Medicare,” “knowledge,” “understand*,” or “instrument” (a full list of search terms and PRESS derived search strings used can be found on Online Resource 1 ). The following databases were searched: ERIC, Econlit, PubMed, PsycInfo, CINAHL, and Google Scholar. Screening of titles and abstracts, full texts; and data extraction were completed by two independent reviewers. Disagreements between reviewers were resolved by discussion until consensus was reached, and a third researcher was consulted when no consensus was attained.

The literature search was complemented through environmental scan and reference-harvesting, which involved reaching out to researchers for gray literature and finding relevant references from bibliographies of included studies, respectively. Screening, data extraction, and PRISMA diagram generation was completed using Distiller SR (DistillerSR n.a. ). Statistical software Stata 16 was used to synthesize information and generate figures (StataCorp LLC 2019 ).

The review is in accordance with the PRISMA reporting guidelines for systematic reviews (Moher et al. 2009 ). The PRISMA diagram (Fig.  1 ) illustrates the number of articles found through different sources, those eliminated due to deduplication or exclusion, and the final count of studies included in the review.

figure 1

PRISMA diagram

Inclusion and exclusion criteria

The main inclusion and exclusion criteria are summarized in Table 1 using the PICoS pneumonic (Population, phenomenon of Interest, Context, Study design) (Lockwood et al. 2015 ).

Studies with any target population were included, except for those limited to health professionals, health insurance experts, or students of disciplines such as medicine, pharmacy, and nursing.

To be included, studies had to describe in sufficient detail the measure that was used to assess HIL or a related construct. Sufficient description included, but was not limited to, whether the used measure was subjective (i.e., self-reported) or objective; number of items, HIL domains assessed by the measure according to the Paez. et al. conceptual model (Paez et al. 2014 ), and country. The review was limited to quantitative and mixed methods measures. Purely qualitative measures were excluded. The search was not restricted to any time period or country. Inclusion of publications was limited to those published in English.

Cross-sectional, cohort and case-control studies were included. Studies which described an intervention to improve HIL or related constructs were included only if baseline levels were assessed.

In the case of studies in which a measure was used or mentioned but was not thoroughly described, authors were contacted. Those studies for which no further or sufficient information could be obtained were excluded.

Data extraction

Found measurement tools were categorized according to characteristics, such as originality of the measure: if it was a previously developed measure or if it was a mixed measure, meaning the measure had been previously developed but was supplemented with additional items or items were adapted from previously used measures. Further characteristics included the number of items, whether the measure was quantitative or used a mixed method such as coding of answers to open questions; if measures were objective and/or subjective; design of the study in which the measurement tool was used, e.g., surveys, interviews, focus groups, etc.

Measures for which psychometric properties were evaluated, were categorized using Paez. et al.’s conceptual model (Paez et al. 2014 ), by domains and number of domains assessed. This conceptual model was selected for several reasons. First, it was developed with purchasers of private health insurance in mind, rather than being focused on Medicare, which is the case of the model from McCormack et al. (Tseng et al. 2009 ). Second, it focuses on domains and domain-specific tasks that make up the concept of HIL, rather than showcasing factors that may explain different levels of HIL (Vardell 2017 ). Finally, the conceptual model by Paez et al. ( 2014 ) is the most recent attempt at conceptualizing and operationalizing the concept of HIL. Measures for which insufficient information was provided regarding the assessment of measurement properties were not included in this section.

Psychometrically tested measures were further assessed for methodological quality using the COnsensus-based Standards for the selection of health Measurement Instruments (COSMIN) Risk of Bias checklist (Terwee et al. 2018 ; Mokkink et al. 2018 ); a modular tool which provides standards to evaluate the methodological quality of studies on outcome measures. The checklist allows the assessment of the methodological quality of measure development, including content validity, structural validity, internal consistency, reliability, measurement error, criterion validity, hypotheses testing for construct validity, and responsiveness; for definitions of these measurement properties please refer to Online Resource 2 .

The methodological quality of the studies on outcome measures was rated as very good, adequate, doubtful, or inadequate quality; according to COSMIN standards, only when sufficient information was available to do so. Overall rating for methodological quality was given by selecting the lowest or “worst score counts” rating for each of the domains assessed per COSMIN guidelines. Because there is no established gold-standard to assess HIL, and none of the measures were tested for cross-cultural validity, criterion validity and cross-cultural validity were not assessed. The levels of evidence of the overall study quality for measurement properties was determined using the GRADE approach specified in the COSMIN manual. The quality of the evidence was determined using a four-scale grading: high, moderate, low, and very low. Quality of measurement properties was assessed according to the COSMIN criteria for good measurement properties when possible (Mokkink et al. 2018 ).

Given that the main focus of this systematic literature review is the content and conceptual comparison of the found psychometrically tested measures, the methodological quality of the studies and quality of measurement properties are only briefly discussed.

The PRISMA flow chart (Fig. 1 ) summarizes the results of the search process. The search yielded a total of 7700 publications and 9 more were found through hand search, resulting in 7709 items. After deduplication, the abstracts of the remaining 5620 unique records were screened out of which 408 were included for full-text screening. After full-text screening, a total of 189 publications were selected for inclusion in the qualitative analysis and synthesis.

Out of these 189, 108 used original measures, 67 used previously developed instruments, and 14 combined previously developed measures with new items; 171 of the studies used quantitative methods while 18 used mixed methods, see for example De Gagne et al. ( 2015 ) and Desselle ( 2003 ).

The frequency of papers published related to the topic of HIL and similar constructs has increased over time, especially since 2015 (Fig.  2 ). Over 80% of these publications are from the USA (151), while the rest comes from other countries (see Fig.  3 for all countries).

figure 2

Frequency of publications per year

figure 3

Frequency of publications per country

Characteristics of measures

One hundred twenty-three original measures for HIL or related constructs were found. Out of these, 109 were quantitative (88.6%). When categorized based on the number of domains assessed according to the Paez et al. conceptual model, approximately 66% (72) of the original quantitative measures assessed only one domain, 19.3% (21) measured two domains, 8.3% (9) three, and 6.4% (7) assessed all four domains. Out of the original quantitative measures assessing only one domain, most measures (68) evaluated knowledge.

The most used measures and questions were derived from the Medicare Current Beneficiary Survey (MCBS) (24), Health Insurance Literacy Measure (HILM) (13), Kaiser Family Foundation quiz (8), United States Household Health Reform Monitoring Survey (5), and S-TOFHLA for functional health literacy (5).

Psychometrically tested measures

Nineteen of the measures found, were evaluated for psychometric properties. These assessed at least one HIL domain according to the Paez et al. conceptual model (2014) and presented information on development and measurement properties. Eighteen of these measures were developed and used to evaluate Medicare beneficiaries’ knowledge in the USA. The most recent psychometrically tested measure found, the Health Insurance Literacy Measure (HILM), is the only one attempting to assess consumers’ ability to select and use private health insurance.

A summary of the measures for which psychometric properties were evaluated can be found in Table 2 . Table 3 describes the HIL domains assessed with these measures according to the Paez et al. conceptual model (2014). Table 4 summarizes the evaluation of the measurement properties quality. Information on the methodological quality of measurement development, methodological quality of studies assessing measurement properties, and measurement properties of the found psychometrically tested measures can be found in Online Resources 3 , 4 , and 5 , respectively. Level of evidence was moderate for all the following measures, which are listed in Table 2 , as only one study of adequate quality is available for most of them. Quality of measurement development was deemed inadequate for all measures as there was no information on the measurement development, or it lacked the inclusion of a target population sample.

As measures developed in the context of Medicare and specifically the MCBS represent the most widely used in the found studies, these will be discussed in the following chronologically in the order in which they were used, followed by the assessment of the HILM.

1995–1998 MCBS measures

Six measures were created by using existing items from different waves of the MCBS from 1995 to 1998 (Bann et al. 2000 ). The aim of these measures was to evaluate the impact of the National Medicare Education Program (NMEP) project, an initiative to develop informational resources for Medicare.

The first two of these measures were single item questions that assessed the understandability of the Medicare program, called the “Medicare Understandability Question” and the “Global Know-All-Need-To-Know Question.”

The “Know-All-Need-To-Know Index” consisted of five questions included in two rounds of the MCBS, in 1996 and 1998. It assessed how much individuals felt they knew about different aspects of Medicare, Medigap, benefits, health expenses, and finding and choosing health providers.

An eight-item true/false quiz, referred to as the “Eight-item Quiz” consisted of items that were included in one round in 1998. Items aimed to assess knowledge on Medicare options and managed care plans through true/false/not sure statements.

Similarly, a “four-item quiz,” and a “three-item quiz” were also generated by including true/false MCBS items used in 1998.

Results on measurement properties for the above described measures were presented in two papers (Bann et al. 2000 ; Bonito et al. 2000 ), which showed that the methodological quality varied between very good and adequate for structural validity, internal consistency, reliability, measurement error, construct validit, and responsiveness. Internal consistency for the “Know-All-Need-To-Know index” and the “Eight-item quiz” was determined as sufficient, while the “Four-item quiz” and “Three-item quiz” was insufficient. Content validity and structural validity were indeterminate for all measures included in this set. Hypothesis testing was sufficient as well as responsiveness for all measures.

Kansas City index and national evaluation index

The “Kansas City Index” and the “National Evaluation Index” were measures designed to evaluate the impact of Medicare information material on beneficiaries’ knowledge, all of the questions could be answered by consulting the Medicare & You handbook (Medicare and You n.a. ).

The “Kansas City Index” (Bonito et al. 2000 ; McCormack et al. 2002 ) is a 15-item measure that was used to evaluate the impact of different interventions such as the distribution of the Medicare & You handbook, and the Medicare & You bulletin on Medicare beneficiary knowledge in the Kansas City metropolitan area.

The “National Evaluation Index” (Bonito et al. 2000 ; Mccormack and Uhrig 2003 ) is a 22-item index that reflects Medicare related knowledge in seven different content areas: awareness of Medicare options, access to original Medicare, cost implications of insurance choices, coverage/benefits, plan rules/restrictions, availability of information, and beneficiary rights.

Methodological quality for the assessment of measurement properties was either very good or adequate for structural validity, internal consistency, reliability, construct validity, and responsiveness. Methodological quality for the assessment of measurement error was considered doubtful both for the “Kansas City Index” and the “National City Index” since the time period between responses was approximately two years (Bonito et al. 2000 ), as it is recommended to be measured over shorter intervals, especially to assess test-retest reliability (Terwee et al. 2007 ).

Regarding the quality of measurement properties, both the “Kansas City Index” and the “National Evaluation Index” met sufficient criteria for internal consistency, hypothesis testing, and responsiveness. Quality of content validity and structural validity was indeterminate.

2002 Questionnaire, knowledge quizzes, and health literacy quizzes

Uhrig et al. (Uhrig et al. 2002 ) developed a questionnaire composed of 99 questions based on recommendation from Bonito et al. (Bonito et al. 2000 ) to develop a knowledge index using Item Response Theory (IRT), which would allow measuring and tracking Medicare beneficiaries’ knowledge over time. Questions were cognitively tested and calibrated using IRT to develop six alternative forms of quizzes: three on Medicare knowledge and three on health literacy specific for Medicare beneficiaries focused on insurance-related terminology and scenarios (Bann and McCormack 2005 ). These six measures were generated to demonstrate how the different items in the previously generated questionnaire could be used to create dynamic quizzes to be integrated in the MCBS, using different questions throughout the waves of the survey, yet providing comparable results for longitudinal studies and allowing quizzes to be updated when items became irrelevant or there were policy changes.

The development of the questionnaire items described by Uhrig et al. (Uhrig et al. 2002 ) was based on a background research, review of existing Medicare informational materials and knowledge surveys, and discussions with experts in the field. Questions were generated and selected to cover five knowledge areas: eligibility for and structure of Original Medicare, Medicare+Choice (an alternative model of Medicare), plan choices and health plan decision-making, information and assistance, and Medigap/Employer-sponsored supplemental insurance. Four additional question categories were included: self-reported knowledge, health literacy, cognitive abilities, and other non-knowledge items. The questionnaire was pilot tested within the MCBS and the resulting data was used to evaluate the psychometric properties of the items and to calibrate them using IRT. After completing item calibration, the authors generated the three alternative forms of “knowledge” quizzes and three alternative forms of “health literacy” quizzes (Bann and McCormack 2005 ). Items were selected by the authors so that each of the forms would contain at least one item from each of the identified content areas, items with similar total correct percentages, high slopes, a variety of difficulty levels, as well as items that seem relevant from a policy standpoint. The three knowledge measures contained one Original Medicare section and one Medicare+Choice section, covering both factors in each of the forms. Similarly, the three “Health Literacy” measures contained sections for two different factors, a terminology, and a reading comprehension one.

Methodological quality on measurement properties for the 99-item questionnaire is adequate for structural validity. Methodology for structural validity, construct validity, comparison with other instruments, and comparison between subgroups was adequate for all quizzes, while quality for Internal consistency, reliability, and measurement error was very good.

In regard to measurement properties, content validity was inconsistent for the 99-item questionnaire, while it met the sufficient criteria for structural validity, hypothesis testing, and responsiveness. The “Knowledge” and “Health Literacy” measures forms A, B, and C met the sufficient criteria for hypothesis testing and responsiveness, while internal consistency was insufficient, and content validity and structural validity was indeterminate.

Perceived knowledge index (PKI) and seven-item quiz (2003)

In a paper published in 2003, Bann et al. used data from rounds of the MCBS 1998 and 1999 to develop two different knowledge indices to measure knowledge of Medicare beneficiaries. The first of these indices is the “Perceived Knowledge Index” (PKI), a five-item measure constructed from questions included in two different rounds of the MCBS. The questions assessed how much beneficiaries felt they knew about different aspects of Medicare. The second index was a seven-item quiz, made up of questions also included in two different rounds of the MCBS, which used true/false questions to test objective knowledge of Medicare (Bann et al. 2003 ).

For both the PKI and the “Seven-item quiz” the methodological quality of the following were adequate: analysis of structural validity, internal consistency, known-groups validity, and comparison between subgroups was very good, while also having convergent validity and comparison with other instruments.

Both PKI and the “Seven-item quiz” met sufficient criteria for internal consistency, hypothesis testing, and responsiveness. Content validity was indeterminate.

HIL framework and health insurance literacy items (MCBS) (2009)

In 2009, McCormack et al. published a paper in which they described the development of a framework and measurement instrument for HIL in the context of Medicare (McCormack et al. 2009 ). The framework was built based on a literature review and additional key studies. It integrates consumer and health care system variables that would be associated with HIL and the navigation of the health care and health insurance systems. Items were developed to operationalize the framework and cognitively tested and eventually fielded in the MCBS national survey.

Methodological quality for the analysis of structural validity, internal consistency, known-groups validity, and comparison between subgroups was considered to be very good. Reliability, convergent validity, and comparison with other instruments had adequate methodological quality.

The measure met sufficient measurement properties criteria for internal consistency, hypothesis and responsiveness was met. Structural validity was insufficient and content validity was indeterminate.

Health insurance literacy measure (HILM) (2014)

The Health Insurance Literacy Measure (HILM), a self-assessment measure that was constructed based on formative research and stakeholder guidance. The conceptual basis of the measure is the Health Insurance Literacy conceptual model by Paez et al. (Paez et al. 2014 ), which was previously described.

The HILM was developed using a four-stage process. First a conceptual model for HIL was constructed. Once the conceptual model was finalized, two pools of items were created to operationalize its domains. Stage three consisted of cognitive testing of the items. Finally, the last stage involved field-testing of the HILM to develop scales and establish its validity.

When assessing methodological quality of the study on measurement properties, comprehensibility for the HILM was set to doubtful while relevance was adequate. Structural validity, internal consistency, reliability, measurement error, known groups validity, comparison between subgroups and comparison before and after intervention were very good. Convergent validity and comparison with other instruments were inadequate.

Concerning measurement properties, evidence for content validity was found to be inconsistent. Criteria for structural validity was insufficient, while sufficient criteria for internal consistency, hypothesis testing, and responsiveness was met.

This review demonstrates the wide variety of instruments that have been used to assess HIL and related constructs. While the increasing body of evidence around HIL and related constructs, such as health insurance knowledge or understanding has provided valuable insights into the topic, further steps are needed to improve the quality and value of gathered data and reduce waste in research (Ioannidis et al. 2014 ).

Over 80% of the studies that were found through this review were carried out in the USA. Reforms and expansion of social health programs in the USA such as Medicare have played an important role in the growing interest in HIL. This is reflected in the fact that most of the instruments for which psychometric properties were evaluated identified through this review have been developed and used in the context of Medicare, which consists of a very specific population exclusive to the USA and its health insurance system. Only one of the psychometrically tested instruments, as well as its underlying conceptual model (Paez et al. 2014 ) was developed and used in the context of private health insurance. However, it was still exclusive to the USA health insurance system and its population.

Furthermore, the domains and aspects assessed by these measures focus mostly on knowledge. This is the only domain assessed by all found psychometrically tested measures. Out of the 19 found psychometrically tested measures, four assessed all domains of HIL according to the conceptual model of Paez et al. ( 2014 ), while 12 of them evaluate only one or two domains. Most of these measures not only ignore important aspects that are associated with the way people navigate the health insurance system and make health insurance and healthcare decisions, but they also make it difficult to compare results across studies and populations.

None of the measures reviewed included a sample of the target population in the development phase, which is important to evaluate comprehensiveness. As a result, the quality of measurement development was rated inadequate. Similarly, while assessment of reliability and measurement error was mentioned by some of the studies, little information was available to determine the quality of measurement properties for each individual measure.

A valid and reliable instrument tool to assess HIL may not only help to accurately measure an individual’s or population’s HIL level but can also provide ways of identifying vulnerable groups and guide the development and implementation of effective interventions to facilitate health system navigation, health insurance utilization and access. For example, deeper understanding of HIL could inform health insurance design and choice architecture to facilitate optimal health insurance selection (Barnes et al. 2019 ).

Given that existing HIL measurement tools are context-specific, further steps may require adaptation of current definitions of HIL, its conceptualization and operationalization according to specific health and health insurance systems. The McCormack et al. 2009 framework for HIL and the Paez et al. ( 2014 ) conceptual model for HIL might provide a foundation for measurement development and research in other contexts but should be adapted and instruments should be tested for validity.

Also, there are currently not enough standardized measures that would allow assessing HIL across different contexts and health insurance systems. The translation and cultural validation of the HILM could represent a viable solution to further research on specific health insurance systems. However, some of its limitations are that it is a self-administered and self-reported instrument, which may bring respondent-bias and provide a subjective perception of one’s own HIL rather than an objective assessment of it.

There are some limitations to this systematic literature review. First, the search was restricted to studies or papers published in English that used quantitative or mixed methods instruments to assess HIL and related constructs. Therefore, some relevant instruments may have been missed. Second, even though the COSMIN checklist and guideline for systematic reviews is a valuable tool for the evaluation and critical appraisal of outcome measures, it was originally developed to assess the methodological quality of health outcome measures, and may not be ideal for evaluating instruments assessing HIL.

Data availability

Data resulting from the systematic review can be requested from the corresponding author.

Code availability

Code for figure generation can be requested to the corresponding author.

Bann C, McCormack L (2005) Measuring knowledge and health literacy among medicare beneficiaries. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Reports/Research-Reports-Items/CMS062191.html

Bann C, Lissy K, Keller S, et al (2000) Analysis of medicare beneficiary baseline knowledge data from the medicare current beneficiary survey knowledge index technical note. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Reports/downloads/berkman_2003_7.pdf

Bann CM, Terrell SA, McCormack LA, Berkman ND (2003) Measuring beneficiary knowledge of the Medicare program: a psychometric analysis. Health Care Financ Rev 24:111–125

PubMed   PubMed Central   Google Scholar  

Barnes A, Hanoch Y, Rice T (2015) Determinants of coverage decisions in health insurance marketplaces: consumers’ decision-making abilities and the amount of information in their choice environment. Health Serv Res 50:58–80. https://doi.org/10.1111/1475-6773.12181

Article   PubMed   Google Scholar  

Barnes AJ, Karpman M, Long SK et al (2019) More intelligent designs: comparing the effectiveness of choice architectures in US health insurance marketplaces. Organ Behav Hum Decis Process. https://doi.org/10.1016/j.obhdp.2019.02.002

Bhargava S, Loewenstein G (2015) Choosing a health insurance plan: complexity and consequences. JAMA 314:2505–2506. https://doi.org/10.1001/jama.2015.15176

Article   CAS   PubMed   Google Scholar  

Bhargava S, Loewenstein G, Sydnor J (2015) Do individuals make sensible health insurance decisions? Evidence from a menu with dominated options. NBER Working Papers 21160, National Bureau of Economic Research

Bhargava S, Loewenstein G, Benartzi S (2017) The cost of poor health (plan choices) & prescriptions for reform. Behav Sci 3:12

Google Scholar  

Bonito A, Bann C, Kuo M, et al (2000) Analysis of baseline measures in the Medicare current beneficiary survey for use in monitoring the national medicare education program. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Reports/Downloads/Bonito_2000_7.pdf . Accessed 23 Apr 2020

Cafferata GL (1984) Knowledge of their health insurance coverage by the elderly. Med Care 22:835–847

De Gagne J, PhD D, Oh J et al (2015) A mixed methods study of health care experience among Asian Indians in the southeastern United States. J Transcult Nurs 26:354–364. https://doi.org/10.1177/1043659614526247

Desselle SP (2003) Consumers’ lack of awareness on issues pertaining to their prescription drug coverage. J Health Soc Policy 17:21–39

DistillerSR (n.a.) Evidence Partners, Ottawa, Canada. https://www.evidencepartners.com . Accessed April 2019-July 2020

Edward J, Morris S, Mataoui F et al (2018) The impact of health and health insurance literacy on access to care for Hispanic/Latino communities. Public Health Nurs 35:176–183. https://doi.org/10.1111/phn.12385

Flores G, Lin H, Walker C et al (2017) The health and healthcare impact of providing insurance coverage to uninsured children: A prospective observational study. BMC Public Health 17:Article number: 553. https://doi.org/10.1186/s12889-017-4363-z

Article   Google Scholar  

Garnick D, Hendricks A, Thorpe K et al (1993) How well do Americans understand their health coverage? Health Aff 12:204–212. https://doi.org/10.1377/hlthaff.12.3.204

Article   CAS   Google Scholar  

Ioannidis JPA, Greenland S, Hlatky MA et al (2014) Increasing value and reducing waste in research design, conduct, and analysis. Lancet 383:166–175. https://doi.org/10.1016/S0140-6736(13)62227-8

Article   PubMed   PubMed Central   Google Scholar  

James TG, Sullivan MK, Dumeny L, et al (2018) Health insurance literacy and health service utilization among college students. J Am Coll Heal 68:200–206. https://doi.org/10.1080/07448481.2018.1538151

Lambert ZV (1980) Elderly consumers’ knowledge related to Medigap protection needs. J Consum Aff 14:434–451. https://doi.org/10.1111/j.1745-6606.1980.tb00680.x

Lockwood C, Munn Z, Porritt K (2015) Qualitative research synthesis: methodological guidance for systematic reviewers utilizing meta-aggregation. JBI Evid Implement 13:179–187. https://doi.org/10.1097/XEB.0000000000000062

Loewenstein G, Friedman JY, McGill B et al (2013) Consumers’ misunderstanding of health insurance. J Health Econ 32:850–862. https://doi.org/10.1016/j.jhealeco.2013.04.004

Long S, Shartzer A, Polity M (2014) Low Levels of Self-Reported Literacy and Numeracy Create Barriers to Obtaining and Using Health Insurance Coverage. http://apps.urban.org/features/hrms/briefs/Low-Levels-of-Self-Reported-Literacy-and-Numeracy.html . Accessed 11 Feb 2019

Marquis MS (1983) Consumers’ knowledge about their health insurance coverage. Health Care Financ Rev 5:65–80

CAS   PubMed   PubMed Central   Google Scholar  

McCall N, Rice T, Sangl J (1986) Consumer knowledge of Medicare and supplemental health insurance benefits. Health Serv Res 20:633–657

Mccormack L, Uhrig J (2003) How does beneficiary knowledge of the Medicare program vary by type of insurance? Med Care 41:972–978

McCormack LA et al (2002) Health insurance knowledge among Medicare beneficiaries. Health Serv Res 37:43–63

PubMed   Google Scholar  

McCormack L, Bann C, Uhrig J et al (2009) Health insurance literacy of older adults. J Consum Aff 43:223–248. https://doi.org/10.1111/j.1745-6606.2009.01138.x

McGowan J, Sampson M, Salzwedel DM et al (2016) PRESS peer review of electronic search strategies: 2015 guideline statement. J Clin Epidemiol 75:40–46. https://doi.org/10.1016/j.jclinepi.2016.01.021

Medicare & You | Medicare (n.a.) https://www.medicare.gov/medicare-and-you . Accessed 8 Oct 2020b

Moher D, Liberati A, Tetzlaff J et al (2009) Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med 6:e1000097. https://doi.org/10.1371/journal.pmed.1000097

Mokkink LB, Prinsen C, Patrick D et al (2018) COSMIN guideline for systematic reviews of patient-reported outcome measures. Qual Life Res 27:1147–1157. https://doi.org/10.1007/s11136-018-1798-3

Paez KA, Mallery CJ, Noel H et al (2014) Development of the health insurance literacy measure (HILM): conceptualizing and measuring consumer ability to choose and use private health insurance. J Health Commun 19:225–239. https://doi.org/10.1080/10810730.2014.936568

Quincy L (2012a) What’s behind the door: consumer difficulties selecting health plans. In: Consumers union. https://consumersunion.org/research/whats-behind-the-door-consumer-difficulties-selecting-health-plans/ . Accessed 17 Jul 2018

Quincy L (2012b) Measuring health insurance literacy: a call to action. https://consumersunion.org/pub/Health_Insurance_Literacy_Roundtable_rpt.pdf . Accessed 17 Jul 2018

StataCorp LLC (2019) Stata statistical software: release 16. StataCorp LLC, College Station

Tennyson S (2011) Consumers’ insurance literacy: evidence from survey data. Finan Serv Rev 20:165–179

Terwee CB, Bot SDM, de Boer MR et al (2007) Quality criteria were proposed for measurement properties of health status questionnaires. J Clin Epidemiol 60:34–42. https://doi.org/10.1016/j.jclinepi.2006.03.012

Terwee CB, Prinsen CAC, Chiarotto A et al (2018) COSMIN methodology for evaluating the content validity of patient-reported outcome measures: a Delphi study. Qual Life Res 27:1159–1170. https://doi.org/10.1007/s11136-018-1829-0

Article   CAS   PubMed   PubMed Central   Google Scholar  

Tipirneni R, Politi MC, Kullgren JT et al (2018) Association between health insurance literacy and avoidance of health care services owing to cost. JAMA Netw Open 1:e184796. https://doi.org/10.1001/jamanetworkopen.2018.4796

Tseng C-W, Dudley RA, Brook RH et al (2009) Elderly patients’ knowledge of drug benefit caps and communication with providers about exceeding caps. J Am Geriatr Soc 57:848–854. https://doi.org/10.1111/j.1532-5415.2009.02244.x

Uhrig J, Squire C, McCormack L, et al (2002) Questionnaire Development Final Report 110

Vardell EJ (2017) Health insurance literacy: how people understand and make health insurance purchase decisions. Ph.D., the University of North Carolina at Chapel Hill

Download references

Acknowledgements

The author would like to thank in particular Prof. Dr. Stefan Boes, Dr. Sarah Mantwill, Aljosha Benjamin Hwang, Daniella Majakari, and Tess Bardy for their contribution to this review.

Open Access funding provided by Universität Luzern.

Author information

Authors and affiliations.

Department of Health Sciences and Medicine, Center for Health, Policy and Economics, University of Lucerne, Switzerland, Swiss Learning Health System, Frohburgstrasse 3, PO Box 4466, CH-6002, Lucerne, Switzerland

Ana Cecilia Quiroga Gutiérrez

You can also search for this author in PubMed   Google Scholar

Contributions

Ana Cecilia Quiroga Gutierrez contributed to the study conception and design. Material preparation, screening of titles and abstracts, screening of full texts, data collection, synthesis and analysis were performed by Ana Cecilia Quiroga Gutierrez. A second independent reviewer (non-author), Daniella Majakari performed screening of titles and abstracts as well as full texts. A third independent reviewer (non-author), Tess Bardy, performed data collection. Peer Review Search Strategy was conducted by Aljoscha Benjamin Hwang (non-author). Work was critically revised by Sarah Mantwill (non-author) and Prof. Dr. Stefan Boes (non-author).

Corresponding author

Correspondence to Ana Cecilia Quiroga Gutiérrez .

Ethics declarations

Conflict of interest.

The author has no relevant financial or non-financial interests to disclose.

Ethics approval

Not applicable. Given that this study is a systematic literature review, it involved no human subjects or animals.

Consent to participate

Not applicable.

Consent for publication

Additional information, publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

(DOC 91 kb)

(DOC 32 kb)

(DOC 123 kb)

(DOC 137 kb)

(DOC 117 kb)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Quiroga Gutiérrez, A.C. Health insurance literacy assessment tools: a systematic literature review. J Public Health (Berl.) 31 , 1137–1150 (2023). https://doi.org/10.1007/s10389-021-01634-7

Download citation

Received : 10 February 2021

Accepted : 30 June 2021

Published : 03 August 2021

Issue Date : July 2023

DOI : https://doi.org/10.1007/s10389-021-01634-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Health insurance literacy
  • Health insurance
  • Health insurance decision making
  • Health insurance education
  • Health literacy
  • Find a journal
  • Publish with us
  • Track your research
  • Open access
  • Published: 31 October 2023

Instruments for measuring nursing research competence: a COSMIN-based scoping review

  • Yuting Xia 1   na1 ,
  • Hui Huang 2   na1 ,
  • Xirongguli Halili 1 ,
  • Siyuan Tang 1 &
  • Qirong Chen 1 , 3  

BMC Nursing volume  22 , Article number:  410 ( 2023 ) Cite this article

1016 Accesses

Metrics details

The aim of this scoping review was to evaluate and summarise the measurement properties of nursing research competence instruments and provide a summary overview of the use of nursing research competence instruments.

Increasing nursing research competence instruments have been developed. However, a systematic review and evaluation of nursing research competence instruments is lacking.

This scoping review was conducted following the Joanna Briggs Institute updated methodology for scoping reviews and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews checklist. Reviewers searched articles in Eight English databases and two Chinese databases between April 1st, 2022, and April 30th, 2022. An updated literature search was conducted between March 1st and March 4th, 2023. The literature screening and data extraction were conducted by two reviewers, independently. A third reviewer was involved when consensus was needed. The COnsensus-based Standards for the selection of health Measurement Instruments methodology was used to evaluate the methodological quality and measurement properties of the nursing research competence instruments.

Ten studies involving eight nursing research competence instruments were included. None of the existing instruments have assessed all measurement properties. A total of 177 empirical studies have utilized a nursing research competence instrument with tested measurement properties.

‘Self-evaluated Nursing Research Capacity of Questionnaire (refined)’ was identified as the most appropriate nursing research competence instrument in existing instruments. However, reviewers need to conduct further measurement properties studies on the existing nursing research competence instruments.

Implications for the nursing policy

This study could guide the selection of appropriate nursing research competence instruments which could help to evaluate the nursing research competence of nurses and inform the development of intervention plans to enhance nursing research competence.

Peer Review reports

Introduction

Nursing research competence (NRC) refers to the individual nurse’s ability to conduct nursing research activities [ 1 , 2 ]. Evidence-based nursing has developed rapidly in recent years, and the importance of evidence-based nursing in improving clinical nursing quality has been confirmed by many researchers [ 3 , 4 , 5 ]. However, there is currently a lack of relevant available evidence focusing on clinical problems, so it is necessary for some nurses with nursing research competence to conduct original research on clinial practice in order to generate relevant available evidence and promote evidence-based nursing practices [ 6 ]. Specifically, enhancing the NRC of nurses holds significant importance in the advancement of high-quality clinical nursing research. For clinical nurses who are inclined towards research, possessing a strong NRC competence can motivate them to address clinical issues scientifically, apply evidence-based practices, and contribute to bridging the gap between theory and practical application [ 7 ]. As future nursing researchers and nurses, improving the NRC of nursing students has a positive promoting effect on the future development of nursing [ 8 , 9 ]. Using NRC instruments are necessary to evaluate the NRC of nursing staff and the effectiveness of interventions [ 8 , 10 ].

Measuring the NRC of nursing staff is important for research, education, and management purposes. Research has shown that clinical nurses are the end users and producers of nursing research, and nurses with research competence can promote the development of nursing discipline [ 11 ]. The prerequisite for improving nurses' research competence is to clarify the current situation and influencing factors of nurses' research competence, which provides a precise theoretical basis for formulating intervention plans to improve nursing staff's research competence [ 11 ]. However, an important way to clarify the current state of NRC and its associated factors was to use precise NRC instruments to measure NRC. They can provide evidence for building effective intervention strategies in research, evaluating teaching quality and promote the development of courses or training programs in education [ 9 ]. In addition, using the NRC instruments to measure the NRC of nurses could help nursing managers identify which nurses have good research competence, assist in organizing and conducting research projects, and cultivate research-oriented nurses in a targeted manner [ 10 , 12 ]. Therefore, it is important to evaluate the measurement properties and the application of existing NRC instruments. This could aid in selecting the most appropriate instrument and in revising or/and developing higher-quality instruments. COSMIN (Consensus-based Standards for the Selection of Health Measurement Instruments) is a consensus-based standard for the selection of health measurement instruments, which can evaluate the methodological quality and measurement properties of measuring instruments and provide recommendations for instrument selection [ 13 ]. This study evaluated all measurement properties of the NRC instruments based on COSMIN methodology. For more detailed steps on COSMIN methodology were showed in the ' Methods ' section.

Literature review

Recently, many NRC instruments have been developed, such as the Self-evaluated Nursing Research Capacity Questionnaire for nursing staff by Liu [ 14 ], later refined by Pan [ 15 ], the Research Competence Scale for nursing students by Qiu [ 9 ], and the Scientific Research Competency Scale for nursing professionals at the undergraduate and graduate levels by Pinar Duru [ 16 ]. However, researchers are unsure about how to accurately choose an instrument to measure the NRC of the target population. The selection of instrument directly affects the accuracy and credibility of empirical research results.

Research performed with outcome measurement instruments of poor or unknown quality constitutes a waste of resources and is unethical [ 13 ]. Selecting a measurement instrument with good reliability and validity is crucial to accurately evaluate NRC. While there are numerous instruments available for measuring NRC [ 9 , 15 , 16 , 17 ], to our knowledge there is still a lack of comprehensive evaluation and research on the selection and development of guiding NRC instruments [ 8 ]. Therefore, the purpose of this scoping review is to identify, evaluate, compare, and summarize the current NRC instruments and their usage, to provide guidance for researchers in selecting appropriate NRC instruments and developing new ones in the future.

This scoping review could answer the following questions [ 8 ]: (1) Which NRC instruments have been developed and how they were used in related studies? (2) Were there any well-validated and reliable instruments for measuring NRC? (3) If there were more than one well-validated and reliable instrument for measuring NRC, were there circumstances under which certain instruments were more appropriate for measuring NRC than the other instruments? (4) What were the differences between NRC instruments designed for different groups (e.g., clinical nurses, nursing students)? and (5) What were potential directions for the future development and improvement of NRC instruments?

(1) To identity, evaluate, compare, and summarize the validated instruments developed to measure nursing research competence.

(2) To provide an overview of the use of all NRC instruments.

Protocol and registration

This scoping review was conducted following: (1) the Consensus-based Standards for the Selection of Health Measurement Instruments (COSMIN) guidance [ 13 ], (2) Joanna Briggs Institute (JBI) updated methodology for scoping review [ 18 ] and was reported following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews checklist (PRISMA-ScR checklist) [ 19 ]. A protocol for this scoping review had been published [ 8 ] and registered on the Open Science Framework (osf.io/ksh43).

Search strategy

Reviewers searched for articles in eight English databases, including the Cochrane Library, Cumulative Index to Nursing and Allied Health Literature (CINAHL), Excerpta Medica Database (EMBASE), PubMed, PsycINFO, Scopus, Education Resource Information Center (ERIC), and ProQuest Dissertations & Theses Global as well as two Chinese databases, namely the China National Knowledge Infrastructure (CNKI) and WANGFANG DATA between April 1st, 2022, and April 30th, 2022. An updated literature search was conducted between March 1st and March 4th, 2023, covering the literature during the period from April 1st, 2022, to March 1st, 2023. Our search methodology was guided by the COSMIN guideline. It encompassed three primary components: (1) the target demographic (e.g., nurses, nursing students), (2) the focal concept (e.g., research, competence), and (3) the measurement attributes (e.g., internal consistency, content validity, among others). Elaborate search strategies for each database could be found in Tables Supplementary 4–13 (Tables S 4 -S 13 ) within the supplementary material.

Eligibility criteria

The scoping review aimed to (1) summarize the instruments developed to measure NRC and (2) provide an overview of their use [ 8 ]. The inclusion criteria were as follows: (1) the instruments aim to measure NRC; (2) studies that targeted various nursing personnel (e.g., nurses, nursing students, nursing teachers et al.); (3) studies should concern NRC instruments; (4) the aim of the study should be the evaluation of one or more measurement properties, the development of NRC instruments, or the evaluation of the interpretability of the NRC (5) studies that published between 1999 and 2023 (We selected a time frame limit based on most of the research related to NRC were published after 1999, and our search conducted between April 1st, 2022, and April 30th, 2022. An updated literature search was conducted between March 1st and March 4th, 2023, covering the literature during the period from April 1st, 2022, to March 1st, 2023.); (6) studies had available full-text .

Study screening

All studies were exported to an EndNote X9 library (Clarivate Analytics, USA), and duplicates were removed using its deduplication function. Two reviewers (YX and HH) independently screened the titles and abstracts, followed by assessment of full tests of potentially eligible articles. Disagreements between the two reviewers were resolved by a third reviewer (QC). Any articles that were not available online or through author contact were excluded, and The references of the included studies were also screened using the same process.

Data extraction

Two reviewers (YX and HH) independently extracted data from Tables S 1 -S 4 in the published protocol of this scoping review [ 20 ]. A third reviewer (QC) reviewed the results and any disagreements were solved by discussion.

For all eligible studies of objective (1), we extracted information including the development and verification of instruments, measurement properties of the included the development and verification of instrument. However, none of the self-designed scales provided details about the development of NRC instruments or psychometric testing. Furthermore, the evaluation of these scales did not adhere to the COSMIN methodology nor was their data extracted in this study. The extracted data are shown in Table S 1 , Table S 2 , and Table 1 .

For all eligible studies of objective (2), we extracted the information including author, year, location, study aim, design (intervention), participants, sample size, the instrument of NRC used, and results related to NRC. The information is shown in Table S 3 in the supplementary file.

Quality appraisal and data synthesis

Two reviewers (YX and HH) appraised the quality of the studies, with a third reviewer (QC) resolving any disagreement. First, the content validity (instrument development and content validity) was considered the most important section to determine whether the instrument items were suitable for the construct of interest and target population. Next, evaluating the internal structure (structural validity, internal consistency, and cross-cultural validity) was crucial to understand how the items were combined into a scale or subscale. Finally, the remaining measurement properties (reliability, measurement error, criterion validity, hypotheses testing for construct validity, and responsiveness) were also taken into account [ 13 ]. Based on the COSMIN methodology, the studies for objective (1) were evaluated through the following three sections.

Evaluation of methodological quality

COSMIN Risk of Bias Checklist was used to evaluate the risk of bias of 10 measurement properties (including content validity, structural validity, internal consistency, cross-cultural validity, reliability, measurement error, criterion validity, construct validity hypothesis testing, and responsiveness) [ 13 ]. The COSMIN Risk of Bias Checklist has 116 items, each item has five options, including “very good”, “adequate”, “doubtful”, “inadequate”, and “not applicable”. The overall rating of the quality of each study on every measurement properties was determined using the lowest rating among the items [ 13 ] (Table 1 ).

Evaluation the quality of measurement properties

The methodological quality ratings of instrument development and reviewer's ratings were used to evaluate content validity against the 10 Criteria for Good Content Validity, scoring each measure as "sufficient ( +)", "insufficient (-)", or "inconsistent ( ±)" [ 13 ]. The overall rating for a measure was determined by the ratings for relevance, consistency, and comprehensiveness, with inconsistent ratings being scored as ( ±) [ 13 ] (Table 2 ). The results of other psychometric property (including structural validity, internal consistency, cross-cultural validity, reliability, measurement error, criterion validity, construct validity hypothesis testing, and responsiveness) were evaluated against updated criteria for good measurement properties, and were rated as "sufficient ( +)", "insufficient (-)", or "indeterminate (?)" [ 13 ] (Table 2 ). The overall rating was based on the synthesized results, and the synthesized results were generated based on the measurement properities of each single study.

Grading of the evidence

The modified GRADE approach was used to rate the quality of evidence, based on the number and quality of available studies, their results, reviewer ratings, and consistency of results. The overall quality was graded as "High", "Moderate", "Low", or "Very low" [ 13 ]. Evidence quality was further downgraded based on the presence of risk of bias, inconsistency, and indirectness [ 13 ] (Table 2 ).

Studies that only used the NRC instrument as a variable without testing its properties would not be evaluated, but their characteristics would be extracted.

Recommendation

Instruments were categorized using COSMIN guidelines into three groups: (A) Instruments with evidence for sufficient content validity (any level) AND at least low quality evidence for sufficient internal consistency; (B) Instruments categorized not in A or C; (C) Instruments with high quality evidence for any an insufficient measurement property [ 13 ].

Instruments categorized as (A) could be recommended for widely use. Instruments categorized as (B) have potential to be recommended for use, but further research was needed to assess the quality of this instrument. Instruments categorized as (C) should not be recommended for use.

Search results

A total of 3265 articles were retrieved, 920 duplicates were removed, and 454 were screened for eligibility. From these, 10 studies on NRC instrument development and psychometric properties, 177 empirical studies using a psychometric tested NRC instrument, and 23 empirical studies using a self-designed NRC questionnaire (without describing the development or/and the psychometric testing) were identified (Fig.  1 ).

figure 1

PRISMA flow diagram for this scoping review

Study characteristics

Tables S 1 and S 2 presents characteristics of eligible NRC instruments and study populations for objective (1). Six original instruments [ 9 , 14 , 16 , 21 , 22 , 23 ], two modified instruments [ 10 , 15 , 17 ], and one psychometric property testing of one NRC instrument are featured in these tables [ 24 ]. However, among the ten articles, two articles (one dissertation and another published in a peer-reviewed journal) were published by the same author describing the same instrument [ 10 , 15 ]. Therefore, we only extracted and evaluated data from the dissertation for this instrument [ 15 ]. Self-designed scales without description of the development or psychometric testing were not included in the quality appraisal.

Table S 3 shows an overview of all eligible studies for objective (2), along with the NRC instruments that were identified and the number of studies that utilized each specific instrument. The NRC instrument ⑦ adapted by Pan was the most commonly used instrument, with a frequency of 127 [ 15 ]. The NRC instrument ⑤ developed by Liu was used 38 times to measure the NRC of nurse staff [ 14 ]. The NRC instrument ⑧ was used seven times, and the NRC instrument ③ was used twice. The NRC instruments ② and ⑥ were both used only once to measure the NRC. However, the NRC instruments ① and ④ have not been used. Self-designed NRC instruments without validation were used in 23 studies.

The results of NRC instruments evaluation

The result of methodological quality showed in Table 1 . Among the nine studies included, none evaluated all measurement attributes [ 9 , 14 , 15 , 16 , 17 , 21 , 22 , 23 , 24 ]. Cross-cultural validity, measurement error, criterion validity, and responsiveness have not been evaluated in any of the studies.

The content validity ratings for five NRC instruments ( ① ② ③ ④ ⑥ ) was rated as 'inconsistent ( ±)' and NRC instruments ( ⑤ ⑦ ⑧ ) was rated as ‘sufficient ( +)’. The assessment of structural validity for two NRC instruments ( ③ ⑦ ) was rated as 'sufficient ( +)’, the assessment of the structural validity of these four NRC instruments ( ④ ⑤ ⑥ ⑧ ) was ‘indeterminate (?)’. The assessment of internal consistency for three NRC instruments ( ② ④ ⑥ ) was rated as ' sufficient ( +)', two NRC instruments ( ① ③ ) was rated as ‘insufficient (-)’, and three NRC instruments ( ⑤ ⑦ ⑧ ) was rated as ‘indeterminate (?)’. The measurement properties of reliability for seven NRC instruments ( ① ③ ④ ⑤ ⑥ ⑦ ⑧ ) was rated as ‘indeterminate (?)’. The assessment of the hypotheses testing for construct validity for NRC instruments ④ and ⑧ was rated as 'sufficient ( +)'. More details were shown in Table 2 .

As present in Table 2 , NRC instruments ⑤ ⑦ ⑧ was rated as ‘moderate’ for content validity, other NRC instruments ( ① ② ③ ④ ⑥ ) were rated as ‘very low’. The quality of the evidence for structural validity of six NRC instruments ( ③ ④ ⑥ ⑤ ⑦ ⑧ ) were rated as ‘high’. The evidence quality for internal consistency of NRC instruments ① and ② was rated as ‘moderate’, while NRC instrument ③ was rated as ‘low’. NRC instruments ④ ⑤ ⑥ ⑦ ⑧ was rated as ‘high’ in terms of internal consistency. The evidence quality for the hypotheses testing for construct validity was rated as 'low' for NRC instruments ④ and ⑧ .

Recommended NRC instruments

Based on the evaluation results, three NRC instruments ( ⑤ ⑦ ⑧ ) were rated as ‘sufficient ( +)’ for content validity, but their internal consistency was rated as ‘indeterminate (?)’ (Table 2 ). Thus, they were recommended for use under category B. The other NRC instruments ( ① ② ③ ④ ⑥ ) had evidence of ‘indeterminate ( ±)’ content validity and lacked high-quality evidence indicating that their content validity was ‘insufficient’ (Table 2 ). Therefore, these instruments ( ① ② ③ ④ ⑥ ) were also recommended for use under category B (Table 2 ). As all NRC instruments ( ① ② ③ ④ ⑤ ⑥ ⑦ ⑧ ) were recommended for use under category B, which mean these NRC instruments ( ① ② ③ ④ ⑤ ⑥ ⑦ ⑧ ) have the potential to be recommended, but further validation studies were needed [ 13 ].

The overview of the usage of all NRC instruments

Most studies on NRC instruments were conducted in China (197/200, 98.5%), with cross-sectional studies (147/200, 73.5%), randomized controlled trials (18/200, 9%), a quasi-experimental study (1/200, 0.5%) and before-after studies being the predominant study designs (34/200, 17%). All the studies (100%) were published after 2009, with most studies targeting either nurses (121/200, 60.5%) or nursing students (66/200, 33%) as the target population. Further details on objective (2) can be found in Table S3.

This scoping review evaluated eight NRC instruments using the COSMIN checklist, but none of them have assessed all measurement properties. Among the existing eight NRC instruments, NRC instrument ⑦ is the most widely used, and it has only been used by Chinese scholars as of now. This may be because the NRC instrument ⑦ was developed by Chinese scholar Pan et al. In addition, NRC instrument ⑦ was developed in 2011 and was one of the earliest NRC instruments developed in China.

Lack of reference to the target population during development was an important disadvantage in developing NRC instruments. The items in NRC instruments should be both relevant and comprehensive for the "construct" being measured, as well as comprehensible for the study population. These elements are crucial for ensuring content validity, which is crucial for ensuring an instrument's psychometric properties, and? requires cognitive interviews with the target population [ 13 , 20 ]. However, only two NRC instruments ( ⑦ ⑧ ) conducted cognitive interviews with the target population during development, and these interviews lacked detail. However, details of the cognitive interview process were missing. Additionally, three studies ( ⑤ ⑦ ⑧ ) asked the target population about the relevance, comprehensiveness, and comprehensibility of the instrument's content validity, while experts were consulted about the relevance and comprehensiveness of the instruments in all three studies. Comprehensive details of the cognitive interview process are necessary to evaluate content validity. However, the published articles lacked such details, which may be due to the COSMIN guideline being published in 2018, while most (75%) NRC instruments in this review were developed prior to 2018 [ 25 ].

Confirmatory factor analysis (CFA) and exploratory factor analysis (EFA) were performed on six NRC instruments ( ③ ④ ⑤ ⑥ ⑦ ⑧ ), with two instruments ( ③ ⑦ ) reporting CFI values of 0.98 and 0.97, respectively. These NRC instruments ( ③ ④ ⑤ ⑥ ⑦ ⑧ ) are capable of reliably capturing the theoretical structure and idiosyncratic degree [ 26 ]. In other words, these NRC instruments ( ③ ④ ⑤ ⑥ ⑦ ⑧ ) could effectively represent the theoretical concept of nursing research competence.

Most studies focused on internal consistency, which reflects the correlation of items in the NRC instrument or subscales. However, some NRC instruments ① ② ④ ⑤ ⑥ and ⑧ did not meet the criterion for sufficient structural validity [ 13 ]. Therefore, reviewers should evaluate the structural validity before assessing internal consistency and provide detailed information in the future. In addition, NRC instrument ③ only reported Cronbach's alpha for the total instrument, whereas in future studies, reliability analysis should be conducted to evaluate Cronbach's alpha for each dimension of NRC instrument ③ . It is worth noting that the Cronbach's alpha values of three subscales in NRC instrument ⑦ were below 0.70 (0.68, 0.68, 0.66, respectively). The value of Cronbach's alpha is influenced by factors such as the number of items, item interrelatedness, and dimensionality [ 27 ]. The low Cronbach's alpha value suggests that heterogeneity exists between some items of the instrument and that these items should be revised or removed. One straightforward method is to calculate the item-total score correlation and eliminate items with low correlations [ 27 , 28 ]. Therefore, additional studies are necessary to enhance and assess the internal consistency of NRC instrument ⑦ . Moreover, the sample sizes of studies assessing the internal consistency of NRC instruments ① and ② were below 100, resulting in downgrading of the quality of evidence on internal consistency. Consequently, a larger target population is required to further evaluate the internal consistency of these two NRC instruments ( ① ② ).

The prerequisite for widespread use of NRC instruments is to ensure reliability, minimal measurement error, and sensitivity to changes. Except 'The Nurse Research Questionnaire among Nurse Clinicians ② ', all NRC instruments have been evaluated for reliability. The reported ICC values were not clearly documented in the literature, indicating that their reliability was not satisfactory. Although reliability and measurement error are interrelated measurement properties [ 25 ], there is currently no NRC instrument available to evaluate measurement error. Measurement error refers to the systematic and random errors in the target population's rating, which are not attributed to actual changes in the structure to be measured [ 27 ]. However, the credibility of the results obtained from the NRC instruments evaluated in this study may be compromised by the lack of measurement error assessment. Therefore, future research should address this issue by evaluating the measurement error of NRC instruments. Moreover, none of the NRC instruments were tested for responsiveness, which may be attributed to the lack of longitudinal validation, including intervention studies [ 29 ]. Although NRC instruments have been utilized in some intervention studies to evaluate outcomes, the minimal important change/distribution of scores (MIC/SD) change score for the stable group target was not calculated [ 13 ]. Future research should include more longitudinal or intervention studies that employ NRC instruments to assess their reliability and responsiveness [ 25 ].

Criterion validity of NRC instruments ④ and ⑧ was reported using the author-defined gold standard in the respective articles. However, we question the appropriateness of using 'The Anxiety Scale Towards Research and the Attention Scale Towards Scientific Research' and 'General Self Efficiency Scale' as the gold standard in the studies, as they may not be ideal measures for NRC instruments ④ and ⑧ , respectively [ 13 ]. According to the guidance of the COSMIN guideline, the criterion validity reported in the articles for NRC instruments ④ and ⑧ would be more appropriately considered as convergent validity, which pertains to the hypothesis of the targeted instrument's relationship with other relevant measurement instruments [ 30 ]. As a result, we opted not to evaluate criterion validity of NRC instruments ④ and ⑧ , and instead focused on testing hypotheses for construct validity (specifically, convergent validity). The challenge of identifying a suitable 'gold standard' for NRC instruments may be attributed to the difficulty in establishing an objective index for NRC. To address this issue, we recommend enhancing the development of objective evaluation indicators for NRC, which could lead to the formation of a 'gold standard' instrument. Having a reliable gold standard could aid in the development and validation of more user-friendly and efficient NRC instruments.

Hypotheses testing for construct validity is defined as the relationships of scores on the instrument of interest with the scores on other instruments measuring similar constructs (convergent validity) or dissimilar constructs (discriminant validity), or the difference in the instrument scores between subgroups of people (known-groups validity) [ 31 ]. The study on NRC instrument ④ reported hypotheses stating that individuals with high levels of research competency would hold more positive attitudes towards scientific research and experience less anxiety towards research [ 16 ]. Although the hypothesis was not explicitly stated in the study on NRC instrument ⑧ , the positive correlation observed between NRC and general self-efficiency could be used to draw conclusions about the construct validity of NRC instrument ⑧ [ 13 , 17 ]. The studies on NRC instruments ④ and ⑧ all formulated hypotheses for testing construct validity, with expected directions of effect. To accurately represent the underlying theoretical structure of nursing research competence, hypotheses should verify both the magnitude of correlations or differences [ 31 ].

No studies evaluated the cross-cultural validity of the NRC instruments. Cross-cultural validity refers to the degree to which the performance of the items in the translated or culturally adapted instrument adequately reflects the performance of the items in the original version of the instrument [ 13 , 31 ]. Cross-cultural validity is important to ensure that a measurement instrument can accurately measure what it is intended to measure among different target populations [ 32 ]. The evaluation of cross-cultural validity should be conducted across different groups, cultures, and languages [ 13 , 31 , 33 ]. We recommend that it be conducted for NRC instruments in different groups such as clinical nurses and nursing students, as well as across different cultures and languages to ensure their reliability across different contexts.

The study recommended all NRC instruments as Grade B. However, the lack of specific information regarding the evaluation of content and construct validity may have influenced this rating. Although all NRC instruments could be recommended for use, but further studies are necessary to confirm the reliability of all NRC instruments. It is also important to note that the stringent evaluation method of COSMIN bases the score of each measurement property on the lowest-scoring item across all items. This approach may result in lower evaluations for instruments with insufficient information [ 25 , 34 , 35 ]. In this study, interpretability and feasibility were not evaluated, so future research is suggested to assess these properties.

We have observed that even though the of developers these NRC instruments have limited the target population (nurses/nursing students) and provided clear definitions of NRC, there was no significant difference in their definitions of NRC for nurses and nursing students. Furthermore, minimal discrepancies exist among the NRC instruments developed for distinct populations (with the exception of attitudes towards nursing research, problem finding competence, research design competence, and paper writing competence). Therefore, further research should investigate whether a distinction in NRC between different populations (nurses and nursing students) is necessary. If different populations need to have different NRC, it becomes imperative to delineate the precise implications and extent of NRC for each distinct group. Conversely, if there is no difference, all the NRC instruments can be universally used without limiting the target population.

All the NRC instruments were recommended for use under category B. Considering that the COSMIN guidelines recommend using instruments categorized as (B), and given the current widespread used of NRC instruments, it was not recommended to develop new NRC instruments. Instead, existing NRC instruments that were recommended as category (B) could be optimized to the greatest extent possible. For example, it's worth noting that the evaluation on content validity of NRC instruments ① ② ③ ④ ⑥ did not encompass nurses and/or nursing students (the main reason why these NRC instruments cannot be recommended as category (A)). Therefore, it is possible to consider conducting additional interviews with nurses and/or nursing students, comparing the NRC instruments ① ② ③ ④ ⑥ items with the interview results to estimate which NRC instruments have very good content validity. Furthermore, the evaluation results suggest the Cronbach's alpha values of some dimensions of NRC instruments ⑤ ⑦ ⑧ are below 0.70. It's imperative to undertake large-scale studies that validate these dimensions comprehensively. If the Cronbach's alpha values for all dimensions fail to reach the desired threshold even within a large-scale study, a revision of the dimensions and entries within the existing NRC instruments should be deliberated. In a broader context, future researchers are encouraged to develop novel measurement instruments guided by the COSMIN framework. Notably, the development process should incorporate qualitative interviews with the target population, specifically focusing on gauging reliability, comprehensiveness, and understandability of content validity within these instruments. Subsequently, extensive validation of internal consistency within a sizable sample of the target population is pivotal, ensuring that instruments could be categorized as (A) merit recommendation for practical use.

By summarizing the usage of all NRC instruments, we found that nurses and nursing students were currently the main focus of research using NRC instruments, and more than 50% of the research was cross-sectional. This provides a theoretical basis for nursing researchers to understand the current situation of nurses and nursing students' NRC and develop precise intervention plans to improve their NRC. It is worth noting that although RCT and Before-after study in the same patient have been conducted, there were few studies with a large sample size and a lack of longitudinal evaluation of the effectiveness of NRC intervention by nurses and/or nursing students. In addition, almost all research was conducted in China, which may be due to the fact that the majority (87.5%) of NRC instruments were first developed by Chinese researchers. Therefore, in the future, nursing researchers from different countries should improve existing NRC instruments, select appropriate NRC instrument based on specific contexts and cultural backgrounds, and conduct cross-cultural testing to clarify the NRC competence of nursing staff from different countries and provide a theoretical basis for formulating intervention measures.

Strengths and limitations

The study has three strengths: (1) it followed the COSMIN guideline, JBI methodology, and reported following PRISMA-ScR checklist; (2) it comprehensively searched and retrieved relevant literature from English and Chinese databases; and (3) it evaluated the methodological quality of studies and instruments according to the COSMIN guideline.

Limitations of the study include the exclusion of NRC instruments published in languages other than English and Chinese, and the possibility of missing relevant literature not included in the selected databases. In addition, the NRC instruments in this scoping review were designed for nurses and/or nursing students, not for patients. Therefore, we replaced all patients with nursers/nursing students during the evaluation. COSMIN guideline suggested that it could be used as a guidance for reviews of non-PROMs. However, COSMIN guideline did not mention how to make specific modifications to steps 5–7 (evaluate content validity, internal structure (structural validity, internal consistency, cross-cultural validity), and remaining measurement properties (reliability, measurement error, criterion validity, hypotheses testing for construct validity, and responsiveness) for non-PROMs.

The study recommended NRC instrument ⑦ as the most suitable among existing instruments, but calls for further research on the measurement properties of NRC instruments, especially cross-cultural validity, measurement error, and criteria validity. Additionally, researchers should evaluate and report on the interpretability and feasibility of NRC instruments, and explore the development of more reliable and feasible instruments for different nursing populations based on a unified concept of nursing research competence.

Implications for clinical practice

This study evaluated NRC instruments' measurement properties and provides recommendations for selecting appropriate instruments. Valid and reliable NRC instruments can accurately evaluate nurses' NRC in clinical settings and provide evidence for intervention plans to improve their competence.

Availability of data and materials

Not application.

Chen Q, et al. Research capacity in nursing: a concept analysis based on a scoping review. BMJ Open. 2019;9(11): e032356.

Article   PubMed   PubMed Central   Google Scholar  

Leung K, Trevena L, Waters D. Systematic review of instruments for measuring nurses’ knowledge, skills and attitudes for evidence-based practice. J Adv Nurs. 2014;70(10):2181–95.

Article   PubMed   Google Scholar  

Hu, Y., et al., Research competence of community nurses in Shanghai: A cross-sectional study. J Nurs Manag, 2022.

Segrott J, McIvor M, Green B. Challenges and strategies in developing nursing research capacity: a review of the literature. Int J Nurs Stud. 2006;43(5):637–51.

O’Byrne L, Smith S. Models to enhance research capacity and capability in clinical nurses: A narrative review. J Clin Nurs. 2011;20(9–10):1365–71.

Alqahtani N, et al. Nurses’ evidence-based practice knowledge, attitudes and implementation: A cross-sectional study. J Clin Nurs. 2020;29(1–2):274–83.

Pearson A, Field J, Jordan Z. Evidence-Based Clinical Practice in Nursing and Health Care: Assimilating research, experience and expertise. 2009. https://doi.org/10.1002/9781444316544 .

Chen Q, et al. Instruments for measuring nursing research competence: a protocol for a scoping review. BMJ Open. 2021;11(2):e042325.

Qiu C, et al. Development and psychometric testing of the Research Competency Scale for Nursing Students: An instrument design study. Nurse Educ Today. 2019;79:198–203.

Pan Y, Cheng J. Revise of scientific research ablility self-evaluation rating scales of nursing staff. Nurs Res. 2011;25(13):1205–8 (China).

Google Scholar  

Chen Q, et al. Relationship between critical thinking disposition and research competence among clinical nurses: A cross-sectional study. J Clin Nurs. 2020;29(7–8):1332–40.

Staffileno BA, Carlson E. Providing direct care nurses research and evidence-based practice information: an essential component of nursing leadership. J Nurs Manag. 2010;18(1):84–9.

Prinsen CAC, et al. COSMIN guideline for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27(5):1147–57.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Liu R. Study on the reliability and validity of nursing staff's scientific research ability self-assessment scale. Chin J Pract Nurs. 2004;(09):8–10. (China)

Pan, Y. Revise of scientific research ablility self-evaluation rating scales of nursing staff. (Master). Shanxi Medical University. 2011. (China)

Duru P, Örsal Ö. Development of the Scientific Research Competency Scale for nurses. J Res Nurs. 2021;26(7):684–700.

Yin H, Yin A, Zhang X, et al. Development and reliability and validity of the scale for self- evaluating the scientific research ability of nursing staff. Chin J Pract Nurs. 2016;32(08):630–7 (China).

Peters MDJ, et al. Updated methodological guidance for the conduct of scoping reviews. JBI Evid Synth. 2020;18(10):2119–26.

Tricco AC, et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann Intern Med. 2018;169(7):467–73.

Terwee CB, et al. Rating the methodological quality in systematic reviews of studies on measurement properties: a scoring system for the COSMIN checklist. Qual Life Res. 2012;21(4):651–7.

Arthur D, Wong FK. The effects of the “learning by proposing to do” approach on Hong Kong nursing students’ research orientation, attitude toward research, knowledge, and research skill. Nurse Educ Today. 2000;20(8):662–71.

Article   CAS   PubMed   Google Scholar  

Gething L, et al. Fostering nursing research among nurse clinicians in an Australian area health service. J Contin Educ Nurs. 2001;32(5):228–37.

Wu H, Song C, Dai H, et al. Development and reliability and validity of the scale for evaluating the scientific research ability of nursing staff. Chin J Morden Nurs. 2016;22(10):1367–71 (China).

Chu Y, Cheng J, Han F, et al. The research on self-evaluated of research competence scale. Chin J Med Sci Res Manage. 2013;26(04):285–9 (China).

Paramanandam VS, et al. Self-reported questionnaires for lymphoedema: a systematic review of measurement properties using COSMIN framework. Acta Oncol. 2021;60(3):379–91.

Ong CW, et al. A systematic review and psychometric evaluation of self-report measures for hoarding disorder. J Affect Disord. 2021;290:136–48.

Tian L, Cao X, Feng X. Evaluation of psychometric properties of needs assessment tools in cancer patients: A systematic literature review. PLoS ONE. 2019;14(1):e0210242.

Tavakol M, Dennick R. Making sense of Cronbach’s alpha. Int J Med Educ. 2011;2:53–5.

Cheng Q, et al. Needs assessment instruments for family caregivers of cancer patients receiving palliative care: a systematic review. Support Care Cancer. 2022;30(10):8441–53.

Chen, W., Peng, J., Shen, Lan., et al. Introduction to the COSMIN method: A systemic review of patient-reported outcomes measurement tools. Journal of Nurses Traning. 2021;36(8), 699–703. https://doi.org/10.16821/j.cnki.hsjx.2021.08.005 (China).

Lee EH, Kang EH, Kang HJ. Evaluation of Studies on the Measurement Properties of Self-Reported Instruments. Asian Nurs Res (Korean Soc Nurs Sci). 2020;14(5):267–76.

PubMed   Google Scholar  

Vet HD, et al. Measurement in Medicine: References. Cambridge University Press, 2011. https://doi.org/10.1017/CBO9780511996214 .

Mokkink LB, et al. COSMIN Risk of Bias checklist for systematic reviews of Patient-Reported Outcome Measures. Qual Life Res. 2018;27(5):1171–9.

Speyer R, et al. Measurement properties of self-report questionnaires on health-related quality of life and functional health status in dysphonia: a systematic review using the COSMIN taxonomy. Qual Life Res. 2019;28(2):283–96.

Crudgington H, et al. Epilepsy-specific patient-reported outcome measures of children’s health-related quality of life: A systematic review of measurement properties. Epilepsia. 2020;61(2):230–48.

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 72104250) and the Natural Science Foundation of Hunan Province (No.2022JJ40642).

Author information

Yuting Xia and Hui Huang are both the first author. They share the authorship of the first author.

Authors and Affiliations

Xiangya School of Nursing, Central South University, 172 Tongzipo Road, Changsha, 410000, Hunan, China

Yuting Xia, Xirongguli Halili, Siyuan Tang & Qirong Chen

The Third Xiangya Hospital, Central South University, Changsha, China

Xiangya Research Center of Evidence-Based Healthcare, Central South University, Changsha, China

Qirong Chen

You can also search for this author in PubMed   Google Scholar

Contributions

Study design: YX, HH, QC; .Literature searching: YX, XH; Quality Appraisal: YX, HH, QC; .Data extraction: YX, QC; Study supervision: QC, ST; Manuscript drafting: YX;Critical revisions for important intellectual content: YX, QC, HH.

Corresponding author

Correspondence to Qirong Chen .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: table s1..

The characteristics of eligible NRC instruments. Table S2. The characteristics of study populations involved in the development and validation of eligible NRC instruments. Table S3. An overview of the uses of all the NRC instruments. Table S4. Search strategy for Pubmed. Table S5. Search strategy for Embase. Table S6. Search strategy for Scopus. Table S7. Search strategy for Cochrane. Table S8. Search strategy for CINAHL. Table S9. Search strategy for PsycINFO. Table S10. Search strategy for ERIC. Table S11. Search strategy for ProQuest. Table S12. Search strategy for Wanfang. Table S13. Search strategy for CNKI.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Xia, Y., Huang, H., Halili, X. et al. Instruments for measuring nursing research competence: a COSMIN-based scoping review. BMC Nurs 22 , 410 (2023). https://doi.org/10.1186/s12912-023-01572-7

Download citation

Received : 03 July 2023

Accepted : 20 October 2023

Published : 31 October 2023

DOI : https://doi.org/10.1186/s12912-023-01572-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Instruments
  • Nursing research competence
  • Scoping review
  • COnsensus-based Standards for the selection of health Measurement Instruments

BMC Nursing

ISSN: 1472-6955

literature review measurement instruments

Researching for a literature review: Test Instruments

  • What is a literature review?
  • Dissertations

Test Instruments

Reference material.

Handbook of Tests and Measurement in Education and the Social Sciences

All Databases > Tests and Test Reviews

Identifying or using test instruments, such as surveys or questionnaires, may be a necessary part of methodology. The following databases may help identify tests; finding the entire test may require additional steps or costs.

ERIC [Education database]

Keyword search for topic (uncheck Search words in thesaurus if necessary). From the results list, select All Limit Options. On the next screen, under Publication Type, scroll to and select Tests/Questionnaires.

Click here to view a 1 minute video on finding Tests/Questionnaires in ERIC.

ETS Test Collection

Index of published instruments, including non-commercially available tests.

Click the tab "Find a Test," search for topic, click on a test this is of interest and look at the "Availability" field of the test. If one of the following statements is present, the instrument is available in our library (see a reference librarian for assistance, or Ask Us !):

Tests in Microfiche (BGSU owns 1975 to present) ERIC Document # (available through ERIC database) Citation to a journal article (check BGSU Libraries Catalog for journal availability)

Mental Measurements Yearbook with Tests in Print

The Mental Measurements Yearbook contains information about and reviews of English-language standardized tests back to 1978. It is cross-searchable with Tests in Print , which contains information about all currently in-print and commercially available tests, including what they measure, where to find them, and reviews.

Searchable together, Mental Measurements Yearbook with Tests in Print allows users to identify tests and view their availability or order information at the same time.

PsycINFO [Psychology database]

Keyword search for topic and click Show Limit Options, select Tests and Measures.

OR: Keyword search for topic, then review results to see if they provide details about the instrument used to conduct the study.

Want to see how to search this database? View a video produced by APA on finding Tests and Measures in PsycINFO.

View an APA video on finding methodology in PsycINFO.

  • << Previous: Websites
  • Last Updated: Feb 8, 2024 9:56 AM
  • URL: https://libguides.bgsu.edu/litreview

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.8(7); 2018

Logo of bmjo

Tools to assess the measurement properties of quality of life instruments: a meta-review protocol

Sonia lorente.

1 Department of Psychobiology and Methodology of Health Science, Universitat Autònoma de Barcelona, Bellaterra, Spain

2 Pediatric Area, PNP, Hospital de Terrassa, Consorci Sanitari de Terrassa, Terrassa, Spain

Jaume Vives

Carme viladrich, josep-maria losilla, associated data, introduction.

Using specific tools to assess the measurement properties of health status instruments is recommended both to standardise the review process and to improve the methodological quality of systematic reviews. However, depending on the measurement standards on which these tools are developed, the approach to appraise the measurement properties of instruments may vary. For this reason, the present meta-review aims to: (1) identify systematic reviews assessing the measurement properties of instruments evaluating health-related quality of life (HRQoL); (2) identify the tools applied to assess the measurement properties of HRQoL instruments; (3) describe the characteristics of the tools applied to assess the measurement properties of HRQoL instruments; (4) identify the measurement standards on which these tools were developed or conform to and (5) compare the similarities and differences among the identified measurement standards.

Methods and analysis

A systematic review will be conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Protocols Guidelines. Electronic search will be carried out on bibliographic databases, including PubMed, Cumulative Index to Nursing and Allied Health Literature, Psychological Information, SCOPUS, Web of Science, COSMIN database and ProQuest Dissertations & Theses Global, being limited by time (2008–2018) and language (English). Descriptive analyses of different aspects of tools applied to evaluate the measurement properties of HRQoL instruments will be presented; the different measurement standards will be described and some recommendations about the methodological and research applications will be made.

Ethics and dissemination

Ethical approval is not necessary for systematic review protocols. The results will be disseminated by its publication in a peer-reviewed journal and presented at a relevant conference.

PROSPERO registration number

CRD42017065232

Strengths and limitations of this study

  • The search strategy has been designed to be comprehensive, following the Peer Review of Electronic Search Strategies guidelines and including filters for finding studies on measurement properties of measurement instruments.
  • The systematic review protocol is developed using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses for Protocols guidelines.
  • Inclusion of studies published in English only may lead to language bias.

Systematic reviews of measurement properties critically appraise and compare the content and measurement properties of all instruments measuring a certain construct of interest in a specific study population. 1 High quality systematic reviews can provide a comprehensive overview of the measurement properties of patient-reported outcome measures and support evidence-based recommendations in the selection of the most suitable health status instrument for a given purpose (ie, research or clinical practice). 2 To be confident that the design, conduct, analysis and interpretation of the review results and conclusions are adequate, the methodological quality of systematic reviews should be appraised. 1

Because of this, different authors evaluate systematic reviews assessing the measurement properties of health status assessment instruments, as Mokkink et al 1 or Terwee et al . 3 In both cases, authors examine the search strategy, data extraction (two or more reviewers), data synthesis and whether the measurement properties of health status instruments were assessed using specific tools that are recommended both to standardise the review process and to improve the methodological quality of systematic reviews. 3 However, depending on the measurement standards on these tools were developed, the approach to analyse the measurement properties of instruments may vary. Given this, the present meta-review aims to discuss the methodological, research and practical applications of these tools in systematic reviews that assess the measurement properties of instruments evaluating the quality of life within the context of health and disease, that is, health-related quality of life (HRQoL) instruments. 4

To identify systematic reviews assessing the measurement properties of HRQoL instruments.

To identify the main tools applied to assess the measurement properties of HRQoL instruments.

To describe the most relevant characteristics of the tools applied to assess the measurement properties of HRQoL instruments (validity, reliability, feasibility, etc).

To identify the measurement standards on which these tools were developed or conform to.

To compare the similarities and differences among the identified measurement standards.

Study design

Where applicable, the present meta-review will follow the Preferred Reporting Items for Systematic Reviews and Meta-Analysis Protocols guidelines. 5

Search strategy

A systematic review will be performed in PubMed, US National Library of Medicine, by National Center for Biotechnology Information; Cumulative Index to Nursing and Allied Health Literature by EBSCOhost; Psychological Information by APA PsycNET; SCOPUS by Elsevier; Web of Science CORE by Thomson Reuters and COSMIN database by COSMIN Initiative ( http://www.cosmin.nl/) . In addition, ProQuest Dissertations & Theses Global will be used for searching grey literature, and search alerts in all databases will be set. The search strategy will follow the Peer Review of Electronic Search Strategies guidelines recommendations 6 7 and will consist of 3 filters composed of search terms for the following: (1) systematic review methodology; (2) HRQoL instruments and (3) measurement properties. The latter filter was developed for the VU University Medical Center for finding studies on measurement properties of measurement instruments. 8 All filters will be adapted for all databases. The systematic search will be performed in July 2018, limited by time and language (English) ( table 1 shows the string of terms in PubMed).

Search strings for PubMed

Inclusion criteria

We will limit our search to studies published between 2008 and 2018.

Systematic reviews aiming to report or to assess the measurement properties of instruments evaluating the quality of life within the context of health and disease, namely HRQoL instruments, 4 including all studies examining at least two or more measurement properties of a HRQoL instrument. Systematic reviews were required to include the full results report and detailed information about the instruments used to assess the measurement properties.

Setting and participants

We will include the whole range of ages (new borns, toddlers, children, teenagers, young adults, middle age adults and elderly people), in any healthcare setting.

Condition or domain being studied

The quality of health status and the quality of life instruments are essential to obtain accurate diagnoses and to assess the efficacy or effectiveness of a specific intervention in healthcare. Evaluating and improving the quality of life, as well, is considered a public health priority, 4 and because of this the present meta-review is focused on systematic reviews that appraised the measurement properties of HRQoL instruments.

To study the characteristics of tools assessing the measurement properties of HRQoL instruments in systematic reviews and to compare the measurement standards on which these tools were developed or conform to, with examples found in Viladrich and Doval 9 : attributes and criteria to assess health status and quality of life instruments, 10 11 the standards for educational and psychological measurement 12 13 or the health status measures in economic evaluation. 14 15

Primary outcomes

Identification of the main specific tools applied to assess the measurement properties of HRQoL instruments and comparison of their most relevant characteristics. Identification and comparison of the measurement standards on which these tools were developed. Appraisal of how authors of the systematic reviews include the assessment of the quality of the HRQoL instruments in their results and how they use this evaluation to come to an overall conclusion regarding the quality of each instrument.

Instruments

We will include tools aiming to assess the quality of measurement properties of HRQoL instruments.

Study screening

References identified by the search strategy will be entered into Mendeley bibliographic software, and duplicates will be removed. Titles and abstracts will be screened independently by two reviewers. When decisions are unable to be made from title and abstract alone, the full paper will be retrieved. Full text inclusion criteria will be checked independently by two reviewers. Discrepancies during the process will be resolved through discussion (with a third reviewer where necessary).

Data extraction

Extracted information of each selected systematic review and meta-analysis will include: general information (author, year, country of origin and papers, theoretical/conceptual framework); tools applied to assess the measurement properties of HRQoL instruments (title, purpose/use, number of items, response categories, criteria to assess the measurement properties on specific measurement standard, ease and usefulness of interpretation, level of expertise required for scoring and interpreting and time required to completion); reporting of the measurement properties assessed and use of the results from the evaluation of the measurement properties to come to an overall conclusion regarding the quality of each HRQoL instruments. Authors of eligible studies will be contacted to provide missing or additional data if necessary.

Strategy for data analysis

We will initially categorise the tools applied to assess the measurement properties of the HRQoL instruments according to the measurement standards on they were developed or conform to. Next, we will detail the most relevant characteristics of these tools according to their measurement standards and their conceptual frameworks.

Strategy for data synthesis

Descriptive analyses of different aspects of the identified tools applied to evaluate the measurement properties of HRQoL instruments. The extracted information related to these tools will be reported in a table to facilitate their comparison. Some recommendations about the methodological, practical and the research applications of each tool will be made.

Patient and public involvement

No patient or public involvement.

To date, there are not meta-reviews of tools assessing the measurement properties of HRQoL instruments and the different measurement standards on which these tools were developed. The findings of this work will be useful, first, to compare the minimum criteria and attributes recommended to assess the measurement properties of HRQoL; second, to establish the most relevant differences and similarities among both the measurement standards and the assessment tools of measurement properties and finally, to discuss the methodological, research and practical applications of these tools in systematic reviews. This information will facilitate and improve the work of researchers and clinicians that conduct systematic reviews of HRQoL instruments measurement properties.

Supplementary Material

Contributors: All authors meet the criteria recommended by the International Committee of Medical Journal Editors, ICMJE. All authors made substantial contributions to conception and design, piloted the inclusion criteria and provided direction of the data extraction and analysis. SL: drafted the article. JV, CV and J-ML: critically revised the draft for important intellectual content. All authors agreed on the final version.

Funding: This work was supported by the Grant PSI2014-52962-P, Spanish Ministry of Economy and Competitiveness.

Competing interests: None declared.

Patient consent: Not required.

Provenance and peer review: Not commissioned; externally peer reviewed.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 21 March 2024

Expert review of the science underlying nature-based climate solutions

  • B. Buma   ORCID: orcid.org/0000-0003-2402-7737 1 , 2   na1 ,
  • D. R. Gordon   ORCID: orcid.org/0000-0001-6398-2345 1 , 3   na1 ,
  • K. M. Kleisner 1 ,
  • A. Bartuska 1 , 4 ,
  • A. Bidlack 5 ,
  • R. DeFries   ORCID: orcid.org/0000-0002-3332-4621 6 ,
  • P. Ellis   ORCID: orcid.org/0000-0001-7933-8298 7 ,
  • P. Friedlingstein   ORCID: orcid.org/0000-0003-3309-4739 8 , 9 ,
  • S. Metzger 10   nAff15   nAff16 ,
  • G. Morgan 11 ,
  • K. Novick   ORCID: orcid.org/0000-0002-8431-0879 12 ,
  • J. N. Sanchirico 13 ,
  • J. R. Collins   ORCID: orcid.org/0000-0002-5705-9682 1 , 14 ,
  • A. J. Eagle   ORCID: orcid.org/0000-0003-0841-2379 1 ,
  • R. Fujita 1 ,
  • E. Holst 1 ,
  • J. M. Lavallee   ORCID: orcid.org/0000-0002-3028-7087 1 ,
  • R. N. Lubowski 1   nAff17 ,
  • C. Melikov 1   nAff18 ,
  • L. A. Moore   ORCID: orcid.org/0000-0003-0239-6080 1   nAff19 ,
  • E. E. Oldfield   ORCID: orcid.org/0000-0002-6181-1267 1 ,
  • J. Paltseva 1   nAff20 ,
  • A. M. Raffeld   ORCID: orcid.org/0000-0002-5036-6460 1 ,
  • N. A. Randazzo 1   nAff21   nAff22 ,
  • C. Schneider 1 ,
  • N. Uludere Aragon 1   nAff23 &
  • S. P. Hamburg 1  

Nature Climate Change ( 2024 ) Cite this article

7120 Accesses

43 Altmetric

Metrics details

  • Climate-change ecology
  • Climate-change mitigation
  • Environmental impact

Viable nature-based climate solutions (NbCS) are needed to achieve climate goals expressed in international agreements like the Paris Accord. Many NbCS pathways have strong scientific foundations and can deliver meaningful climate benefits but effective mitigation is undermined by pathways with less scientific certainty. Here we couple an extensive literature review with an expert elicitation on 43 pathways and find that at present the most used pathways, such as tropical forest conservation, have a solid scientific basis for mitigation. However, the experts suggested that some pathways, many with carbon credit eligibility and market activity, remain uncertain in terms of their climate mitigation efficacy. Sources of uncertainty include incomplete GHG measurement and accounting. We recommend focusing on resolving those uncertainties before broadly scaling implementation of those pathways in quantitative emission or sequestration mitigation plans. If appropriate, those pathways should be supported for their cobenefits, such as biodiversity and food security.

Similar content being viewed by others

literature review measurement instruments

Realizing the full potential of behavioural science for climate change mitigation

Kristian S. Nielsen, Viktoria Cologna, … Kimberly S. Wolske

literature review measurement instruments

Drought triggers and sustains overnight fires in North America

Kaiwei Luo, Xianli Wang, … Mike Flannigan

literature review measurement instruments

The evolution and future of research on Nature-based Solutions to address societal challenges

Thomas Dunlop, Danial Khojasteh, … Stefan Felder

Nature-based climate solutions (NbCS) are conservation, restoration and improved management strategies (pathways) in natural and working ecosystems with the primary motivation to mitigate GHG emissions and remove CO 2 from the atmosphere 1 (similar to ecosystem-based mitigation 2 ). GHG mitigation through ecosystem stewardship is integral to meeting global climate goals, with the greatest benefit coming from near-term maximization of emission reductions, followed by CO 2 removal 3 . Many countries (for example, Indonesia, China and Colombia) use NbCS to demonstrate progress toward national climate commitments.

The scope of NbCS is narrower than that of nature-based solutions (NbS) which include interventions that prioritize non-climate benefits alongside climate (for example, biodiversity, food provisioning and water quality improvement) 4 . In many cases, GHG mitigation is considered a cobenefit that results from NbS actions focused on these other challenges 2 . In contrast, NbCS are broader than natural climate solutions, which are primarily focused on climate mitigation through conservation, restoration and improved land management, generally not moving ecosystems beyond their unmodified structure, function or composition 5 . NbCS may involve moving systems beyond their original function, for example by cultivating macroalgae in water deeper than their natural habitat.

The promise of NbCS has generated a proliferation of interest in using them in GHG mitigation plans 6 , 7 ; 104 of the 168 signatories to the Paris Accord included nature-based actions as part of their mitigation plans 8 . Success in long-term GHG management requires an accurate accounting of inputs and outputs to the atmosphere at scale, so NbCS credits must have robust, comprehensive and transparent scientific underpinnings 9 . Given the urgency of the climate problem, our goal is to identify NbCS pathways with a sufficient scientific foundation to provide broad confidence in their potential GHG mitigation impact, provide resources for confident implementation and identify priority research areas in more uncertain pathways. Evaluating implementation of mitigation projects is beyond our scope; this effort focuses on understanding the underlying science. The purpose is not evaluating any specific carbon crediting protocol or implementation framework but rather the current state of scientific understanding necessary to provide confidence in any NbCS.

In service of this goal, we first investigated nine biomes (boreal forests, coastal marine (salt marsh, mangrove, seagrass and coral reef), freshwater wetlands, grasslands, open ocean (large marine animal and mesopelagic zone biomass, seabed), peatlands, shrublands, temperate forests and tropical forests) and three cultivation types (agroforestry, croplands and macroalgae aquaculture); these were chosen because of their identified potential scale of global impact. In this context, impact is assessed as net GHG mitigation: the CO 2 sequestered or emissions reduced, for example, discounted by understood simultaneous emissions of other GHG (as when N 2 O is released simultaneously with carbon sequestration in cropland soils). From there, we identified 43 NbCS pathways which have been formally implemented (with or without market action) or informally proposed. We estimated the scale of mitigation impact for each pathway on the basis of this literature and, as a proxy measure of NbCS implementation, determined eligibility and activity under existing carbon crediting protocols. Eligibility means that the pathway is addressed by an existing GHG mitigation protocol; market activity means that credits are actively being bought under those eligibility requirements. We considered pathways across a spectrum from protection to improved management to restoration to manipulated systems, but some boundaries were necessary. We excluded primarily abiotically driven pathways (for example, ocean alkalinity enhancement) or where major land use or land-use trade-offs exist (for example, afforestation) 10 , 11 , 12 . Of the 43 pathways, 79% are at present eligible for carbon crediting (sometimes under several methodologies) and at least 65% of those have been implemented (Supplementary Table 1 ). This review was then appraised by 30 independent scholars (at least three per pathway; a complete review synthesis is given in the Supplementary Data ).

Consolidation of a broad body of scientific knowledge, with inherent variance, requires expert judgement. We used an expert elicitation process 13 , 14 , 15 with ten experts to place each proposed NbCS pathway into one of three readiness categories following their own assessment of the scientific literature, categorized by general sources of potential uncertainty: category 1, sufficient scientific basis to support a high-quality carbon accounting system or to support the development of such a system today; category 2, a >25% chance that focused research and reasonable funding would support development of high-quality carbon accounting (that is, move to category 1) within 5 years; or category 3, a <25% chance of development of high-quality carbon accounting within 5 years (for example, due to measurement challenges, unconstrained leakage, external factors which constrain viability).

If an expert ranked a pathway as category 2, they were also asked to rank general research needs to resolve: leakage/displacement (spillover to other areas), measuring, reporting and verification (the ability to quantify all salient stocks and fluxes), basic mechanisms of action (fundamental science), durability (ability to predict or compensate for uncertainty in timescale of effectiveness due to disturbances, climate change, human activity or other factors), geographic uncertainty (place-to-place variation), scaling potential (ability to estimate impact) and setting of a baseline (ability to estimate additionality over non-action; a counterfactual). To avoid biasing towards a particular a priori framework for evaluation of the scientific literature, reviewers could use their own framework for evaluating the NbCS literature about potential climate impact and so could choose to ignore or add relevant categorizations as well. Any pathway in category 1 would not need fundamental research for implementation; research gaps were considered too extensive for useful guidance on reducing uncertainty in category 3 pathways. Estimates of the global scale of likely potential impact (PgCO 2 e yr −1 ) and cobenefits were also collected from expert elicitors. See Methods and Supplementary Information for the survey instrument.

Four pathways with the highest current carbon market activity and high mitigation potential (tropical and temperate forest conservation and reforestation; Table 1 and Supplementary Data ), were consistently rated as high-confidence pathways in the expert elicitation survey. Other NbCS pathways, especially in the forestry sector, were rated relatively strongly by the experts for both confidence in scientific basis and scale of potential impact, with some spread across the experts (upper right quadrant, Fig. 1 ). Conversely, 13 pathways were consistently marked by experts as currently highly uncertain/low confidence (median score across experts: 2.5–3.0) and placed in category 3 (for example, cropland microbial amendments and coral reef restoration; Supplementary Tables 1 and 2 ). For the full review, including crediting protocols currently used, literature estimates of scale and details of sub-pathways, see Supplementary Data .

figure 1

Pathways in the upper right quadrant have both high confidence in the scientific foundations and the largest potential scale of global impact; pathways in the lower left have the lowest confidence in our present scientific body of knowledge and an estimated smaller potential scale of impact. Designations of carbon credit eligibility under existing protocols and market activity at the present time are noted. Grassland enhanced mineral weathering (EMW) is not shown (mean category rating 2.9) as no scale of impact was estimated. See Supplementary Table 1 for specific pathway data. Bars represent 20th to 80th percentiles of individual estimates, if there was variability in estimates. A small amount of random noise was added to avoid overlap.

The experts assessed 26 pathways as having average confidence scores between 1.5 and 2.4, suggesting the potential for near-term resolution of uncertainties. This categorization arose from either consensus amongst experts on the uncertain potential (for example, boreal forest reforestation consistently rated category 2, with primary concerns about durability) or because experts disagreed, with some ranking category 1 and others category 3 (for example, pasture management). We note that where expert disagreement exists (seen as the spread of responses in Fig. 1 and Supplementary Table 1 ; also see Data availability for link to original data), this suggests caution against overconfidence in statements about these pathways. These results also suggest that confidence may be increased by targeted research on the identified sources of uncertainty (Supplementary Table 3 ).

Sources of uncertainty

Durability and baseline-setting were rated as high sources of uncertainty across all pathways ranked as category 2 by the experts (mean ratings of 3.6 and 3.4 out of 5, respectively; Supplementary Table 3 ). Understanding of mechanisms and geographic spread had the lowest uncertainty ratings (2.1 and 2.3, respectively), showing confidence in the basic science. Different subsets of pathways had different prioritizations, however, suggesting different research needs: forest-centric pathways were most uncertain in their durability and additionality (3.8 and 3.4, respectively), suggesting concerns about long-term climate and disturbance trajectories. Agricultural and grassland systems, however, had higher uncertainty in measurement methods and additionality (3.9 and 3.5 respectively). Although there were concerns about durability from some experts (for example, due to sea-level rise), some coastal blue carbon pathways such as mangrove restoration (mean category ranking: 1.7 (20th to 80th percentile 1.0–2.0)) have higher confidence than others (for example, seagrass restoration: mean category ranking 2.8, 20th to 80th percentile 2.6–3.0)), which are relatively poorly constrained in terms of net radiative forcing potential despite a potentially large carbon impact (seagrass median: 1.60 PgCO 2 e yr −1 ; see Supplementary Data for more scientific literature estimates).

Scale of impact

For those pathways with lower categorization by the expert elicitation (category 2 or 3) at the present time, scale of global impact is a potential heuristic for prioritizing further research. High variability, often two orders of magnitude, was evident in the mean estimated potential PgCO 2 e yr −1 impacts for the different pathways (Fig. 1 and Supplementary Table 2 ) and the review of the literature found even larger ranges produced by individual studies (Supplementary Data ). A probable cause of this wide range was different constraints on the estimated potential, with some studies focusing on potential maximum impact and others on more constrained realizable impacts. Only avoided loss of tropical forest and cropland biochar amendment were consistently estimated as having the likely potential to mitigate >2 PgCO 2 e yr −1 , although biochar was considered more uncertain by experts due to other factors germane to its overall viability as a climate solution, averaging a categorization of 2.2. The next four highest potential impact pathways, ranging from 1.6 to 1.7 PgCO 2 e yr −1 , spanned the spectrum from high readiness (temperate forest restoration) to moderate (cropland conversion from annual to perennial vegetation and grassland restoration) to low (seagrass restoration, with main uncertainties around scale of potential impact and durability).

There was high variability in the elicitors’ estimated potential scale of impact, even in pathways with strong support, such as tropical forest avoided loss (20th to 80th percentile confidence interval: 1–8 PgCO 2 e yr −1 ), again emphasizing the importance of consistent definitions and constraints on how NbCS are measured, evaluated and then used in broad-scale climate change mitigation planning and budgeting. Generally, as pathway readiness decreased (moving from category 1 to 3), the elicitor-estimated estimates of GHG mitigation potential decreased (Supplementary Fig. 1 ). Note that individual studies from the scientific literature may have higher or lower estimates (Supplementary Data ).

Expert elicitation meta-analyses suggest that 6–12 responses are sufficient for a robust and stable quantification of responses 15 . We tested that assumption via a Monte Carlo-based sensitivity assessment. Readiness categorizations by the ten experts were robust to a Monte Carlo simulation test, where further samples were randomly drawn from the observed distribution of responses: mean difference between the original and the boot-strapped data was 0.02 (s.d. = 0.05) with an absolute difference average of 0.06 (s.d. = 0.06). The maximum difference in readiness categorization means across all pathways was 0.20 (s.d. = 0.20) (Supplementary Table 2 ). The full dataset of responses is available online (see ʻData availabilityʼ).

These results highlight opportunities to accelerate implementation of NbCS in well-supported pathways and identify critical research needs in others (Fig. 1 ). We suggest focusing future efforts on resolving identified uncertainties for pathways at the intersection between moderate average readiness (for example, mean categorizations between ~1.5 and 2.0) and high potential impact (for example, median >0.5 PgCO 2 e yr −1 ; Supplementary Table 1 ): agroforestry, improved tropical and temperate forest management, tropical and boreal peatlands avoided loss and peatland restoration. Many, although not all, experts identified durability and baseline/additionality as key concerns to resolve in those systems; research explicitly targeted at those specific uncertainties (Supplementary Table 3 ) could rapidly improve confidence in those pathways.

We recommend a secondary research focus on the lower ranked (mean category 2.0 to 3.0) pathways with estimated potential impacts >1 PgCO 2 e yr −1 (Supplementary Fig. 2 ). For these pathways, explicit, quantitative incorporation into broad-scale GHG management plans will require further focus on systems-level carbon/GHG understandings to inspire confidence at all stages of action and/or identifying locations likely to support durable GHG mitigation, for example ref. 16 . Examples of this group include avoided loss and degradation of boreal forests (for example, fire, pests and pathogens and albedo 16 ) and effective mesopelagic fishery management, which some individual studies estimate would avoid future reductions of the currently sequestered 1.5–2.0 PgC yr −1 (refs. 17 , 18 ). These pathways may turn out to have higher or lower potential than the expert review suggests, on the basis of individual studies (Supplementary Data ) but strong support will require further, independent verification of that potential.

We note that category 3 rankings by expert elicitation do not necessarily imply non-viability but simply that much more research is needed to confidently incorporate actions into quantitative GHG mitigation plans. We found an unsurprising trend of lower readiness categorization with lower pathway familiarity (Supplementary Fig. 3 ). This correlation may result from two, non-exclusive potential causes: (1) lower elicitor expertise in some pathways (inevitable, although the panel was explicitly chosen for global perspectives, connections and diverse specialties) and (2) an actual lack of scientific evidence in the literature, which leads to that self-reported lack of familiarity, a common finding in the literature review (Supplementary Data ). Both explanations suggest a need to better consolidate, develop and disseminate the science in each pathway for global utility and recognition.

Our focus on GHG-related benefits in no way diminishes the substantial conservation, environmental and social cobenefits of these pathways (Supplementary Table 4 ), which often exceed their perceived climate benefits 1 , 19 , 20 , 21 . Where experts found climate impacts to remain highly uncertain but other NbS benefits are clear (for example, biodiversity and water quality; Supplementary Table 4 ), other incentives or financing mechanisms independent of carbon crediting should be pursued. While the goals here directly relate to using NbCS as a reliably quantifiable part of global climate action planning and thus strong GHG-related scientific foundations, non-climate NbS projects may provide climate benefits that are less well constrained (and thus less useful from a GHG budgeting standpoint) but also valuable. Potential trade-offs, if any, between ecosystem services and management actions, such as biodiversity and positive GHG outcomes, should be explored to ensure the best realization of desired goals 2 .

Finally, our focus in this study was on broad-scale NbCS potential in quantitative mitigation planning because of the principal and necessary role of NbCS in overall global warming targets. We recognize the range of project conditions that may increase, or decrease, the rigour of any pathway outside the global-scale focus here. We did not specifically evaluate the large and increasing number of crediting concepts (by pathway: Supplementary Data ), focusing rather on the underlying scientific body of knowledge within those pathways. Some broad pathways may have better defined sub-pathways within them, with a smaller potential scale of impact but potentially lower uncertainty (for example, macroalgae harvest cycling). Poorly enacted NbCS actions and/or crediting methodologies at project scales may result in loss of benefits even from high-ranking pathways 22 , 23 , 24 and attention to implementation should be paramount. Conversely, strong, careful project-scale methodologies may make lower readiness pathways beneficial for a given site.

Viable NbCS are vital to global climate change mitigation but NbCS pathways that lack strong scientific underpinnings threaten global accounting by potentially overestimating future climate benefits and eroding public trust in rigorous natural solutions. Both the review of the scientific literature and the expert elicitation survey identified high potential ready-to-implement pathways (for example, tropical reforestation), reinforcing present use of NbCS in planning.

However, uncertainty remains about the quantifiable GHG mitigation of some active and nascent NbCS pathways. On the basis of the expert elicitation survey and review of the scientific literature, we are concerned that large-scale implementation of less scientifically well-founded NbCS pathways in mitigation plans may undermine net GHG budget planning; those pathways require more study before they can be confidently promoted at broad scales and life-cycle analyses to integrate system-level emissions when calculating totals. The expert elicitation judgements suggest a precautionary approach to scaling lower confidence pathways until the scientific foundations are strengthened, especially for NbCS pathways with insufficient measurement and monitoring 10 , 24 , 25 or poorly understood or measured net GHG mitigation potentials 16 , 26 , 27 , 28 . While the need to implement more NbCS pathways for reducing GHG emissions and removing carbon from the atmosphere is urgent, advancing the implementation of poorly quantified pathways (in relation to their GHG mitigation efficacy) could give the false impression that they can balance ongoing, fossil emissions, thereby undermining overall support for more viable NbCS pathways. Explicitly targeting research to resolve these uncertainties in the baseline science could greatly bolster confidence in the less-established NbCS pathways, benefiting efforts to reduce GHG concentrations 29 .

The results of this study should inform both market-based mechanisms and non-market approaches to NbCS pathway management. Research and action that elucidates and advances pathways to ensure a solid scientific basis will provide confidence in the foundation for successfully implementing NbCS as a core component of global GHG management.

NbCS pathway selection

We synthesized scientific publications for nine biomes (boreal forests, coastal blue carbon, freshwater wetlands, grasslands, open ocean blue carbon, peatlands, shrublands, temperate forests and tropical forests) and three cultivation types (agroforestry, croplands and macroalgae aquaculture) (hereafter, systems) and the different pathways through which they may be able to remove carbon or reduce GHG emissions. Shrublands and grasslands were considered as independent ecosystems; nonetheless, we acknowledge that there is overlap in the numbers presented here because shrublands are often included with grasslands 5 , 30 , 31 , 32 , 33 .

The 12 systems were chosen because they have each been identified as having potential for emissions reductions or carbon removal at globally relevant scales. Within these systems, we identified 43 pathways which either have carbon credit protocols formally established or informally proposed for review (non-carbon associated credits were not evaluated). We obtained data on carbon crediting protocols from international, national and regional organizations and registries, such as Verra, American Carbon Registry, Climate Action Reserve, Gold Standard, Clean Development Mechanism, FAO and Nori. We also obtained data from the Voluntary Registry Offsets Database developed by the Berkeley Carbon Trading Project and Carbon Direct company 34 . While we found evidence of more Chinese carbon crediting protocols, we were not able to review these because of limited publicly available information. To maintain clarity and avoid misrepresentation, we used the language as written in each protocol. A full list of the organizations and registries for each system can be found in the Supplementary Data .

Literature searches and synthesis

We reviewed scientific literature and reviews (for example, IPCC special reports) to identify studies reporting data on carbon stocks, GHG dynamics and sequestration potential of each system. Peer-reviewed studies and meta-analyses were identified on Scopus, Web of Science and Google Scholar using simple queries combining the specific practice or pathway names or synonyms (for example, no-tillage, soil amendments, reduced stocking rates, improved forest management, avoided forest conversion and degradation, avoided mangrove conversion and degradation) and the following search terms: ‘carbon storage’, ‘carbon stocks’, ‘carbon sequestration’, ‘carbon sequestration potential’, ‘additional carbon storage’, ‘carbon dynamics’, ‘areal extent’ or ‘global’.

The full literature review was conducted between January and October 2021. We solicited an independent, external review of the syntheses (obtaining from at least three external reviewers per natural or working system; see p. 2 of the Supplementary Data ) as a second check against missing key papers or misinterpretation of data. The review was generally completed in March 2022. Data from additional relevant citations were added through October 2022 as they were discovered. For a complete list of all literature cited, see pp. 217–249 of the Supplementary Data .

From candidate papers, the papers were considered if their results/data could be applied to the following central questions:

How much carbon is stored (globally) at present in the system (total and on average per hectare) and what is the confidence?

At the global level, is the system a carbon source or sink at this time? What is the business-as-usual projection for its carbon dynamics?

Is it possible, through active management, to either increase net carbon sequestration in the system or prevent carbon emissions from that system? (Note that other GHG emissions and forcings were included here as well.)

What is the range of estimates for how much extra carbon could be sequestered globally?

How much confidence do we have in the present methods to detect any net increases in carbon sequestration in a system or net changes in areal extent of that?

From each paper, quantitative estimates for the above questions were extracted for each pathway, including any descriptive information/metadata necessary to understand the estimate. In addition, information on sample size, sampling scheme, geographic coverage, timeline of study, timeline of projections (if applicable) and specific study contexts (for example, wind-break agroforestry) were recorded.

We also tracked where the literature identified trade-offs between carbon sequestered or CO 2 emissions reduced and emissions of other GHG (for example, N 2 O or methane) for questions three and five above. For example, wetland restoration can result in increased CO 2 uptake from the atmosphere. However, it can also increase methane and N 2 O emissions to the atmosphere. Experts were asked to consider the uncertainty in assessing net GHG mitigation as they categorized the NbCS pathways.

Inclusion of each pathway in mitigation protocols and the specific carbon registries involved were also identified. These results are reported (grouped or individually as appropriate) in the Supplementary Data , organized by the central questions and including textual information for interpretation. The data and protocol summaries for each of the 12 systems were reviewed by at least three scientists each and accordingly revised.

These summaries were provided to the expert elicitation group as optional background information.

Unit conversions

Since this synthesis draws on literature from several sources that use different methods and units, all carbon measurements were standardized to the International System of Units (SI units). When referring to total stocks for each system, numbers are reported in SI units of elemental carbon (that is, PgC). When referring to mitigation potential, elemental carbon was converted to CO 2 by multiplying by 3.67. Differences in methodology, such as soil sampling depth, make it difficult to standardize across studies. Where applicable, the specific measurement used to develop each stock estimate is reported.

Expert elicitation process

To assess conclusions brought about by the initial review process described above, we conducted an expert elicitation survey to consolidate and add further, independent assessments to the original literature review. The expert elicitation survey design followed best practice recommendations 14 , with a focus on participant selection, explicitly defining uncertainty, minimizing cognitive and overconfidence biases and clarity of focus. Research on expert elicitation suggests that 6–12 responses are sufficient for a stable quantification of responses 15 . We identified >40 potential experts via a broad survey of leading academics, science-oriented NGO and government agency publications and products. These individuals have published on several NbCS pathways or could represent larger research efforts that spanned the NbCS under consideration. Careful attention was paid to the gender and sectoral breakdown of respondents to ensure equitable representation. Of the invitees, ten completed the full elicitation effort. Experts were offered compensation for their time.

Implementation of the expert elicitation process followed the IDEA protocol 15 . Briefly, after a short introductory interview, the survey was sent to the participants. Results were anonymized and standardized (methods below) and a meeting held with the entire group to discuss the initial results and calibrate understanding of questions. The purpose of this meeting was not to develop consensus on a singular answer but to discuss and ensure that all questions are being considered in the same way (for example, clarifying any potentially confusing language, discussing any questions that emerged as part of the process). The experts then revisited their initial rankings to provide final, anonymous rankings which were compiled in the same way. These final rankings are the results presented here and may be the same or different from the initial rankings, which were discarded.

Survey questions

The expert elicitation survey comprised five questions for each pathway. The data were collected via Google Forms and collated anonymously at the level of pathways, with each respondent contributing one datapoint for each pathway. The experts reported their familiarity (or the familiarity of the organization whose work they were representing) with the pathway and other cobenefits for the pathways.

The initial question ranked the NbCS pathway by category, from one to three.

Category 1 was defined as a pathway with sufficient scientific knowledge to support a high-quality carbon accounting system today (for example, meets the scientific criteria identified in the WWF-EDF-Oeko Institut and ICAO TAB) or to support the development of such a system today. The intended interpretation is that sufficient science is available for quantifying and verifying net GHG mitigation. Note that experts were not required to reference any given ‘high-quality’ crediting framework, which were provided only as examples. In other words, the evaluation was not intended to rank a given framework (for example, ref. 35 ) but rather expert confidence in the fundamental scientific understandings that underpin potential for carbon accounting overall. To this end, no categorization of uncertainty was required (reviewers could skip categorizations they felt were not necessary) and space was available to fill in new categories by individual reviewers (if they felt a category was missing or needed). Uncertainties at this category 1 level are deemed ‘acceptable’, for example, not precluding accounting now, although more research may further substantiate high-quality credits.

Category 2 pathways have a good chance (>25%) that with more research and within the next 5 years, the pathway could be developed into a high-quality pathway for carbon accounting and as a nature-based climate solution pathway. For these pathways, further understanding is needed for factors such as baseline processes, long-term stability, unconstrained fluxes, possible leakage or other before labelling as category 1 but the expert is confident that information can be developed, in 5 years or less, with more work. The >25% chance threshold and 5-year timeframe were determined a priori to reflect and identify pathways that experts identified as having the potential to meet the Paris Accord 2030 goal. Other thresholds (for example, longer timeframes) could have been chosen, which would impact the relative distribution of pathways in categories 2 and 3 (for example, a longer timeframe allowed could move some pathways from category 3 into category 2, for some reviewers). We emphasize that category 3 pathways do not necessarily mean non-valuable approaches but longer timeframes required for research than the one set here.

Category 3 responses denoted pathways that the expert thought had little chance (<25%) that with more research and within the next 5 years, this pathway could be developed into a suitable pathway for managing as a natural solutions pathway, either because present evidence already suggests GHG reduction is not likely to be viable, co-emissions or other biophysical feedbacks may offset those gains or because understanding of key factors is lacking and unlikely to be developed within the next 5 years. Notably, the last does not mean that the NbCS pathway is not valid or viable in the long-term, simply that physical and biological understandings are probably not established enough to enable scientific rigorous and valid NbCS activity in the near term.

The second question asked the experts to identify research gaps associated with those that they ranked as category 2 pathways to determine focal areas for further research. The experts were asked to rank concerns about durability (ability to predict or compensate for uncertainty in timescale of effectiveness due to disturbances, climate change, human activity or other factors), geographic uncertainty (place-to-place variation), leakage or displacement (spillover of activities to other areas), measuring, reporting and verification (MRV, referring to the ability to quantify all salient stocks and fluxes to fully assess climate impacts), basic mechanisms of action (fundamental science), scaling potential (ability to estimate potential growth) and setting of a baseline (ability to reasonably quantify additionality over non-action, a counterfactual). Respondents could also enter a different category if desired. For complete definitions of these categories, see the survey instrument ( Supplementary Information ). This question was not asked if the expert ranked the pathway as category 1, as those were deemed acceptable, or for category 3, respecting the substantial uncertainty in that rating. Note that responses were individual and so the same NbCS pathway could receive (for example) several individual category 1 rankings, which would indicate reasonable confidence from those experts, and several category 2 rankings from others, which would indicate that those reviewers have lingering concerns about the scientific basis, along with their rankings of the remaining key uncertainties in those pathways. These are important considerations, as they reflect the diversity of opinions and research priorities; individual responses are publicly available (anonymized: https://doi.org/10.5281/zenodo.7859146 ).

The third question involved quantification of the potential for moving from category 2 to 1 explicitly. Following ref. 14 , the respondents first reported the lowest plausible value for the potential likelihood of movement (representing the lower end of a 95% confidence interval), then the upper likelihood and then their best guess for the median/most likely probability. They were also asked for the odds that their chosen interval contained the true value, which was used to scale responses to standard 80% credible intervals and limit overconfidence bias 13 , 15 . This question was not asked if the expert ranked the pathway as category 3, respecting the substantial uncertainty in that rating.

The fourth question involved the scale of potential impact from the NbCS, given the range of uncertainties associated with effectiveness, area of applicability and other factors. The question followed the same pattern as the third, first asking about lowest, then highest, then best estimate for potential scale of impact (in PgCO 2 e yr −1 ). Experts were again asked to express their confidence in their own range, which was used to scale to a standard 80% credible interval. This estimate represents a consolidation of the best-available science by the reviewers. For a complete review including individual studies and their respective findings, see the Supplementary Data . This question was not asked if the expert ranked the pathway as category 3, respecting the substantial uncertainty in that rating.

Final results

After collection of the final survey responses, results were anonymized and compiled by pathway. For overall visualization and discussion purposes, responses were combined into a mean and 20th to 80th percentile range. The strength of the expert elicitation process lies in the collection of several independent assessments. Those different responses represent real differences in data interpretation and synthesis ascribed by experts. This can have meaningful impacts on decision-making by different individuals and organizations (for example, those that are more optimistic or pessimistic about any given pathway). Therefore, individual anonymous responses were retained by pathway to show the diversity of responses for any given pathway. The experts surveyed, despite their broad range of expertise, ranked themselves as less familiar with category 3 pathways than category 1 or 2 (linear regression, P  < 0.001, F  = 59.6 2, 394 ); this could be because of a lack of appropriate experts—although they represented all principal fields—or simply because the data are limited in those areas.

Sensitivity

To check for robustness against sample size variation, we conducted a Monte Carlo sensitivity analysis of the data on each pathway to generate responses of a further ten hypothetical experts. Briefly, the extra samples were randomly drawn from the observed category ranking mean and standard deviations for each individual pathway and appended to the original list; values <1 or >3 were truncated to those values. This analysis resulted in only minor differences in the mean categorization across all pathways: the mean difference between the original and the boot-strapped data was 0.02 (s.d. = 0.05) with an absolute difference average of 0.06 (s.d. = 0.06). The maximum difference in means across all pathways was 0.20 (s.d. = 0.20) (Supplementary Table 2 ). The results suggest that the response values are stable to additional responses.

All processing was done in R 36 , with packages including fmsb 37 and forcats 38 .

Data availability

Anonymized expert elicitation responses are available on Zenodo 39 : https://doi.org/10.5281/zenodo.7859146 .

Code availability

R code for analysis available on Zenodo 39 : https://doi.org/10.5281/zenodo.7859146 .

Novick, K. A. et al. Informing nature‐based climate solutions for the United States with the best‐available science. Glob. Change Biol. 28 , 3778–3794 (2022).

Article   Google Scholar  

Cohen-Shacham, E., Walters, G., Janzen, C. & Maginnis, S. (eds) Nature-based Solutions to Address Global Societal Challenges (IUCN, 2016).

IPCC Climate Change 2021: The Physical Science Basis (eds Masson-Delmotte, V. et al.) (Cambridge Univ. Press, 2021).

Seddon, N. et al. Understanding the value and limits of nature-based solutions to climate change and other global challenges. Philos. Trans. R. Soc. B 375 , 20190120 (2020).

Griscom, B. W. et al. Natural climate solutions. Proc. Natl Acad. Sci. USA 114 , 11645–11650 (2017).

Article   CAS   PubMed   PubMed Central   ADS   Google Scholar  

Blaufelder, C., Levy, C., Mannion, P. & Pinner, D. A. Blueprint for Scaling Voluntary Carbon Markets to Meet the Climate Challenge (McKinsey & Company, 2021).

Arcusa, S. & Sprenkle-Hyppolite, S. Snapshot of the carbon dioxide removal certification and standards ecosystem (2021–2022). Clim. Policy 22 , 1319–1332 (2022).

Seddon, N. et al. Global recognition of the importance of nature-based solutions to the impacts of climate change Glob. Sustain. 3 , pe15 (2020).

Anderegg, W. R. Gambling with the climate: how risky of a bet are natural climate solutions? AGU Adv. 2 , e2021AV000490 (2021).

Article   ADS   Google Scholar  

Gattuso, J. P. et al. Ocean solutions to address climate change and its effects on marine ecosystems. Front. Mar. Sci. 5 , p337 (2018).

Bach, L. T., Gill, S. J., Rickaby, R. E., Gore, S. & Renforth, P. CO 2 removal with enhanced weathering and ocean alkalinity enhancement: potential risks and co-benefits for marine pelagic ecosystems. Front. Clim. 1 , 7 (2019).

Doelman, J. C. et al. Afforestation for climate change mitigation: potentials, risks and trade‐offs. Glob. Change Biol. 26 , 1576–1591 (2019).

Speirs-Bridge, A. et al. Reducing overconfidence in the interval judgments of experts. Risk Anal. 30 , 512–523 (2010).

Article   PubMed   Google Scholar  

Morgan, M. G. Use (and abuse) of expert elicitation in support of decision making for public policy. Proc. Natl Acad. Sci. USA 111 , 7176–7184 (2014).

Hemming, V., Burgman, M. A., Hanea, A. M., McBride, M. F. & Wintle, B. C. A practical guide to structured expert elicitation using the IDEA protocol. Methods Ecol. Evol. 9 , 169–180 (2018).

Anderegg, W. R. et al. Climate-driven risks to the climate mitigation potential of forests. Science 368 , eaaz7005 (2020).

Article   CAS   PubMed   Google Scholar  

Boyd, P. W., Claustre, H., Levy, M., Siegel, D. A. & Weber, T. Multi-faceted particle pumps drive carbon sequestration in the ocean. Nature 568 , 327–335 (2019).

Article   CAS   PubMed   ADS   Google Scholar  

Saba, G. K. et al. Toward a better understanding of fish-based contribution to ocean carbon flux. Limnol. Oceanogr. 66 , 1639–1664 (2021).

Article   CAS   ADS   Google Scholar  

Seddon, N., Turner, B., Berry, P., Chausson, A. & Girardin, C. A. Grounding nature-based climate solutions in sound biodiversity science. Nat. Clim. Change 9 , 84–87 (2019).

Soto-Navarro, C. et al. Mapping co-benefits for carbon storage and biodiversity to inform conservation policy and action. Philos. Trans. R. Soc. B 375 , 20190128 (2020).

Article   CAS   Google Scholar  

Schulte, I., Eggers, J., Nielsen, J. Ø. & Fuss, S. What influences the implementation of natural climate solutions? A systematic map and review of the evidence. Environ. Res. Lett. 17 , p013002 (2022).

West, T. A., Börner, J., Sills, E. O. & Kontoleon, A. Overstated carbon emission reductions from voluntary REDD+ projects in the Brazilian Amazon. Proc. Natl Acad. Sci. USA 117 , 24188–24194 (2020).

Di Sacco, A. et al. Ten golden rules for reforestation to optimize carbon sequestration, biodiversity recovery and livelihood benefits. Glob. Change Biol. 27 , 1328–1348 (2021).

López-Vallejo, M. in Towards an Emissions Trading System in Mexico: Rationale, Design and Connections with the Global Climate Agenda (ed. Lucatello, S.) 191–221 (Springer, 2022)

Oldfield, E. E. et al. Realizing the potential of agricultural soil carbon sequestration requires more effective accounting. Science 375 , 1222–1225 (2022).

Burkholz, C., Garcias-Bonet, N. & Duarte, C. M. Warming enhances carbon dioxide and methane fluxes from Red Sea seagrass ( Halophila stipulacea ) sediments. Biogeosciences 17 , 1717–1730 (2020).

Guenet, B. et al. Can N 2 O emissions offset the benefits from soil organic carbon storage? Glob. Change Biol. 27 , 237–256 (2021).

Rosentreter, J. A., Al‐Haj, A. N., Fulweiler, R. W. & Williamson, P. Methane and nitrous oxide emissions complicate coastal blue carbon assessments. Glob. Biogeochem. Cycles 35 , pe2020GB006858 (2021).

Schwartzman, S. et al. Environmental integrity of emissions reductions depends on scale and systemic changes, not sector of origin. Environ. Res. Lett. 16 , p091001 (2021).

Crop and Livestock Products Database (FAO, 2022); https://www.fao.org/faostat/en/#data/QCL

Fargione, J. E. et al. Natural climate solutions for the United States. Sci. Adv. 4 , eaat1869 (2018).

Article   PubMed   PubMed Central   ADS   Google Scholar  

Meyer, S. E. Is climate change mitigation the best use of desert shrublands? Nat. Resour. Environ. Issues 17 , 2 (2011).

Google Scholar  

Lorenz, K. & Lal, R. Carbon Sequestration in Agricultural Ecosystems (Springer Cham, 2018).

Haya, B., So, I. & Elias, M. The Voluntary Registry Offsets Database (Univ. California, 2021); https://gspp.berkeley.edu/faculty-and-impact/centers/cepp/projects/berkeley-carbon-trading-project/offsets-database

Core Carbon Principles; CCP Attributes; Assessment Framework for Programs; and Assessment Procedure (ICVCM, 2023); https://icvcm.org/the-core-carbon-principles/

R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2022).

Nakazawa, M. fmsb: Functions for medical statistics book with some demographic data. R package version 0.7.4 https://CRAN.R-project.org/package=fmsb (2022).

Wickham, H. forcats: Tools for working with categorical variables (factors). R package version 0.5.2 https://CRAN.R-project.org/package=forcats (2022)

Buma, B. Nature-based climate solutions: expert elicitation data and analysis code. Zenodo https://doi.org/10.5281/zenodo.7859146 (2023).

Download references

Acknowledgements

This research was supported through gifts to the Environmental Defense Fund from the Bezos Earth Fund, King Philanthropies and Arcadia, a charitable fund of L. Rausing and P. Baldwin. We thank J. Rudek for help assembling the review and 30 experts who reviewed some or all of those data and protocol summaries (Supplementary Data ). S.M. was supported by a cooperative agreement between the National Science Foundation and Battelle that sponsors the National Ecological Observatory Network programme.

Author information

Present address: Department of Atmospheric and Oceanic Sciences, University of Wisconsin-Madison, Madison, WI, USA

Present address: AtmoFacts, Longmont, CO, USA

R. N. Lubowski

Present address: Lombard Odier Investment Managers, New York, NY, USA

Present address: Ecological Carbon Offset Partners LLC, dba EP Carbon, Minneapolis, MN, USA

L. A. Moore

Present address: , San Francisco, CA, USA

J. Paltseva

Present address: ART, Arlington, VA, USA

N. A. Randazzo

Present address: NASA/GSFC, Greenbelt, MD, USA

Present address: University of Maryland, College Park, MD, USA

N. Uludere Aragon

Present address: Numerical Terradynamic Simulation Group, University of Montana, Missoula, MT, USA

These authors contributed equally: B. Buma, D. R. Gordon.

Authors and Affiliations

Environmental Defense Fund, New York, NY, USA

B. Buma, D. R. Gordon, K. M. Kleisner, A. Bartuska, J. R. Collins, A. J. Eagle, R. Fujita, E. Holst, J. M. Lavallee, R. N. Lubowski, C. Melikov, L. A. Moore, E. E. Oldfield, J. Paltseva, A. M. Raffeld, N. A. Randazzo, C. Schneider, N. Uludere Aragon & S. P. Hamburg

Department of Integrative Biology, University of Colorado, Denver, CO, USA

Department of Biology, University of Florida, Gainesville, FL, USA

D. R. Gordon

Resources for the Future, Washington, DC, USA

A. Bartuska

International Arctic Research Center, University of Alaska, Fairbanks, AK, USA

Department of Ecology Evolution and Environmental Biology and the Climate School, Columbia University, New York, NY, USA

The Nature Conservancy, Arlington, VA, USA

Faculty of Environment, Science and Economy, University of Exeter, Exeter, UK

P. Friedlingstein

Laboratoire de Météorologie Dynamique/Institut Pierre-Simon Laplace, CNRS, Ecole Normale Supérieure/Université PSL, Sorbonne Université, Ecole Polytechnique, Palaiseau, France

National Ecological Observatory Network, Battelle, Boulder, CO, USA

Department of Engineering and Public Policy, Carnegie Mellon University, Pittsburgh, PA, USA

O’Neill School of Public and Environmental Affairs, Indiana University, Bloomington, IN, USA

Department of Environmental Science and Policy, University of California, Davis, CA, USA

J. N. Sanchirico

Department of Marine Chemistry & Geochemistry, Woods Hole Oceanographic Institution, Woods Hole, MA, USA

J. R. Collins

You can also search for this author in PubMed   Google Scholar

Contributions

D.R.G. and B.B. conceived of and executed the study design. D.R.G., K.M.K., J.R.C., A.J.E., R.F., E.H., J.M.L., R.N.L., C.M., L.A.M., E.E.O., J.P., A.M.R., N.A.R., C.S. and N.U.A. coordinated and conducted the literature review. G.M. and B.B. primarily designed the survey. A. Bartuska, A. Bidlack, B.B., J.N.S., K.N., P.E., P.F., R.D. and S.M. contributed to the elicitation. B.B. conducted the analysis and coding. S.P.H. coordinated funding. B.B. and D.R.G. were primary writers; all authors were invited to contribute to the initial drafting.

Corresponding author

Correspondence to B. Buma .

Ethics declarations

Competing interests.

The authors declare no competing interests. In the interest of full transparency, we note that while B.B., D.R.G., K.M.K., A.B., J.R.C., A.J.E., R.F., E.H., J.M.L., R.N.L., C.M., L.A.M., E.E.O., J.P., A.M.R., N.A.R., C.S., N.U.A., S.P.H. and P.E. are employed by organizations that have taken positions on specific NbCS frameworks or carbon crediting pathways (not the focus of this work), none have financial or other competing interest in any of the pathways and all relied on independent science in their contributions to the work.

Peer review

Peer review information.

Nature Climate Change thanks Camila Donatti, Connor Nolan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Supplementary Tables 1–4, Figs. 1–3 and survey instrument.

Supplementary Data

Literature review and list of reviewers.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Buma, B., Gordon, D.R., Kleisner, K.M. et al. Expert review of the science underlying nature-based climate solutions. Nat. Clim. Chang. (2024). https://doi.org/10.1038/s41558-024-01960-0

Download citation

Received : 24 April 2023

Accepted : 20 February 2024

Published : 21 March 2024

DOI : https://doi.org/10.1038/s41558-024-01960-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

literature review measurement instruments

COMMENTS

  1. Instruments to assess integrated care: a systematic review

    Different measurement instruments have been developed to measure the level of integration of diverse forms of services and networks within the health-care system [ 8, 9 ]. To support development of evidence in the area of integrated care, systematic literature reviews are central for various reasons. First, knowledge of existing instruments can ...

  2. Systematic Literature Review Measurement Instruments of Productivity

    Objectives. This study aimed (1) to perform a systematic literature review of instruments for measuring productivity loss of paid and unpaid work and (2) to assess the suitability (in terms of identification, measurement, and valuation) of these instruments for use in health economic evaluations from a societal perspective.

  3. Quality of Care and Patient Satisfaction: A Review of Measuring Instruments

    Surveying the literature on the assessment of quality of care from the patient's perspective, the concept has often been operationalized as patient satisfaction. Patient satisfaction has been a widely investigated subject in health care research, and dozens of measuring instruments were developed during the past decade.

  4. Tools to assess the measurement properties of quality of life

    Introduction. The systematic reviews of measurement properties critically appraise the content and measurement properties of all instruments that assess a certain construct of interest in a specific study population. 1 These systematic reviews provide both a comprehensive overview of the measurement properties of health instruments and supportive evidence for the selection of instruments for a ...

  5. A Comprehensive Review of Instruments Measuring Attitudes ...

    Since the studies and measurement instruments examined were published in high impact and prestigious science education journals, it is reasonable to expect that the results of this literature review can be generalized to unretrieved instruments. Second, the quality appraisal approach used in this review was rooted in consensus-based standards.

  6. Measurement in STEM education research: a systematic literature review

    The purpose of this systematic literature review is twofold. First, we aim to examine the measurement trends in survey instruments utilized in STEM education research. For example, we are interested in identifying which validated and reliable surveys are currently used in STEM education contexts, as well as what each instrument is measuring.

  7. Instruments to assess integrated care: a systematic review

    The purpose of this review is to identify the instruments concerning how to measure the level of integration across health-care sectors and to assess and evaluate the organisational elements within the instruments identified. Methods: An extensive, systematic literature review in PubMed, CINAHL, PsycINFO, Cochrane Library, Web of Science for ...

  8. Health insurance literacy assessment tools: a systematic literature review

    Aim This systematic literature review aimed to find and summarize the content and conceptual dimensions assessed by quantitative tools measuring Health Insurance Literacy (HIL) and related constructs. Methods Using a Peer Review of Electronic Search Strategy (PRESS) and the PRISMA guideline, a systematic literature review of studies found in ERIC, Econlit, PubMed, PsycInfo, CINAHL, and Google ...

  9. Critical Analysis of Reliability and Validity in Literature Reviews

    The development and pilot of an instrument for measuring nurse managers' leadership and management competencies Show details Hide details Kati Kantanen and more ...

  10. Instruments for measuring nursing research competence: a COSMIN-based

    The aim of this scoping review was to evaluate and summarise the measurement properties of nursing research competence instruments and provide a summary overview of the use of nursing research competence instruments. Increasing nursing research competence instruments have been developed. However, a systematic review and evaluation of nursing research competence instruments is lacking.

  11. Approaches to Measuring Creativity: A Systematic Literature Review*

    This literature review seeks to provide researchers with an up-to-date overview of the current state of creativity measurement, including the existing measurement ... reported a development or validation of instruments measuring creativity. Moreover, we only included (1) journal articles, conference papers, or dissertations (2) that were

  12. What, what for and how? Developing measurement instruments in

    The development and cross-cultural adaptation of measurement instruments have received less attention in methodological discussions, even though it is essential for epidemiological research. ... Literature review. Consultation with experts. Qualitative approaches with members of the target population (in-depth interviews, focus groups, etc.) 36,37.

  13. Literature review of instruments measuring Organisational Commitment

    Literature review highlighted, need for scholars to acquaint themselves about inadequacies of these instruments as it aids their refinement and study of commitment. Future research on the subject ...

  14. LibGuides: Researching for a literature review: Test Instruments

    ETS Test Collection. Index of published instruments, including non-commercially available tests. Click the tab "Find a Test," search for topic, click on a test this is of interest and look at the "Availability" field of the test. If one of the following statements is present, the instrument is available in our library (see a reference librarian ...

  15. PDF Assessing nurse professionalism: a literature review of instruments and

    to identify valid and reliable instruments that could be used to measure nurse professionalism. Aim . The literature review aimed to provide an overview of instruments measuring nurse professionalism, and to analyze and critically evaluate their psychometric properties. The specific research questions were: 1) What instruments measuring nurse

  16. Instruments for Assessing Reading Attitudes: A Review of Research and

    This review focuses on recent literature related to development of scales for measuring reading attitudes. The scales are reviewed under the headings of self-report instruments, direct observation, and projective techniques.

  17. Measuring Service Quality: a Systematic literature Review

    A systematic literature review was conducted to structure the literature and reveal further research gaps. Findings were assigned to the service typology matrix of Jaakkola et al. (2017) to gain further research gaps for service quality measurement models directly related to socio-technical change as the two dimen-sions of the matrix reflect ...

  18. Tools to assess the measurement properties of quality of life

    Introduction. Systematic reviews of measurement properties critically appraise and compare the content and measurement properties of all instruments measuring a certain construct of interest in a specific study population. 1 High quality systematic reviews can provide a comprehensive overview of the measurement properties of patient-reported outcome measures and support evidence-based ...

  19. Full article: Development and validation of the Winchester Adolescent

    The present study sought to address the gap within measurement literature focused on adolescent's wellbeing. The scale development and validation process followed the recommendations outlined by MacKenzie et al. (Citation 2011) for scale development. The initial stages, including the development of measures, model specification, scale ...

  20. Assessing nurse professionalism: a literature review of instruments and

    PDF | On Mar 5, 2022, Katarína Žiaková and others published Assessing nurse professionalism: a literature review of instruments and their measurement properties | Find, read and cite all the ...

  21. Expert review of the science underlying nature-based climate solutions

    Here we couple an extensive literature review with an expert elicitation on 43 pathways and find that at present the most used pathways, such as tropical forest conservation, have a solid ...