Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Perspective
  • Open access
  • Published: 14 April 2020

Verification, analytical validation, and clinical validation (V3): the foundation of determining fit-for-purpose for Biometric Monitoring Technologies (BioMeTs)

  • Jennifer C. Goldsack 1 ,
  • Andrea Coravos   ORCID: orcid.org/0000-0001-5379-3540 1 , 2 , 3 ,
  • Jessie P. Bakker 1 , 4 ,
  • Brinnae Bent 5 ,
  • Ariel V. Dowling   ORCID: orcid.org/0000-0002-7889-4978 6 ,
  • Cheryl Fitzer-Attas 7 ,
  • Alan Godfrey   ORCID: orcid.org/0000-0003-4049-9291 8 ,
  • Job G. Godino 9 ,
  • Ninad Gujar   ORCID: orcid.org/0000-0001-7901-308X 10 , 11 ,
  • Elena Izmailova 1 , 12 ,
  • Christine Manta 1 , 2 ,
  • Barry Peterson 13 ,
  • Benjamin Vandendriessche   ORCID: orcid.org/0000-0003-0672-0327 14 , 15 ,
  • William A. Wood 16 ,
  • Ke Will Wang 5 &
  • Jessilyn Dunn   ORCID: orcid.org/0000-0002-3241-8183 5 , 17  

npj Digital Medicine volume  3 , Article number:  55 ( 2020 ) Cite this article

52k Accesses

202 Citations

69 Altmetric

Metrics details

  • Research data
  • Scientific community

Digital medicine is an interdisciplinary field, drawing together stakeholders with expertize in engineering, manufacturing, clinical science, data science, biostatistics, regulatory science, ethics, patient advocacy, and healthcare policy, to name a few. Although this diversity is undoubtedly valuable, it can lead to confusion regarding terminology and best practices. There are many instances, as we detail in this paper, where a single term is used by different groups to mean different things, as well as cases where multiple terms are used to describe essentially the same concept. Our intent is to clarify core terminology and best practices for the evaluation of Biometric Monitoring Technologies (BioMeTs), without unnecessarily introducing new terms. We focus on the evaluation of BioMeTs as fit-for-purpose for use in clinical trials. However, our intent is for this framework to be instructional to all users of digital measurement tools, regardless of setting or intended use. We propose and describe a three-component framework intended to provide a foundational evaluation framework for BioMeTs. This framework includes (1) verification, (2) analytical validation, and (3) clinical validation. We aim for this common vocabulary to enable more effective communication and collaboration, generate a common and meaningful evidence base for BioMeTs, and improve the accessibility of the digital medicine field.

Similar content being viewed by others

case study method validation

A systematic review of feasibility studies promoting the use of mobile technologies in clinical research

Jessie P. Bakker, Jennifer C. Goldsack, … William A. Wood

case study method validation

Digital health: a path to validation

Simon C. Mathews, Michael J. McShea, … Adam B. Cohen

case study method validation

Smart wearable devices in cardiovascular care: where we are and how to move forward

Karim Bayoumy, Mohammed Gaber, … Mohamed B. Elshazly

Introduction

Digital medicine describes a field concerned with the use of technologies as tools for measurement and intervention in the service of human health. Digital medicine products are driven by high-quality hardware, firmware, and software that support the practice of medicine broadly, including treatment, intervention, and disease prevention, as well as health monitoring and promotion for individuals and across populations 1 .

Isolated silos of knowledge exist within the engineering, technology, data science, regulatory, and clinical communities that are critical to the development and appropriate deployment of digital medicine products. Currently, terminology, approaches, and evidentiary standards are not aligned across these communities, slowing the advancement of digital medicine for improved health, healthcare, and health economics. Consensus approaches are needed to evaluate the quality of digital medicine products, including their clinical utility, cybersecurity risks, user experience, and data rights and governance for ‘digital specimen’ collection 2 .

In this work, we refer to a specific type of digital medicine product that we call Biometric Monitoring Technologies, or BioMeTs. BioMeTs are connected digital medicine products that process data captured by mobile sensors using algorithms to generate measures of behavioral and/or physiological function. This includes novel measures and indices of characteristics for which we may not yet understand the underlying biological processes. BioMeTs, like other digital medicine products, should be characterized by a body of evidence to support their quality, safety, and effectiveness 3 . However, the rapid rise in the development of and demand for BioMeTs to support the practice of medicine has left in its wake a knowledge gap regarding how to develop and evaluate this body of evidence systematically 4 . If not addressed, there is potential for misinterpretation of data resulting in misleading clinical trials and possibly patient harm.

What are the necessary steps to determine whether a metric derived from a BioMeT is trustworthy, and by extension, whether that BioMeT is fit-for-purpose? We begin by exploring and adapting applicable concepts from other standards in related fields. Digital medicine is an interdisciplinary and rapidly evolving field. The Biomarkers, EndpointS, and other Tools (B.E.S.T) framework emphasizes that “effective, unambiguous communication is essential for efficient translation of promising scientific discoveries into approved medical products” 5 . Siloed and non-standardized practices will slow down innovation and impede collaboration across domains.

In this manuscript, we develop an evaluation framework for BioMeTs intended for healthcare applications. This framework includes verification, analytical validation, and clinical validation (V3). We propose definitions intended to bridge disciplinary divides and describe how these processes provide foundational evidence demonstrating the quality and clinical utility of BioMeTs as digital medicine products.

Language matters and should be used intentionally

Establishing a common language to describe evaluation standards for BioMeTs is critical to streamline trustworthy product development and regulatory oversight. In this paper, we avoid using the term “device” because we anticipate that there is a potential regulatory context for the V3 framework. We want to avoid confounding the V3 terminology with existing FDA Terms of Art (e.g., “device”). Instead, we intentionally discuss digital medicine products, and specifically BioMeTs. We refer the reader to Coravos et al for more background on regulatory considerations 3 . In addition, in this manuscript we use the term “algorithm” to describe a range of data manipulation processes embedded in firmware and software, including but not limited to signal processing, data compression and decompression, artificial intelligence, and machine learning.

We also avoid using the term “feasibility study.” These studies can be purposed to evaluate the feasibility of a number of performance questions and so “feasibility study” in isolation is a meaningless term. We use the term “gold standard” in quotations because it often refers to entrenched and commonly used measurement standards that are considered sub-optimal. “Gold standards” should be considered as nothing more than the best available measurement per consensus, against which the accuracy of other measurements of similar purposes may be judged 6 .

In this paper, we use the term “data supply chain” to describe data flow and data provenance for information generated from hardware, sensors, software, and algorithms.

Two terms, verification and validation, have been used for decades to describe critical components of successful quality management systems. The ISO 9000 family of quality management system standards, first published in 1987, have specific standards and definitions related to design verification and validation 7 . These ISO 9000 standards are generic and can be applied to any type of organization; as such, many industries have adapted these standards to their specific needs. For example, ISO 13485 specifies quality management system requirements related to design verification and validation for organizations that provide medical devices and related services 8 .

In the most basic sense, a BioMeT combines software and hardware for medical or health applications. The software, hardware, and regulatory parent industries have long histories of verification and validation as part of their quality management systems. Software and hardware verification and validation are guided by the IEEE Standard for System, Software, and Hardware Verification and Validation (IEEE 1012-2016), which lays out specific requirements that must be met in order to comply with the standard 9 . The FDA also describes verification and validation processes required for software and hardware products that are submitted for their approval 10 , 11 .

Traditional validation for software and hardware products confirms that the end product accurately measures what it claims to measure. However, BioMeT-derived measures from digital tools must also be clinically useful to a defined population. As such, we have split validation into analytical validation and clinical validation, similar to the framework used in the development of wet biomarkers and described in the BEST (Biomarkers, EndpointS, and other Tools) resource developed by the FDA-NIH Biomarkers working group 5 .

The three-component V3 framework is novel and intentionally combines well established practices from both software and clinical development. The definitions for V3 were derived from guidance documents, historical, and current frameworks ranging from 2002 to 2018. Each document referenced focuses on the particular audience for its associated organization(s), including system developers and suppliers, pharmaceutical industry sponsors, and regulators (Table 1 ). The context of the definitions provided for V3 vary greatly, highlighting that language and processes are often generated and used within disciplinary silos. Although some commonalities exist, the comparisons are confusing at best (Supplementary Table 1 ). These communities also lack a standard language to describe the data supply chain for information generated from the hardware, sensors, software, and algorithms.

Given (1) the historical context for the terms verification and validation in software and hardware standards, regulations, and guidances, and (2) the separated concepts of analytical and clinical validation in wet biomarkers development, this paper seeks to adapt existing terminology and evaluation frameworks for use in BioMeTs. In this new era of digital medicine, we suggest a broad interdisciplinary approach and a common lexicon containing consensus definitions across disciplines for these important terms.

Moving from current siloed practices to one universal best practice

Evaluation of BioMeTs should be a multi-step process that includes relevant expertize at each stage, as well as interdisciplinary collaboration throughout. We propose V3, a three-component framework for the evaluation of BioMeTs in digital medicine (Fig. 1 ):

Verification of BioMeTs entails a systematic evaluation by hardware manufacturers. At this step, sample-level sensor outputs are evaluated. This stage occurs computationally in silico and at the bench in vitro.

Analytical validation occurs at the intersection of engineering and clinical expertize. This step translates the evaluation procedure for BioMeTs from the bench to in vivo. Data processing algorithms that convert sample-level sensor measurements into physiological metrics are evaluated. This step is usually performed by the entity that created the algorithm, either the vendor or the clinical trial sponsor.

Clinical validation is typically performed by a clinical trial sponsor to facilitate the development of a new medical product 12 . The goal of clinical validation is to demonstrate that the BioMeT acceptably identifies, measures, or predicts the clinical, biological, physical, functional state, or experience in the defined context of use (which includes the definition of the population). This step is generally performed on cohorts of patients with and without the phenotype of interest.

figure 1

The stages of V3 for a BioMeT.

V3 must be conducted as part of a comprehensive BioMeT evaluation. However, although V3 processes are foundational, they are not the only evaluation steps. The concept we propose here is analogous to FDA’s Bioanalytical Method Validation Guidance for Industry 13 , which captures key elements necessary for successful validation of pharmacokinetic and wet laboratory biomarkers in the context of drug development clinical trials though there are some fundamental differences due to the nature of data collection tools and methods.

Clinical utility, which evaluates whether using the BioMeT will lead to improved health outcomes or provide useful information about diagnosis, treatment, management, or prevention of a disease is also necessary to determine fit-for-purpose 5 . To evaluate the clinical utility of a digital tool, the range of potential benefits and risks to individuals and populations must be considered, along with the relevance and usefulness of the digital product to individuals (e.g., adherence to using the technology, user experience, and battery life). Clinical utility is typically evaluated by a process of usability and user experience testing. A BioMeT may perform well under V3, but is useless if it cannot be used appropriately by the target population in the anticipated setting. However, usability, and user experience are outside of the scope of the proposed V3 framework. Other criteria, such as cost, accessibility, compatibility, burden and ease of use, failure rates, and manufacturers’ terms of use and or customer service, are also critical to determining fit-for-purpose. These are described in more detail by the Clinical Trials Transformation Initiative (CTTI) 14 .

How does V3 for BioMeTs fit within the current regulatory landscape?

In the United States, regulators evaluate the claim(s) a manufacturer makes for a product, rather than the product’s capabilities. In other words, a product may be categorized as a regulated “device” or “non-device” purely through a change in the manufacturer’s description of the product with no change to its functionality (e.g., no change to the hardware, firmware, or software).

The setting in which a BioMeT is used can also shift the regulatory framework. For instance, a wearable used in a clinical trial to support a drug application (e.g., to digitally collect an endpoint like heart rate) would not necessarily be considered a “device”. However, the exact same product sold in the post-market setting claiming to diagnose a condition like atrial fibrillation, would be a device under the current paradigm.

Recognizing recent shifts in the technology landscape, the US Congress signed the 21st Century Cures Act (Cures Act) 15 into law on 13 December 2016, which amended the definition of “device” in the Food, Drug and Cosmetic Act to include software-based products. As a result, the FDA has been generating new guidance documents, updating policies, and considering better approaches to regulate software-driven products 16 . One novel approach has been to decouple the system into separate hardware and software components. For instance, the International Medical Device Regulators Forum defined ‘Software as a Medical Device (SaMD)’ as a software that performs independently of medical device hardware and that is intended to be used for medical purposes 17 . Importantly, this regulatory construct means that software (including algorithms), which lack a hardware component can be considered a “device” and thus, regulated by the FDA. For example, in 2018 two mobile applications that use either electrocardiogram (ECG) or photoplethymography data to generate “Irregular Rhythm Notifications” were granted De Novo clearance by the FDA 18 , 19 .

Verification

The verification process evaluates the capture and transference of a sensor-generated signal into collected data. Verification demonstrates that a sensor technology meets a set of design specifications, ensuring that (A) the sensors it contains are capturing analog data appropriately, and (B) the firmware that modifies the captured data are generating appropriate output data. In lay terms, the process of verification protects against the risk of ‘garbage in, garbage out’ when making digital measurements of behavioral or physiologic functions. BioMeTs include sensors that sample a physical construct; for example, acceleration, voltage, capacitance, or light. Verification is a bench evaluation that demonstrates that sensor technologies are capturing data with a minimum defined accuracy and precision when compared against a ground-truth reference standard, consistently over time (intra-sensor comparison) and uniformly across multiple sensors (inter-sensor comparison). The choice of reference standard depends on the physical construct captured. For example, verification of an accelerometer would involve placing the sensor on a shaking bench with known acceleration, and using these data to calculate accuracy, precision, consistency, and uniformity. In all of these processes, the evaluation criteria and thresholds should be defined prior to initiating the evaluation tests in order to determine whether the pre-specified acceptance criteria have been met.

The data supply chain

All digital measurements reported by BioMeTs are derived through a data supply chain, which includes hardware, firmware, and software components. For example, the accelerometer is a basic micro-electro-mechanical system frequently found in BioMeTs. Mechanical motion of a damped mass or cantilever in the accelerometer generates physical displacement information that can be translated through a series of data manipulations into a daily step count metric (Fig. 2 ; Supplementary Table 2 ). Each of these steps along the data supply chain has to be verified before the resulting measurement can be validated in a given population under specified conditions.

figure 2

Acceleration results in physical motion of the equivalence of a spring and proof mass, which in turn results in changes of electrical properties that can be captured by electrical property sensors. Electrical signals are then converted from analog to digital signals and stored and transmitted via the microprocessor on a wristband or mobile device. Through BLE, data are then processed and compressed multiple times for transmission and storage through mobile devices or cloud storage. This figure summarizes the steps of data collection and manipulation into a daily step count metric and illustrates that “raw” data could refer to different stages of the data collection and manipulation process and have different meanings. For more details of the data types and technologies involved in each step, please refer to Supplementary Table 2. Here, two arrows are highlighted with asterisks, which signify steps in the data supply chain where the “raw data dilemma” usually occurs. What is defined and clarified as “sample-level data” are the primary and processed digital signals marked by asterisks.

The term “raw data” is often used to describe data existing in an early stage of the data supply chain. Because the data supply chains vary across BioMeTs, the definition of “raw” is often inconsistent across different technologies. Here, we define the term sample-level data as a construct that holds clear and consistent meaning across all BioMeTs. All sensors output data at the sample level (for example, a 50 Hz accelerometer signal or a 250 Hz ECG signal); these data are sometimes accessible to all users and sometimes only accessible to the sensor manufacturers. We refer to this sensor output data as d and that data are reported in the International System of Units (SI). Although signal processing methods may have been applied to this data (e.g., downsampling, filtering, interpolation, smoothing, etc.), the data are still considered “raw” because it is a direct representation of the original analog signal produced by the sensor. These are the data that must undergo verification. Unfortunately, this sample-level data are often inaccessible to third parties using those technologies. This may be owing to limitations on storage space or battery life during transmission of high frequency data or it may be due to the risk of a third party reverse-engineering proprietary algorithms developed by the BioMeT manufacturer. In these situations, only the BioMeT manufacturer can complete verification of the sample-level data.

In summary, verification occurs at the bench prior to validation of the BioMeT in human subjects. Verified sample-level data generated from the sensor technology becomes the input data for algorithms that process that data into physiologically meaningful metrics (described further in analytical validation, below). Therefore, verification serves as a critical quality control step in the data supply chain to ensure that the sample-level data meet pre-specified acceptance criteria before the data are used further.

Table 2 summarizes the process of verification.

How can we reconcile the process of verifying sensor technologies in digital medicine with approaches more familiar to other disciplines?

In both engineering and medicine, the goal of verification is to document that a specific device performs to intended specifications, but the details of the process vary with the sensor technology 20 . Verification itself is not defined by a fixed standard applied across all tools—rather, it is a declaration of performance with respect to a pre-specified performance goal. That performance goal is usually established by the manufacturer based on the intended use of the technology or by community standards for more common technologies, and can be precisely defined in terms that are easily testable. For example, an accelerometer’s intended performance circumstances may include the range of accelerations for which the accuracy exceeds 95% as well as the environmental and contextual conditions (e.g., temperature, humidity, battery level) for which the technology’s performance remains within that accuracy threshold. BioMeT community verification standards are typically covered by the IEC 60601 series of technical standards for the safety and essential performance of medical electrical equipment 21 . The series consists of collateral (IEC 60601-1-X) and particular (IEC 60601-2-X) standards. The latter define verification requirements for specific sensor signals. For instance, IEC 60601-2-26 specifies verification requirements for amplifier and signal quality properties for electroencephalography (EEG) sensors. IEC 60601-2-40 specifies similar criteria for electromyography sensors, IEC 60601-2-25 for ECG sensors, and IEC 60601-2-47 even focuses on requirements for ambulatory ECG sensors. Beyond these biopotential signals, specific standards do not exist for other commonly used sensor signals in BioMeTs (e.g., inertial, bioimpedance, and optical), leaving the definition of the verification criteria up to the manufacturer and regulatory authorities.

One challenge with establishing standard performance metrics is that performance requirements can vary by use case, and therefore the same technology performance may be sufficient for one scenario but not for another. For example, heart rate accuracy is critical for detection of atrial fibrillation in high-risk patients, but is less critical for longitudinal resting heart rate monitoring in healthy young athletes. The verification process, therefore, must include the intended use for designating appropriate thresholding criteria.

Verification serves as the initial step in a process in which data collected from further studies using the sensor technology are used to continue development of rational standards for use, uncover any unexpected sources of error, and optimize performance of BioMeTs.

Who is responsible for verification?

Verification of BioMeTs is generally performed by the manufacturer through bench-top testing. Verification tests require access to the individual hardware components and the firmware used to process the sample-level data, both of which may be proprietary; as such, in some cases it may be impractical to expect anyone other than the technology manufacturer to complete verification. Indeed, many clinical investigators utilizing the technology will not have the resources or expertize required to perform such evaluations. However, it is likely the clinical investigators who will need to define the parameters of verification that would allow a determination of whether the sensor is, indeed, fit for a particular purpose.

Technology manufacturers should provide researchers and clinical users of their tools with timely and detailed verification documentation that is easily understandable to non-technologists. This documentation should be similar to the data sheets provided for hardware components, such as individual sensors that comprise the BioMeT. The documentation of BioMeTs should include three sections: performance specifications for the integrated hardware, output data specifications, and software system tests.

Performance specifications for the integrated hardware will mimic the performance specifications for individual hardware components but the testing must be completed on the full hardware system in situ. As an example, take a simple step counting BioMeT consisting of an accelerometer sensor and associated hardware to display the current daily step count on a small screen. Verification tests for integrated hardware performance specifications could include power management (expected battery life under a variety of conditions), fatigue testing (expected lifespan of the hardware under typical and extreme use), and/or electrical conductance (expected electrical current through the BioMeT).

Output data specifications should describe the accuracy of the sample-level data produced by the BioMeT’s sensors that will be used as input to the processing algorithms to produce the processed data. These verification tests usually consist of bench-top tests. These tests are necessary even if sample-level data are passed directly to the algorithms because, at a minimum, an analog to digital conversion of the sensor data may occur within the BioMeT. In the previous example of a simple step counting BioMeT, there is only one algorithm output metric: step counts. The sample-level data that are used as an input into that algorithm are the measurements that come from the on-board accelerometer as measured in SI units. The output data specifications should detail the accuracy of the accelerometer data in each axis (e.g., ± 0.02 g) as determined through bench-top testing of the full system, not just the accelerometer sensor.

Software system tests should indicate that the entire system including software that generates the sample-level data are functioning as intended, even under unusual circumstances of use. The results of the system tests do not need to be described in exhaustive detail in the documentation; instead, a high-level description of the software system tests should be included for general knowledge. For the step counter, this could include testing to ensure that the current step count is always displayed on the screen and is incremented within 1 s of a step being detected. An unusual situation would be to test what happens when the number of steps is so great that the size of the displayed digits exceeds the size of the screen (e.g., 100,000 steps per day or more). Other system tests could include what happens when the software detects an error within the system, such as a sensor malfunction.

Overall, the verification documentation for a BioMeT should give the clinical user enough information to use the BioMeT exactly as it was designed.

What is the regulatory oversight of verification?

Regulation of verification testing in medical devices is currently overseen by the FDA in the US and the various Notified Bodies that conduct conformity assessments for CE marking in the EU 22 . These entities require specific verification testing before a medical device can receive clearance or approval. However, many BioMeTs are not required to go through the regulatory clearance/approval process, so independent verification standards for BioMeTs need to be developed.

There is a need for “verification standards” for BioMeTs that parallels the quality standards used to evaluate components of pharmaceuticals. In drug development, the United States Pharmacopeia 23 is a non-profit organization that develops public guidelines for drug quality in collaboration with regulatory agencies, industry partners, and academia. An analogous organization for BioMeTs would be responsible for creating and updating guidelines and standards for verification testing. At present, there are multiple working groups within larger organizations that are focused on developing these verification standards for specific subsets of BioMeTs. Two examples of these working groups are the IEEE-WAMIII (Wearables and Medical IOT Interoperability & Intelligence) and the Consumer Technology Association’s Health and Fitness Technology Division. Such groups should collaborate to develop unified standards for verification that can be used by the regulatory bodies for oversight.

Table 3 describes the application of verification in practice.

Analytical validation

Analytical validation involves evaluation of a BioMeT for generating physiological- and behavioral metrics. This involves evaluation of the processed data and requires testing with human subjects 24 . After verified sample-level data have been generated by a BioMeT, algorithms are applied to these data in order to create behaviorally or physiologically meaningful metrics, such as estimated sleep time, oxygen saturation, heart rate variability, or gait velocity.

This process begins at the point at which verified output data (sample-level data), becomes the data input for algorithmic processing. Therefore, the first step of analytical validation requires a defined data capture protocol and a specified test subject population. For example, to develop an algorithm for gait velocity using data captured from a verified inertial measurement unit (IMU), it is necessary to specify (1) where the technology is worn (e.g., on the waist at the lumbar spine, ankle, or dominant wrist) and the orientation of the sensor, and (2) the study participant population (e.g., healthy adults aged 18–64, or patients with a diagnosis of multiple sclerosis aged 5–18) 25 , 26 . In this example, the analytical validation consists of evaluating the performance of the gait velocity algorithm on verified IMU data captured in accordance with the specific study protocol and in the particular study population of healthy adults aged 18–64.

During the process of analytical validation, the metric produced by the algorithm must be evaluated against an appropriate reference standard. Sleep onset/wake, for example, should be validated against polysomnography; oxygen saturation against arterial blood samples; heart rate variability against electrocardiography; and biomechanics such as gait dynamics against motion capture systems. It is important to remember that there can be multiple reference standards for a single metric, and not all reference standards are based on sensors. For example, a commonly used reference standard for respiratory rate is a manual measurement: a nurse observes and counts a study participant’s chest raises over a defined period of time. Manual reference standards are necessary when it is infeasible or impractical to use a sensor-based standard; step counts, for example, are typically validated using manual step counting rather than an instrumented walkway or instrumented shoes because it is more practical to have a human observer manually count the subject’s steps during a long walk test. In general, however, manual measurements are not the best choice for reference standards as they are the most prone to user error; they should only be used when absolutely necessary and no other reference standards are suitable and/or feasible.

It would be counterproductive to recommend a single threshold of accuracy for analytical validation of a BioMeT metric versus a reference standard as not all reference standards are of equal quality. First, not all reference standards are completely objective. For example, polysomnography signals are collected via sensors but may be manually scored by a trained technologist to generate sleep variables. Second, ostensibly objective reference standards like optical motion capture systems may have substantial operator bias that increases the variability of the final measurements 27 . Finally, in some cases a “gold standard” reference standard may not be clearly defined. For example, Godfrey et al. noted that the validation process for metrics produced by a gait algorithm based on body worn inertial sensors compared with a traditional laboratory reference standard, an instrumented pressure sensor gait mat, revealed poor agreement for variability and asymmetry estimates of left/right step data. In this case, a gait mat is a poor choice of reference standard to evaluate body worn sensors due to fundamental differences in measurement methods between the pressure and inertial sensor modalities 28 . Therefore, we recommend caution in the choice of reference standards for analytical validation studies. Most importantly, it is critical to understand how the selected reference standard measures and interprets the desired metric in order to undertake appropriate analytical validation procedures.

Best practices should be followed when choosing a reference standard for analytical validation of a BioMeT. The most rigorous and quantitative reference standards should be agreed upon and documented by guidance documents and consensus statements from governance and professional organizations. These are the reference standards that should be selected in order to avoid poor methodological approaches. Low-quality reference standards have the potential to introduce error as they may only produce an estimate of the desired metric. For example, a sleep diary contains the subject’s recollection of their sleep onset/wake time, which might vary considerably from the actual sleep onset/wake. Similarly, the process of back-validation, where analytical validation of a next generation BioMeT is evaluated against the previous generation, will also introduce error that can quickly compound if this process is repeated over multiple generations.

Table 4 summarizes the process of analytical validation.

How can we reconcile analytical validation of BioMeT-generated measures in digital medicine with more familiar approaches from other disciplines?

BioMeTs come in a wide variety of form factors and levels of complexity. Despite this variation, the goals and challenges of generating evidence of analytical validity are common across many tools and are similar to those of non-digital tools. For example, both assessing the analytical validity of heart rate variability (HRV) from a commercial chest strap and gait velocity from a wrist-worn accelerometer require the use of reference standards, testing protocols, and statistical analyses that are widely accepted by subject matter experts. These elements have been a part of analytical validation within engineering and health-related disciplines for many years. However, questions of their relevance to BioMeTs of ever-increasing novelty can arise, particularly when the reference standards, testing protocols, and statistical analyses are poorly defined, non- intuitive, or are not disclosed at all.

In some instances, a BioMeT may be attempting to replace a less-robust clinical measurement tool that provides only measurement estimates (i.e., patient diaries). When it is not possible to robustly establish analytical validation due to the novelty of the data type generated from a BioMeT (i.e., no reference standard exists), then the need for evidence of clinical validity and utility increases. In contrast, the primary element required to demonstrate clinical validity (discussed below) is a reproducible association with a clinical outcome of interest. Methodological approaches to establishing associations are diverse and the most appropriate methods are dependent on the target population and context of clinical care.

Who is responsible for analytical validation?

Analytical validation focuses on the performance of the algorithm and its ability to measure, detect, or predict the presence or absence of a phenotype or health state and must involve assessment of the BioMeT on human participants. As such, the entity that is developing the algorithm is responsible for analytical validation. Ideally, analytical validation would benefit from collaboration between the engineering team responsible for developing the sensor technology, data scientists/analysts/statisticians, physiologists or behavioral scientists, and the clinical teams responsible for testing in human participants from which the data are captured and the algorithm is derived. These multi-disciplinary teams might all sit within a single organization or may be split between a technology manufacturer and an analytics company, academic organization, and/or medical product manufacturer.

Commercial technology manufacturers often focus on developing generic algorithms with broad applications to a wide variety of subject populations in order to market their products to the widest possible consumer base. These algorithms (step count, walking speed, heart rate and heart rate variability, falls, sleep, muscle activation, etc.) could be applied to subjects with a variety of health conditions and under a variety of circumstances. However, commercial technology manufacturers may only conduct analytical validation for their algorithms using a small cohort of healthy subjects in a controlled laboratory setting. The manufacturer may or may not document the results of these studies in order to demonstrate the analytical validation of all the algorithms in their product. Sponsors of new medical products (drugs, biologics, or devices) choosing to use commercial technology will typically need to conduct their own analytical (and then clinical) validation.

When sponsors of new medical products (drugs, biologics, or devices) want to use BioMeTs to assess safety or efficacy of a new medical product for regulatory approval, they necessarily focus on developing specific algorithms with narrow applications that are targeted to their exact patient population of interest (e.g., Parkinson’s disease, multiple sclerosis, Duchenne’s muscular dystrophy). Through their clinical trial populations, sponsors generally have access to large data sets of patients with the specific health condition of interest from which to develop their algorithms. The trial sponsors may include a BioMeT prospectively as an exploratory measure in a clinical trial (both early and late stage) and use the collected data to develop the algorithm. There may be no available reference standards for these targeted algorithms; as a result, the sponsor may use other data collected during the clinical trial as the surrogate reference standards for the algorithms.

The sponsor should thoroughly document the analytical validation of the algorithms and is required to submit these results to regulatory bodies such as FDA or EMA. However, owing to the sensitivity of data collected during a clinical trial, these results may never be published or may be published years after the clinical trial has concluded. To demonstrate the efficacy of the BioMeT, we recommend that sponsors publish the results of analytical validation as soon as possible.

Table 5 describes the application of analytical validation in practice.

Clinical validation

Clinical validation is the process that evaluates whether the BioMeT acceptably identifies, measures, or predicts a meaningful clinical, biological, physical, functional state, or experience in the specified context of use. An understanding of what level of accuracy, precision, and reliability is necessary for a tool to be useful in a specific clinical research setting is necessary to meaningfully interpret results.

Clinical validation is intended to take a measurement that has undergone verification and analytical validation steps and evaluate whether it can answer a specific clinical question. This may involve assessment or prognosis of a certain clinical condition. Clinical validation should always be tailored to a specific context of use. The goal of clinical validation is to evaluate the association between a BioMeT-derived measurement and a clinical condition. The process of clinical validation also ensures the absence of systemic biases and can uncover BioMeT limitations such as an improper dynamic range to address a particular question. For example, a clinical validation could be determined in a study assessing the relationship between ambulatory BP monitoring and all-cause and cardiovascular mortality 29 .

Developing a standardized framework for clinical validation is challenging because of the highly variable nature of questions asked of clinical validation studies. However, we can adapt solutions from the FDA Guidance on patient reported outcomes 30 or the CTTI recommendations and resources for novel endpoint development 31 . Some of the concepts such as defining meaningful change to interpret treatment response and ability to detect clinically meaningful change could be leveraged more extensively for the purposes of clinical validation for BioMeTs.

Clinical experts, regulators, and psychometricians who are experienced with the development of clinical measurement tools are intimately familiar with the process of clinical validation. The work that these experts do, does not change when the tool is digital.

Table 6 summarizes the process of clinical validation.

How can we reconcile clinical validation of sensor-generated measures in digital medicine with more familiar approaches from other disciplines?

Clinical validation is a process that is largely unique to the development of tests, tools, or measurements either as medical products themselves, or to support safety and/or efficacy claims during the development of new medical products, or new applications of existing medical products. Technology manufacturers who are not yet experienced in the clinical field may be unfamiliar with this final step in the development of a BioMeT. Equally, clinical experts with significant experience developing traditional clinical tests, tools, and measurement instruments may not realize that this process does not vary when developing and evaluating a BioMeT.

Who is responsible for clinical validation?

Clinical validation is conducted by clinical teams planning to use, or promote the use of, the BioMeT in a certain patient population for a specific purpose. In practice, sponsors of new medical products (drugs, biologics, or devices) or clinical researchers will be the primary entities conducting clinical validation. If the digital tool is being used to support a labeling claim in the development of a new medical product, or a new application of an existing medical product, then the sponsor of the necessary clinical trials will be required to conduct clinical validation of any BioMeTs they use to make labeling claims.

In circumstances where the sponsor has completed analytical validation of an algorithm for a specific and narrow patient population, it may be possible to reuse some of the patient data that informed analytical validation to complete clinical validation. Clinical trials (both early and late stage) generate large data sets of patient health data that have traditionally been used to demonstrate clinical validity of biomarkers or surrogate endpoints 5 . This same process still applies when evaluating BioMeTs. We recommend using caution to avoid overestimating the utility of a digital endpoint if the same data set is used for both analytical and clinical validation. Documentation of clinical validation for BioMeTs should follow the same processes and requirements of clinical validation of traditional tests, tools, and measurement instruments 32 .

Table 7 describes the application of clinical validation in practice.

What is the regulatory oversight of the analytical and clinical validation processes?

The pathways for regulatory oversight of the validation processes will vary with the claims that the manufacturer of the BioMeT makes. For BioMeTs on regulatory pathways that require clearance or approval as a medical device, the centers within regulatory bodies responsible for these devices have regulatory oversight. These pathways are described in detail in Digital Medicine: A Primer on Measurement 3 .

For BioMeTs being used to support safety and efficacy claims of other medical products, there are a number of different options. In the United States, there is a pathway to “qualify” a digital tool outside of an individual drug development program 32 . Other pathways are specific to the medical product of interest. Decisions about the best approach to developing and/or a BioMeT in clinical trials and the preferred approaches for analytical validation should be made with input from regulatory agencies. CTTI has developed a quick reference guide to engage with the FDA for these conversations 33 .

Real-world examples of V3 processes

Table 8 describes the application of V3 processes for five use cases, including both commercial and medical BioMeTs.

The V3 framework in practice

There are a number of considerations that transcend the processes of verification and analytical validation, and clinical validation in the development of BioMeTs.

Do these processes replace existing GxP processes?

No. Good ‘x’ practices (or GxP) are guidelines that apply to a particular field. For example, ‘x’ may be manufacturing (GMP) or laboratory (GLP). Good practice guidelines apply to products in regulated fields (e.g., pharmaceuticals and medical devices) and are intended to ensure that these products are safe and meet their intended use by complying with strict quality standards throughout the entire process of production. V3 processes should be applied to all BioMeTs used in digital medicine. Digital tools that are also cleared or approved as medical devices must also comply with applicable GxP guidelines.

Emphasizing the importance of a study protocol during V3 evaluation

It is important to develop clear study protocols and reports prior to embarking on V3 exercises. For verification, documentation should stipulate the requirements/acceptance criteria, testing steps, procedures, timelines, and documentation of the experimental results with appropriate conclusions. Both analytical validation and clinical validation processes are subject to regulations applicable to human experimentation. Clinical study protocols are required with an approval of IRB/EC and regulatory agencies, as applicable.

For all V3 processes, keeping appropriate test/study protocols and reporting the results is critical as it serves multiple purposes: defining the objectives of the experiment, aligning all stakeholders involved, complying with applicable regulations, and providing tools for determining compliance. In addition, protocols and study reports are key tools for documenting scientific evidence needed to draw inferences on whether a technology is fit-for-purpose for the intended use and context of use.

Considering upgrades to firmware and/or software

The requirements for V3 are determined by the intended use of the BioMeT. Therefore, if the hardware or software are changed, new verification and/or analytical validation studies are needed to provide updated documentation for the end user (e.g., the study sponsor using the BioMeT as a drug development tool). Fortunately, changes in hardware and firmware often have no negative effects on the sample-level data, but the manufacturer still needs to demonstrate that this is true and also whether there is a “backwards compatibility” with earlier models. This is important because if an engineering improvement in BioMeT firmware or hardware makes the new data incompatible with data collected from earlier versions, this “improvement” could be disastrous for longitudinal studies and meta analyses.

Software updates that include changes to the algorithm processing the sample-level data require analytical validation to be repeated. However, if the hardware and firmware are unchanged, it is not necessary to repeat verification and analytical validation can be conducted using pre-existing sample-level data.

There can be misperceptions of the implications of firmware and software updates, such as whether or not those trigger new reviews from regulators like the FDA. For instance, software manufacturers are able—and encouraged by the FDA—to patch known security vulnerabilities 34 . Notably, software manufacturers, and not the FDA, are responsible for 640 validation of software changes after the patch has been deployed 34 .

Extending BioMeTs to new populations

If the BioMeT itself has not changed, it is not necessary to repeat existing verification studies. However, whether existing validation data can be generalized to a different patient population or clinical setting is also a matter for scientific judgment and may require additional analytical validation and clinical validation studies. For example, consider an algorithm that processes data from a hip-worn accelerometer to generate the number of steps per day that was originally developed using data collected from healthy college athletes. There may be published data demonstrating that the algorithm performs well when tested on similar populations, such as people who are slightly older or those who are generally fit and active. However, it is unlikely, that the algorithm will generate an accurate step count if applied to a person suffering from peripheral neuropathy or a gait disorder. Thus, it would be incorrect to assume that just because analytical validation testing has demonstrated good performance in one scenario that the algorithm is then validated for use in all scenarios.

Extending V3 concepts to multimodal and composite digital measures

V3 processes extend to multimodal data and composite digital measures. Multimodal describes data captured from two or more unique measurement methods. For example, a combination of accelerometer and gyroscope data can be used to detect falls and sit-to-stand transitions 35 , 36 . Digital tools relying on multimodal data should have evidence of verification available for each sensor, and evidence of analytical validation and clinical validation for the measure itself. Composite digital measures combine several individual measures, often derived from different sensors, to reach a single interpretive readout. For example, combining digital assessments of heart rate, sleep and heart rate variability can render a composite measure of depression 37 . Another example may combine accelerometer, GPS, keyboard and voice data from a smartphone to give a composite measure of cognition 38 . In these cases, verification of all contributing sensors is required along with validation of both the individual measures and the combined composite measure.

How much validation is “enough”?

It can be difficult to decide whether an analytical validation study has achieved its goal of determining that an algorithm correctly captures the behavioral or physiological measure of interest. If there is a clear and objective reference standard, then a numerical accuracy threshold can be set a priori, and the algorithm can be said to be sufficiently well validated if the results of the testing meet or exceed the threshold. A numerical accuracy threshold should be chosen based on the expected accuracy of the reference standard combined with a literature review of relevant research and comparable validation studies that indicate what would be clinically meaningful accuracy. For example, heart rate has a clear reference standard (multi-lead ECG) and there are many published analytic validation studies describing the accuracy of various heart rate measurement devices 39 .

When evaluating a novel metric where there is no clear reference standard, analytical validation becomes a more challenging task. In such cases, the first step is to determine what level of accuracy is necessary to be clinically meaningful in the expected user population. This can be determined by a literature review of previously published research combined with consultations of key opinion leaders in the field. Once an approximate accuracy threshold has been established, the best available reference standard should be chosen. The reference standard is often the measurement method used in clinical practice, and should be chosen based on the literature and in consultation with key opinion leaders. Then the analytical validation study can be completed. It is noteworthy that the absence of a clear reference standard necessarily requires the integration of elements of analytical and clinical validation to appropriately evaluate the measure. An example of this type of study is the measurement of tremor in patients with Parkinson’s disease. Tremor is usually assessed by visual observation of the patient, which is not a clear reference standard. In one study, a BioMeT’s measurement of Percent of Time that Tremor is Present in Parkinson’s patients was assessed against visual observation to generate an accuracy score 40 .

In general, it is not possible to set a blanket threshold for all types of statistical assessments of clinical validation, as these will differ depending on the clinical measurement, patient population, and context of use. For example, a BioMeT that is highly sensitive to detecting a disease may be valuable for the purposes of screening owing to the low false-negative rate, whereas a BioMeT that is highly specific may be of value for the purpose of diagnosis owing to the low false-positive rate. Second, determining that the endpoint generated by the BioMeT is clinically valid and of importance to understanding the functional status or quality of life of the target population is critical. This process relies on examining the totality of evidence related to the endpoint in question, and using that information to make a scientific judgment as to whether the endpoint is an appropriate measurement or diagnostic marker. For clinical validation, the best practice would be to publish all available testing and results (including the protocols), which will allow future users to choose the most appropriate BioMeT for their specific purpose (fit for purpose).

Figure 3 summarizes the application of the V3 process in the real world.

figure 3

The V3 process in practice.

Statistical considerations in V3

Error can stem from a wide array of sources when employing BioMeTs. The development and implementation of a robust V3 protocol and subsequent BioMeT deployment and use in accordance with that V3 protocol will minimize error resulting from differences between expected and actual accuracy as well as intended and actual use. There are a wide range of statistical analyses used to evaluate BioMeTs for their coherence with reference standards and their clinical power, which is beyond the scope of this paper. Provision of raw data, whenever possible, helps to address transparency and independent evaluation of technologies by allowing independent investigation of, for example, data variance and its impact on BioMeT reliability. In addition, it is important to consider the limits of agreement if using different devices to quantify the same biomarker at different timepoints or in different cohorts.

Future directions

Digital medicine is an interdisciplinary field, drawing together stakeholders with expertize in engineering, manufacturing, clinical science, data science, biostatistics, regulatory science, ethics, patient advocacy, and healthcare policy, to name a few. Although this diversity is undoubtedly valuable, it can lead to confusion regarding terminology and best practices in this nascent field. There are many instances, as we detail in this paper, where a single term is used by different groups to mean different things, as well as cases where multiple terms are used to describe what is essentially the same concept. Our intent is to clarify the core terminology and best practices for the evaluation of BioMeTs for use in clinical trials of new medical products, without unnecessarily introducing new terms. We aim for this common vocabulary to enable more effective communication and collaboration while improving the accessibility of the field to new adopters.

Figure 4 summarizes the role of the different disciplinary experts in the V3 process.

figure 4

V3 processes are typically conducted by experts across disciplines and domain.

V3 processes for traditional medical devices are generally well established but BioMeTs introduce new considerations 41 . For instance, SaMDs do not rely on specific hardware or sensors. The process of verification enables the use of SaMDs on verified data from any suitable sensor technology. In addition, some vendors sell “black box” algorithms or combined sensor/algorithm pairings. Establishing clear definitions and evidentiary expectations for the V3 processes will support collaborators seeking to evaluate the output of a “black box” sensor technology and/or measurement tool. Although the focus of this manuscript is on the use of BioMeTs in regulated trials of new medical products, our intent is for this framework to be instructional to all users of digital measurement tools, regardless of setting or intended use. Informing treatment decisions or care management based on a digital measure should not be subject to different scrutiny. Our goal in advancing this unifying V3 evaluation framework is to standardize the way high-quality digital measures of health are developed and implemented broadly. Evidence to support a determination of ‘fit-for-purpose’ and build trust in a digital measure should be uniform. A lack of V3 evaluation will have severe consequences (see Table 9 for illustrative examples) if algorithms fail to run according to predetermined specifications or if BioMeTs fail to perform according to their intended purpose.

Adopting streamlined methods for transparent reporting of V3 methodologies could lead to more ubiquitous deployment of low-cost technologies to better assess and monitor people outside of the clinic setting. This in turn can help healthcare professionals better diagnose, treat, and manage their patients, whereas promoting individualized approaches to medicine. Transparency will overcome “black box” technology development and evaluation approaches, ensuring that BioMeTs are used appropriately with the robust capture of data regardless of environment and context.

The proposed V3 process for BioMeTs describes an evidence base to drive the appropriate adoption of fit-for-purpose digital measurement technologies. In this document, we propose this three-pronged framework using historic and current contexts to define the key terms in this process. As a next step, we strongly encourage a re-initiation of the FDA B.E.S.T. working group to consider these definitions, refine them, and add them to the working compendium BEST framework 42 . We also encourage groups like the IEEE to consider these ontologies and provide feedback and guidance on the next steps required to adopt a common language and model for digital tools. We also recognize that technological developments will move faster than any regulatory or standards body can keep up with, so we encourage the practitioners in the digital era of medicine, including data scientists, engineers, clinicians and more, to continue to build upon this work. Professional societies like The Digital Medicine Society (DiMe) aim to become a collaborative hub for innovation in this area. Our hope is that the V3 framework and definitions continue to evolve to reflect the technologies that they serve. Our team will aim for annual updates to the framework as it exists herein. Once a common BioMeT evaluation paradigm is agreed upon, we will be able to develop technologies deserving of the trust we place in them (Boxes 1 – 3 ).

Box 1: Key takeaways

The term “clinically validated” is often found in marketing literature for digital medicine tools but, currently, its meaning is not clear. A standardized framework is needed to bring meaning to this term.

Biometric Monitoring Technologies (BioMeTs) are connected digital medicine tools that process data captured by mobile sensors using algorithms to generate measures of behavioral and/or physiological function.

The rapid rise in the demand for and development of digital medicine products, and specifically BioMeTs, to support the practice of medicine has left in its wake a body of new technologies with no systematic, evidence-based evaluation framework.

BioMeTs should be characterized by a body of evidence to support their quality, safety, and efficacy. Users of these technologies should recognize that verification and validation processes are critical to support a technology as fit-for-purpose. Without a supporting body of evidence, data can be misinterpreted. In the context of clinical trials, this can result in misleading study conclusions and possibly patient harm.

The evaluation framework for BioMeTs should encompass both the product’s components (e.g., hardware, firmware, and software, including algorithms) and the intended use of the product. Existing frameworks for new biotechnologies are not sufficiently adaptable, but they can provide meaningful insight for developing new evaluation frameworks for BioMeTs.

We propose and describe a three-component framework intended to provide a foundational evaluation of BioMeTs. This framework includes (1) verification, (2) analytical validation, and (3) clinical validation.

V3 are foundational to determine whether a digital medicine tool is fit-for-purpose. An evaluation of the usefulness and utility is only applicable after gaining evidence and assurance that the underlying data and predictions are “valid” to answer a given question.

Adopting streamlined methods for transparent reporting of V3 processes, coupled with transparency, will overcome “black box” technology development and evaluation approaches, ensuring that BioMeTs are used appropriately with the robust capture of data.

Box 2: Reality check—analytical validation in practice

The process of conducting analytical validation as we describe it here is not always what happens in practice. Often algorithms are developed by technology manufacturers, are considered proprietary, and are not disclosed for testing. Sponsors of new medical products who want to use one of these tools to evaluate the safety or efficacy of a new product may therefore not have access to the algorithms. However, access to the algorithm itself is not necessary for the purposes of analytical validation, as long as the investigator is able to access the input data (sample-level data or processed data, depending on the algorithm) along with the software containing the algorithm in order to generate the endpoint/s of interest. Regardless of which party performs analytical validation, sponsors opting to use a particular BioMeT are responsible for their trial data integrity and communicating documentation of all stages of the V3 processes to regulators. Where IP issues prohibit sponsors from completing analytical validation independently, they must have means to assess analytical validation of the tools upon which their trial success depends.

Box 3: Sample-level and processed data

Sample-level data are used as input to algorithms that convert that data to a second type of reported data that is not a direct representation of the original analog signal. We refer to this data as processed data because it is the result of processing operations applied to the original sample-level data. For example, ‘heart rate’ and ‘step count per minute’ are two processed data types that can be obtained from sample-level data (e.g., 250 Hz ECG or 50 Hz accelerometer, respectively).

In both cases, the processed data are not a direct representation of the original analog signal measured by the sensor; instead, an algorithm was applied to produce the new type of data. These processed data are almost always available to third parties and exists at a lower frequency than the sample-level data. In the case of sensor technologies that restrict access to the sample-level data, this processed data are the first-accessible data set from the device.

The distinction between sample-level and processed data are important because the evaluation processes differ. Following the V3 framework, sample-level data should be evaluated at the verification stage, and processed data should be evaluated at the analytical validation stage. This difference in evaluation processes is owing to the fact that the processed data have been manipulated from its original form.

Goldsack, J. Laying the Foundation: Defining Digital Medicine. Medium (2019). Available at: https://medium.com/digital-medicine-society-dime/laying-the-foundation-defining-digital-medicine-49ab7b6ab6ef . (Accessed 18 Sept 2019).

Perakslis, E. & Coravos, A. Is health-care data the new blood? Lancet Digital Health 1 , e8–e9 (2019).

Article   Google Scholar  

Coravos, A. et al. Digital medicine: a primer on measurement. Digit Biomark. 3 , 31–71 (2019).

Dunn, J., Runge, R. & Snyder, M. Wearables and the medical revolution. Per. Med. 15 , 429–448 (2018).

Article   CAS   Google Scholar  

FDA-NIH Biomarker Working Group. BEST (Biomarkers, EndpointS, and other Tools) Resource . (Food and Drug Administration (US), 2016).

Versi, E. ‘Gold standard’ is an appropriate term. BMJ 305 , 187 (1992).

14:00-17:00. ISO 9001:2015. ISO Available at: http://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/06/20/62085.html . (Accessed 18 Sept 2019).

International Organization for Standardization & International Electrotechnical Commission. ISO 13485:2016, Medical devices — Quality management systems — Requirements for regulatory purposes . (2016).

IEEE Computer Society. IEEE Standard for System, Software, and Hardware Verification and Validation. IEEE Std 1012-2016 (Revision of IEEE Std 1012-2012/ Incorporates IEEE Std 1012-2016/Cor1-2017) 1–260 (2017). https://doi.org/10.1109/IEEESTD.2017.8055462 .

U.S. Department Of Health and Human Services, U.S. Food and Drug Administration, Center for Devices and Radiological Health & Center for Biologics Evaluation and Research. General Principles of Software Validation; Final Guidance for Industry and FDA Staff, 47 (2002).

U.S. Food and Drug Administration. CFR - Code of Federal Regulations Title 21. Available at: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?fr=820.30 . (Accessed 18 Sept 2019).

Center for Drug Evaluation and Research. Drug Development Tool Qualification Programs. FDA (2019). Available at: http://www.fda.gov/drugs/development-approval-process-drugs/drug-development-tool-qualification-programs . (Accessed 18 Sept 2019).

U.S. Food and Drug Administration. Bioanalytical Method Validation Guidance for Industry. Available at: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/bioanalytical-method-validation-guidance-industry . (Accessed 7 Mar 2020).

Clinical Trials Transformation Initiative. Framework of Specifications to Consider During Mobile Technology Selection (2018).

H.R. 34, 114th Congress. 21st Century Cures Act (2016).

U.S. Food and Drug Administration. Digital Health Innovation Action Plan. (2017). https://www.fda.gov/media/106331/download .

IMDRF SaMD Working Group. Software as a Medical Device (SaMD): Key definitions (2017).

Krueger, A. C. Regulation of photoplethysmograph analysis software for over-the-counter use. U.S. Food & Drug Administration (2018).

Krueger, A. C. Regulation of electrocardiograph software for over-the-counter use. U.S. Food & Drug Administration (2018).

Bignardi, G. E. Validation and verification of automated urine particle analysers. J. Clin. Pathol. 70 , 94–101 (2017).

International Electrotechnical Commission. Available at: https://www.iec.ch/ . (Accessed 18 Sept 2019).

Margaine, C. The Notified Body’s Role in Medical Device Certification. Available at: https://lne-america.com/certification/ce-marking-gain-market-access-to-europe/notified-body . (Accessed 18 Sept 2019).

USP (The United States Pharmacopeial Convention). Available at: https://www.usp.org/ . (Accessed 18 Sept 2019).

Witt, D. R., Kellogg, R. A., Snyder, M. P. & Dunn, J. Windows into human health through wearables data analytics. Curr. Opin. Biomed. Eng. 9 , 28–46 (2019).

McCamley, J., Donati, M., Grimpampi, E. & Mazzà, C. An enhanced estimate of initial contact and final contact instants of time using lower trunk inertial sensor data. Gait Posture 36 , 316–318 (2012).

Trojaniello, D., Cereatti, A. & Della Croce, U. Accuracy, sensitivity and robustness of five different methods for the estimation of gait temporal parameters using a single inertial sensor mounted on the lower trunk. Gait Posture 40 , 487–492 (2014).

Hutchinson, L. et al. Operator bias errors are reduced using standing marker alignment device for repeated visit studies. J. Biomech. Eng. 140 , 041001 (2018).

Godfrey, A., Del Din, S., Barry, G., Mathers, J. C. & Rochester, L. Instrumenting gait with an accelerometer: a system and algorithm examination. Med. Eng. Phys. 37 , 400–407 (2015).

Banegas, J. R. et al. Relationship between clinic and ambulatory blood-pressure measurements and mortality. N. Engl. J. Med. 378 , 1509–1520 (2018).

U.S. Department of Health and Human Services, U.S. Food and Drug Administration, Center for Drug Evaluation and Research (CDER), Center for Biologics Evaluation and Research (CBER) & Center for Devices and Radiological Health (CDRH). Guidance for Industry: Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims. (2009).

Clinical Trials Transformation Initiative. CTTI Recommendations: Developing Novel Endpoints Generated by Mobile Technology for Use in Clinical Trials. (2017).

U.S. Department of Health and Human Services, U.S. Food and Drug Administration, Center for Drug Evaluation and Research (CDER) & Center for Biologics Evaluation and Research (CBER). Biomarker Qualification: Evidentiary Framework Guidance for Industry and FDA Staff. (2018).

Clinical Trials Transformation Initiative. Quick Reference Guide to Processings for Interacting with the US Food and Drug Administration (FDA) regarding Novel Endpoint Development. (2017).

U.S. Food and Drug Administration. FDA Fact Sheet: The FDA’S Role in Medical Device Cybersecurity, Dispelling Myths and Understanding Facts.

Huynh, Q. T., Nguyen, U. D., Irazabal, L. B., Ghassemian, N. & Tran, B. Q. Optimization of an accelerometer and gyroscope-based fall detection algorithm. J. Sens. (2015). https://doi.org/10.1155/2015/452078 .

Pham, M. H. et al. Validation of a lower back “wearable”-based sit-to-stand and stand-to-sit algorithm for patients with parkinson’s disease and older adults in a home-like environment. Front. Neurol. 9 , 652 (2018).

Kovalchick, C. et al. Can composite digital monitoring biomarkers come of age? A framework for utilization. J. Clin. Transl. Sci. 1 , 373–380 (2017).

Insel, T. R. Digital phenotyping: technology for a new science of behavior. JAMA 318 , 1215–1216 (2017).

Wang, R. et al. Accuracy of wrist-worn heart rate monitors. JAMA Cardiol. 2 , 104 (2017).

Braybrook, M. et al. An ambulatory tremor Score for parkinson’s disease. J. Parkinsons Dis. 6 , 723–731 (2016).

Panescu, D. Medical device development. in 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society 5591–5594 (2009). https://doi.org/10.1109/IEMBS.2009.5333490 .

Commissioner, O. of the. FDA in brief: FDA seeks public feedback on biomarker and study endpoint glossary. FDA (2019).

IEEE Standard for System, Software, and Hardware Verification and Validation. IEEE Std 1012-2016 (Revision of IEEE Std 1012-2012/ Incorporates IEEE Std 1012-2016/Cor1-2017) 1–260 (2017). https://doi.org/10.1109/IEEESTD.2017.8055462 .

National Academies of Sciences, Engineering, and Medicine. An Evidence Framework for Genetic Testing . (The National Academies Press, 2017). https://doi.org/10.17226/24632 .

Giles, D., Draper, N. & Neil, W. Validity of the Polar V800 heart rate monitor to measure RR intervals at rest. Eur. J. Appl. Physiol. 116 , 563–571 (2016).

Heart rate variability: standards of measurement, physiological interpretation and clinical use. Task force of the European society of cardiology and the north american society of pacing and electrophysiology. Circulation 93 , 1043–1065 (1996).

Hernando, D., Garatachea, N., Almeida, R., Casajús, J. A. & Bailón, R. Validation of heart rate monitor polar rs800 for heart rate variability analysis during exercise. J. Strength Cond. Res. 32 , 716 (2018).

Frasch, M. G. et al. Can a heart rate variability biomarker identify the presence of autism spectrum disorder in eight year old children? arXiv:1808.08306 [q-bio] (2018).

Karpman, C., LeBrasseur, N. K., DePew, Z. S., Novotny, P. J. & Benzo, R. P. Measuring gait speed in the out-patient clinic: methodology and feasibility. Respir. Care 59 , 531–537 (2014).

Fortune, E., Lugade, V., Morrow, M. & Kaufman, K. Validity of using tri-axial accelerometers to measure human movement – Part II: step counts at a wide range of gait velocities. Med. Eng. Phys. 36 , 659–669 (2014).

König, A. et al. Objective measurement of gait parameters in healthy and cognitively impaired elderly using the dual-task paradigm. Aging Clin. Exp. Res 29 , 1181–1189 (2017).

U.S. Department of Health and Human Services et al. Guidance for Industry and FDA Staff: Class II Special Controls Guidance Document: Arrhythmia Detector and Alarm. (2003).

Apple Inc. Using Apple Watch for Arrhythmia Detection. (2018).

Parvinian, B., Scully, C., Wiyor, H., Kumar, A. & Weininger, S. Regulatory considerations for physiological closed-loop controlled medical devices used for automated critical care: food and drug administration workshop discussion topics. Anesth. Analg. 126 , 1916–1925 (2018).

Allen, N. & Gupta, A. Current Diabetes Technology: Striving for the Artificial Pancreas. Diagnostics (Basel) 9 , 31 (2019).

The 670G System - P160017. Available at: http://wayback.archive-it.org/7993/20170111141252/http://www.fda.gov/MedicalDevices/ProductsandMedicalProcedures/DeviceApprovalsandClearances/Recently-ApprovedDevices/ucm522764.htm . (Accessed: 19th September 2019).

Watanabe, N. et al. Development and Validation of a Novel Cuff-Less Blood Pressure Monitoring Device. J Am Coll Cardiol Basic Trans . JACC: Basic to Translational Science 2 , 631–642 (2017).

PubMed   Google Scholar  

IEEE Standard for Wearable Cuffless Blood Pressure Measuring Devices. IEEE Std 1708-2014 1–38 (2014). https://doi.org/10.1109/IEEESTD.2014.6882122 .

International Organization for Standardization & International Electrotechnical Commission. ANSI/AAMI/ISO 81060-2:2013: Non-invasive sphygmomanometers — Part 2: Clinical investigation of automated measurement type.

IEEE standard for software verification and validation . (Institute of Electrical and Electronics Engineers, 1998).

Kourtis, L. C., Regele, O. B., Wright, J. M. & Jones, G. B. Digital biomarkers for Alzheimer’s disease: the mobile/wearable devices opportunity. Npj Digit. Med. 2 , 1–9 (2019).

Download references

Acknowledgements

We are grateful for input from Geoffrey S. Ginsburg MD, PhD on the language and processes used to evaluate the evidence base supporting the development and use of genetic tests to improve patient care and treatment. We are grateful for the support of many additional members of the Digital Medicine Society (DiMe) for providing expertize and insights on particular topics during the development of this work.

Author information

Authors and affiliations.

Digital Medicine Society (DiMe), Boston, MA, USA

Jennifer C. Goldsack, Andrea Coravos, Jessie P. Bakker, Elena Izmailova & Christine Manta

Elektra Labs, Boston, MA, USA

Andrea Coravos & Christine Manta

Harvard-MIT Center for Regulatory Science, Boston, MA, USA

Andrea Coravos

Philips, Monroeville, PA, USA

Jessie P. Bakker

Biomedical Engineering Department, Duke University, Durham, NC, USA

Brinnae Bent, Ke Will Wang & Jessilyn Dunn

Takeda Pharmaceuticals, Cambridge, MA, USA

Ariel V. Dowling

ClinMed LLC, Dayton, NJ, USA

Cheryl Fitzer-Attas

Computer and Information Sciences Department, Northumbria University, Newcastle-upon-Tyne, UK

Alan Godfrey

Center for Wireless and Population Health Systems, University of California, San Diego, CA, USA

Job G. Godino

Samsung Neurologica, Danvers, MA, USA

Ninad Gujar

Curis Advisors, Cambridge, MA, USA

Koneksa Health, New York, USA

Elena Izmailova

Independent Consultant, Charlotte, NC, USA

Barry Peterson

Byteflies, Antwerp, Belgium

Benjamin Vandendriessche

Department of Electrical, Computer and Systems Engineering, Case Western Reserve University, Cleveland, OH, USA

Department of Medicine, University of North Carolina at Chapel Hill; Lineberger Comprehensive Cancer Center, Chapel Hill, NC, USA

William A. Wood

Department of Biostatistics & Bioinformatics, Duke University, Durham, NC, USA

Jessilyn Dunn

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: J.G., A.C., J.D. Analysis and writing: all authors made substantial contributions to the conception or design of the work, participated in drafting and revisions, and provided final approval of the version to be published.

Corresponding author

Correspondence to Jessilyn Dunn .

Ethics declarations

Competing interests.

This collaborative manuscript was developed as part of research initiatives led by the Digital Medicine Society (DiMe). All authors are members of DiMe. DiMe is a Massachusetts non-profit corporation with 501(c)(3) application pending. J.C.G. is a part-time employee of HealthMode, Inc. A.C. is founder of Elektra Labs. E.I. is an executive of Koneksa Health. J.P.B. is a full-time employee of Philips.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Goldsack, J.C., Coravos, A., Bakker, J.P. et al. Verification, analytical validation, and clinical validation (V3): the foundation of determining fit-for-purpose for Biometric Monitoring Technologies (BioMeTs). npj Digit. Med. 3 , 55 (2020). https://doi.org/10.1038/s41746-020-0260-4

Download citation

Received : 22 September 2019

Accepted : 12 March 2020

Published : 14 April 2020

DOI : https://doi.org/10.1038/s41746-020-0260-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Nimbalwear analytics pipeline for wearable sensors: a modular, open-source platform for evaluating multiple domains of health and behaviour.

  • Kit B. Beyer
  • Kyle S. Weber
  • Karen Van Ooteghem

BMC Digital Health (2024)

Monitoring Activity and Gait in Children (MAGIC) using digital health technologies

  • Pirinka Georgiev Tuttle

Pediatric Research (2024)

Walk, talk, think, see and feel: harnessing the power of digital biomarkers in healthcare

  • Dylan Powell

npj Digital Medicine (2024)

Improved measurement of disease progression in people living with early Parkinson’s disease using digital health technologies

  • Matthew D. Czech
  • Darryl Badley
  • Josh D. Cosman

Communications Medicine (2024)

Decentralized clinical trials and rare diseases: a Drug Information Association Innovative Design Scientific Working Group (DIA-IDSWG) perspective

  • Mercedeh Ghadessi
  • Robert A. Beckman

Orphanet Journal of Rare Diseases (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

case study method validation

  • Open access
  • Published: 01 September 2021

Perspectives in modeling and model validation during analytical quality by design chromatographic method evaluation: a case study

  • Yongzhi Dong   ORCID: orcid.org/0000-0003-4268-5952 1 ,
  • Zhimin Liu 1 ,
  • Charles Li 1 ,
  • Emily Pinter 1 ,
  • Alan Potts 1 ,
  • Tanya Tadey 1 &
  • William Weiser 1  

AAPS Open volume  7 , Article number:  3 ( 2021 ) Cite this article

4157 Accesses

1 Citations

Metrics details

Design of experiments (DOE)-based analytical quality by design (AQbD) method evaluation, development, and validation is gaining momentum and has the potential to create robust chromatographic methods through deeper understanding and control of variability. In this paper, a case study is used to explore the pros, cons, and pitfalls of using various chromatographic responses as modeling targets during a DOE-based AQbD approach. The case study involves evaluation of a reverse phase gradient HPLC method by a modified circumscribed central composite (CCC) response surface DOE.

Solid models were produced for most responses and their validation was assessed with graphical and numeric statistics as well as chromatographic mechanistic understanding. The five most relevant responses with valid models were selected for multiple responses method optimization and the final optimized method was chosen based on the Method Operable Design Region (MODR). The final method has a much larger MODR than the original method and is thus more robust.

This study showcases how to use AQbD to gain deep method understanding and make informed decisions on method suitability. Discoveries and discussions in this case study may contribute to continuous improvement of AQbD chromatography practices in the pharmaceutical industry.

Introduction

Drug development using a quality by design (QbD) approach is an essential part of the Pharmaceutical cGMP Initiative for the twenty-first century (FDA Pharmaceutical cGMPs For The 21st Century — A Risk-Based Approach. 2004 ) established by the FDA. This initiative seeks to address unmet patient needs, unsustainable rise of healthcare costs, and the reluctance to adopt new technology in pharmaceutical development and manufacturing. These issues were the result of old regulations that are very rigid and made continuous improvement of previously approved drugs both challenging and costly. The International Council for Harmonization of Technical Requirements for Pharmaceuticals for Human Use (ICH) embraced this initiative and began issuing QbD relevant quality guidelines in 2005. The final versions of ICH Q8–Q12 [(ICH Q8 (R2) 2009 ) (ICH Q9 2005 ) (ICH Q10 2008 ) (ICH Q11 2012 ) (ICH Q12 2019 )] have been adopted by all ICH members. The in-progress version of ICH Q14 (ICH Q14 2018 ) will offer AQbD guidelines for analytical procedures and promote the use of QbD principles to achieve a greater understanding and control of testing methods and reduction of result variability.

Product development using a QbD approach emphasizes understanding of product and process variability, as well as control of process variability. It relies on analytical methods to measure, understand, and control the critical quality attributes (CQA) of raw materials and intermediates to optimize critical process parameters and realize the Quality Target Product Profile (ICH Q8 (R2) 2009 ). Nevertheless, part of the variability reported by an analytical test can originate from the variability of the analytical measurement itself. This can be seen from Eq. 1 .

The reported variability is the sum of intrinsic product variability and extrinsic analytical measurement variability (NIST/SEMATECH e-Handbook of statistical methods 2012a , 2012b , 2012c , 2012d ). The measurement variability can be minimized by applying QbD principles, concepts, and tools during method development to assure the quality and reliability of the analytical method can meet the target measurement uncertainty (TMU) (EURACHEM/CITAC 2015 ). High-quality analytical data truthfully reveal product CQAs and thus enables robust, informed decisions regarding drug development, manufacturing, and quality control.

ICH Q14 introduces the AQbD concepts, using a rational, systematic, and holistic approach to build quality into analytical methods. The Method Operable Design Region (MODR) (Borman et al. 2007 ) is a multidimensional space based on the critical method parameters and settings that provide suitable method performance. This approach begins with a predefined analytical target profile (ATP) (Schweitzer et al. 2010 ), which defines the method’s intended purpose and commands analytical technique selection and all other method development activities. This involves understanding of the method and control of the method variability based on sound science and risk management. It is generally agreed upon that systematic AQbD method development should include the following six consecutive steps (Tang 2011 ):

ATP determination

Analytical technique selection

Method risk assessment

MODR establishment

Method control strategy

Continuous method improvement through a life cycle approach

A multivariate MODR allows freedom to make method changes and maintain the method validation (Chatterjee S 2012 ). Changing method conditions within an approved MODR does not impact the results and offers an advantage for continuous improvement without submission of supplemental regulatory documentation. Establishment of the MODR is facilitated by multivariate design of experiments (DOE) (Volta e Sousa et al. 2021 ). Typically, three types of DOE may be involved in AQbD. The screening DOE further consolidates the potential critical method parameters determined from the risk assessment. The optimization DOE builds mathematical models and selects the appropriate critical method parameter settings to reach to the target mean responses. Finally, the robustness DOE further narrows down the critical method parameter settings to establish the MODR, within which the target mean responses are consistently realized. Based on this AQbD framework, it is very clear DOE models are essential to understanding and controlling method variability to build robustness into analytical methods. Although there have been extensive case studies published regarding AQbD (Grangeia et al. 2020 ), systematic and in-depth discussion of the fundamental AQbD modeling is still largely unexplored. Methodical evaluation of the pros, cons, and pitfalls of using various chromatographic responses as modeling targets is even more rare (Debrus et al. 2013 ) (Orlandini et al. 2013 ) (Bezerraa et al. 2019 ). The purpose of this case study is to investigate relevant topics such as data analysis and modeling principles, statistical and scientific validation of DOE models, method robustness evaluation and optimization by Monte Carlo simulation (Chatterjee S 2012 ), multiple responses method optimization (Leardi 2009 ), and MODR establishment. Discoveries and discussions in this case study may contribute to continuous improvement of chromatographic AQbD practices in the pharmaceutical industry.

Methods/experimental

Materials and methods.

C111229929-C, a third-generation novel synthetic tetracycline-class antibiotic currently under phase 1 clinical trial was provided by KBP Biosciences. A reverse phase HPLC purity and impurities method was also provided for evaluation and optimization using AQbD. The method was developed using a one factor at a time (OFAT) approach and used a Waters XBridge C18 column (4.6 × 150 mm, 3.5 μm) and a UV detector. Mobile phase A was composed of ammonium acetate/ethylenediaminetetraacetic acid (EDTA) buffer at pH 8.8 and mobile phase B was composed of 70:30 (v/v) acetonitrile/EDTA buffer at pH 8.5. Existing data from forced degradation and 24-month stability studies demonstrated that the method was capable of separating all six specified impurities/degradants with ≥ 1.5 resolution.

A 1.0 mg/mL C111229929-C solution was prepared by dissolving the aged C111229929-C stability sample into 10 mM HCl in methanol and used as the method evaluation sample. An agilent 1290 UPLC equipped with a DAD detector was used. In-house 18.2 MΩ Milli-Q Water was used for solution preparations. All other reagents were of ACS equivalent or higher grade. Waters Empower® 3 was used as the Chromatographic Data System. Fusion QbD v 9.9.0 software was used for DOE design, data analysis, modeling, Monte Carlo simulation, multiple responses mean, and robustness optimization. Empower® 3 and Fusion QbD were fully integrated and validated.

A method risk assessment was performed through review of the literature and existing validation and stability data to establish priorities for method inputs and responses. Based on the risk assessment, four method parameters with the highest risk priority numbers were selected as critical method parameters. Method evaluation and optimization was performed by a modified circumscribed central composite (CCC) response surface DOE design with five levels per parameter, for a total of 30 runs. The modifications were the extra duplicate replications at three factorial points. In addition to triplicate replications at the center point, the modified design had a total of nine replicates. See Table 1 for the detailed design matrix. A full quadratic model for the four-factor five-level CCC design has a total of fourteen potential terms. They include four main linear terms (A, B, C, D), four quadratic terms (A 2 , B 2 , C 2 , D 2 ), and six two-way interaction terms (A*B, A*C, A*D, B*C, B*D, and C*D).

Pre-runs executed at selected star (extreme) points verified that all expected analytes eluted within the 35 min run time. This mitigated the risk of any non-eluting peaks during the full DOE study, as a single unusable run may raise questions regarding the validity of the entire study. Based on the pre-runs, the concentration of the stock EDTA solution was decreased four-fold to mitigate inaccurate in-line mixing of mobile phase B caused by low volumes of a high concentration stock. The final ranges and levels for each of the four selected method parameters are also listed in Table 1 .

Each unique DOE run in Table 1 is a different method. As there were 25 unique running conditions, there were 25 different methods in this DOE study. The G-Efficiency and the average predicted variance (NIST/SEMATECH e-Handbook of statistical methods 2012a , 2012b , 2012c , 2012d ) (Myers and Montgomery 1995 ) of the design were 86.8% and 10.6%, respectively, meeting their respective design goals of ≥ 50% and ≤ 25%. Some of the major advantages of this modified CCC design include the following:

Established quadratic effects

Robust models that minimize effects of potential missing data

Good coverage of the design space by including the interior design points

Low predictive variances of the design points

Low model term coefficient estimation errors

The design also allows for implementation of a sequential approach, where trials from previously conducted factorial experiments can be augmented to form the CCC design. When there is little understanding about the method and critical method parameters, such as when developing a new method from scratch, direct application of an optimizing CCC design is generally not recommended. However, there was sufficient previous knowledge regarding this specific method, justifying the direct approach.

DOE data analysis and modeling principles

DOE software is one of the most important tools to facilitate efficient and effective AQbD chromatographic method development, validation, and transfer. Fusion QbD software was employed for DOE design and data analysis. Mathematical modeling of the physicochemical chromatographic separation process is essential for DOE to develop robust chromatographic methods through three phases: chemistry screening, mean optimization, and robustness optimization. The primary method parameters affecting separation (e.g., column packing, mobile phase pH, mobile phase organic modifier) are statistically determined with models during chemistry screening. The secondary method parameters affecting separation (e.g., column temperature, flow rate, gradient slope settings) are optimized during mean optimization using models to identify the method most capable of reaching all selected method response goals on average. During robustness optimization, robustness models for selected method responses are created with Monte Carlo simulation and used to further optimize method parameters such that all method responses consistently reach their goals, as reflected by a process capability value of ≥ 1.33, which is the established standard for a robust process (NIST/SEMATECH e-Handbook of statistical methods 2012a , 2012b , 2012c , 2012d ).

Models are critical to the AQbD approach and must be validated both statistically and scientifically. Statistical validation is performed using various statistical tests such as residual randomness and normality (NIST/SEMATECH e-Handbook of statistical methods 2012a , 2012b , 2012c , 2012d ), regression R-squared, adjusted regression R-squared, and prediction R-squared. Scientific validation is achieved by checking the terms in a statistical model against the relevant established scientific principles, which is described as mechanistic understanding in the relevant literature (ICH Q8 (R2) 2009 ).

Fusion uses data transformation analysis to decide whether data transformation is necessary before modeling, and then uses analysis of variance (ANOVA) and regression to generate method response models. ANOVA provides objective and statistical rationale for each consecutive modeling decision. Model residual plots are fundamental tools for validating the final method response models. When a model fits the DOE data well, the response residuals should be distributed randomly without any defined structure, and normally. A valid method response model provides the deepest understanding of how a method response, such as resolution, is affected by critical method parameters.

Since Fusion relies on models for chemistry screening, mean optimization, and robustness optimization, it is critical to holistically evaluate each method response model from all relevant model regression statistics to assure model validity before multiple method response optimization. Inappropriate models will lead to poor prediction and non-robust methods. This paper will describe the holistic evaluation approach used to develop a robust chromatographic method with Fusion QbD.

Representative chromatogram under nominal conditions

Careful planning and pre-runs executed at select star points allowed for successful execution of the DOE with all expected peaks eluting within the running time for all the 30 runs. A representative chromatogram at the nominal conditions is shown in Fig. 1 . The API peak (C1112299299-C) and the Epimer peak (C112299299-C-epimer) can be seen, as well as seven minor impurity peaks, among which impurity 2 and impurity 3 elute at 8.90 and 10.51 min, respectively. The inset shows the full-scale chromatogram.

figure 1

A representative chromatogram under nominal conditions

Results for statistical validation of the DOE models

ANOVA and regression data analysis revealed many DOE models for various peak responses. The major numeric regression statistics of the peak response models are summarized in Table 2 .

MSR (mean square regression), MSR adjusted, and MS-LOF (mean square lack of fit) are major numeric statistics for validating a DOE model. A model is statistically significant when the MSR ≥ the MSR significance threshold, which is the 0.0500 probability value for statistical significance. The lack of fit of a model is not statistically significant when the MS-LOF ≤ the MS-LOF significance threshold, which is also the 0.0500 probability value for statistical significance. The MSR adjusted statistic is the MSR adjusted with the number of terms in the model to assure a new term improves the model fit more than expected by chance alone. For a valid model, the MSR adjusted is always smaller than the MSR and the difference is usually very small, unless too many terms are used in the model or the sample size is too small.

Model Term Ranking Pareto Charts for scientific validation of DOE models

DOE models are calculated from standardized variable level settings. Scientific validation of a DOE model through mechanistic understanding can be challenging when data transformation before modeling ostensibly inverts the positive and negative nature of the model term effect. To overcome this challenge, Model Term Ranking Pareto Charts that provide the detailed effects of each term in a model were employed. See Fig. 2 for details.

figure 2

Model Term Ranking Pareto Charts. Top row from left to right: API area (default), Epimer area (default), API plate count. Middle row from left to right: API RT, Epimer RT, Impurity 2 RT. Bottom row from left to right: impurity 3 RT, # of peaks, # of peaks with ≥ 1.5 — resolution

The chart presents all terms of a model in descending order (left to right) based on the absolute magnitude of their effects. The primary y -axis (model term effect) gives the absolute magnitude of individual model terms, while the secondary y -axis (cumulative percentage) gives the cumulative relative percentage effects of all model terms. Blue bars correspond to terms with a positive effect, while gray bars correspond to those with a negative effect. The Model Term Ranking Pareto Charts for all models are summarized in Fig. 2 , except the two “customer” peak area models with a single term and the two C pk models.

AQbD relies on models for efficient and effective chemistry screening, mean optimization, and robustness optimization of chromatographic methods. It is critical to “validate” the models both statistically and scientifically, as inappropriate models may lead to impractical methods. As such, this section will discuss statistical and scientific validation of the DOE models. After the models were fully validated for all selected individual method responses, the method MODR was substantiated by balancing and compromising among the most important method responses.

Statistical validation of the DOE models

As shown in Table 2 , the MSR values ranged from 0.7928 to 0.9999. All MSR values were much higher than their respective MSR threshold, which ranged from 0.0006 to 0.0711, indicating that all models were statistically significant and explained the corresponding chromatographic response data. The MSR adjusted values were all smaller than their respective MSR values, and the differences between the two was always very small (the largest difference was 0.0195 for the API plate count model), indicating that there was no overfitting for the models. There was slight lack of fit for the two customer models due to very low pure errors, and the MS-LOF cannot be calculated for the two C pk model because the Monte Carlo simulation gives essentially zero pure error. Other than that, the MS-LOF ≤ the MS-LOF significance threshold for all other models, indicating the lack of fit was not statistically significant.

In addition to the above numeric statistical validation, various model residual plots were employed for graphical statistical model validation. The parameter–residual plots and the run number-residual plots for all models showed no defined structure, indicating random residual distribution. The normal probability plots showed all residual points lay in a nearly straight line for each single model, indicating normal residual distribution for all models. The randomly and normally distributed residuals provided the primary graphical statistical validation of the DOE models. See Fig. 3 for representative residuals plots for the “# of Peaks” model.

figure 3

Representative residuals plot for the “# of Peaks” model. Upper: run no – residuals plot; lower: residues normal probability plot

Scientific validation of the DOE models

With all models statistically validated, the discussions below will focus on scientific validation of the models by mechanistic understanding.

Peak area models for API and Epimer peaks

Peak areas and peak heights have been used for chromatographic quantification. However, peak area was chosen as the preferred approach as it is less sensitive to peak distortions such as broadening, fronting, and tailing, which can cause significant variation in analyte quantitation. To use peak area to reliably quantify the analyte within the MODR of a robust chromatographic method, the peak area must remain stable with consistent analyte injections.

Peak area models can be critical to the method development and validation with multivariate DOE approach. Solid peak area models were revealed for the API and Epimer peaks in this study. See the “API (default)” and “Epimer (default)” rows in Table 2 for the detailed model regression statistics. See Fig. 2 for the Model Term Ranking Pareto Charts. See Eqs. 2 and 3 below for the detailed models.

Although a full quadratic model for the four-factor five-level CCC design has a total of fourteen potential terms, multivariate regression analyses revealed that only two of the fourteen terms are statistically significant for both the API and Epimer peak area models. In addition, the flow rate term and flow rate squared-terms are identical for the two models, indicating the other three parameters (final percentage strong solvent, oven temperature, and EDTA concentration) have no significant effect on peak area for both peaks.

Oven temperature and EDTA concentration have negligible effect on peak area and thus were not significant terms in the peak area models. The percentage of strong solvent was also not a significant term in the peak area models even though it did appear to influence peak height almost as much as flow rate, but not the peak area, as seen in Fig. 4 . It was hypothesized that the two flow rate terms in the model consisted of a strong negative first order term and a weak positive second order term, but more investigation was needed.

figure 4

Effects of final percentage of strong solvent and flow rate on the API peak area and peak height: run 15 (black) = 31%/0.9 mL/min; run 11 (red) = 33%/1.0 mL/min; run 19 (blue) = 35%/0.9 mL/min; run 9 (green) = 37%/1.0 mL/min; run 16 (purple) = 39%/0.9 mL/min

Peak purity and peak integration are the primary factors affecting peak area. Partial or total peak overlap (resolution < 1.5) due to analyte co-elution can impact the peak purity resulting in inaccurate integration of both peaks. Peak integration may also be affected by unstable baseline and/or peak fronting and tailing due to uncertainty in determining peak start and end points. In this DOE study, the API and Epimer peaks were consistently well-resolved (resolution ≥ 2.0) and were also significantly higher than the limit of quantitation, contributing to the strong peak area models. In contrast, no appropriate peak area models could be developed for other impurity peaks as they were either not properly resolved or were too close to the limit of quantitation. For peaks with resolution ≤ 1.0 there will likely never be an area model with reliable predictivity as the peak area cannot be consistently and accurately measured.

The importance of a mechanistic understanding of the DOE models for AQbD has been extensively discussed. The API and Epimer peak area models were very similar in that they both contained a strong negative first order flow rate term and a weak positive second order flow rate term.

The strong negative first order term can be explained by the exposure time of the analyte molecules to the detector. The UV detector used in the LC method is non-destructive and concentration sensitive. Analyte molecules send signals to the detector when exposed to UV light while flowing through the fixed length detecting window in a band. As the molecules are not degraded by the UV light, the slower the flow rate, the longer the analyte molecules are exposed to the UV light, allowing for increased signal to the detector and thus increased analyte peak area. Simple direct linear regression of the peak area against inverse flow rate confirmed both the API and Epimer peak areas were proportional to the inverse flow rate, with R 2 values ≥ 0.99 (data not included).

As there was no obvious mechanistic explanation of the weak positive second order term in the models, more investigation was needed. Multivariate DOE customer models were pursued. The acquired customer models, listed in Eqs. 4 and 5 , used inverse flow rate “1/A” in place of the flow rate “A” for all pertinent terms among the fourteen terms. The major model regression statistics of the customer models are summarized in the “API (customer)” and “Epimer (customer)” rows in Table 2 . Both customer models contain a single inverse flow rate term, confirming the negative effect of flow rate on peak area for both peaks. The customer models in Eqs. 4 and 5 provide more intuitive understanding of the flow rate effects on peak area than the “default” models in Eqs. 2 and 3 . The weak positive second order flow rate term in Eqs. 2 and 3 contributes less than 15% effect to the peak area and is very challenging to explain mechanistically. This kind of model term replacing technique may be of general value when using DOE to explore and discover new scientific theory, including new chromatographic theory.

Additionally, the peak area models in Eqs. 2 – 5 revealed that the pump flow rate must be very consistent among all injections during a quantitative chromatographic sequence. Currently, the best-in-industry flow rate precision for a binary UPLC pump is “< 0.05% RSD or < 0.01 min SD” (Thermo Fisher Scientific, Vanquish Pump Specification. 2021 ).

API peak plate count model

Column plate count is potentially useful in DOE modeling as it is a key parameter used in all modes of chromatography for measuring and controlling column efficiency to assure separation of the analytes. The equation for plate count ( N ) is shown below. It is calculated using peak retention time ( t r ) and peak width at half height ( w 1/2 ) to mitigate any baseline effects and provide a more reliable response for modeling-based QbD chromatographic method development.

The peak plate count model for the API peak can be seen in Eq. 6 . It was developed by reducing the fourteen terms. The major model quality attributes are summarized in Table 2 .

The flow rate was not a critical factor in the plate count model. This seemingly goes against the Van Deemter equation (van Deemter et al. 1956 ), which states that flow rate directly affects column plate height and thus plate count. However, the missing flow rate term can be rationalized by the LC column that was used. According to the Van Deemter equation, plate height for the 150 × 4.6 mm, 3.5 μm column will remain flat at a minimum level within the 0.7–1.1 mL/min flow rate range used in this DOE study (Altiero 2018 ). As plate count is inversely proportional to the plate height, it will also remain flat at a maximal level within the 0.7–1.1 mL/min flow rate range.

The most dominating parameter in the API plate count model was the final percentage of strong solvent. Its two terms B and B 2 provided more than 60% positive effects to the plate count response (see the Model Term Ranking Pareto Chart in Fig. 2 ) and could be easily explained by the inverse relationship between plate count and peak width when the gradient slope is increased.

Retention time models

Retention time (RT) and peak width are the primary attributes for a chromatographic peak. They are used to calculate secondary attributes such as resolution, plate count, and tailing. These peak attributes together define the overall quality of separation and subsequently quantification of the analytes. RT is determined using all data points on a peak and is thus a more reliable measurand than peak width, which uses only some data points on a peak. As such, peak width cannot provide the same level of RT accuracy, especially for minor peaks, due to uncertainty in the determination of peak start, end, and height. Consequently, RT is the most reliably measured peak attribute.

The reliability of the RT measurement was confirmed in this DOE study. As listed in Table 2 , well-fitted RT models were acquired for the major API and Epimer peaks as well as the minor impurity 2 and impurity 3 peaks. The retention time models are listed in Eqs. 7 – 10 ( note : reciprocal square for the Epimer and impurity 2, and reciprocal for impurity 3 retention time data transformation before modeling inverted the positive and negative nature of the model term effect in Eqs. 8 – 10 , see the Model Term Ranking Pareto Charts in Fig. 2 for the actual effect). The four models shared three common terms: flow rate, final percentage of strong solvent, and the square of final percentage of strong solvent. These three terms contributed more than 90% of the effect in all four RT models. Furthermore, in all four models the flow rate and final percentage of strong solvent terms consistently produced a negative effect on RT, whereas the square of the final percentage of strong solvent term consistently produced positive effects. While the scientific rationale for the negative effects of the first two terms is well-established, the rationale for the positive effects of the third term lies beyond the scope of this study.

As RT is typically the most reliable measured peak response, therefore, it produces most reliable models. One potential shortcoming of RT modeling-based method optimization is that the resolution of two neighboring peaks is not only affected by the retention time, but also by peak width and peak shape, such as peak fronting and tailing.

Peak number models

A representative analytical sample is critical for AQbD to use DOE to develop a chromatographic method capable of resolving all potential related substances. Multivariate DOE chromatography of a forced degradation sample may contain many minor peaks, which may elute in different orders across the different runs of the study, making tracking of the individual peaks nearly impossible. One way to solve this problem is to focus on the number of peaks observed, instead of tracking of individual peaks. Furthermore, to avoid an impractical method with too many partially resolved peaks, the number of peaks with ≥ 1.5 resolution could be an alternative response for modeling.

Excellent models were acquired for both the number of peak responses and the number of peaks with ≥ 1.5 resolution in this DOE study. See Table 2 for the major model statistics, Fig. 2 for the Model Term Pareto Ranking Chart, and Eqs. 11 and 12 for the detailed models ( note : reciprocal square data transformation before modeling reversed the positive and negative nature of the model term effect in Eqs. 11 – 12 ; see the Model Term Ranking Pareto Charts in Fig. 2 for the actual effect). Of the 14 terms, only four were statistically significant for the peak number model and only three were statistically significant for the resolved peak number model. Additionally, it is notable that the two models share three common terms (final percentage of strong solvent ( B ), flow rate ( A ), and oven temperature ( C )) and the orders of impact for the three terms is maintained as ( B ) > ( A ) > ( C ), as seen in the Model Term Ranking Pareto Chart. The models indicated that within the evaluated ranges the final percentage of strong solvent and flow rate have negative effects on the overall separation, while column temperature has a positive effect. These observations align well with chromatographic scientific principles.

Challenges and solutions to peak resolution modeling

No appropriate model was found for the API peak resolution response in this study, possibly due to very high pure experimental error (34.2%) based on the replication runs. With this elevated level of resolution measurement error, only large effects of the experiment variables would be discernable from an analysis of the resolution data. There are many potential reasons for the high pure experimental error: (1) error in the resolution value determination in each DOE run, especially with small peak size or tailing of the reference impurity peaks; (2) the use of different reference peaks to calculate the resolution when elution order shifts between DOE runs; (3) the column is not sufficiently re-equilibrated between different conditions (note: Mention of column equilibration was hypothetical in this case and only to stress the importance of column conditioning during DOE in general. As Fusion QbD automatically inserts conditioning runs into the DOE sequence where needed, this was not found to be an issue in this case study). The respective solutions to overcome these challenges are (1) when reference materials are available, make a synthetic method-development sample composed of each analyte at concentrations at least ten times the limit of quantitation; (2) keep the concentration of analytes in the synthetic sample at distinguishably different levels so that the peaks can be tracked by size; and (3) allow enough time for the column to be sufficiently re-equilibrated between different conditions.

Method robustness evaluation and optimization by Monte Carlo simulation

The robustness of a method is a measure of its capacity to remain unaffected by small but deliberate variations in method parameters. It provides an indication of the method’s reliability during normal usage. Robustness was demonstrated for critical method responses by running system suitability checks, in which selected method parameters were changed one factor at a time. In comparison, the AQbD approach quantifies method robustness with process robustness indices, such as C P and C pk , through multivariate robustness DOE, in which critical method parameters are systematically varied, simultaneously. Process robustness indices are standard statistical process control matrices widely used to quantify and evaluate process and product variations. In this AQbD case study, method capability indices were calculated to compare the variability of a chromatographic method response to its specification limits. The comparison is made by forming the ratio between the spread of the response specifications and the spread of the response values, as measured by six times standard deviation of the response. The spread of the response values is acquired through tens of thousands of virtual Monte Carlo simulation runs of the corresponding response model, with all critical method parameters varied around their setting points randomly and simultaneously according to specified distributions. A method with a process capability of ≥ 1.33 is considered robust as it will only fail to meet the response specifications 63 times out of a million runs and thus is capable of providing much more reliable measurements for informed decisions on drug development, manufacturing, and quality control. Due to its intrinsic advantages over the OFAT approach, multivariate DOE robustness evaluation was recommended to replace the OFAT approach in the latest regulatory guidelines (FDA Guidance for industry-analytical procedures and methods validation for drugs and biologics. 2015 ).

In this DOE study, solid C pk models were produced for the “API Plate Count” and “Number of Peaks ≥ 1.5 USP Resolution”. See Table 2 for the detailed model regression statistics.

Multiple responses method optimization

Once models have been established for selected individual method responses, overall method evaluation and optimization can be performed. This is usually substantiated by balancing and compromising among multiple method responses. Three principles must be followed in selecting method responses to be included in the final optimization: (1) the selected response is critical to achieve the goal (see Table 4 ); (2) a response is included only when its model is of sufficiently high quality to meet the goals of validation; and (3) the total number of responses included should be kept to a minimum.

Following the above three principles, five method responses were selected for the overall method evaluation and optimization. Best overall answer search identified a new optimized method when the four critical method parameters were set at the specific values as listed in Table 3 . The cumulative desirability for the five desired method response goals reached the maximum value of 1.0. The desirability for each individual goal also reached the maximum value of 1.0, as listed in Table 4 .

Method Operable Design Region (MODR)

The critical method parameter settings in Table 3 define a single method that can simultaneously fulfill all five targeted method goals listed in Table 4 to the best extent possible. However, the actual operational values of the four critical parameters may drift around their set points during routine method executions. Based on the models, contour plots for method response can be created to reveal how the response value changes as the method parameters drift. Furthermore, overlaying the contour plots of all selected method responses reveal the MODR, as shown in Figs. 4 , 5 , and 6 . Note that for each response, a single unique color is used to shade the region of the graph where the response fails the criteria; thus, criteria for all responses are met in the unshaded area.

figure 5

Trellis overlay graph shows how the size of the MODR (unshaded area) changes as the four method parameters change

figure 6

Single overlay graph shows the original as-is method at point T is not robust (pump flow rate = 0.90 mL/min; final % strong solvent = 35%; oven temperature = 30 °C; EDTA concentration = 0.50 mM)

The Trellis overlay graph in Fig. 5 reveals the MODR from the perspectives of all four critical method parameters, among which flow rate and final percentage of strong solvent change continuously while oven temperature and EDTA additive concentration were each set at three different levels. Figure 5 clearly demonstrates how the size of the MODR changes with the four method parameters. The single overlay graph in Fig. 6 shows that the original as-is method (represented by the center point T) is on the edge of failure for two method responses, number of peaks (red) and number of peaks ≥ 1.5 resolution (blue), indicating that the original method is not robust. Conversely, point T in the single overlay graph in Fig. 7 is at the center of a relatively large unshaded area, indicating that the method is much more robust than the original method.

figure 7

Single overlay graph shows a much more robust method at point T (pump flow rate = 0.78 mL/min; final % strong solvent = 34.2%; oven temperature = 30.8 °C; EDTA concentration = 0.42 mM)

Through the collaboration of regulatory authorities and the industry, AQbD is the new paradigm to develop robust chromatographic methods in the pharmaceutical industry. It uses a systematic approach to understand and control variability and build robustness into chromatographic methods. This ensures that analytical results are always close to the product true value and meet the target measurement uncertainty, thus enabling informed decisions on drug development, manufacturing, and quality control.

Multivariate DOE modeling plays an essential role in AQbD and has the potential to elevate chromatographic methods to a robustness level rarely achievable via the traditional OFAT approach. However, as demonstrated in this case study, chromatography science was still the foundation for prioritizing method inputs and responses for the most appropriate DOE design and modeling, and provided further scientific validation to the statistically validated DOE models. Once models were fully validated for all selected individual method responses, the MODR was substantiated by balancing and compromising among the most important method responses.

Developing a MODR is critical for labs that transfer in externally sourced chromatographic methods. In this case study, method evaluation using AQbD produced objective data that enabled a deeper understanding of method variability, upon which a more robust method with a much larger MODR was proposed. The in-depth method variability understanding through AQbD also paved the way for establishing a much more effective method control strategy. Method development and validation from a multivariate data driven exercise led to better and more informed decisions regarding the suitability of the method.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

Analytical quality by design

Design of experiments

Circumscribed central composite

Method Operable Design Region

Quality by design

The International Council for Harmonization of Technical Requirements for Pharmaceuticals for Human Use

Critical quality attributes

Target measurement uncertainty

Analytical target profile

Analysis of variance

Ethylenediaminetetraacetic acid

Mean square regression

Mean square lack of fit

One factor at a time

Altiero, P. Why they matter, an introduction to chromatography equations. Slide 21. https://www.agilent.com/cs/library/eseminars/public/Agilent_Webinar_Why_They_Matter_An_Intro_Chromatography_Equations_Nov262018.pdf . Accessed 13 May 2021. (2018).

Bezerraa MA, , Ferreirab SLC, Novaesa CG, dos Santoset AMP, Valasquesal GS, da Mata Cerqueira UMF, et al. Simultaneous optimization of multiple responses and its application in Analytical Chemistry – a review. Talanta; 194: 941-959. (2019).

Article   Google Scholar  

Borman P, Chatfield M, Nethercote P, Thompson D, Truman K (2007) The application of quality by design to analytical methods. Pharma.l Technol 31(12):142–152 (n.d.)

Google Scholar  

Chatterjee S, CMC Lead for QbD, ONDQA/CDER/FDA. Design space considerations, AAPS Annual Meeting,2012. (n.d.).

Debrus B, Guillarme D, Rudaz S (2013) Improved quality-by-design compliant methodology for method development in reversed-phase liquid chromatography. J Pharm Biomed Anal 84:215–223 (n.d.)

Article   CAS   Google Scholar  

EURACHEM / CITAC. Setting and using target uncertainty in chemical measurement. 2015. (n.d.).

FDA Guidance for industry-analytical procedures and methods validation for drugs and biologics. 2015.

FDA pharmaceutical cGMPs for the 21st century — a risk-based approach. 2004. (n.d.).

Grangeia HB, Silvaa C, Simões SP, Reis MS (2020) Quality by design in pharmaceutical manufacturing: a systematic review of current status, challenges and future perspectives. Eur J Pharm Biopharm 147:19–37 (n.d.)

ICH Q10 - Pharmaceutical quality system. 2008. (n.d.).

ICH Q11 - Development and manufacturing of drug substances (chemical entities and biotechnological/biological entities). 2012. (n.d.).

ICH Q12 - Technical and regulatory considerations for pharmaceutical product lifecycle management. 2019. (n.d.).

ICH Q14 - Analytical procedure development and revision of Q2(R1) analytical validation - final concept paper. 2018. (n.d.).

ICH Q8 (R2) - Pharmaceutical development. 2009. (n.d.).

ICH Q9 - Quality risk management. 2005. (n.d.).

Leardi R (2009) Experimental design in chemistry: a tutorial. Anal. Chim. Acta 652(1–2):161–172 (n.d.)

Myers RH, Montgomery DC (1995) Response surface methodology: process and product optimization using designed experiments, 2nd edition. John Wiley & Sons, New York, pp 366–404 (n.d.)

NIST/SEMATECH e-Handbook of statistical methods. 2012a. http://www.itl.nist.gov/div898/handbook/ppc/section1/ppc133.htm . Accessed May 13, 2021. (n.d.).

NIST/SEMATECH e-Handbook of statistical methods. 2012b. https://www.itl.nist.gov/div898/handbook/pmc/section1/pmc16.htm . Accessed May 13, 2021. (n.d.).

NIST/SEMATECH e-Handbook of statistical methods. 2012c. https://www.itl.nist.gov/div898/handbook/pmd/section4/pmd44.htm . Accessed May 13, 2021. (n.d.).

NIST/SEMATECH e-Handbook of statistical methods. 2012d. https://www.itl.nist.gov/div898/handbook/pri/section5/pri52.htm . Accessed May 13, 2021. (n.d.).

Orlandini S, Pinzauti S, Furlanetto S (2013) Application of quality by design to the development of analytical separation methods. Anal Bioanal Chem 2:443–450 (n.d.)

Schweitzer M, Pohl M, Hanna-Brown M, Nethercote P, Borman P, Hansen P, Smith K et al (2010) Implications and opportunities for applying QbD principles to analytical measurements. Pharma. Technol 34(2):52–59 (n.d.)

CAS   Google Scholar  

Tang YB, FDA/CDER/ONDQA (2011) Quality by design approaches to analytical methods -- FDA perspective. AAPS, Washington DC (n.d.)

Thermo Fisher Scientific, Vanquish pump specification. 2021. https://assets.thermofisher.com/TFS-Assets/CMD/Specification-Sheets/ps-73056-vanquish-pumps-ps73056-en.pdf . Accessed May 22, 2021. (n.d.).

van Deemter JJ, Zuiderweg FJ, Klinkenberg A. Longitudinal diffusion and resistance to mass transfer as causes of non ideality in chromatography. 1956. (n.d.).

Volta e Sousa L, Gonçalves R, Menezes JC, Ramos A (2021) Analytical method lifecycle management in pharmaceutical industry: a review. AAPS PharmSciTech 22(3):128–141. https://doi.org/10.1208/s12249-021-01960-9 (n.d.)

Article   PubMed   Google Scholar  

Download references

Acknowledgements

The authors would like to thank KBP Biosciences for reviewing and giving permission to publish this case study. They would also like to thank Thermo Fisher Scientific and S Matrix for the Fusion QbD software, Lynette Bueno Perez for solution preparations, Dr. Michael Goedecke for statistical review, and both Barry Gujral and Francis Vazquez for their overall support.

Not applicable; authors contributed case studies based on existing company knowledge and experience.

Author information

Authors and affiliations.

Thermo Fisher Scientific Inc., Durham, NC, USA

Yongzhi Dong, Zhimin Liu, Charles Li, Emily Pinter, Alan Potts, Tanya Tadey & William Weiser

You can also search for this author in PubMed   Google Scholar

Contributions

YD designed the study and performed the data analysis. ZL was the primary scientist that executed the study. CL, EP, AP, TT, and WW contributed ideas and information to the study and reviewed and approved the manuscript. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Yongzhi Dong .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Dong, Y., Liu, Z., Li, C. et al. Perspectives in modeling and model validation during analytical quality by design chromatographic method evaluation: a case study. AAPS Open 7 , 3 (2021). https://doi.org/10.1186/s41120-021-00037-y

Download citation

Received : 27 May 2021

Accepted : 29 July 2021

Published : 01 September 2021

DOI : https://doi.org/10.1186/s41120-021-00037-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Statistical model validation
  • Scientific model validation
  • Multiple responses optimization

case study method validation

A case study for in-house method validation of gas chromatography technique using class-1 calibration gas mixtures for greenhouse gases monitoring

  • Published: 15 September 2023
  • Volume 28 , pages 209–220, ( 2023 )

Cite this article

  • Komal 1 , 2 ,
  • Daya Soni 1 , 2 &
  • Shankar G. Aggarwal 1 , 2  

340 Accesses

Explore all metrics

In this study, analytical method validation has been done for the measurement of carbon dioxide/nitrogen (CO 2 /N 2 ) and methane/nitrogen (CH 4 /N 2 ) calibration gas mixtures using gas chromatography with flame ionization detector (GC-FID). Class-I calibration gas mixtures (CGMs) of CO 2 (500 µmol mol −1 to 1100 µmol mol −1 ) and CH 4 (2 µmol mol −1 to 130 µmol mol −1 ) used in method validation process has been prepared gravimetrically following ISO 6142-1. All prepared gas mixtures have expanded uncertainty 1 % at coverage factor ( k ) of 2 with 95 % confidence. The following parameters are chosen for this case study which include selectivity, accuracy, precision, linearity, limit of detection (LOD), limit of quantification (LOQ), robustness, stability, and uncertainty. Different statistical approaches are taken into consideration for each parameter assessment. The results indicate that GC-FID is selective for CO 2 and CH 4 . CGMs represent good repeatability and reproducibility having percentage relative deviation < 1 % among measurements. A good linear behaviour was observed for CGMs of CH 4 and CO 2 on basis of least square regression with R 2 value 0.9995 and 1, respectively. LOD and LOQ for CH 4 are calculated 0.47 and 1.59 µmol mol −1 based on signal-to-noise ratio by taking its lowest concentration of 2.9 µmol mol −1 . The in-house validated method for GC-FID using CGMs for the measurement of greenhouse gases (CO 2 and CH 4 ) is found to be precise, accurate and fit for purpose.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

case study method validation

Similar content being viewed by others

case study method validation

Review: CO2 capturing methods of the last two decades

S. Kammerer, I. Borho, … M. S. Schmidt

HPLC–MS/MS methods for the determination of 52 perfluoroalkyl and polyfluoroalkyl substances in aqueous samples

Christoph Gremmel, Tobias Frömel & Thomas P. Knepper

case study method validation

A single analytical method for the determination of 53 legacy and emerging per- and polyfluoroalkyl substances (PFAS) in aqueous matrices

Timothy L. Coggan, Tarun Anumol, … Bradley O. Clarke

Data availability

The data supporting the findings of this study are available from the corresponding author on request.

Dai A (2011) Drought under global warming: a review. Wiley Interdiscip Rev Clim Change 2(1):45–65. https://doi.org/10.1002/WCC.81

Article   Google Scholar  

Emanuel K (2011) Global warming effects on US hurricane damage. Weather Clim Soc 3(4):261–268. https://doi.org/10.1175/WCAS-D-11-00007.1

Perera F (2018) Pollution from Fossil-Fuel Combustion is the Leading Environmental Threat to Global Pediatric Health and Equity: Solutions Exist. Int J Environ Res Public Health. https://doi.org/10.3390/IJERPH15010016

Article   PubMed   PubMed Central   Google Scholar  

Smith ZA (2017) The environmental policy paradox. https://doi.org/10.4324/9781315623641

Yusuf RO, Noor ZZ, Abba AH, Hassan MAA, Din MFM (2012) Methane emission by sectors: a comprehensive review of emission sources and mitigation methods. Renew Sustain Energy Rev 16(7):5059–5070. https://doi.org/10.1016/J.RSER.2012.04.008

Article   CAS   Google Scholar  

Crosson ER (2008) A cavity ring-down analyzer for measuring atmospheric levels of methane, carbon dioxide, and water vapor. Appl Phys B Lasers Opt. https://doi.org/10.1007/S00340-008-3135-Y

Gavrilov NM, Makarova MV, Poberovskii AV, Timofeyev YM (2014) Comparisons of CH 4 ground-based FTIR measurements near Saint Petersburg with GOSAT observations. Atmos Meas Tech 7(4):1003–1010. https://doi.org/10.5194/AMT-7-1003-2014

Kamiński M, Kartanowicz R, Jastrzȩbski D, Kamiński MM (2003) Determination of carbon monoxide, methane and carbon dioxide in refinery hydrogen gases and air by gas chromatography. J Chromatogr A 989(2):277–283. https://doi.org/10.1016/S0021-9673(03)00032-3

Article   CAS   PubMed   Google Scholar  

Lodge JP (2018) Determination of O 2 , N 2 , CO, CO 2 , and CH 4 (Gas Chromatographic Method). Methods Air Sampl Anal. https://doi.org/10.1201/9780203747407-51

Van Der Laan S, Neubert REM, Meijer HAJ (2009) Atmospheric measurement techniques a single gas chromatograph for accurate atmospheric mixing ratio measurements of CO 2 , CH 4 , N 2 O, SF 6 and CO. Atmos Meas Tech 2:549–559

Weiss RF (1981) Determinations of carbon dioxide and methane by dual catalyst flame ionization chromatography and nitrous oxide by electron capture chromatography. J Chromatogr Sci 19(12):611–616. https://doi.org/10.1093/CHROMSCI/19.12.611

Hilborn JC, Monkman JL (1975) Gas chromatographic analysis of calibration gas mixtures. Sci Total Environ 4(1):97–106. https://doi.org/10.1016/0048-9697(75)90017-0

International conference on harmonisation of technical requirements ICH harmonised tripartite, Guidelines for validation of analytical procedure (1994)

EURACHEM: The fitness for purpose of analytical methods (2014)

Taverniers I, De Loose M, Van Bockstaele E (2004) Trends in quality in the analytical laboratory. II. Analytical method validation and quality assurance. TrAC Trends Anal Chem 23(8):535–552. https://doi.org/10.1016/J.TRAC.2004.04.001

Thompson M, Ellison SLR, Wood R (2002) Resulting from the symposium on harmonization of quality assurance systems for analytical laboratories. Pure Appl Chem 74(5):4–5. https://doi.org/10.1351/pac200274050835

Wenclawiak B, Hadjicostas E (2010) Validation of analytical methods - To be fit for the purpose. Quality Assur Anal Chem Train Teach. https://doi.org/10.1007/978-3-642-13609-2_11/COVER

ISO 6142-1 (2015) Gas analysis—preparation of calibration gas mixtures — Part 1: Gravimetric method for Class I mixtures

CCQM (2019) Mise en pratique - mole - Appendix 2 - SI Brochure

ISO/IEC 17025 (2017) General requirements for the competence of testing and calibration laboratories

Komal, Soni D, Kumari P, Gazal, Singh K, Aggarwal SG (2022) A practical approach of measurement uncertainty evaluation for gravimetrically prepared binary component calibration gas mixture. Mapan J Metrol Soc India, 37(3): 653–664. https://doi.org/10.1007/s12647-022-00600-2

JCGM 100 (2008) Evaluation of measurement data—Guide to the expression of uncertainty in measurement. International Organization for Standardization Geneva

Hong K, Kim BM, Kil Bae H, Lee S, Tshilongo J, Mogale D, Seemane P, Mphamo T, Kadir HA, Ahmad MF, Hidaya N, Nasir A, Baharom N, Soni D, Singh K, Bhat S, Aggarwal SG, Johri P, Kiryong H (2020) Final report international comparison APMP.QM-S7.1 Methane in nitrogen at 2000 μmol/mol. Metrologia. https://doi.org/10.1088/0026-1394/52/1A/08013

Lee J, Lim J, Moon D, Aggarwal SG, Johri P, Soni D, Hui L, Ming KF, Sinweeruthai R, Rattanasombat S, Zuas O, Budiman H, Mulyana MR, Alexandrov V (2021) Final report for supplementary comparison APMP.QM-S15: carbon dioxide in nitrogen at 1000 µmol/mol. Metrologia. https://doi.org/10.1088/0026-1394/58/1A/08014

Ribani M (2004) validation for chromatographic and electrophoretic methods. Quim Nova 27(5):771–780. https://doi.org/10.1590/S0100-40422004000500017

Persson B (2001) The use of selectivity in analytical chemistry. Trends Anla Chem 20(10):526–532. https://doi.org/10.1016/S0165-9936(01)00093-0

Freeman RR, Kukla D (1986) The Role of Selectivity in Gas Chromatography. J Chromatogr Sci 24(9):392–395. https://doi.org/10.1093/CHROMSCI/24.9.392

Foley JP (1991) Resolution equations for column chromatography. Analyst 116(12):1275–1279. https://doi.org/10.1039/AN9911601275

Juradao JM (2017) Some practical considerations for linearity assessment of calibration curves as function of concentration levels according to the fitness for purpose approach. Talanta. https://doi.org/10.1016/j.talanta.2017.05.049

Dorschel CA, Ekmanis JL, Oberholtzer JE, Vincent Warren F, Bidlingmeyer BA (1989) LC detectors: evaluation and practical implications of linearity. Anal Chem 61(17):951A-968A. https://doi.org/10.1021/AC00192A719

Roddam AW (2005) Statistics for the Quality Control Chemistry Laboratory. J R Stat Soc A Stat Soc 168(2):464–464. https://doi.org/10.1111/J.1467-985X.2005.358_13.X

Andrade JM, Gómez-Carracedo MP (2013) Notes on the use of Mandel’s test to check for nonlinearity in laboratory calibrations. Anal Methods 5(5):1145–1149. https://doi.org/10.1039/c2ay26400

Miller JN (1991) Basic statistical methods for analytical chemistry. Part 2. Calibration and regression methods. A review. The Anal 116(1):3–14. https://doi.org/10.1039/AN9911600003

Raposo F (2016) Evaluation of analytical calibration based on least-squares linear regression for instrumental techniques: a tutorial review. TrAC Trends Anal Chem 77:167–185. https://doi.org/10.1016/j.trac.2015.12.006

Rodríguez LC, Campaña AMG, Linares CJ, Ceba MR (1993) Estimation of performance characteristics of an analytical method using the data set of the calibration experiment. Anal Lett 26(6):1243–1258. https://doi.org/10.1080/00032719308019900

Montgomery D, Peck EA, Vining GG (2006) Introduction to linear regression analysis, 4th edn. John Wiley & Sons, New Jersey

Google Scholar  

Mactaggart DL, Farwell SO (1992) Analytical use of linear regression. Part I: regression procedures for calibration and quantitation. J AOAC Int 75(4):594–608. https://doi.org/10.1093/JAOAC/75.4.594

ISO 5725-1 (2023) Accuracy (trueness and precision) of measurement methods and results—Part 1

ISO 6143 (2001) Gas Analysis—Comparison methods for determining and checking the composition of calibration gas mixture

González AG, Herrador MÁ, Asuero AG (2010) Intra-laboratory assessment of method accuracy (trueness and precision) by using validation standards. Talanta 82(5):1995–1998. https://doi.org/10.1016/J.TALANTA.2010.07.071

Article   PubMed   Google Scholar  

Albert R, Horwitz W (1997) A heuristic derivation of the horwitz curve. Anal Chem 69(4):789–790. https://doi.org/10.1021/AC9608376

Horwitz W, Albert R (2006) The Horwitz ratio (HorRat): a useful index of method performance with respect to precision. J AOAC Int 89(4):1095–1109. https://doi.org/10.1093/jaoac/89.4.1095

Shrivastava A, Gupta V (2011) Methods for the determination of limit of detection and limit of quantitation of the analytical methods. Chron Young Sci 2(1):21. https://doi.org/10.4103/2229-5186.79345

Desimoni E, Brunetti B (2015) About estimating the limit of detection by the signal to noise approach. Pharm Anal Acta. https://doi.org/10.4172/2153-2435.1000355

ISO/IEC Guide 99 (2007) International vocabulary of metrology- Basic and general concepts and associated terms (VIM)

Konieczka P, Namieśnik J (2010) Estimating uncertainty in analytical procedures based on chromatographic techniques. J Chromatogr A 1217(6):882–891. https://doi.org/10.1016/j.chroma.2009.03.078

Satterthwaite FE (1946) An approximate distribution of estimates of variance components. Int Biom Soc 2:110–114. https://doi.org/10.2307/3002019

Welch BL (1947) The generalisation of ‘students’ problem when several different population variances are involved. Biom Bull. https://doi.org/10.1093/biomet/34.1-2.28

ISO Guide 35 (2017) Reference materials—Guidance for characterization and assessment of homogeneity and stability

Download references

Acknowledgements

The author, Komal is thankful to Council of Scientific and Industrial Research (CSIR) for providing the fellowship under CSIR-SRF scheme (P- 81-101). Authors are thankful to the Director, CSIR-NPL for providing all support to carry out gas metrology work and further extend their thanks to Head of ESBM Division and Gas Metrology group members for their help and support.

Author information

Authors and affiliations.

CSIR-National Physical Laboratory, Dr. KS Krishnan Marg, New Delhi, 110012, India

Komal, Daya Soni & Shankar G. Aggarwal

Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India

You can also search for this author in PubMed   Google Scholar

Contributions

Ms. Komal did all the experimental analysis, data curation and writing manuscript. Dr Daya Soni supervised in all experimental part , constructing the manuscript and data processing. Dr. Shankar G. Aggarwal reviewed the manuscript.

Corresponding author

Correspondence to Daya Soni .

Ethics declarations

Conflict of interest.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Komal, Soni, D. & Aggarwal, S.G. A case study for in-house method validation of gas chromatography technique using class-1 calibration gas mixtures for greenhouse gases monitoring. Accred Qual Assur 28 , 209–220 (2023). https://doi.org/10.1007/s00769-023-01552-z

Download citation

Received : 01 May 2023

Accepted : 22 July 2023

Published : 15 September 2023

Issue Date : October 2023

DOI : https://doi.org/10.1007/s00769-023-01552-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Greenhouse gases
  • Method validation
  • Calibration gas mixture
  • Find a journal
  • Publish with us
  • Track your research

Diseases & Diagnoses

Issue Index

  • Case Reports

Columns | March/April 2022

Dementia Insights: The Validation Method for Dementia Care

If you validate someone, you accept them where they are and where they’re not. if you accept them, then they can accept themselves. —naomi feil.

Vicki de Klerk-Rubin, RN, MBA; and Daniel C. Potts, MD

Physicians and Caregiving

Both as trainees and in practice, physicians receive relatively little education on interacting with and managing the health care of elderly individuals living with cognitive impairment and often find themselves inadequately prepared. 1 Nurses, social workers, gerontologists, health care administrators, and allied health care professionals are more likely than physicians to enroll in the advanced clinical dementia practice educational programs that have been developed. 2

Medical education, being rather insular, includes little to no training from the fields of nursing, social work, gerontology, or the allied health professions. As a result, the theory, methodology, and practice of caregiving is not part of the educational experience of most physicians. Disease-focused, evidence-based paradigms that are the primary pathway to learning and proficiency across traditional medicine, may undervalue the clinical experience of physicians and nonphysician health care professionals, and, thereby, the art of healing, marginalizing qualities of caregiving that could foster patient well-being. 3

Dementia Care

Physicians’ resulting limited acumen related to the more human aspects of dementia care becomes more problematic considering that cures and effective treatments for Alzheimer disease (AD) and other dementias remain elusive. This, coupled with increasing rates of burnout and career dissatisfaction, 4 contribute to situations in which neurologists may, for reasons of self-preservation, unintentionally avoid engaging with dementia patients, thereby missing opportunities to help them live as well as possible and to provide adequate support for their caregivers.

Atul Gawande, physician author of Being Mortal , 5 writes that, for a clinician, “nothing is more threatening to who you think you are than a patient with a problem you cannot solve.” For this reason, Gawande notes that often, “medicine fails the people it is supposed to help.”⁵ Acknowledging that solutions are lacking for clinical problems is difficult for physicians, and places them in the often-unfamiliar territory of vulnerability. Yet, such a state can clear a space for presence, defined in this context by Kleinman as “the intensity of interacting with another human being that animates being there for, and with, that person, a calling forward or a stepping toward the other.” 3 Presence, asserts Kleinman “is built out of listening intensely, indicating that the person and their story matter,” the ongoing practice of which, he posits, “sustains clinical work over the long and difficult journey of a career in medicine,” and defines the essence of caregiving. 3

In the case of persons living with dementia, this both requires and increases empathy as well as understanding the complex ways dementia manifests in the lives of patients. Required skills include effectively communicating with persons whose language may have diminished and a grasp of the essential elements of personhood and life story. With those skills, it becomes possible to find the meaning in often misunderstood responses of people with dementia to the inner and outer challenges with which they are confronted.

The Validation Method

Development of the Validation Method

Such a seasoned, thoughtful, nuanced approach requires access to the wisdom, experience and knowledge of a master teacher. Naomi Feil, iconic founder of the Validation Method (Box), is just such a teacher. Physicians, in general, and neurologists, in particular, should become familiar with the principles and practice of the Validation Method as a means to improve the quality of care they are able to provide elders living with dementia as well as support for their caregivers.

Box. Development of The Validation Method by Naomi Feil, MSW

case study method validation

After graduating with an MSW from Columbia University, Feil returned to the Montefiore home where she had grown up and began working with the special services residents who were then labeled as, senile psychotic, or diagnosed as having organic brain damage. Discontented with reality orientation, remotivation, and other methods that were in mode in the 1960s and 70s, Feil began experimenting. In 1972, she reported her research and experience at the 25th annual meeting of the Gerontological Society of America. Those promising results were based on her developing new group work with disoriented residents, documented in the film The Tuesday Group .

With her husband, Edward, a documentary film producer, Feil created the classic 1978 film, Looking for Yesterday , which chronicled how the Validation Method worked in practice. A film created for a national television audience, When Generations Meet , showed Feil teaching beginning validation principles and techniques to teenagers who then engaged both oriented and disoriented nursing home residents in conversation. Feil was asked to show these films and present her work, first at the Ohio Department of Aging, and thereafter, nationwide. Often, she could be seen carrying the 16 mm film cans to conferences and spoke from her experience working with the residents.

By 1982, Feil had developed the Validation Method to the point of publishing her first book, Validation: the Feil Method . 6 Also in 1982, the Validation Training Institute (VTI) was founded and became a registered nonprofit organization in 1983. Structure was needed in order to keep the method integral and to avoid ‘wild growth.’ Authorized Validation Organizations (AVOs) became the training centers and certification processes were implemented.

In 1988-89, beginning in the Netherlands and Austria, Feil began doing workshop tours throughout Europe. Her books were translated into almost all European languages. AVOs were formed in Austria, Belgium, France, Germany, the Netherlands, and Switzerland. Since the early 1990s, certification levels have been revised, pedagogically responsible testing has been implemented, and a team of experts oversees all developments of the method and how it is taught. There are over 450 certified Validation Method teachers in the world, 23 AVOs in 13 countries, and over 7,000 people certified in what is now called the Validation Method. VTI found the word ‘therapy’ overly focused on healing and in Europe, there was resistance to using that word. The team of Validation experts agreed that a better description of what Naomi Feil created was, the Validation Method.

Evidence for the Validation Method

Throughout the 1980s, Feil’s presentations of the Validation Method and its positive impacts gained attention in the US and Validation Therapy was becoming more widely known. During that period, research by Alprin, Peoples, Babins, Fritz, Dietch, and Sharp demonstrated that validation in elders with dementia increased communication and positive affect, reduced aggressive behavior, and lowered use of psychotropic medications. 7-14 Additionally, caregivers were noted to feel more capable of handling difficult situations and experience more pleasure in their work.

In more recent research, the Validation Method combined with sensorial reminiscence found significant improvements for behavioral disturbances compared to controls in one study,¹⁵ and another found that Validation was associated with decreased agitation, apathy, irritability, and night-time disturbance.¹⁶

Ongoing research on the Validation Method adds to an emerging body of evidence that eye contact and physical touch, 2 elements of engagement that build trust in caring relationships, and essential components of Validation, have neurochemical effects that are conducive to neuroplasticity and resiliency.¹⁷ A new international research project at sites in The Netherlands, Australia, Israel, and Germany will measure dopamine, cortisol and serotonin before and after Validation sessions, to further elucidate the biological underpinnings of the Method’s observed effects.

Principles of the Validation Method

The Validation Method principles have essentially remained the same since Feil wrote them in 1982:

1. All very elderly people are unique and worthwhile.

2. Disoriented elders should be accepted as they are; we should not try to change them.

3. Listening with empathy builds trust, reduces anxiety, and restores dignity.

4. Painful feelings that are expressed, acknowledged, and validated by a trusted listener will diminish, whereas painful feelings that are ignored or suppressed will gain in strength.

5. There is a reason behind the behavior of very elderly adults with cognitive losses.

6. Basic human needs underlie behavior of disoriented very elderly people and include:

  • to resolve unfinished issues, in order to die in peace;
  • to live in peace;
  • to restore a sense of equilibrium when eyesight, hearing, mobility, and memory fail;
  • to make sense out of an unbearable reality by finding a place that feels comfortable, where one feels in order or in harmony and where relationships are familiar;
  • to be recognized for status, identity, and self-worth;
  • to be useful and productive;
  • to be listened to and respected;
  • to express feelings and be heard;
  • to be loved and to belong;
  • to have human contact;
  • to be nurtured and feel safe and secure, rather than immobilized and restrained;
  • to have sensory stimulation: tactile, visual, auditory, olfactory, gustatory, as well as sexual expression; and
  • to reduce pain and discomfort. In order to satisfy their needs, people are drawn to the past or are pushed from the present to resolve, retreat, relieve, relive, and express.

7. Early learned behaviors return when verbal ability and recent memory fail.

8. Personal symbols used by disoriented elderly are people or things in the present that represent people, things, or concepts from the past that are laden with emotion.

9. Disoriented elders live on several levels of awareness, often at the same time.

10. When the 5 senses fail, disoriented elderly people stimulate and use their ‘inner senses,’ seeing with their ‘mind’s eye’ and hearing sounds from the past.

11. Events, emotions, colors, sounds, smells, tastes, and images create emotions that trigger similar emotions experienced in the past. Elders react in the present as they did in the past.

If these 11 principles are embodied, contact and communication with elders has been shown to become easier and more joyful for the caregiver and the person receiving care. Persons who are living with cognitive decline gain benefit through this kind of loving contact.

Mindfulness

Mindfulness is inherent to the Validation Method and can build the proper framework for implementing the Validation Method most effectively. Mindfulness has been defined by John Kabat-Zinn as the awareness that emerges through paying attention on purpose, in the present moment, and nonjudgmentally to the unfolding of experience moment by moment. 18 When the principles of the Validation Method are applied, attention is focused nonjudgmentally on another person, bringing them fully and deeply into the present moment. By acknowledging another person in this way, a trusting relationship is built.

Putting the Validation Method to Use

As shown in the Figure and demonstrated in the Case Studies, the Validation Method can be seen as a tripod of theory, techniques, and attitude. Each leg of the triangle is dependent on the others. If only the basic attitude is used, there will be good contact but little communication. If techniques are used without the validating attitude, they are sterile and often ineffective. Likewise, if attitude and techniques are used without goals, the method often goes wrong. It is only when all 3 elements are in place that contact and communication can easily flow.

Figure. The Validation Method requires a basic nonjudgmental, empathetic, honest, and respectful attitude in combination with theoretical understanding and application of specific techniques.

case study method validation

Click to view larger

CASE STUDY 1. The Validation Method in the Context of Pain and Dementia

GM, age 85, is in clinic complaining of shoulder and arm pain. They look very sad, almost anguished. Serious structural issues have been ruled out with an x-ray. Before ordering further testing, the physician carefully observes their patient and respectfully makes eye contact.

Dr: How are you feeling right now?

GM: Oh doctor, I have such pain. (The doctor takes a moment to match the facial expression of GM and modulates their voice to the same pitch. The physician is sitting at eye-level and making eye contact with a typical, social distance between themself and GM.)

Dr: You look very sad. (The doctor says the emotion, with emotion)

GM: I am sad.

Dr: What makes you sad right now? (Asks an open question)

GM: I miss my spouse. I get so lonely. And my children don’t call me very often. I have to call them.

Dr: Where does it hurt the most? (Asks an open question and emphasizes the extreme)

GM: Here. (pointing to their heart)

Dr: It hurts around your heart? (Rephrases with empathy)

GM: Sometimes there and sometimes in my neck.

Dr: I see. You’re all alone now and you have pain in your heart and neck. Is that it? (Rephrases with empathy)

GM: Exactly. You understand.

(There is now a trusting bond between doctor and patient.)

CASE STUDY 2. Memory Assessment With the Validation Method

Mr J is a 70-year-old veteran who came in because his daughter says he’s losing his memory and becoming ‘demented.’ His doctor observes a slim, athletic, older man. Mr J’s face reflects irritation, and he sits down angrily in the clinic exam room. His history reveals relatively good health outside of age-related aches and pains.

(The doctor takes a deep breath and centers.)

Dr: Good morning Mr J. How are you doing today? (Asks an open question)

Mr J: I’m fine. I don’t know why I’m here.

Dr: Your daughter tells me that you’re having some memory issues.

Mr J: She doesn’t know what she’s talking about .

Dr: Well, what do you think is going on? (Asks an open question)

Mr J: She’s worried that I’m getting too old. I’m strong as a horse!

Dr: Have you always been so strong? (Asks a question emphasizing the extreme)

Mr J: I was in the army since my 18th birthday. The army keeps you strong!

Dr: And now? (Asks an open question)

Mr J: Well, you know how it is. I need glasses now to read.

Dr: Have you noticed other changes?

Mr J: Actually, yes. I have to write things down or I forget them. But that’s normal at my age, right? (His tone of voice softens)

Dr: Yes, it’s normal to lose some short-term memory as you age but I’d like to measure that. Would you be willing to do a few tests with me?

Mr J: It’s always better to know the lay of the land and then make the battle plan.

Dr: Great. Here’s what I’d like to do next…

Practitioners of the Validation Method do not judge or correct false attributions and enhancement of memories. These are believed to be expressions of needs and feelings that are validated. Such validation enhances the individual’s sense of identity and acceptance. If an elder says they are hungry just after eating a large breakfast, we understand that ‘hunger’ is understood to represent something other than physical insatiety due to desiring food. A practitioner might ask, “Where do you feel the hunger? Is it a pang or an ache? What would fill that empty space?”

The following techniques should be learned as prerequisites for effective and empathetic communication.

Centering. Clear away thoughts and feelings that interfere with making contact with others. Centering will help one find moments of peace, gather energy, and clear away the emotions of self and others that a person, themselves, may be experiencing.

Observation and Calibration. Exquisitely observe others, picking up clues about their current emotional state (ie, taking their “emotional temperature”). The next step to finding empathy is to match or calibrate yourself to these observations.

Respectful Tone of Voice. By using an adult-to-adult vocal tone, respect is given, helping to build a trusting relationship.

Respectful Eye Contact . Staying at eye-level, approaching from the front or slightly diagonal (if that is most appropriate) and using as much direct eye contact as the other person needs in the moment also conveys respect and builds trust. 16

Appropriate Distancing. It is important to discover an appropriate distance because getting too close can create anger or fear, whereas staying too far away can block good communication. Finding the appropriate spot builds a warm connection.

The other verbal and nonverbal validation techniques are drawn from many different sources and are used like herbs when cooking. Recognizing and using the most appropriate techniques for each individual at the moment of contact is the skill set developed in Validation Method training courses.

Conclusion and Additional Resources

The principles and techniques of the Validation Method can help physicians and other health care professionals provide better care to elderly persons who are living with dementia, and also give physicians a helpful resource that may be shared with caregivers and families. Additionally, making use of the Validation Method’s attitudes, theories, and techniques may improve relationships with elderly individuals who are living with dementia and their care partners, thereby bolstering physicians’ satisfaction with their work and helping to combat burnout.

For those interested in learning more about the Validation Method or who are seeking educational or training opportunities, the following resources are available.

  • Validation Training Institut e (VTI)
  • VTI YouTube channel
  • Interview with Naomi Feil by neurologist, Daniel C. Potts

1. Lee L, Weston WW, Hillier L, Archibald D, Lee J. Improving family medicine resident training in dementia care: An experiential learning opportunity in primary care collaborative memory clinics. Gerontol Geriatr Educ . 2020;41(4):447-462.

2. Online Certificate in Advanced Clinical Dementia Practice. University of Michigan School of Scocial Work. Accessed March 1, 2022. https://ssw.umich.edu/offices/continuing-education/certificate-courses/clinical-dementia.

3. Kleinman A. Presence. Lancet . 2017;389(10088):2466-2467. doi: 10.1016/S0140-6736(17)31620-3

4. Busis NA, Shanafelt TD, Keran CM, et al. Burnout, career satisfaction, and well-being among US neurologists in 2016. Neurology. 2017; 88(8):797-808.

5. Gawande A. Being Mortal . Metropolitan Books; 2014.

6. Feil N, de Klerk-Rubin V. Validation: The Feil Method 3rd ed . Edward Feil Productions; 2015.

7. Alprin SI, Feil N. Study to determine results of implementing validation therapy. Unpublished Manuscript. Cleveland State University; 1980.

8. Peoples M. Validation Therapy Versus Reality Orientation as Treatment for the Institutionalized Disoriented Elderly. Thesis. University of Akron. 1982.

9. Babins L. Group Approaches With the Disoriented Elderly: Reality Orientation and Validation Therapies. Master’s Thesis.McGill University, 1985.

10. Babins L. Conceptual analysis of validation therapy. Int J Aging Hum Dev . 1988;26(3):161-168.

11. Babins LH, Dillion JP, Merovitz S. The effects of validation therapy on disoriented elderly. Activities, Adaptation & Aging . 2010;12(1/2):73-86. doi:10.1300/J016v12n01_06.

12. Fritz P. The Language of Resolution Among the Old-Old: The Effects of Validation Therapy on Two Levels of Cognitive Confusion. Presented at Annual Meeting of the Speech Communication Association. November 12-16, 1986. Chicago, IL.

13. Dietch J, Hewett L, Jones S. Adverse effects of reality orientation. J Am Geriatr Soc . 1989;37(10): 974-976.

14. Sharp C. Validation Therapy: An Australian Evaluation. Final Report for Monitoring and Evaluation of the Validation Therapy Plus Programme. For South Port Community Nursing Home. Prepared by P.E.R.S.O.N.A.L. Research and Evaluation Consultancy Pty Ltd (1989)

15. Deponte A, Missan R. Effectiveness of validation therapy (VT) in group: preliminary results. Arch Gerontol Geriatr . 2007;44(2):113-117.

16. Tondi L, Ribani L, Bottazzi M, Viscomi G, Vulcano V. Validation therapy (VT) in nursing home: a case-control study. Arch Gerontol Geriatr . 2007;44 Suppl 1:407-411.

17. Kerr F, Wiechula R, Feo R, Schultz T, Kitson A. Neurophysiology of human touch and eye gaze in therapeutic relationships and healing: a scoping review. JBI Database System Rev Implement Rep . 2019;17(2):209-247. doi:10.11124/JBISRIR-2017-003549

18. Kabat-Zinn J. Mindfulness-based interventions in context: past, present, and future. Clin Psych: Sci Pract . 2003;10(2):144-156. doi:10.1093/clipsy.bpg016

VdKR and DCP report no disclosures

Vicki de Klerk-Rubin, RN, MBA headshot

Vicki de Klerk-Rubin, RN, MBA

Executive Director The Validation Training Institute The Hague, Netherlands

Daniel C. Potts, MD headshot

Daniel C. Potts, MD

Attending Neurologist, Tuscaloosa VA Medical Center Adjunct Faculty, The University of Alabama Birmingham UAB School of Medicine University of South Alabama College of Medicine Tuscaloosa, AL

The Future of Alzheimer Disease Care Depends on Early Diagnosis

Gregory A. Rippon, MD, MS

Epilepsy Essentials: VideoEEG Monitoring

Jacqui-Lyn Saw, MBBS; and Elson L. So, MD

This Month's Issue

Nicholas J. Silvestri, MD

Abdalmalik Bin Khunayfir, MD; and Brian Appleby, MD

Julia Greenberg, MD; Kaiulani Houston, MD, PhD; Kevin Spiegler, MD, PhD; Sujata Thawani, MD, MPH; Harold Weinberg, MD, PhD; and Ilya Kister, MD

April 2024 cover

Related Articles

Paola Andrea Ortiz-Salas, MD, MSc; Monica Ortiz-Pereira, MD; Jorge Molinares, MD; Neiry Zapa-Perez, MD; and Jorge Aníbal Daza-Buitrago, MD

Andrew Anastos, BS; Misha Pless, MD; Rabih Tawk, MD; and Eric Eggenberger, DO

Sign up to receive new issue alerts and news updates from Practical Neurology®.

Related News

National Academies Press: OpenBook

Incorporating Reliability Performance Measures into the Transportation Planning and Programming Processes (2013)

Chapter: chapter 4 - validation case studies.

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

35 C h a p t e r 4 This chapter describes the approach used for selecting and conducting case studies that support the guide and the techni- cal reference. From the state of the practice survey, 13 MPOs, 19 DOTs, and two transportation authorities expressed inter- est in participating in a more detailed case study. In addition, respondents identified planning products they expect to be working on in the next year, which is a key element of the proposed approach. approach The case studies for the SHRP 2 L05 project are unique in that they are validation case studies and not best practices case studies. Because the case studies validate reliability processes, the case studies are for areas that may not necessarily be fully incorporating reliability into their planning and operations processes. In general, case study agencies have at least begun to think about reliability but have not fully incorporated it into planning. The case studies provide an opportunity to test the methods presented in the guide and technical reference for incorporating reliability. Project L05 identified a specific planning task (e.g., priori- tizing projects, identifying reliability deficiencies and needs) and assisted the case study site by collecting and analyzing data. The project team and case study participants worked together to accomplish a specific desired outcome; the les- sons learned were incorporated into the guide and technical reference. The validation case studies revolved around one or two of the following major planning and programming products. • State and metropolitan Long-Range Transportation Plans (LRTP), which include a range of approaches, especially for states; • Congestion management processes (CMP); • Corridor, area, modal, and other similar studies that exam- ine one portion of the transportation system; • State Transportation Improvement Programs (STIP) or MPO Transportation Improvement Programs (TIP); • State or regional efforts to plan for operations generally or to plan for special events, extreme weather, and other simi- lar efforts; • Project development processes (i.e., design); • Environmental review; • Project construction and work zone planning; and • System operations and management. Selection Criteria The team has defined multiple criteria for selecting validation case study sites. The team selected validation case study sites meeting as many of the following criteria as possible. • Understanding of reliability. While the guide is intended to be applied at MPOs and DOTs at different levels of sophis- tication facing different levels of reliability problems, it is important that validation take place at agencies that have some conceptual understanding of reliability and face real reliability challenges. It is also useful to work with agencies that have been considering operations within the planning process, even if not explicitly measuring and tracking operations-oriented performance measures. All levels of sophistication are addressed, adapting the four levels of sophistication identified in Chapter 3 (i.e., sections on leaders and innovators, unrealized opportunities, planning ahead, and in need of a primer). • Area size. Because of the need to have broad applicability, it is important to study metropolitan areas ranging in pop- ulation size. The case studies include medium and large metropolitan areas because they are the most likely to have experienced travel time reliability problems. • Agency type. It is important to study both state DOTs and MPOs. This ensures that both perspectives are accounted for in the guide and technical reference. Validation Case Studies

36 • Work product. Each case study validation site was organized around specific planning products or processes. Case study sites were selected in such a way that most work products and processes are accounted for. • Geographic coverage. The case studies draw from agencies from across the United States representing various geogra- phies and land-use development patterns. Case studies should include regions with dense urban areas and regions with more disperse suburban-style development. These areas are likely to have unique issues and unique solutions. • Willingness to participate. Willingness to participate is important with any case study effort. Case Studies The team identified validation case study locations from a combination of research team experience and the findings of the state of the practice survey described in Chapter 3. This chapter includes summaries of each case study developed in this research effort. Full write-ups of each case study can be found in the technical reference. Key findings from the case study results are referenced throughout the guide and techni- cal reference and are summarized in Table 4.1 Knoxville Region Transportation Planning Organization The primary objective of the case study is to develop a process for estimating reliability performance measures and identify- ing reliability deficiencies based on traffic flow and incident duration data, as well as estimating the impacts of operations projects for the Knoxville Regional Transportation Planning Organization (TPO). The TPO has begun to carry out the update of the Long-Range Transportation Plan (LRTP) for the region and is undertaking Planning for Operations. This case study documents the incorporation of reliability into the agency’s transportation planning process. The case study also provides validation for the following steps in the guide. • Measuring and tracking reliability; • Incorporating reliability in policy statements; and • Incorporating reliability measures into program and proj- ect investment decisions. The case study was successful in establishing an initial framework for an ongoing reliability performance monitoring system. It demonstrated how various reliability performance indices and incident duration can be calculated using archived traffic volume, speed, and incident data from a regional ITS freeway management system. This is a critical first step in identifying reliability deficiencies on freeway segments and potential traffic operations strategies for improving reliability on these segments. It also demonstrated how agencies can formulate travel time reliability and incident duration goals and set specific tar- gets for their region based on reliability and incident duration analysis results. These can be incorporated as criteria in the long-range transportation plan development process as well as in operations planning. Finally, the case study showed how agencies can use sketch- planning methods and the data poor reliability prediction equations from SHRP 2 L03 to assess the reliability benefits for operations strategies within a regional ITS architecture and then build a roster of operations projects for inclusion in the LRTP. Florida Department of Transportation The objective of the case study is to document Florida DOT’s efforts to incorporate travel time reliability into their planning and programming process. Florida has developed reliability measures for both planning (system focused) and operations (corridor focused). These measures are being incorporated into Florida DOT’s short-range decision support tool, the Strategic Investment Tool (SIT), which is used to prioritize projects for inclusion in the State Transportation Improve- ment Program (STIP). The planning office has also developed modeling techniques for predicting the impact of projects on travel time reliability. In addition, both offices are very inter- ested in the economic value of projects and return on invest- ment of operations improvements. The case study documents these activities and provides validation for the following steps in the guide. • Measuring and tracking reliability; • Incorporating reliability in policy statements; and • Incorporating reliability measures into program and proj- ect investment decisions. The Florida DOT case study revealed that incorporating reliability (specifically operations projects) into the program- ming process is a challenging process for most state DOTs. It requires locating a specific funding category to cover opera- tions improvements, although statutory requirements may limit the types of projects that can be funded with existing funding categories. Two basic funding models could be con- sidered: (1) allocating separate funding for operations proj- ects; or (2) allocating a portion of existing capacity funding for operations projects. This has important implications for the SHRP 2 L05 project, as it appears that many states would benefit from guidance on determining eligibility of funding operations improvements under specific silos or funding categories or making the required policy changes to set up

37 Table 4.1. Key Findings and Lessons from Validation Case Studies Case Study Objectives Key Findings/Lessons References Colorado DOT Conduct a before-and-after analysis and benefits study of a pilot traffic operations project being conducted by Colorado DOT in Denver. One of the key themes of SHRP 2 L05 and other efforts is an attempt to main- stream operations planning within the broader planning process. This validation case study identifies methods to better achieve that objective. Documents the process for conducting an arterial before-and- after analysis with emphasis on travel time reliability. Benefits of operations strategies in improving travel time reliability. Steps to incorporating reliability performance measures into the LRTP at CDOT. The findings validate the operations planning phase of the planning process. Technical reference: Section 2.2, Appendix D Technical reference: Chapter 6, Appendix B, Appendix C Technical reference: na Florida DOT Document FDOT’s efforts to incorporate travel time reli- ability into their planning and programming process, including incorporating reliability into their short range decision support tool (Strategic Investment Tool) and modeling techniques for predicting the impact of proj- ects on reliability. Incorporating reliability into the programming process is a chal- lenge due to lack of specific funding categories and chal- lenges due to statutory requirements regarding the types of projects that can be funded. The case study documented many success factors for incorporating reliability into the planning and programming process. The findings validate the programming phase of the planning process. Guide: Chapter 3 Technical reference: Chapters 2, 3 Knoxville, TN MPO Demonstrate how reliability can be incorporated into the ITS/operations element of the region’s upcoming LRTP and assist MPO staff in incorporating reliability perfor- mance measures in plan development, project identifi- cation, and project prioritization processes. Developed a reliability objective for inclusion in the Congestion Management Process. Calculated reliability performance measures along freeways and incident prone locations. Developed a method for incorporating reliability into the project selection process. The findings validate tools for quantifying travel time reliability using somewhat less sophisticated modeling and other tools. Technical reference: na Guide: Chapter 3 Technical reference: Chapter 5, Appendix D Technical reference: Chapters 3, 5 LAMTA (Los Angeles) Document the development of an arterial performance monitoring system, which will be used to prioritize arterial operations projects for funding. Recommends approach for using alternative data sources to support an arterial performance monitoring system. Prelimi- nary findings suggest that multi-modal reliability measures can be calculated from alternative data sources, although data source consistency is critical. Technical reference: Chapter 2, Appendix D NCTCOG (Dallas-Fort Worth) Identify best practices on how other MPOs are incorporat- ing reliability into their Congestion Management Process and provide recommendations on how NCTCOG can incorporate reliability into their planning process. Only a limited number of MPOs have incorporated reliability into their CMP. Success factors include having robust amounts and sources of traffic data, using corridor-level measures and effective reporting graphics, defining reliability in a way that can be easily understood by multiple audiences, and having a performance measurement working group consisting of agency staff, technical and policy board members, local stakeholders, and the public. Technical reference: Chapters 2, 5, Appendix D SEMCOG (Detroit) Identify reliability performance measures for assessing highway operations and develop a method for incorpo- rating reliability into SEMCOG’s performance-based program trade-off process. Reliability can be incorporated in the trade-off analysis process and will likely impact the results of the prioritization process; the use of representative corridors can be effective in conducting a regional analysis; assessments of reliability can be conducted even in situations with limited data availability. The findings validate incorporation of reliability into a program- level trade-off analysis. Guide: Chapter 5 Technical reference: Chapters 5, 6, Appendix C Washington State DOT Incorporate reliability into identifying deficiencies and investments in a corridor. Establishes a methodology for examining reliability deficiencies for WSDOT corridor studies. Guide: Chapter 2. Technical reference: Chapter 3 Note: na = not applicable.

38 a dedicated funding mechanism. However, because different state DOTs have different programming priorities and pro- cesses, it may be difficult to identify a good decision-making model for the long term. The case study validated the following success factors for incorporating reliability into the planning and programming process: • Reliability needs to be specifically addressed in the vision, mission, and goals of a plan. These policy statements define the long-term direction of an agency and provide the foun- dation on which to select reliability performance measures and make the right choices and trade-offs when setting funding levels and selecting projects. • Reliability needs to be a well-defined measure with support- ing data. Well-defined reliability performance measures define an important, but often overlooked, aspect of cus- tomer needs. The measures help to support the develop- ment of policy language and are critical to making reasoned choices and balanced trade-offs. • Reliability needs to be used to estimate and predict trans- portation needs and deficiencies including the develop- ment and analysis of project and scenario alternatives. Estimating reliability deficiencies using well-defined mea- sures helps to define the size and source of the reliability problem and can be used to inform policy makers about how the reliability of the system has been changing over time and how it is expected to change in the future. The maps, charts, and figures provide critical background when making choices and trade-offs. • Reliability needs to be used in program level trade-offs. Bringing reliability into the discussion brings clarity to the issue of balancing operations and capacity funding. With- out the consideration of reliability, the trade-off nearly always tilts toward capacity projects. • Reliability needs to be an integral component of priority setting and decision making at the project level. Incorpo- rating reliability into project prioritization and program- ming brings clarity to the issue of choosing the appropriate balance of operations and capacity strategies. State DOTs would benefit from a maturity model that defines various levels of organizational capability with respect to these success factors. State DOTs could use the maturity model as a tool for (1) assessing where they stand with respect to incorporating reliability into all components of the plan- ning and programming process; (2) assisting them in under- standing common concepts related to the process; and (3) assisting them in identifying next steps to achieve suc- cess toward an ultimate goal state. The maturity model should be a living document that is continually refined based on agency capabilities. Los Angeles Metropolitan Transit Authority The objective of the Los Angeles (LA) County Arterial Perfor- mance Monitoring case study is to develop the preliminary framework for an arterial performance monitoring system, which is being developed by the Los Angeles Metropolitan Transit Authority (LAMTA) as an improved mechanism for prioritizing arterial operations projects for funding. As part of the 2009 Long-Range Transportation Plan (LRTP), LAMTA continues to focus on improving arterial traffic flow through the implementation of transportation system man- agement (TSM) projects, including intelligent transportation systems (ITS), coordinated signal timing, and bus signal prior- ity. Historically, LAMTA has programmed over $30 million per year to meet regional and subregional needs for projects of this nature. Due to a number of financial constraints, the 2009 LRTP Strategic Plan calls for a 50% reduction in TSM funding over the next 30 years. They have annual solicitations for agen- cies in LA County to apply for funding to improve arterial operations. LAMTA’s current process for prioritizing arterial operations projects involves conducting before and after evaluations. Data are collected using floating car surveys and spot counts. It is currently a reactive approach in response to incidents and complaints received from the traveling public. The basis for this approach is local level evaluation using optimization. This case study documented the development of a pre- liminary framework for an arterial performance monitoring system. The case study results show that arterial reliability measures require robust data sets that provide sufficient data points on each roadway of interest during all times of interest. Although it is possible to calculate arterial reliability mea- sures from a variety of multimodal data sources, there is a challenge in collecting large enough samples both spatially and temporally. Data source consistency is critical. Southeast Michigan Council of Governments The Southeast Michigan Council of Governments (SEMCOG) is the metropolitan planning organization (MPO) for the Detroit region. As in many regions, the identified need for infrastructure improvements greatly outweighs the available funding levels, so a logical and effective process is needed to assist SEMCOG in setting program funding levels. They devel- oped such a process while preparing their 2035 regional trans- portation plan (RTP) that allows them to trade off among several program areas, including pavement, bridge, highway capacity, safety, transit, and non-motorized modes. This case study updates that process to assess funding levels required for SEMCOG’s roadway operations program by assessing total delay, including nonrecurring delay, the main cause of unreliable travel.

39 • Separating the various roadway operations improvements within the analysis to allow each strategy to be analyzed individually. Colorado Department of Transportation (DOT) and Denver Regional Council of Governments This case study establishes baseline conditions for a pilot cor- ridor and lays the groundwork for conducting a before-and- after analysis to assess benefits of operations strategies using an arterial performance monitoring system. It documents the steps to planning and funding an operations project intended to improve travel time reliability. Finally, the case study docu- ments the Colorado Department of Transportation’s (CDOT) efforts in selecting and incorporating operations (including reliability) performance measures into their long-range plan- ning process. This case study provides validation for the following steps in the guide: (1) measuring and tracking reliability and (2) incorporating reliability measures into program and project investment decisions. The pilot project on Hampden Avenue in Denver proved that reliability data can be calculated with a small amount of equipment (in this case three Bluetooth readers) over a relatively short period of time (two months). The use of this portable detection and monitoring system indicates to other agencies that corridor reliability studies and opera- tions improvements benefits analysis can be conducted inexpensively. CDOT is actively pursuing collection of reliability data. The purchase of Navteq data statewide and the portable detection and monitoring system have both proven to be valuable assets in obtaining reliability data. CDOT’s experience in their LRTP update process indicates that reliability data can provide transportation agencies with opportunities to enhance several steps within the statewide transportation plan development process, including • Assessing program or strategy performance toward meet- ing mobility goals and objectives; • Determining needs-based investment levels for corridors; • Determining and evaluating the strategies that are best suited to improve travel in a corridor; • Selecting and prioritizing projects for inclusion in the STIP; and • Providing detailed data used in the design of specific projects. CDOT modified their previous LRTP and STIP develop- ment processes to incorporate a process that is performance- driven and needs-based for this plan update cycle. They determined that reliability was one of the most important The case study provides validation for the “Incorporating Reliability into Program and Project Investment Decisions” step in the guide. The comparison of the benefits estimated both with and without considering reliability shows several interesting results. Key findings include the following results. As expected, when nonrecurring delay is considered in the analysis, the overall delay estimates are much greater (with the baseline delay more than doubling from 2.4 to 6.8 hours of delay per 1,000 vehicle miles traveled [VMT]). Investments in roadway operations strategies were shown to a yield a much greater impact on total hours of delay, par- ticularly at the lower investment levels. Small investments in these strategies result in a steep curve of reducing delay levels. Similar to the analysis, which does not considering reli- ability, there is a declining utility to higher investment levels, and increased investment brings about lower incremental improvement for each dollar spent. In addition to the actual analysis results, several lessons were learned throughout the case study. • Reliability can be relatively easily incorporated in the trade- off analysis process. Consideration of reliability will likely have an impact on the results of the prioritization process. • The use of representative corridors can be effective in con- ducting a regional analysis within reasonable budget and schedule requirements. • Even in situations with limited data availability, assessments of reliability can be performed efficiently, providing much needed consideration of these factors within the overall assessment of tradeoffs regarding investment priorities. The analysis approach represented in this case study repre- sents a first step in the overall incorporation of reliability per- formance measures in the investment prioritization process. Improvements and enhancements to this process may include the following. • Application of nonrecurring congestion measurement within the analysis of highway capacity improvements to make the comparison of capacity and operations improve- ments more equitable (e.g., capture the reliability benefits of increasing capacity). • Inclusion of a greater variety of representative corridors in the analysis. • Development of automated routines to allow the estima- tion of incident-related delay and total delay (recurring and nonrecurring) within the travel demand model itself, thus allowing the more detailed regional assessment of these measures.

40 The case study provides validation for the “Evaluating Reli- ability Needs and Deficiencies” and “Incorporating Reliabil- ity Measures into Program and Project Investment Decisions” steps in the guide. The case study was successful in demonstrating how agen- cies can use sketch-planning methods to assess the reliability impacts for a package of operations strategies within a cor- ridor and then advance these projects into the region’s long- range transportation plan. The case study demonstrated • The process for collecting data and selecting appropriate analytical techniques from among several available options. • How to divide the entire corridor into subsections. This allowed the analysis to be completed in a timely and resource- conscious manner without washing out the differences in performance that would have likely occurred if the corridor was treated as a whole. • How to identify reliability deficiencies in a corridor using reliability thresholds. • How a relatively low-cost set of operations investments can improve travel time reliability in a corridor. • How agencies can apply sketch-planning methods using travel-demand-model data and the SHRP 2 L03 data poor reliability prediction equations within a spreadsheet environment. factors in both evaluating system and project performance and assessing corridor needs. Developing plans based on per- formance data provides decision makers, taxpayers and users with assurances that implemented projects will meet perfor- mance goals, will be a high priority based on performance, and will provide users with specific benefits. Continuous monitoring of corridor and network performance will pro- vide decision makers, taxpayers, and users with quantifiable information on both specific projects and on the sum of all improvements made to the corridor or network. Performance data, including reliability, provides accountability for invest- ments to decision makers, taxpayers and users. Performance data also enables calculations of specific benefits and benefit- cost ratios that allow easy comparison with more traditional transportation improvements, such as capacity addition. Washington State Department of Transportation The objective of this case study is to identify reliability defi- ciencies along a key segment of the Interstate 5 (I-5) corri- dor near the Joint Base Lewis-McChord military base and apply sketch-planning methods to assess the impacts of implementing a package of reliability mitigation strategies within the corridor.

TRB’s second Strategic Highway Research Program (SHRP 2) Report S2-L05-RW-1: Incorporating Reliability Performance Measures into the Transportation Planning and Programming Processes reviews domestic and international literature describing current research and practical use of travel-time reliability in transportation planning; summarizes results from a survey of state departments of transportation and metropolitan planning organizations about the current state-of-the-practice of using travel-time reliability in transportation planning; summarizes case studies of agencies that are incorporating reliability into their transportation planning processes; summarizes travel-time reliability performance measures, strategies for improving travel-time reliability, and tools for measuring the impacts of strategies on travel-time reliability; and describes the framework for incorporating reliability performance into the transportation planning process.

The Final Report is designed to accompany the Technical Reference that provides a “how-to” guide for technical staff to select and calculate the appropriate performance measures to support the development of key planning products and a Guide designed to help planning, programming, and operations managers apply the concept of travel-time reliability to balance investment in programs and projects.

SHRP 2 Reliability Project L05 has developed a series of case studies that highlight examples of agencies that have incorporated reliability into their transportation planning processes as well as three reliability assessment spreadsheet tools related to the case studies.

Software Disclaimer: This software is offered as is, without warranty or promise of support of any kind either expressed or implied. Under no circumstance will the National Academy of Sciences or the Transportation Research Board (collectively "TRB") be liable for any loss or damage caused by the installation or operation of this product. TRB makes no representation or warranty of any kind, expressed or implied, in fact or in law, including without limitation, the warranty of merchantability or the warranty of fitness for a particular purpose, and shall not in any case be liable for any consequential or special damages.

READ FREE ONLINE

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • BMC Health Serv Res

Logo of bmchsr

Mixed methods instrument validation: Evaluation procedures for practitioners developed from the validation of the Swiss Instrument for Evaluating Interprofessional Collaboration

Jean anthony grand-guillaume-perrenoud.

1 Bern University of Applied Sciences, School of Health Professions, Bern, Switzerland

Franziska Geese

2 Academic-Practice-Partnership of Bern University of Applied Sciences and Insel Gruppe, Bern University Hospital, Bern, Switzerland

Katja Uhlmann

Angela blasimann, felicitas l. wagner.

3 University of Bern, Institute for Medical Education, Department for Assessment and Evaluation, Bern, Switzerland

Florian B. Neubauer

Sören huwendiek, sabine hahn, kai-uwe schmitt, associated data.

Data sets analysed in the current study are available from the corresponding author on reasonable request.

Quantitative and qualitative procedures are necessary components of instrument development and assessment. However, validation studies conventionally emphasise quantitative assessments while neglecting qualitative procedures. Applying both methods in a mixed methods design provides additional insights into instrument quality and more rigorous validity evidence. Drawing from an extensive review of the methodological and applied validation literature on mixed methods, we showcase our use of mixed methods for validation which applied the quality criteria of congruence, convergence, and credibility on data collected with an instrument measuring interprofessional collaboration in the context of Swiss healthcare, named the Swiss Instrument for Evaluating Interprofessional Collaboration.

We employ a convergent parallel mixed methods design to analyse quantitative and qualitative questionnaire data. Data were collected from staff, supervisors, and patients of a university hospital and regional hospitals in the German and Italian speaking regions of Switzerland. We compare quantitative ratings and qualitative comments to evaluate the quality criteria of congruence, convergence, and credibility, which together form part of an instrument’s construct validity evidence.

Questionnaires from 435 staff, 133 supervisors, and 189 patients were collected. Analysis of congruence potentially provides explanations why respondents’ comments are off topic. Convergence between quantitative ratings and qualitative comments can be interpreted as an indication of convergent validity. Credibility provides a summary evaluation of instrument quality. These quality criteria provide evidence that questions were understood as intended, provide construct validity, and also point to potential item quality issues.

Conclusions

Mixed methods provide alternative means of collecting construct validity evidence. Our suggested procedures can be easily applied on empirical data and allow the congruence, convergence, and credibility of questionnaire items to be evaluated. The described procedures provide an efficient means of enhancing the rigor of an instrument and can be used alone or in conjunction with traditional quantitative psychometric approaches.

Questionnaire development comprises procedures that are qualitative and quantitative. For instance, generating items to represent a construct involves qualitative processes. These include a literature review and conducting expert interviews or focus groups to extract relevant dimensions and develop items that capture them [ 1 , 2 ]. In a further qualitative process, developed items are judged by experts whether they capture all aspects of a dimension and are well understood by prospective respondents [ 3 ]. This is sometimes supplemented by a quantitative assessment of whether the items are relevant and understandable, such as in the example of the Content Validity Index [ 4 – 6 ]. When an initial draft of the instrument has been developed, a qualitative cognitive pre-test is advised [ 7 ]. Quantitative procedures then come into play as the battery of items is tested against statistical criteria, such as Cronbach’s alpha coefficient to demonstrate internal consistency [ 8 ] or bivariate correlation coefficients to demonstrate construct-related and criterion-related validity [ 9 , 10 ]. Despite the fact that both qualitative and quantitative procedures are involved in instrument development [ 11 , 12 ], quantitative methods may be overemphasised [ 13 ] and qualitative methods neglected [ 14 ].

These circumstances contribute to the perception that instrument development is bound to its methodological tradition, wherein only quantitative approaches are appropriate for developing quantitative instruments [ 15 ]. The neglect of qualitative methods holds untapped potential for validation and opens up new means of collecting evidence of construct validity [ 16 ]. Given that qualitative and quantitative procedures are part of instrument development and assessment, we propose that their mix can provide additional insights into instrument quality that go beyond the contributions of a mono-method alone [ 15 ]. Specifically, we propose that a mixed methods (MM) approach to instrument validation (IV) will enrich the process with more rigorous validity evidence [ 17 ].

We develop procedures and illustrate the potential of MM analysis for IV using an instrument measuring interprofessional collaboration (IPC) in the Swiss healthcare context, called the Swiss Instrument for Evaluating Interprofessional Collaboration (SIPEI). IPC in healthcare is understood as the joint efforts of workers from different healthcare professions to provide high quality comprehensive care to patients, families, and communities across settings [ 18 ]. The importance of IPC has been recognised by the World Health Organization (WHO) since the 1970s, with research showing that IPC may have a positive impact on patient satisfaction, length of hospital stay, and access to healthcare services [ 19 ]. It may also increase the flow of information between professions [ 20 ] and workplace satisfaction of health professionals [ 21 , 22 ].

In the following, we describe our study’s contribution to the instrument validation literature. This is followed by theoretical frameworks for IV and exemplar studies, which we will use to derive validation criteria. Our study demonstrates the utility of MM in IV using sample items of SIPEI to illustrate. We begin by reviewing the literature on MM validation frameworks and validation studies that use quantitative and qualitative methods. We derive criteria and procedures applicable for our IV, given the data collected and the time constraints imposed by our project. Our procedures provide researchers constrained by time, budget, and limited data with a means of enriching an IV through MM.

Theoretical frameworks for mixed methods validation

Mixing multiple quantitative methods for IV can be traced as far back as Campbell and Fiske’s [ 23 ] seminal paper using multitrait-multimethod analysis, which some methodologists view as having formalised the use of multiple methods for validation [ 15 , 24 , 25 ] and even as laying the groundwork for MM research [ 26 ]. Multitrait-multimethod analysis, however, does not include any qualitative assessment. With the advent of MM, an overarching approach to instrument development and validation became available that combines quantitative and qualitative methods.

Among the theoretical developments, Dellinger and Leech [ 16 ] proposed a unified validation framework (VF) which provides guidance for construct validation by suggesting elements of validity evidence to consider within a MM framework. The authors review existing terminology on validity from the quantitative, qualitative and MM literature and suggest four new quality criteria which can provide information on the validity of a study. Among the criteria, they introduce the concept of a ‘foundational element,’ which refers to researchers’ understanding of a construct or phenomenon. Second, their concept of ‘inferential consistency’ refers to the degree to which a study’s findings agree with previous research. Third, citing Messick [ 27 ], they introduce a utility/historical element, which uses past utilization of an instrument as indication of construct validity. Fourth, the authors propose a ‘consequential element’, wherein an instrument’s or study findings’ socially acceptable use is regarded as evidence of ‘consequential validity’.

A second framework, proposed by Onwuegbuzie et al. [ 15 ], is a meta-framework that prescribes the use of validation procedures. It consists of a 10-phase process called “Instrument Development and Construct Validation” (IDCV) to optimise quantitative instrument development. Using the different types of validity as starting point (e.g., structural validity, convergent validity, etc.), the authors propose corresponding ‘crossover analyses’, which supplement the traditional analyses associated with various types of validity. Crossover analyses use qualitative methods to analyse quantitative data, and quantitative methods to analyse qualitative data. The framework contains separate quantitative and qualitative analysis phases, but also phases where both methods are combined in crossover analysis. One crossover analysis phase is qualitative-dominant, and another phase is quantitative-dominant. These procedures are designed to enhance instrument fidelity, which encompasses an instrument’s appropriateness or utility.

Another notable framework was proposed by Adcock and Collier [ 28 ] and applied in a MM instrument validation [ 17 ]. Adcock and Collier [ 29 ] discussed the lack of shared standards for quantitative and qualitative research. They proposed a shared framework for establishing validity that uses quantitative and qualitative methods. It distinguishes four levels a researcher progresses through when developing an instrument and defines tasks between levels that lead a researcher to transition between levels. The starting point, Level 1, is the background concept. The task of conceptualization leads to Level 2, the systematised concept, which is derived from a literature review, usually culminating in an explicit definition of the concept being researched. The task of operationalization leads from the systematised concept to Level 3, the indicators. Finally, the task of giving scores to responses leads to Level 4, to scores for each respondent. The framework focuses on a criterion dubbed ‘measurement validity’, which addresses the relationship between the systematised concept and the observations gathered using the instrument. Measurement validity deals with Levels 2–4, i.e., the systematised concept and measured scores. When initial instrument testing has taken place, a revision can be undertaken by working backward through the levels and making refinements. Adcock and Collier [ 28 ] distinguished between three types of validation, merging certain types of validation into one category: 1) content validation, 2) content/discriminant validation, and 3) nomological/construct validation and argued that all three forms could be validated using quantitative and qualitative methods.

The presented frameworks are, to our knowledge, the only frameworks to explicitly propose MM for IV [ 15 , 16 ] or to have been applied in a MM IV [ 28 ]. They vary in the degree to which they specify procedures and the degree to which quantitative and qualitative methods are mixed. Dellinger and Leech’s [ 16 ] contribution aimed to guide thinking about validity within quantitative, qualitative and MM traditions and compiled a catalogue of quality concepts related to validity within the three traditions. However, it does not suggest specific validation procedures. Onwuegbuzie et al. [ 15 ] provided a 10-phase process for instrument development and validation and suggested specific procedures for handling quantitative and qualitative data analysis on their own and in mixed, crossover analyses as part of a 10-phase process. The elaboration of each phase and its application using an actual example of instrument development [ 29 ] bridges the gap between the abstractions of the methodological literature and the hands-on procedures of empirical validation literature. Adcock and Collier [ 28 ] developed a four-level framework for instrument development that uses quantitative and qualitative methods separately but does not explicitly combine them to enable deeper insights that transcend separate mono-method analyses. Of the frameworks described, Dellinger and Leech’s [ 16 ] is the most abstract and least prescriptive, while Onwuegbuzie et al. [ 15 ] and Adcock and Collier [ 28 ] provide more explicitly practice-oriented frameworks from which specific procedures are more easily derived.

We next present examples of how multiple methods have been applied in validation studies and propose a typology. Our overview demonstrates the application of multiple methods with varying forms of mixing. Some studies apply MM frameworks developed explicitly for validation purposes. Other studies apply separate quantitative and qualitative mono-methods within the same validation study. This literature informed our validation and can provide other instrument developers with practical analytic examples, which can be varied depending on the time, budget, and data available as well as other project constraints [ 30 ].

Overview of studies applying multiple methods

We propose a typology of multiple method validation studies based on how the methods are applied. Exemplar studies for each type are presented. We apply the term “multiple methods” as an overarching term of multi-method studies which encompasses MM. We also classify as “multiple methods” any study that applies multiple quantitative or qualitative strands within the same study or combines the use of quantitative and qualitative methods within the same study, without mixing data, analyses, and results. This inclusiveness ensures that even studies that might not be “sufficiently mixed” or have the philosophical grounding of MM can be considered for their potential contribution to IV. This is useful, as what constitutes MM has been defined in different ways by leaders in the field and has been part of the MM discourse [ 24 ]. Some of these leaders have recognised the inconsistencies between various definitions of MM [ 31 ] and have expressed support to continue the discussion on MM’s evolving definition [ 31 , 32 ].

Multiple-method validation studies can be grouped into three categories: 1) studies that explicitly apply one of the MM frameworks specifically for validation, 2) studies that apply a general-purpose MM design within a validation study (e.g. convergent parallel design, explanatory sequential design) [ 33 ], 3) studies that apply quantitative and qualitative methods within the same study but do not mix them. We classify an approach as MM when the study contains quantitative and qualitative analyses, integrates the data and findings to enhance breadth and depth of understanding [ 24 ], and is guided by a philosophical stance/worldview [ 34 ]. Otherwise, we classify a study as multiple methods.

Studies that apply a mixed methods framework specifically for validation

We did not find a study that used Dellinger and Leech’s [ 16 ] Validation Framework (VF) in instrument development or validation. However, we found a literature review based on the VF. Hales [ 35 ] applied the VF to criticise studies guided by culturally responsive teaching and critical race theory. Qualitative, quantitative, and MM elements of the studies were reviewed, and VF criteria were applied to evaluate their quality.

An application of Onwuegbuzie et al.’s [ 15 ] 10-phase IDCV process can be found in Koskey et al.’s [ 36 ] validation of the Transformative Experience Questionnaire. In this study, the quantitative component using Rasch models provided evidence for content-related and construct-related validity. The qualitative component used cognitive interviews to uncover potential issues with the survey format, item wording, and response scale. The validation procedures that are applied and the validity evidence collected are embedded and described within the 10-phase IDCV process.

Studies that apply a general-purpose mixed methods design within a validation study

Enosh et al.’s [ 37 ] development, testing, and validation of the Client Violence Questionnaire applies a sequential MM design. The questionnaire is designed to measure client violence experienced by social workers. The development and validation process has four stages. The first stage comprises semi-structured qualitative interviews to discover forms of client violence, followed by the three stages as suggested by Schwab [ 38 ], which correspond to common procedures in quantitative instrument development. They included a stage in which single items are formulated, another stage combining the items into a scale, and a final stage wherein a psychometric assessment is conducted. This resulted in a 14-item self-report instrument measuring the frequency social workers encounter four types of client violence. Enosh et al. [ 37 ] argue that the addition of a qualitative component as a distinct stage, together with the more traditional components of quantitative instrument development, contributed to the fidelity, appropriateness, and utility of the instrument [ 39 ].

Luyt’s [ 17 ] validation of an instrument measuring male attitude norms expanded upon Adcock and Collier’s [ 28 ] framework and applied it in a convergent parallel design. His modified framework described a cyclical process of instrument design that alternated between measurement development, validation, and revision, using MM to achieve its objectives. While Adcock and Collier’s [ 28 ] framework describes qualitative and quantitative methods that can be used in parallel to collect the same type of validity evidence, e.g. for content validation or convergent validation, they do not explicitly propose a mix of data and findings. Luyt’s [ 17 ] validation approach, however, performed an explicit method mix and grounded its procedures within the philosophical foundations that characterise MM [ 25 , 34 , 40 ].

Studies that apply quantitative and qualitative methods but do not mix them

An objective similar to the study of Enosh et al. [ 37 ] was pursued by Waldrip and Fisher [ 41 ] in developing and validating the Cultural Learning Environment Questionnaire, wherein a qualitative component was used to enrich quantitative psychometric procedures. The instrument’s purpose was to measure culturally sensitive factors that affect learning environments. After quantitative analyses, a qualitative component provided further evidence of construct validity. Students were asked about their perceptions of the instrument, using qualitative interviews. This included determining how students interpreted scales of constructs and items. The students’ statements were compared whether they corresponded to the authors’ intentions. Although this study combined qualitative and quantitative components, it lacks the statement of a philosophical stance or worldview to indicate from which ontological, epistemological, and axiological perspective the study is to be understood [ 34 ]. It also lacks the statement of a specific MM design. More importantly, the qualitative data are not directly compared to any quantitative data. Rather, the qualitative data are compared with other qualitative data.

A further example of multiple methods without mixing is a study by Groenvold et al. [ 42 ] which re-examined the validity of a validated quality of life questionnaire (EORTC QLQ-C30) developed for cancer patients in cancer clinical trials. This study explored whether quantitative questionnaire and qualitative interview responses were consistent. The quantitative questionnaire was administered to breast and gynaecological cancer patients one hour prior to the qualitative interview. Raters listened to audio-taped recordings of the interviews and filled in the most appropriate responses into the quantitative questionnaire based on the interview responses. Afterwards, the two groups of questionnaire responses were quantitatively analyzed. It was argued that consistency in responses would provide evidence that the questions were being understood as intended by the instrument developers. In addition, raters also wrote notes of any issues with the respondents’ understanding of the questions, based on the interviews. This provided information why a patient might indicate not experiencing shortness of breath when, in fact, she did. This patient’s rationale was that the shortness of breath was due to being overweight, rather than being due to cancer. Interview comments gave raters insight which questions might cause misunderstanding and discrepant answers, even when they are understood properly. This study showed two types of multiple method use. First, there was a quantification of interview data by transforming interview responses into quantitative questionnaire responses, which were compared quantitatively with self-administered questionnaire responses. Second, qualitative notes were taken by the raters which provided information why response discrepancies might have resulted. As the two data sources resulted in quantitative ratings and there was no qualitative analysis by means of, e.g., content or thematic analysis, we do not regard this study as using a MM approach. In addition, the philosophical stance/worldview is not elaborated. The aforementioned frameworks are summarised in Table ​ Table1 1 .

Theoretical frameworks and applications of multi-method instrument validation 

Note. QN: quantitative, QL: qualitative

A purposeful selection of validation methods and criteria

Our review of MM frameworks for validation and exemplar studies of multi and MM approaches to validation suggests criteria and methods that might be employed in a validation study. Considering numerous frameworks and approaches, however, also increases the complexity of an IV. As Bamberger et al. [ 30 ] noted, evaluation is often tied to constraints which involve budget, time, data, and politics. Practical limitations substantially shape which kinds of data are feasible to collect and which analyses can be conducted. These circumstances coincide with researchers’ desire to make full use of the data available for instrument enhancement. This calls for an approach that is closely oriented toward specific validation objectives and draws only upon criteria and methods necessary to achieve them. Under time and data constraints, Bamberger et al. [ 30 ] suggest that a MM approach can help in elaborating the information in data and confirming findings. MM can also help in obtaining different perspectives by combining analyses from a small number of cases.

In our validation of the Swiss Instrument for Evaluating Interprofessional Collaboration (SIPEI) [ 43 ] we had short data collection periods and few hospitals and clinics from which data could be collected. These are circumstances we believe to be common for health research studies. With time, data, and the objective of further optimizing the instrument in mind, a feasible approach to strengthening the validation can entail adding an open-ended question to each question/item containing a quantitative rating scale. This provided our validation study supplementary information that could be compared with the quantitative data. If the statements from both data sources converged [ 44 ], it would provide additional evidence that the instrument was measuring what it was intended to measure [ 41 , 42 ]. As in our own study, researchers validating an instrument can also take field notes during data collection, which can be tapped to provide supplementary information.

The procedures we propose share similarities with cognitive interviewing. Both attempt to elicit information on whether the items were understood as intended. Cognitive interviews gather information on how a respondent interpreted an item, how they constructed their answer, which difficulties they had in answering, and any other information that might provide insight into how the respondent came to provide their answer [ 45 ]. Two forms of verbal report methods used in cognitive interviewing are think-aloud and verbal probing [ 46 ]. In the think-aloud method, the respondent is asked to explain what they are thinking while answering questionnaire items. Think-aloud was part of the initial testing of the newly developed items of SIPEI [ 47 ]. In verbal probing, additional questions are asked to gain further insights into the respondent’s thinking [ 46 ]. In this paper, we propose procedures with comparable objectives. The main differences are that, in our MM validation procedures, we gain the inferences from analyses of a respondent’s quantitative and qualitative questionnaire answers. This has the advantage of being more scalable to large samples and minimising the additional time required to collect and analyse data. As a complement to cognitive interviewing, our procedures have the benefit of detecting issues with question design that might have been missed in the smaller sample cognitive interviews. The procedures we developed and cognitive interviewing can both be situated within the Messick validity framework [ 27 , 48 ], which applies generally to instrument validation and is independent of any MM validation frameworks. Viewed from the perspective of Messick’s framework, our procedures should aim to ascertain whether the questions were understood as intended [ 49 ] in order to minimise unwanted variability and provide response process evidence [ 48 ]. We elaborate on the elements we take from the literature review in the methods section “Validation Criteria and Procedures.”

The Swiss Instrument for Evaluating Interprofessional Collaboration (SIPEI)

We illustrate the potential of MM analysis for IV using the data collected in our validation of the Swiss Instrument for Evaluating Interprofessional Collaboration (German: “Schweizerisches InterProfessionalitäts-Evaluations-Instrumentarium», SIPEI) [ 47 ]. SIPEI is an instrument consisting of three questionnaires, each available in German, French, and Italian. A specific questionnaire was developed to collect data from patients, staff, and supervisors, respectively, to account for different perspectives on IPC. Intended for use within healthcare institutions, it is designed to be setting-agnostic and applicable independent of the specific healthcare unit, department, or institution. Questions are asked in four domains: 1) actual interprofessional collaboration (items denoted by the prefix IPC and PIPC in the patient questionnaire), 2) interprofessional organization (items denoted by the prefix IPO), 3) interprofessional education (items denoted by the prefix IPE), and 4) impact of interprofessional collaboration (items denoted by the prefix IPC_IMP). Details of SIPEI, its theoretical foundation and development, are described elsewhere [ 50 , 51 ]. For IV, all closed-ended questions have an associated open-ended question to provide comments. The prompt to elicit comments read “Please enter your comment here:”, which was placed to the right of the quantitative response and above a text box, in which the respondent could enter his/her comments. Placing a comment was not mandatory. The employee and supervisor questionnaires each take approximately 20 min to complete. The patient questionnaire can be completed in approximately 10 min [ 43 ].

Applying mixed methods in instrument validation

Several mixed methods designs can be applied in instrument validation. For instance, when quantitative and qualitative data are collected at the same time, (i.e., in parallel), a convergent parallel mixed methods design can be applied. We employed a convergent parallel mixed methods design to validate and optimise SIPEI.

We employ a convergent parallel mixed methods design [ 17 , 33 ] to analyse quantitative and qualitative questionnaire data collected using SIPEI. In this mixed methods design, quantitative and qualitative data are collected at the same time (i.e., in parallel) for the purpose of testing whether the data converge. Data were collected from staff, supervisors, and patients of a university hospital and regional hospitals in the German and Italian speaking regions of Switzerland. The data are used to test procedures which can be applied to open-ended questions in conjunction with quantitative ratings in a mixed analysis. We also test procedures which can be applied to qualitative open-ended questions on their own. The triangulated data allow evidence of construct validity to be collected as indicated by the criteria of congruence, convergence, and credibility. Our research is informed by a post-positivist philosophical stance/worldview [ 34 ], which is defined by a belief in an objective reality that is only imperfectly knowable and subject to researchers’ values and judgments.

Validation criteria and procedures

With the suggestions from the MM literature and the limitations of our study context to guide our decisions, we lay out a purposeful selection of validation criteria and procedures. They contain the following elements that are commonly found in MM research:

  • Citing the MM design employed [ 25 , 34 ]
  • Stating the underlying philosophical stance/worldview [ 25 , 34 ]
  • Providing a legitimation/rationale for the use of mixed methods [ 16 , 44 ]

In addition, we take elements from our review of the theoretical and empirical MM validation literature, wherein instrument development is seen as a process of continuous improvement [ 15 , 16 ], a cyclical process [ 17 , 28 ], and where the overarching goal of validation is to establish evidence of construct validity [ 16 , 27 , 52 ]. Specifically, we included criteria that could be tested on our data and would indicate that the questions were being understood as intended, providing evidence of construct validity:

  • Congruence [ 16 ] between question/item content and responses in open-ended questions
  • Convergence between quantitative and qualitative data [ 44 ], specifically the agreement between quantitative ratings and qualitative questions, following Waldrip and Fisher [ 41 ] and Groenvold et al. [ 42 ]
  • Credibility [ 16 ], based on the type of response in open-ended questions, inferences drawn from comparing quantitative and qualitative responses, and field notes from patient questionnaire data collection

The criteria we propose, their data requirements, associated analyses as well as advantages and disadvantages are summarised in Table ​ Table2 2 .

Quality criteria and associated analyses

Analytic Procedures

Several MM analyses and one qualitative mono-method analysis were conducted to provide evidence of congruence, convergence, and credibility. All comment fields with content were coded for analysis [ 53 ] by one researcher and reviewed by two other researchers. Inconsistencies in coding were reviewed in discussions.

We begin with a descriptive analysis of the sample, followed by results presented along the criteria selected to establish evidence of construct validity. We selected illustrative items and comments to demonstrate our analyses. Results from staff and supervisor questionnaires are presented together, as their items are comparable. Results from patient questionnaires are presented separately. We conclude with a summary of suggested adaptations to SIPEI.

A total of 1340 staff and supervisors were invited to participate. 435 staff and 133 supervisors participated, corresponding to a response rate of 42.4%. In addition, 189 patients participated in the survey. Table ​ Table3 3 summarises participant characteristics by hospital, profession, and language.

Participant characteristics

Evidence for congruence was collected by testing the match between question/item content and the comments written in the associated open-ended response field.

In this analysis (Table ​ (Table4), 4 ), we judged whether comments were congruent (on-topic; corresponding to 15 respondents or 39% of the sample), incongruent (off-topic; 7 respondents, 18%), unclear (3 respondents, 8%), or not applicable (13 respondents, 34%). Comments were judged not applicable when they indicated that the respondent could not make a substantive judgment.

Staff and supervisor item QN-QL comparison

The presented SIPEI items are paraphrased

The analysis of the comments provided us with potential explanations why some respondents’ answers were on-topic, off-topic or neither (not applicable), pointing to potential issues with a question. Off-topic remarks and remarks that were neither on- nor off-topic may indicate that a respondent might be answering questions differently than intended by the questionnaire designers. It may also indicate that the question cannot be answered by the respondent or that the question is not relevant to the respondent.

For instance, one comment indicated that the respondent could not answer the question because it was unclear (Table ​ (Table4, 4 , Comment C1). This comment was classified “not applicable” because it was neither on- nor off-topic.

Several off-topic comments seemed to indicate that the question was not being answered as intended and that the quantitative rating might not actually be a response to the question being asked. The reasons why remarks are off topic might not always be apparent.

In one off-topic comment the respondent remarked that he/she saw different issues that should be asked about, instead of the question being asked (C6). Another respondent commented off-topic about seldom finding understanding on the part of the doctor when they disagreed on the treatment (C5), although the item was about interprofessional team members knowing other team members’ responsibilities regarding treatment. A further comment referred to having “few meetings between doctors and nurses” (C7), although the question was about whether there were suitable rooms for interprofessional meetings. One off-topic comment suggested that a computer could be used to communicate with other professionals (C8), even though the question was about whether office spaces made it easy for interprofessional teams to exchange information. Finally, one comment suggested that the time was often missed to adapt the treatment plan “in good time” (C3), although the question dealt with whether relevant decisions were jointly made in interprofessional teams .

Despite being on-topic, one comment expressed inability to answer the question on interprofessional collaboration (C4) for lack of an interprofessional team in his/her area of work. One of the on-topic remarks provided a good indication of why the quantitative rating was “cannot judge,” when asked about the percentage of treatment plans jointly developed by more than two professions. This respondent indicated that there were very few treatment plans that were developed together, despite several professional groups working together.

Convergence

Evidence for convergence was collected by checking for agreement between quantitative ratings and comments (Table ​ (Table4). 4 ). When quantitative ratings and their associated comments converge, it provides an indication of convergent validity. In our data, however, the determination of convergence or divergence was only possible in a few cases. The majority of cases were judged “not applicable,” meaning that the criterion of convergence could not be applied. Many of the cases were judged as neutral, i.e., neither convergent nor divergent.

Convergence between quantitative ratings and their associated comments were found in only 5 of 38 questions/items (13%) (Table ​ (Table4). 4 ). However, even fewer comments were divergent. Only 2 comments (5%) were divergent (e.g., Staff IPO2). In 2 comments (5%) it was unclear whether they converged with the quantitative responses (e.g., Staff IPO4), because the comment was either off-topic or could not be clearly associated with the question being asked. 10 comments (26%) were judged as neutral (e.g., Staff IPC1, IPC5) because the comments either indicated that the respondent could not adequately answer the question or could not properly understand the question.

The majority of comments (19 comments, 50%) were judged as not applicable (e.g., Staff IPC 3) because no answer was checked in the quantitative rating, or the quantitative response was “cannot judge”. Other comments were judged not applicable because their responses were off topic (e.g., Staff IPC6) and thus could not be related to the quantitative responses, or the comments indicated that respondents did not provide quantitative responses to questions as they were intended (e.g., Staff IPO2).

Credibility

Credibility was examined in three different tests. We established one type of evidence for credibility by classifying the responses given in the open-ended fields. Response type classifications are “clarifying statements,” “disconfirming statements,” “comprehension difficulty,” and “cannot judge”. In a further examination of credibility, we compared quantitative ratings and open-ended responses to infer whether questions were understood as intended. In a final test of credibility, for patient questionnaires, observations made during data collection and from eyeballing of questionnaires were written down in field notes. The field notes are used to ascertain whether questions were properly understood.

We present credibility evidence as follows, in three analyses: analysis of response types, inference from comparing quantitative (QN) and qualitative (QL) responses, and from the field notes taken during patient data collection.

Response type

The analysis of comments by response type attempts to determine what the respondent is trying to convey. The comments can be classified into four types: clarifying statements, disconfirming statements, statements that express difficulty making a judgment (cannot judge), and statements that express lack of clarity of the question (unclear). Clarifying statements may support the credibility of the quantitative rating by giving an indication why a quantitative rating was chosen, whereas disconfirming statements may give reason to doubt the quantitative rating. When respondents are unable to judge a question or express difficulty understanding it, it may indicate the need to re-evaluate the question’s wording or to provide additional information.

Most comments (13 comments, 34%) were clarifying statements to responses given in the quantitative rating. For instance, on the question regarding whether there are suitable rooms for interprofessional meetings (Staff, IPO3), one respondent marked the checkbox that he/she “somewhat agrees” and commented that there were “few meetings between doctors and nurses” (C7). In another example, one respondent noted that entries were “not always read by everyone” (C10), regarding whether “electronic patient record system(s) optimally support(s) collaboration” (Staff, IPO6).

Only 3 comments (8%) provided disconfirming statements, wherein the quantitative rating indicated “cannot judge” but comments expressed that the respondents in fact made a judgment. For instance, on the question for which percentage of patients a treatment plan is jointly developed by staff of more than two different professions (Staff, IPC2), a respondent commented that “there are very few [cases] where you develop something TOGETHER,” demonstrating that the respondent could in fact make a judgement, despite indicating otherwise in the quantitative rating (C2).

11 comments (29%) expressed that the respondent could not judge, for instance, commenting that the question was “difficult to assess accurately” (C11).

11 comments (29%) expressed comprehension difficulty. For instance, on the question for which percentage of patients a treatment plan is jointly developed by employees of more than two different professions (Staff, IPC1), one respondent commented that he/she “can’t say because the question is not very clear (…)” (C1).

QN-QL inference

Drawing inferences by comparing quantitative ratings and qualitative comments can support credibility by providing explanations why the respondent answered in the way he/she did. This analysis can inform how to potentially improve wording of an item. It can also provide information to support substantive theorising and even provide indications if the content domain is not adequately captured by the items.

In 3 cases (8%), qualitative comments indicated that there was a discrepancy between the quantitative response and what the question intended to ask (Table ​ (Table4). 4 ). For instance, on the question whether there are enough team meetings for joint discussions (Staff, IPO2), one respondent marked that he/she “fully disagrees,” although the respondent’s qualitative comments indicated there are three interprofessional discussions per patient (C6). The respondent went on to comment that “the problem is not the frequency”, but the “timing and content.” This indicates that the quantitative judgment provided was not in terms of frequency, despite the question asking specifically about the frequency.

In another example, regarding whether the electronic patient record system optimally supports collaboration (Staff, IPO6), a respondent marked the checkbox “mostly agree.” However, in his/her comment the same respondent notes that the “entries are not always read by everyone involved due to lack of time or knowledge” (C10). The comment suggests that the systems themselves were adequate but that the limiting factor was having the time and the knowledge to do so. This indicated that the response did not relate perfectly to the question being asked.

Finally, one respondent answering whether team members know their area of responsibility in patient treatment (Supervisor, IPC6) marked the checkbox “somewhat agree.” This respondent went on to comment that there was a “discrepancy between ’knowing something’ and ‘orienting oneself to it / sticking to it’” (C12). This statement appears to be a clarification of why he/she only “somewhat agrees” and may indicate that he/she was answering the question in terms of whether team members “orient themselves to” or “stick to” their responsibilities. The comment provides an indication that the question may not have been answered as the question originally intended.

Inferences from patient questionnaires and field notes

We drew qualitative inferences from patient questionnaires and field notes focusing on whether respondents had understood questions as intended, evaluated through the criteria of congruence and credibility. Specifically, we noted whether questions and comments were congruent, i.e., on- or off-topic. We also drew on field notes to assess whether it could be credibly established that questions had been properly understood. We included all 262 patient comments across 7 items for our analysis and below present two items with particularly illustrative comments (Table ​ (Table5 5 ).

Item PIPC1 asks about whether the team members that looked after the patient treated each other with respect. Field observations indicated that one patient had commented that he/she could only see how the staff interact with each other in the room, but not elsewhere. Field notes further indicated that it was likely difficult for patients to see any interactions outside of the patient’s room. The notes also showed that some patients misunderstood the question as enquiring about how the staff treated them. An off-topic remark such as “they explain too little to me as a patient” is an example for lack of congruence between question and comment. One patient commented that he/she “cannot judge how these people treat each other,” which illustrates what field notes expressed might be difficult for patients.

PIPC6 is an optional open-ended question that asks the patient what was particularly good about the collaboration between the people looking after him/her. Most comments were off topic and were variations of statements that “all is well” or expressed an evaluation of patient treatment by staff. Some comments were unclear as to who or what was being evaluated, for instance a remark about “the humor that could be felt.” The research notes commented that, due to their brevity, comments were sometimes unclear as to who was being referred to.

Deriving instrument adaptations

We based our suggestions for instrument adaptations on our findings from MM analyses and one qualitative mono-method analysis. Focusing on the criteria of congruence, convergence, and credibility, we explored to what extent adaptations to the existing items are warranted. Three kinds of adaptations to SIPEI were introduced based on the findings: 1) the addition of a definition, 2) emphasizing certain words within a question by underlining them, and 3) reversing the response scale. We present the adaptations proposed for SIPEI by questionnaire and item. A list of suggested adaptations to the items is presented in Table ​ Table6 6 .

Adaptations to SIPEI

Staff questionnaire

A definition of the term “treatment plan” should be added to items IPC1 and IPC2, as it was indicated that its meaning was unclear. In IPC11 the words “in an appreciative manner” should be underlined for emphasis, as we discovered that this aspect was often not paid attention to.

Supervisor questionnaire

No changes are suggested for the supervisor questionnaire.

Patient questionnaire

The comments made in Item IPC1 indicate that not all patients understood it as intended. Often, the question was interpreted as meaning how the professionals treat the patients, rather than how the professionals treat each other. Thus, the words “each other” should be underlined to emphasise to whom the question relates. To reduce response set bias, we suggest reversing the response scale such that negative response options are first presented.

Our study results illustrate the utility of MM for validating a quantitative instrument. These methods provide additional sources of construct validity evidence. We draw upon elements from MM frameworks specifically developed for IV as well as empirical validation studies using multiple and MM. We consolidate our methodological review into the three criteria: congruence, convergence, and credibility, with which specific aspects of our data can be evaluated. We add to the instrument validation literature by demonstrating procedures which can be applied to qualitative open-ended questions on their own and in mixed analysis with quantitative ratings. These procedures can serve both as a stand-alone means of collecting evidence of construct validity as well as a complement to traditional psychometric evaluation.

Translating frameworks and validation studies into practical methods

Applying elements from MM frameworks in a validation study requires that their high level of abstraction is translated into criteria and procedures that can be applied to data.

We were guided by three validation frameworks in particular. Dellinger and Leech’s [ 16 ] framework proposes construct validity as overarching framework encompassing all types of validity evidence, in accordance with Messick [ 27 ].This suggests multiple paths to construct validation, which can involve approaches using quantitative, qualitative, and mixed methods. Using their VF can guide thinking on validation and provides a set of criteria that can guide validation practice. Onwuegbuzie et al.’s [ 15 ] framework proposes specific procedures, which helps to bridge the gap between methodology and validation practice. Adcock and Collier [ 28 ] provided an additional multi-method framework that elaborates conceptual levels and tasks involved in instrument development. Within these three validation frameworks, however, guidance is often abstract and lacks the vital link between quality criteria and specific mixed analytic procedures.

Validation studies using multiple methods and MM can often provide more practical guidance, which is easier to implement, bringing the validation practitioner quicker to practical procedures. Validation studies by Groenvold et al. [ 42 ] and Waldrip and Fisher [ 41 ] illustrated validation examples that relied less on deep methodological grounding and instead focused on practical aspects of validation. One of their validation steps involved showing that respondents understood the questions as they were intended.

The shortcomings in the validation frameworks highlight the lack of practical guidance for practitioners who wish to gain deeper insights into an instrument than can be provided by psychometric analysis alone. Given that research projects typically face various practical constraints [ 30 ], a validation study would benefit from deciding early on which data are feasible to collect, which criteria can be evaluated using them, and which procedures need to be applied.

In our study, we were guided by philosophical considerations based on mixed methods validation frameworks as well as mixed methods theory in general, but focused on the procedures for testing congruence, convergence, and credibility.

Advantages of the proposed criteria and assessment procedure

Our analysis shows that evaluating congruence between a quantitative questionnaire item and what a respondent writes in the associated comment box can serve as an indicator that the question was understood as intended. Conversely, incongruence may be an indication that a respondent may have understood an item differently than intended, for instance when comments are off topic or when it cannot be clearly decided if the comment is on or off topic.

Convergence between a quantitative rating and its associated comment box can serve as an indicator of convergent validity because the qualitative comment confirms what is being stated in the quantitative measure. This bears similarities to Campbell and Fiske’s [ 23 ] conceptualization of convergent validity, which is a confirmation of finding between two independent quantitative measures.

Credibility assessed in three different types of analyses provide a summary evaluation of instrument quality. These analyses support the credibility of the quantitative rating because they may provide indications why a given response was chosen. Thus, these analyses can serve as indicators that the question was understood as intended [ 54 , 55 ]. This is an important consideration, as the misinterpretation of questions can pose a threat to the accuracy of answers [ 56 ].

Advantages of the criteria proposed include that they are simple to administer and evaluate using a questionnaire, requiring only a comment box next to the rating scale or below the item. Their implementation only marginally increases questionnaire completion time, as those respondents who wish to write something can do so, while others can simply skip the comments. The procedure allows respondents to comment and clarify responses on each item. The ease of data collection and the simple analytic procedures allow the proposed mixed methods validation to be more easily scaled to large samples than cognitive interviewing. Thus, the proposed criteria and their procedures can complement cognitive interviewing and, through the larger sample size, may provide indications of quality issues that may have been missed in cognitive interviewing. This makes the procedures particularly useful for new instruments being pretested or undergoing their first psychometric validation.

Disadvantages of implementing the criteria may include the need to adjust the questionnaire layout. It also requires that the comments are interpretable. Lack of clarity of open questionnaire comments is a common issue in survey research and needs to be anticipated as a potential data issue. As mixed analysis involves qualitative analysis, criteria may not have any cut-offs. Thus, even when applying the analytic procedures to establish credibility, for instance, the decision whether a questionnaire answer is credible remains a judgment call to be made by the researcher.

We encountered item non-response for comments as a particularly prevalent issue in our study. Andrews [ 57 ] found that item non-response may be a greater issue for open-ended questions compared to closed-ended ones. He also found that dissatisfied employees or customers are more likely to respond to open-ended questions and use comment boxes to vent their frustrations. This may explain some off-topic comments we gathered in our study which expressed criticism but did not directly relate to the question being asked. We also found respondents contradicting themselves in their quantitative response and comment. Contradictory statements from the same respondent within the same questionnaire was previously found in hospital patient surveys [ 58 ]. It was suggested that it does not imply the question is being misunderstood, rather that patients may have negative comments to make about topics that were not part of the questionnaire, or that patients have negative comments but do not adjust their quantitative ratings. Despite the fact that we administered general comment questions, which are more likely to be answered than explanation-seeking questions [ 59 ], the cognitive effort required by our open-ended request for comment may have increased the non-response rate [ 60 ]. For instance, it is possible that the cognitive effort to produce a response was high due to the request’s lack of specificity. Another explanation might be that the use of the phrase “Please enter your comment,” rather than asking about whether the respondent had “any thoughts” might have raised the barrier for providing a response because requesting for “a comment” to be entered may be easily interpreted as being asked to write down if “they have something to say” to the researchers. The generic request for comment may have also made it appear less binding to provide one. To address these possible reasons for item non-response in the comment boxes, we propose rephrasing the request for comment as follows: “Do you have any other thoughts on the question you just answered? Please let us know!”

We highlighted the additional data, analysis, and complexity involved in a mixed methods validation, which may help to explain why uptake among instrument developers has been modest so far. We believe it is likely that the lack of easy to follow procedures and the many different, ambiguous quality concepts make a mixed methods validation more daunting to attempt than standard psychometric evaluation. This paper highlights some simple analytic procedures requiring only little additional data, which may help address some of the issues keeping practitioners away from using mixed methods for validation.

Minor adaptations in preparation for future data collection

Our analyses of qualitative comments, alone and in mixed analysis with quantitative data, suggest that the questions of SIPEI were mostly understood as intended. Accordingly, adaptations to SIPEI were suggested sparingly. Adaptations were focused on making questions clearer by adding definitions [ 56 ] and underlining keywords to emphasise key aspects [ 61 , 62 ]. These changes are unlikely to fundamentally change the instrument’s psychometric properties, but will rather help to reduce unwanted variability [ 48 ]. This has provided a refined instrument which can be retested for further psychometric evaluation.

Limitations

Traditional psychometric analyses were not within the scope of this paper. Thus, the SIPEI instrument’s performance cannot be judged based on the information presented. The mixed methods validation analyses were constrained by missing responses in the qualitative comments. This limited the ability to show convergence. Furthermore, the questionnaires collected were predominantly German language questionnaires. We only collected 15 French and 34 Italian language questionnaires due to the limited hospital access imposed by the COVID-19 pandemic. This reduced the evidence for the French and Italian versions of the questionnaire. Furthermore, samples were obtained from a limited set of participating hospitals. Data collection spanned only two months for the patient survey and three months for the staff survey, limiting the number of questionnaires that could be obtained. More questionnaires could have likely been obtained given a longer data collection period. Our analyses relied on qualitative data from comment boxes and field notes. It is probable that a more expansive data collection strategy, for instance through additional cognitive interviews or focus groups, would have yielded more depth and breadth of data. Finally, no explicit instructions were given on which comments were expected in comment boxes. This likely broadened the variety of comments and reduced the converging validity evidence that might have been collected.

MM approaches can provide insights into an instrument’s quality and can be used on their own and in conjunction with traditional quantitative psychometric approaches to establish evidence of construct validity. Our approach suggests procedures and criteria that are closer to the empirical data and provides practical examples of how the criteria of congruence, convergence, and credibility can be applied to collect construct validity evidence. This can provide research teams constrained by time, budget, and limited data with an avenue of enriching an IV through MM without necessarily requiring more data.

Acknowledgements

We thank the Swiss Federal Office of Public Health for supporting this research. We also thank Bern University of Applied Sciences for supporting this publication.

Abbreviations

Authors’ contributions.

KUS served as project lead. FG served as project coordinator. SHa and AB served on the project advisory board. The study design was developed by KUS, FG, and SHu. FW, FN, and SHu developed the SIPEI instrument. AB recruited participants for questionnaire testing. JAG and KU developed the questions on education, occupation, and institutional/organisational affiliation. JAG designed and tested the online questionnaires. Electronic survey data collection was conducted by FG, KU, and JAG. KU conducted the in-person patient data collection and wrote field notes. JAG conducted the data analysis. KU reviewed the patient data analysis. JAG drafted the final manuscript and implemented all feedbacks. All authors reviewed and provided critical feedback on the manuscript. AB proofread the manuscript. All authors read and approved the final manuscript.

This research was partly funded through the support program “Interprofessionality in healthcare 2017–2020” by the Swiss Federal Office of Public Health (FOPH), which aimed to improve interprofessional cooperation within the healthcare system in order to increase its efficiency (63). The FOPH did not participate in the development of the study design, data collection, data analysis, data interpretation or writing of the manuscript.

Availability of data and materials

Declarations.

The project was submitted to the cantonal ethics committees of Bern, Ticino, and Vaud for review and was found to not require ethical approval (BASEC-No.: Req-2019–00731), as the study does not fall within the scope of the Swiss Federal Human Research Act (SR 810.30). Written informed consent was obtained from patients at admission to use their survey data for research purposes. In addition, when approached by survey staff prior to patient survey data collection, patients were verbally informed of the study objectives and their right to decline to participate and discontinue at any time. All procedures followed prescribed ethical guidelines and regulations.

Consent for publication is not applicable to this study.

The authors declare that they have no competing interests.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Jean Anthony Grand-Guillaume-Perrenoud, Email: [email protected] .

Franziska Geese, Email: [email protected] .

Katja Uhlmann, Email: [email protected] .

Angela Blasimann, Email: [email protected] .

Felicitas L. Wagner, Email: [email protected] .

Florian B. Neubauer, Email: [email protected] .

Sören Huwendiek, Email: [email protected] .

Sabine Hahn, Email: [email protected] .

Kai-Uwe Schmitt, Email: [email protected] .

case study method validation

  • December 20, 2022

Auditing Analytical Method Validation in Pharma

case study method validation

The aim of validating any process is to ensure that it achieves its intended purpose with adequate accuracy and precision. The purpose of analytical procedures or methods is to measure certain attributes of the test article, whether a drug substance or a final product. This may include, among others, measuring the content (assay) or the amounts of certain undesirable components like related substances, degradation products, or residual solvents. Regulatory authorities and auditors are increasingly focusing on ensuring that all analytical methods used within the pharmaceutical industry are validated and that validation work complies with general GMP rules similar to any other process within the industry. In other words, validating analytical methods should follow structured, predefined procedures from start to finish.  In this article, we discuss auditing analytical method validation as a critical industry process. And answer the question, what do auditors look for when reviewing this process?

Validation master document

Auditors will initially examine the presence of master documents that describe analytical method validation as a process. This document (or documents) should adhere to general GMP rules; and provides guidance to responsible personnel about the required analytical validation studies and the acceptance criteria as per ICH guidelines. The major sections of such document(s) should cover the following key points;

  • An overview of the validation studies conducted and their significance, and the acceptance criteria for each, which should be aligned with ICH guidelines and other general chapters from various compendia.
  • Another section describes the process of generating validation protocols, the types of validation studies to be included based on of method validated, and who is responsible for approving the protocol.
  • During validation work, a section deals with Out of Limit (OoL) results. It should provide details about how to investigate the root cause behind such results, the impact on the validation work, and document such results in the final report.

Validation protocols

Analytical method validation is conducted per documented methodology, and any results should comply with predefined acceptance criteria.  Both the methodology and acceptance criteria are mentioned in protocols that should be approved before validation work. 

Validation protocols attract auditors’ attention which is critical to the validation process. They initially examine the rationale behind using the chosen analytical technique for the intended purpose, which should be mentioned in the protocol. For instance, the choice of HPLC, rather than simple TLC, for the quantitative determination of related substances is backed by the ability of the technique to separate complex mixtures. Then, they ensure the adequacy of the validation studies included within the protocol to prove the suitability of the method for its purpose. For example, the validation of the HPLC method for assay tests must consist of validation experiments that demonstrate the specificity, accuracy, precision, linearity of response, and robustness of the method towards changes in the testing environment. For each of these studies, enough data points or measurements should be performed to comply with ICH guidelines. And any required calculations or statistical analysis that is carried out on these results should be mentioned in the protocol. 

The working analytes’ concentration levels at which the validation studies are conducted must be in alignment with the limit of impurities to be tested or the purpose of the method. For instance, if the assay procedure is used for testing the dissolution of tablets or capsules, then the range of validation studies must cover the expected range of results from the such test, i.e., from 0 to 120% of the labeled claim. Finally, the acceptance criteria and the justification for them should be clearly defined in the protocol.

Validation analytical work

Generally, any validation-related work should adhere to broad GMP/GDP rules and other site procedures relevant to the testing technique.  They will also examine the adherence of the work conducted to the methodology stated in the protocol and how any Out of Limit (OOL) results was captured, investigated, and documented. Auditors may specifically request examples of validation-related OoL results to examine how they were investigated to identify the root cause and the impact on the validation work, as well as the final methodology. Any Out of Limit results and their relevant investigations should be addressed as per Validation Master documents. 

case study method validation

 Validation reports

Auditors will examine the agreement of the validation report relevant to the protocol. They will ensure that all results of validation studies specified in the protocol are documented in the final report and in addition to any statistical analysis conducted on these results, such as regression analysis for proof of linearity or standard deviation and confidence intervals for determining the precision. 

The final report should also include a section for discussing any OoL(s) results and any investigation work conducted for these results. It should also show a detailed explanation of the root cause found, and the impact of the OoL on the intended purpose of the method should be stated in the report.  

For instance, the need for an acceptable linear relationship between the test response and the analyte concentration may lead to a change in the purpose of the analytical method to become a semi-quantitative or a limit test. Where the presence of significant matrix interference may lead to adopting a standard addition procedure rather than relying on external standards for measurement. Auditors will carefully examine the scientific rationale behind these changes. 

Finally, the validation report should include the final methodology culminating from the results and findings of the validation work. In addition to the method should consist of details about any solutions or preparations that need to be employed during routine analysis to assure the performance of the method, such as resolution and/or sensitivity solutions for chromatographic methods. This is in addition to any instructions identified during the validation work, which would improve the outcome of the method or prevent failure of method performance criteria such as reproducibility or resolution.

To Summerize

Auditors adopt a comprehensive approach when auditing the process of analytical method validation rather than focusing on the final report. Instead, they aim to ensure that the entire process follows GMP rules and a defined system that will capture any deviations so that validated methods can serve their purpose accurately and efficiently. 

Get in touch to discuss how Qualifyze can help you.

Sustainability.

IMAGES

  1. Test Method Validation Case Study

    case study method validation

  2. Process Validation Case Study: Know What You Know... and What You Don't

    case study method validation

  3. How to Create a Case Study + 14 Case Study Templates

    case study method validation

  4. why use case study approach

    case study method validation

  5. 10+ Validation Report Templates

    case study method validation

  6. How To Do Case Study Analysis?

    case study method validation

VIDEO

  1. Test Method Validation

  2. Empower Method Validation Manager

  3. case study Method And Interdisciplinary Research / Reasearch Methodology

  4. Case study method in practice

  5. Case Study Method In Hindi || वैयक्तिक अध्ययन विधि || D.Ed SE (I.D) || All Students || Special BSTC

  6. Case Study Research design and Method

COMMENTS

  1. Case Study Methodology of Qualitative Research: Key Attributes and

    It must be noted, as highlighted by Yin , a case study is not a method of data collection, rather is a research strategy or design to study a social unit. Creswell (2014, p. 241) makes a lucid and comprehensive definition of case study strategy. ... call 'respondent validation'. ...

  2. How to Improve the Validity and Reliability of a Case Study Approach

    The case study can be used for two main purposes: explorato ry and. descriptive (Yin, 2017). The exploratory study contributes to clarify a. situation where information is scarce. The level of ...

  3. PDF How to Improve the Validity and Reliability of a Case Study Approach

    Several methods can be employed in qualitative methodology, as indicated by Queirós et al. (2017): (i) observation; (ii) ethnography; (iii) field research; (iv) focus groups; or (v) case studies. The case study is a qualitative method that generally consists of a way to deepen an individual unit.

  4. Analytical Method Validation for Quality Assurance and Process

    Method validation is a critical activity in the pharmaceutical industry. Validation data are used to confirm that the analytical procedure employed for a specific test is suitable for its intended purposes. ... Validation of Stability-Indicating HPLC Methods for Pharmaceuticals: Overview, Methodologies, and Case Studies, LCGC North Am. 38(10 ...

  5. Accuracy and Precision in Bioanalysis: Review of Case Studies

    Chromatographic Assays. Accuracy: ±15% of nominal concentration, ±20% of nominal concentration at LLOQ. Precision: ≤15% V, ≤20% V at LLOQ. Ligand Binding Assays. Should include 5 QC levels ...

  6. Methodology or method? A critical review of qualitative case study

    Definitions of qualitative case study research. Case study research is an investigation and analysis of a single or collective case, intended to capture the complexity of the object of study (Stake, 1995).Qualitative case study research, as described by Stake (), draws together "naturalistic, holistic, ethnographic, phenomenological, and biographic research methods" in a bricoleur design ...

  7. (PDF) Analytical method validation: A brief review

    Abstract. Validation is an applied approach to verify that a method is suitable to function as a quality control to ol. The objective. of any analytical measurement is to obtain consistent, r ...

  8. Common misconceptions about validation studies

    Validation studies, in which an investigator compares the accuracy of a measure with a gold standard measure, are an important way to understand and mitigate this bias. ... precision, and power in hospital-based case-control studies. Am J Epidemiol 1990; 132:81-92. [Google Scholar] 6. ... Validation study methods for estimating exposure ...

  9. (PDF) Model verification & validation strategies and methods: an

    shows a model verification and validation architecture [5] that was used to implement the model validation process in the simulation case study in Fig. 2.

  10. Current Practices and Challenges in Method Validation

    The fourth case study described the validation method to measure the regulatory T cells in rat whole blood and thymus, which is a very small population. The validation parameters described above were tested in addition to the determination of a limit of detection (LOD). The LOD is generally validated only when small populations of cells of ...

  11. Bioanalytical method validation: An updated review

    In the case of LC-MS and LC-MS-MS based procedures, matrix effects should be investigated to ensure that precision, selectivity, and sensitivity will not be compromised. Method selectivity should be evaluated during method development and throughout method validation and can continue throughout application of the method to actual study samples.

  12. Bioanalytical Inspections: Overview and Case Studies

    Case #1: Re-injection. During method validation, analyte's matrix stability. (@-70°C) was evaluated. During inspection, it was found that stability result from an initial run failed to meet the ...

  13. Verification, analytical validation, and clinical validation (V3): the

    Given (1) the historical context for the terms verification and validation in software and hardware standards, regulations, and guidances, and (2) the separated concepts of analytical and clinical ...

  14. Perspectives in modeling and model validation during ...

    Design of experiments (DOE)-based analytical quality by design (AQbD) method evaluation, development, and validation is gaining momentum and has the potential to create robust chromatographic methods through deeper understanding and control of variability. In this paper, a case study is used to explore the pros, cons, and pitfalls of using various chromatographic responses as modeling targets ...

  15. PDF Model Verification and Validation Strategies and Methods: An

    of owners of the manufacturing operation system case study; secondly, a conceptual model was built up to represent the simulation problem using data collected from the case study owners; and finally, a simulation model was developed as an implementation of the conceptual model. Fig. 2 Simulation case study model verification and validation

  16. Points to Consider in Quality Control Method Validation and Transfer

    To ensure method consistency between two sites, a transfer approach and design should take into account technical risks. And validation and transfer are effective activities to identify potential analytical issues (such as for the spiking case study above) and enhance our understanding of method performance.

  17. A case study for in-house method validation of gas ...

    In this study, analytical method validation has been done for the measurement of carbon dioxide/nitrogen (CO 2 /N 2) and methane/nitrogen (CH 4 /N 2) calibration gas mixtures using gas chromatography with flame ionization detector (GC-FID).Class-I calibration gas mixtures (CGMs) of CO 2 (500 µmol mol −1 to 1100 µmol mol −1) and CH 4 (2 µmol mol −1 to 130 µmol mol −1) used in method ...

  18. Dementia Insights: The Validation Method for Dementia Care

    The Validation Method requires a basic nonjudgmental, empathetic, honest, and respectful attitude in combination with theoretical understanding and application of specific techniques. CASE STUDY 1. The Validation Method in the Context of Pain and Dementia. GM, age 85, is in clinic complaining of shoulder and arm pain.

  19. Chapter 4

    This validation case study identifies methods to better achieve that objective. Documents the process for conducting an arterial before-and- after analysis with emphasis on travel time reliability. Benefits of operations strategies in improving travel time reliability. Steps to incorporating reliability performance measures into the LRTP at CDOT.

  20. Mixed methods instrument validation: Evaluation procedures for

    We propose a typology of multiple method validation studies based on how the methods are applied. Exemplar studies for each type are presented. ... Validation, and Revision: A Case Study. J Mixed Methods Res. 2012; 6 (4):294-316. doi: 10.1177/1558689811427912. [Google Scholar] 18. WHO WHO. Framework for Action on Interprofessional Education ...

  21. Auditing Analytical Method Validation in Pharma

    December 20, 2022. Auditing Analytical Method Validation in Pharma. The aim of validating any process is to ensure that it achieves its intended purpose with adequate accuracy and precision. The purpose of analytical procedures or methods is to measure certain attributes of the test article, whether a drug substance or a final product.

  22. Mathematical modeling applied to the uncertainty analysis of a tank

    The influence of calibration conditions on the validation of GUM method by means of the Monte Carlo method (MCM) has been investigated computationally, as well as the behavior of the expanded uncertainty (GUM method) as a function of these calibration conditions. Seven case studies have been analyzed in this research work.

  23. A Modified Overset Method in OpenFOAM for Simultaneous Motion and

    The study focuses on developing a robust numerical tool for analyzing the motion of biological systems in fluid flows, crucial to understanding fluid-structure interaction (FSI) phenomena. The research introduces a new mesh deformation solver, dynamicOversetZoneFvMesh, designed to address limitations in OpenFOAM's conventional Overset method.

  24. Model Verification and Validation Strategies and Methods: An

    Application Verification and Validation Methods in Simulation Case Study Fig. 6 shows applications of model verification and validation methods in the simulation case study. development and verification of the concept model, which reduces modeling time and increases model reliability. Both Sensitivity Analysis and Animation methods were ...

  25. Full article: Validating the ratio of insulin like growth factor

    Methods . Women with singleton pregnancies (n = 5000) were recruited between 19 +0-23 +6 weeks' gestation at Tu Du Hospital, Ho Chi Minh City.Maternal serum was collected from 19 +0-22 +6 weeks' gestation and participants followed to neonatal discharge. Relative insulin-like growth factor binding protein 4 (IGFBP4) and sex hormone binding globulin (SHBG) abundances were measured by mass ...