Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • Random Assignment in Experiments | Introduction & Examples

Random Assignment in Experiments | Introduction & Examples

Published on March 8, 2021 by Pritha Bhandari . Revised on June 22, 2023.

In experimental research, random assignment is a way of placing participants from your sample into different treatment groups using randomization.

With simple random assignment, every member of the sample has a known or equal chance of being placed in a control group or an experimental group. Studies that use simple random assignment are also called completely randomized designs .

Random assignment is a key part of experimental design . It helps you ensure that all groups are comparable at the start of a study: any differences between them are due to random factors, not research biases like sampling bias or selection bias .

Table of contents

Why does random assignment matter, random sampling vs random assignment, how do you use random assignment, when is random assignment not used, other interesting articles, frequently asked questions about random assignment.

Random assignment is an important part of control in experimental research, because it helps strengthen the internal validity of an experiment and avoid biases.

In experiments, researchers manipulate an independent variable to assess its effect on a dependent variable, while controlling for other variables. To do so, they often use different levels of an independent variable for different groups of participants.

This is called a between-groups or independent measures design.

You use three groups of participants that are each given a different level of the independent variable:

  • a control group that’s given a placebo (no dosage, to control for a placebo effect ),
  • an experimental group that’s given a low dosage,
  • a second experimental group that’s given a high dosage.

Random assignment to helps you make sure that the treatment groups don’t differ in systematic ways at the start of the experiment, as this can seriously affect (and even invalidate) your work.

If you don’t use random assignment, you may not be able to rule out alternative explanations for your results.

  • participants recruited from cafes are placed in the control group ,
  • participants recruited from local community centers are placed in the low dosage experimental group,
  • participants recruited from gyms are placed in the high dosage group.

With this type of assignment, it’s hard to tell whether the participant characteristics are the same across all groups at the start of the study. Gym-users may tend to engage in more healthy behaviors than people who frequent cafes or community centers, and this would introduce a healthy user bias in your study.

Although random assignment helps even out baseline differences between groups, it doesn’t always make them completely equivalent. There may still be extraneous variables that differ between groups, and there will always be some group differences that arise from chance.

Most of the time, the random variation between groups is low, and, therefore, it’s acceptable for further analysis. This is especially true when you have a large sample. In general, you should always use random assignment in experiments when it is ethically possible and makes sense for your study topic.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Random sampling and random assignment are both important concepts in research, but it’s important to understand the difference between them.

Random sampling (also called probability sampling or random selection) is a way of selecting members of a population to be included in your study. In contrast, random assignment is a way of sorting the sample participants into control and experimental groups.

While random sampling is used in many types of studies, random assignment is only used in between-subjects experimental designs.

Some studies use both random sampling and random assignment, while others use only one or the other.

Random sample vs random assignment

Random sampling enhances the external validity or generalizability of your results, because it helps ensure that your sample is unbiased and representative of the whole population. This allows you to make stronger statistical inferences .

You use a simple random sample to collect data. Because you have access to the whole population (all employees), you can assign all 8000 employees a number and use a random number generator to select 300 employees. These 300 employees are your full sample.

Random assignment enhances the internal validity of the study, because it ensures that there are no systematic differences between the participants in each group. This helps you conclude that the outcomes can be attributed to the independent variable .

  • a control group that receives no intervention.
  • an experimental group that has a remote team-building intervention every week for a month.

You use random assignment to place participants into the control or experimental group. To do so, you take your list of participants and assign each participant a number. Again, you use a random number generator to place each participant in one of the two groups.

To use simple random assignment, you start by giving every member of the sample a unique number. Then, you can use computer programs or manual methods to randomly assign each participant to a group.

  • Random number generator: Use a computer program to generate random numbers from the list for each group.
  • Lottery method: Place all numbers individually in a hat or a bucket, and draw numbers at random for each group.
  • Flip a coin: When you only have two groups, for each number on the list, flip a coin to decide if they’ll be in the control or the experimental group.
  • Use a dice: When you have three groups, for each number on the list, roll a dice to decide which of the groups they will be in. For example, assume that rolling 1 or 2 lands them in a control group; 3 or 4 in an experimental group; and 5 or 6 in a second control or experimental group.

This type of random assignment is the most powerful method of placing participants in conditions, because each individual has an equal chance of being placed in any one of your treatment groups.

Random assignment in block designs

In more complicated experimental designs, random assignment is only used after participants are grouped into blocks based on some characteristic (e.g., test score or demographic variable). These groupings mean that you need a larger sample to achieve high statistical power .

For example, a randomized block design involves placing participants into blocks based on a shared characteristic (e.g., college students versus graduates), and then using random assignment within each block to assign participants to every treatment condition. This helps you assess whether the characteristic affects the outcomes of your treatment.

In an experimental matched design , you use blocking and then match up individual participants from each block based on specific characteristics. Within each matched pair or group, you randomly assign each participant to one of the conditions in the experiment and compare their outcomes.

Sometimes, it’s not relevant or ethical to use simple random assignment, so groups are assigned in a different way.

When comparing different groups

Sometimes, differences between participants are the main focus of a study, for example, when comparing men and women or people with and without health conditions. Participants are not randomly assigned to different groups, but instead assigned based on their characteristics.

In this type of study, the characteristic of interest (e.g., gender) is an independent variable, and the groups differ based on the different levels (e.g., men, women, etc.). All participants are tested the same way, and then their group-level outcomes are compared.

When it’s not ethically permissible

When studying unhealthy or dangerous behaviors, it’s not possible to use random assignment. For example, if you’re studying heavy drinkers and social drinkers, it’s unethical to randomly assign participants to one of the two groups and ask them to drink large amounts of alcohol for your experiment.

When you can’t assign participants to groups, you can also conduct a quasi-experimental study . In a quasi-experiment, you study the outcomes of pre-existing groups who receive treatments that you may not have any control over (e.g., heavy drinkers and social drinkers). These groups aren’t randomly assigned, but may be considered comparable when some other variables (e.g., age or socioeconomic status) are controlled for.

Prevent plagiarism. Run a free check.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval
  • Quartiles & Quantiles
  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Prospective cohort study

Research bias

  • Implicit bias
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hindsight bias
  • Affect heuristic
  • Social desirability bias

In experimental research, random assignment is a way of placing participants from your sample into different groups using randomization. With this method, every member of the sample has a known or equal chance of being placed in a control group or an experimental group.

Random selection, or random sampling , is a way of selecting members of a population for your study’s sample.

In contrast, random assignment is a way of sorting the sample into control and experimental groups.

Random sampling enhances the external validity or generalizability of your results, while random assignment improves the internal validity of your study.

Random assignment is used in experiments with a between-groups or independent measures design. In this research design, there’s usually a control group and one or more experimental groups. Random assignment helps ensure that the groups are comparable.

In general, you should always use random assignment in this type of experimental design when it is ethically possible and makes sense for your study topic.

To implement random assignment , assign a unique number to every member of your study’s sample .

Then, you can use a random number generator or a lottery method to randomly assign each number to a control or experimental group. You can also do so manually, by flipping a coin or rolling a dice to randomly assign participants to groups.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 22). Random Assignment in Experiments | Introduction & Examples. Scribbr. Retrieved April 15, 2024, from https://www.scribbr.com/methodology/random-assignment/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, guide to experimental design | overview, steps, & examples, confounding variables | definition, examples & controls, control groups and treatment groups | uses & examples, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

1.4.2 - causal conclusions.

In order to control for confounding variables, participants can be randomly assigned to different levels of the explanatory variable. This act of randomly assigning cases to different levels of the explanatory variable is known as randomization . An experiment that involves randomization may be referred to as a  randomized experiment or randomized comparative experiment . By randomly assigning cases to different conditions, a  causal conclusion  can be made; in other words, we can say that differences in the response variable are caused by differences in the explanatory variable. Without randomization, an  association  can be noted, but a causal conclusion cannot be made.

Note that randomization and random sampling are different concepts. Randomization refers to the random assignment of experimental units to different conditions (e.g., different treatment groups). Random sampling refers to probability-based methods for selecting a sample from a population.

Example: Fitness Programs Section  

Two teams have designed research studies to compare the weight loss of participants in two different fitness programs. Each team used a different research study design.

The first team surveyed people who already participate in each program. This is an observational study, which means there is no randomization . Each group is comprised of participants who made the personal decision to engaged in that fitness program. With this research study design, the researchers can only determine whether or not there is an  association  between the fitness program and participants' weight loss. A causal conclusion cannot be made because there may be  confounding variables . The people in the two groups may be different in some key ways. For example, if the cost of the two programs is different, the two groups may differ in terms of their finances. 

The second team of researchers obtained a sample of participants and randomly assigned half to participate in the first fitness program and half to participate in the second fitness program. They measured each participants' weight twice: both at the beginning and end of their study. This is a  randomized experiment  because the researchers randomly assigned each participant to one of the two programs. Because participants were randomly assigned to groups, the groups should be balanced in terms of any confounding variables and a  causal conclusion  may be drawn from this study.

Random Assignment in Psychology: Definition & Examples

Julia Simkus

Editor at Simply Psychology

BA (Hons) Psychology, Princeton University

Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.

Learn about our Editorial Process

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

In psychology, random assignment refers to the practice of allocating participants to different experimental groups in a study in a completely unbiased way, ensuring each participant has an equal chance of being assigned to any group.

In experimental research, random assignment, or random placement, organizes participants from your sample into different groups using randomization. 

Random assignment uses chance procedures to ensure that each participant has an equal opportunity of being assigned to either a control or experimental group.

The control group does not receive the treatment in question, whereas the experimental group does receive the treatment.

When using random assignment, neither the researcher nor the participant can choose the group to which the participant is assigned. This ensures that any differences between and within the groups are not systematic at the onset of the study. 

In a study to test the success of a weight-loss program, investigators randomly assigned a pool of participants to one of two groups.

Group A participants participated in the weight-loss program for 10 weeks and took a class where they learned about the benefits of healthy eating and exercise.

Group B participants read a 200-page book that explains the benefits of weight loss. The investigator randomly assigned participants to one of the two groups.

The researchers found that those who participated in the program and took the class were more likely to lose weight than those in the other group that received only the book.

Importance 

Random assignment ensures that each group in the experiment is identical before applying the independent variable.

In experiments , researchers will manipulate an independent variable to assess its effect on a dependent variable, while controlling for other variables. Random assignment increases the likelihood that the treatment groups are the same at the onset of a study.

Thus, any changes that result from the independent variable can be assumed to be a result of the treatment of interest. This is particularly important for eliminating sources of bias and strengthening the internal validity of an experiment.

Random assignment is the best method for inferring a causal relationship between a treatment and an outcome.

Random Selection vs. Random Assignment 

Random selection (also called probability sampling or random sampling) is a way of randomly selecting members of a population to be included in your study.

On the other hand, random assignment is a way of sorting the sample participants into control and treatment groups. 

Random selection ensures that everyone in the population has an equal chance of being selected for the study. Once the pool of participants has been chosen, experimenters use random assignment to assign participants into groups. 

Random assignment is only used in between-subjects experimental designs, while random selection can be used in a variety of study designs.

Random Assignment vs Random Sampling

Random sampling refers to selecting participants from a population so that each individual has an equal chance of being chosen. This method enhances the representativeness of the sample.

Random assignment, on the other hand, is used in experimental designs once participants are selected. It involves allocating these participants to different experimental groups or conditions randomly.

This helps ensure that any differences in results across groups are due to manipulating the independent variable, not preexisting differences among participants.

When to Use Random Assignment

Random assignment is used in experiments with a between-groups or independent measures design.

In these research designs, researchers will manipulate an independent variable to assess its effect on a dependent variable, while controlling for other variables.

There is usually a control group and one or more experimental groups. Random assignment helps ensure that the groups are comparable at the onset of the study.

How to Use Random Assignment

There are a variety of ways to assign participants into study groups randomly. Here are a handful of popular methods: 

  • Random Number Generator : Give each member of the sample a unique number; use a computer program to randomly generate a number from the list for each group.
  • Lottery : Give each member of the sample a unique number. Place all numbers in a hat or bucket and draw numbers at random for each group.
  • Flipping a Coin : Flip a coin for each participant to decide if they will be in the control group or experimental group (this method can only be used when you have just two groups) 
  • Roll a Die : For each number on the list, roll a dice to decide which of the groups they will be in. For example, assume that rolling 1, 2, or 3 places them in a control group and rolling 3, 4, 5 lands them in an experimental group.

When is Random Assignment not used?

  • When it is not ethically permissible: Randomization is only ethical if the researcher has no evidence that one treatment is superior to the other or that one treatment might have harmful side effects. 
  • When answering non-causal questions : If the researcher is just interested in predicting the probability of an event, the causal relationship between the variables is not important and observational designs would be more suitable than random assignment. 
  • When studying the effect of variables that cannot be manipulated: Some risk factors cannot be manipulated and so it would not make any sense to study them in a randomized trial. For example, we cannot randomly assign participants into categories based on age, gender, or genetic factors.

Drawbacks of Random Assignment

While randomization assures an unbiased assignment of participants to groups, it does not guarantee the equality of these groups. There could still be extraneous variables that differ between groups or group differences that arise from chance. Additionally, there is still an element of luck with random assignments.

Thus, researchers can not produce perfectly equal groups for each specific study. Differences between the treatment group and control group might still exist, and the results of a randomized trial may sometimes be wrong, but this is absolutely okay.

Scientific evidence is a long and continuous process, and the groups will tend to be equal in the long run when data is aggregated in a meta-analysis.

Additionally, external validity (i.e., the extent to which the researcher can use the results of the study to generalize to the larger population) is compromised with random assignment.

Random assignment is challenging to implement outside of controlled laboratory conditions and might not represent what would happen in the real world at the population level. 

Random assignment can also be more costly than simple observational studies, where an investigator is just observing events without intervening with the population.

Randomization also can be time-consuming and challenging, especially when participants refuse to receive the assigned treatment or do not adhere to recommendations. 

What is the difference between random sampling and random assignment?

Random sampling refers to randomly selecting a sample of participants from a population. Random assignment refers to randomly assigning participants to treatment groups from the selected sample.

Does random assignment increase internal validity?

Yes, random assignment ensures that there are no systematic differences between the participants in each group, enhancing the study’s internal validity .

Does random assignment reduce sampling error?

Yes, with random assignment, participants have an equal chance of being assigned to either a control group or an experimental group, resulting in a sample that is, in theory, representative of the population.

Random assignment does not completely eliminate sampling error because a sample only approximates the population from which it is drawn. However, random sampling is a way to minimize sampling errors. 

When is random assignment not possible?

Random assignment is not possible when the experimenters cannot control the treatment or independent variable.

For example, if you want to compare how men and women perform on a test, you cannot randomly assign subjects to these groups.

Participants are not randomly assigned to different groups in this study, but instead assigned based on their characteristics.

Does random assignment eliminate confounding variables?

Yes, random assignment eliminates the influence of any confounding variables on the treatment because it distributes them at random among the study groups. Randomization invalidates any relationship between a confounding variable and the treatment.

Why is random assignment of participants to treatment conditions in an experiment used?

Random assignment is used to ensure that all groups are comparable at the start of a study. This allows researchers to conclude that the outcomes of the study can be attributed to the intervention at hand and to rule out alternative explanations for study results.

Further Reading

  • Bogomolnaia, A., & Moulin, H. (2001). A new solution to the random assignment problem .  Journal of Economic theory ,  100 (2), 295-328.
  • Krause, M. S., & Howard, K. I. (2003). What random assignment does and does not do .  Journal of Clinical Psychology ,  59 (7), 751-766.

Print Friendly, PDF & Email

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

Chapter 16 Causal Comparative Research How to Design and Evaluate Research in Education 8th

Profile image of Cut Eka Para Samya

Related Papers

INTERNATIONAL JOURNAL OF MULTIPLE RESEARCH APPROACHES

is random assignment possible in causal comparative research

Educação & Realidade

Steven Klees

1: Comparison is the essence of science and the field of comparative and international education, like many of the social sciences, has been dominated by quantitative methodological approaches. This paper raises fundamental questions about the utility of regression analysis for causal inference. It examines three extensive literatures of applied regression analysis concerned with education policies. The paper concludes that the conditions necessary for regression analysis to yield valid causal inferences are so far from ever being met or approximated that such inferences are never valid. Alternative research methodologies are then briefly discussed.

Practising Comparison. Logics. Relations, Collaborations.

Monika Krause

Journal of educational …

Rhonda Craven

Educational Measurement: Issues and Practice

G. Van Den Wittenboer

What are the conditions that will allow for appropriate comparisons among groups" Should the concept of comparative validity be added to the lexicon of the psychometrician"

Theoretical and Methodological Approaches to Social Sciences and Knowledge Management

The Changing Academic Profession

Ulrich Teichler

Cut Eka Para Samya

Commentary on Causal Prescriptive Statements Causal prescriptive statements are necessary in the social sciences whenever there is a mission to help individuals, groups, or organizations improve. Researchers inquire whether some variable or intervention A causes an improvement in some mental, emotional, or behavioural variable B. If they are satisfied that A causes B, then they can take steps to manipulate A in the real world and thereby help people by enhancing B.

In Part 4, we begin a more detailed discussion of some of the methodologies that educational researchers use. We concentrate here on quantitative research, with a separate chapter devoted to group-comparison experimental research, single-subject experimental research, correlational research, causal-comparative research, and survey research. In each chapter, we not only discuss the method in some detail, but we also provide examples of published studies in which the researchers used one of these methods. We conclude each chapter with an analysis of a particular study's strengths and weaknesses.

RELATED PAPERS

Pavol Bobek

Sergey Polishchuk

Keria: Studia Latina et Graeca

Milan Lovenjak

Intensive Care Medicine

James Krinsley

Luis Margalho

Rodolfo Biaggioni

Ayat Saadattalab

Revue française de droit constitutionnel

Edwin Matutano

Infectious Diseases of Poverty

zachary nsadha

bilan med abdi

Journal of Biological Sciences

Devi Tambunan

ECS Transactions

Paul Ridgway

Pediatric Transplantation

Richard Silva

Acta Scientific Cancer Biology

Ahmed A . J . Mahmood

Meltem Arslan

Electronic Workshops in Computing

Dr John Hillman

Dimensionamiento de Turborreactores

Jaime Tellez

Caminhos de Geografia

Mauro Alixandrini Jr

TRENDS IN THE SCIENCES

Yutaka Suga

Jurnal Transportasi, Logistik, dan Aviasi

Lis Lesmini, SH, MSi

Australian Journal of Earth Sciences

Christopher J Collom , Kimberly Meehan

Sander Pasterkamp

Adriana Vella

Remote Sensing

Francis Dwomoh

RELATED TOPICS

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

is random assignment possible in causal comparative research

Causal Comparative Research: Methods And Examples

Ritu was in charge of marketing a new protein drink about to be launched. The client wanted a causal-comparative study…

Causal Comparative Research

Ritu was in charge of marketing a new protein drink about to be launched. The client wanted a causal-comparative study highlighting the drink’s benefits. They demanded that comparative analysis be made the main campaign design strategy. After carefully analyzing the project requirements, Ritu decided to follow a causal-comparative research design. She realized that causal-comparative research emphasizing physical development in different groups of people would lay a good foundation to establish the product.

What Is Causal Comparative Research?

Examples of causal comparative research variables.

Causal-comparative research is a method used to identify the cause–effect relationship between a dependent and independent variable. This relationship is usually a suggested relationship because we can’t control an independent variable completely. Unlike correlation research, this doesn’t rely on relationships. In a causal-comparative research design, the researcher compares two groups to find out whether the independent variable affected the outcome or the dependent variable.

A causal-comparative method determines whether one variable has a direct influence on the other and why. It identifies the causes of certain occurrences (or non-occurrences). It makes a study descriptive rather than experimental by scrutinizing the relationships among different variables in which the independent variable has already occurred. Variables can’t be manipulated sometimes, but a link between dependent and independent variables is established and the implications of possible causes are used to draw conclusions.

In a causal-comparative design, researchers study cause and effect in retrospect and determine consequences or causes of differences already existing among or between groups of people.

Let’s look at some characteristics of causal-comparative research:

  • This method tries to identify cause and effect relationships.
  • Two or more groups are included as variables.
  • Individuals aren’t selected randomly.
  • Independent variables can’t be manipulated.
  • It helps save time and money.

The main purpose of a causal-comparative study is to explore effects, consequences and causes. There are two types of causal-comparative research design. They are:

Retrospective Causal Comparative Research

For this type of research, a researcher has to investigate a particular question after the effects have occurred. They attempt to determine whether or not a variable influences another variable.

Prospective Causal Comparative Research

The researcher initiates a study, beginning with the causes and determined to analyze the effects of a given condition. This is not as common as retrospective causal-comparative research.

Usually, it’s easier to compare a variable with the known than the unknown.

Researchers use causal-comparative research to achieve research goals by comparing two variables that represent two groups. This data can include differences in opportunities, privileges exclusive to certain groups or developments with respect to gender, race, nationality or ability.

For example, to find out the difference in wages between men and women, researchers have to make a comparative study of wages earned by both genders across various professions, hierarchies and locations. None of the variables can be influenced and cause-effect relationship has to be established with a persuasive logical argument. Some common variables investigated in this type of research are:

  • Achievement and other ability variables
  • Family-related variables
  • Organismic variables such as age, sex and ethnicity
  • Variables related to schools
  • Personality variables

While raw test scores, assessments and other measures (such as grade point averages) are used as data in this research, sources, standardized tests, structured interviews and surveys are popular research tools.

However, there are drawbacks of causal-comparative research too, such as its inability to manipulate or control an independent variable and the lack of randomization. Subject-selection bias always remains a possibility and poses a threat to the internal validity of a study. Researchers can control it with statistical matching or by creating identical subgroups. Executives have to look out for loss of subjects, location influences, poor attitude of subjects and testing threats to produce a valid research study.

Harappa’s Thinking Critically program is for managers who want to learn how to think effectively before making critical decisions. Learn how leaders articulate the reasons behind and implications of their decisions. Become a growth-driven manager looking to select the right strategies to outperform targets. It’s packed with problem-solving and effective-thinking tools that are essential for skill development. What more? It offers live learning support and the opportunity to progress at your own pace. Ask for your free demo today!

Explore Harappa Diaries to learn more about topics such as Objectives Of Research Methodology , Types Of Thinking , What Is Visualisation and Effective Learning Methods to upgrade your knowledge and skills.

Thriversitybannersidenav

Library Homepage

Research Process Guide

  • Step 1 - Identifying and Developing a Topic
  • Step 2 - Narrowing Your Topic
  • Step 3 - Developing Research Questions
  • Step 4 - Conducting a Literature Review
  • Step 5 - Choosing a Conceptual or Theoretical Framework
  • Step 6 - Determining Research Methodology
  • Step 6a - Determining Research Methodology - Quantitative Research Methods
  • Step 6b - Determining Research Methodology - Qualitative Design
  • Step 7 - Considering Ethical Issues in Research with Human Subjects - Institutional Review Board (IRB)
  • Step 8 - Collecting Data
  • Step 9 - Analyzing Data
  • Step 10 - Interpreting Results
  • Step 11 - Writing Up Results

Step 6a: Determining Research Methodology - Quantitative Research Methods

Quantitative research methods have a few designs to choose from, mostly rooted in the postpositivist worldview. The experimental design, quasi-experimental design and single subject experimental design (Bloomfield & Fisher, 2019; Creswell & Creswell, 2018). Single- subject or applied behavioral analysis consists of administering an experimental treatment to a person or small group of people over an extended period of time. Of the quasi experimental designs, subcategories; causal-comparative design and correlational design. Causal-comparative research allows for the investigator to compare two or more groups in terms of a treatment that has already happened. For correlational design, the researcher is looking to examine the relationship between variables or set of scores (Bloomfield & Fisher, 2019; Creswell & Creswell, 2018).

Generally, these kinds of designs fall into two categories, Survey Research and Experimental Research. Survey research uses a quantitative (numerical) description of trends, attitudes, opinions of a population by examining a sample of that population through questionnaires or structured interviews for data collection (Fowler, 2008; Fowler, 2014; Bloomfield & Fisher, 2019; Creswell & Creswell, 2018). These studies can be cross-sectional and longitudinal. Ultimately, the goal is to analyze the data and have the finding be generalizable to the entire population.

Experimental research uses the scientific method to determine if a specific treatment influences an outcome. This design requires random assignment of treatment conditions, and the quasi-experimental and single subject  version of this uses nonrandomized assignment of treatment (Bloomfield & Fisher, 2019). Survey Methods

Survey research methods are widely used and follow a standard format.  Examining survey research in scholarly journals would be a great way to familiarize yourself with the format and determine how to do it and, more importantly, if this method is right for your research.

How to prepare to do survey research? Creswell and Creswell (2018) as well as Fowler (2014), have provided basic framework for the rationale of survey research that you consider as you make the decision abou what kind of methods you will employ to conduct your inquiry.

  • Identify the purpose of your survey research- what variables interest you? This means start sketching out a purpose statement such as “ The primary purpose of this study is to empirically evaluate whether the number of overtime hours predicts subsequent burnout symptoms in emergency room nurses” (Creswell & Creswell, 2018, p. 149).
  • Write out why a survey method is the appropriate kind of approach for your study. It may be beneficial to discuss the advantages of survey research and the disadvantages of other methods.
  • Decide whether the survey will be cross-sectional or longitudinal. Meaning, will you gather the data at the same time  or collect it over time?
  • How will the data be collected, meaning how will the survey be filled out? Mail, phone, internet, structured interviews? Please provide the rationale for your choice.
  • Discuss your population and sampling - who is the target population? What is the size? Who are they in terms of demographic information? How do you plan to identify individuals in this population? Random sampling or systematic sampling and what is the rationale behind your choice? You should really aim for a particular fraction of the population that is typical based on past studies conducted on this topic.
  • Determine the estimated size of the correlation (r) . Using our above example, you might be looking at the relationship between hours worked and burnout symptoms. This might be difficult to determine if no other studies have been completed with these two variables involved.
  • Determine the two tailed value (a) This is called a Type I error and deals with the risk associated with a false positive. Typically, the accepted Type 1 alpha value is set at 0.5%, meaning there is a 5% probability that there is a significant (non-zero) relationship between the two variables (number of hours worked and burnout symptoms.
  • A beta value (b) is called a Type II error which refers to the risk we take saying there is no significant effect when there is a significant effect (false negative) Beta value is commonly set at .20.
  • By plugging in these numbers, r, alpha, and beta into a power analysis tool, you will be able to determine your sample size.

Survey Instrument

As you determine what instrument you will use, a survey you create or that has been used and created by someone else, you should consider the following (Fowler, 2008; Creswell & Creswell, 2018; Bloomfield & Fisher, 2019):

  • Name and give credit to the instrument and the researchers who developed it.  Or discuss your use of proprietary or free survey products online (Qualtrics, Survey Monkey).
  • Content validity (did the survey measure what it was intended to measure?)
  • Predictive validity (do scores predict a criterion measure? Do the scores correlate with other results?)
  • Construct validity (does the survey measure hypothetical concepts?)
  • What is the internal consistency of the survey? Does it perform in the same way with each variable and each item on the survey behaves in the same way? You can use the test-retest reliability, whether the instrument is stable over time.

Experimental Design

There are three components to experimental design which also follows a standard form: participants and design, procedure and measurement. There are a few considerations that Bouma et al. (2012), Bloomfield and Fisher (2019), Creswell and Creswell (2018) suggest you determine early on in your design.

  • Random Sampling - the sampling technique in which each sample has an equal probability of being chosen and is meant to be an unbiased representation of the total population.
  • Quota Sampling - is defined as a non-probability sampling method in which researchers create a sample involving individuals that represent a population.
  • Convenience Sampling - defined as a method adopted by researchers where they collect market research data from a conveniently available pool of participants.
  • Probability Sampling - refers to sampling techniques which  are aiming to identify a representative sample from which to collect data.
  • The idea of randomized assignment is a distinct feature of experimental design. When participants are randomly assigned to groups, the process is called a true experiment. If this is the case with your study, you should discuss how, when and why you are assigning participants to treatment groups. You need to describe in detail how each participant is placed to eliminate systematic bias in assigning participants. If your study design deals with more than one variable or treatment that cannot utilize random assignment (i.e. female school children benefit from a different teaching technique than male school children), this would change your design from true experimental design to a quasi-experimental design.
  • As with survey research, it would be essential to conduct a power analysis for sample size. The steps for power analysis are the same as survey design, however the focus for a power analysis of experimental design is about measuring effect size, meaning the estimated differences between groups of manipulated variables of interest. Please review steps  for power analysis in the survey research section.
  • Identify variables in the study, specifically the dependent and independent variables, as well as any other variables you intend to measure in the study. For example, you might want to think about participant demographic variables, variables that might impact your study design like time of day (i.e. energy levels might fluctuate during the day so that could impact measurement) and lastly, other variables that might impact your study’s outcomes.

Instrumentation

Just like with survey research, it is important to discuss how you are collecting your data, through what instrument or instruments are used, what scales are used, what their reliability and validity are based on past uses (Bouma et al., 2012; Creswell & Creswell, 2018; Bloomfield and Fisher, 2019). Ultimately, some quantitative experimental models may use  data sets that have already been collected like the National Center for Educational Statistics (NCES). In that case, you will be able to discuss the validity and reliability easily as it is well-established. However, if you are collecting your own data, you must discuss in detail what materials are used in the manipulation of variables. For example, you might want to pilot test the experiment so you have a detailed knowledge of the procedure (Bouma et al., 2012; Creswell & Creswell, 2018).

Also, often in experimental design, you don’t want the participants to know which variables are being manipulated or which group they are being assigned to. In order to be sure you are in line with IRB regulations (See IRB section), you want to draft a letter that will be used to explain the procedures and the study’s purpose to the participants (Creswell & Creswell, 2018). If there is any deception used in the study, be sure to check the IRB guidelines to ensure that you have all procedures and documents approved by Kean University’s IRB . Measurement and Data Analysis for Quantitative Methods

It is important to reiterate that there are several kinds of ways to collect data for a quantitative study. The data is always numerical, as opposed to qualitative data, which is largely narrative. The most common data collection methods for quantitative research are:

  • Close-ended surveys
  • Close-ended questionnaires
  • Structured interviews

The data is collected across populations, using a large sample size and then is analyzed using statistical analysis. Then, the results would be generalizable across populations. However, before you collect the data, you need to determine what exactly you are proposing to measure as you choose your variables and you. There are several kinds of statistical measurements in quantitative research. Each has its own purpose and objective. Ultimately, you need to decide if you are going to describe, explain, predict, or control your numerical data.

Quantitative data collection typically means there are a lot of data. Once the data is gathered, it may seem to be messy and disorganized at first. Your job as the researcher is to organize and then make the significance of the data clear. You do this by cleaning your data through “measurements” or scales and then running statistical analysis tests through your statistical analysis software program.

There are several purposes to statistical analysis in a quantitative study, such as (Kumar, 2015):

  • Summarize your data by identifying what is typical and what is atypical within a group.
  • Identify the rank of an individual or entity within a group
  • Demonstrate the relationship between or among variables.
  • Show similarities and differences among groups.
  • Identify any error that is inherent in a sample.
  • Test for significance.
  • Can support you in making inferences about the population being studied.

It is important to know that in order to properly analyze your numerical data, you will need access to statistical analysis software such as SPSS. The OCIS Help Desk website provides information on how to access SPSS under the Remote Learning (Students)  section.

Once you have collected your numerical data, you can run a series of statistical tests on your data, depending on your research questions.

There are four kinds of statistical measurements that you will be able to choose from in order to determine the best statistical tests to be utilized to explore your research inquiry. These measurements are also referred to as scales, and have very particular sets of statistical analysis tools that go along with each kind of scale (Bryman & Cramer, 2009).

Nominal measurements are labels (names, hence nominal) of  specific categories within mutually exclusive populations or treatment groups. These labels delineate non-numerical data such as gender, city of birth, race, ethnicity, or marital status (Bryman & Cramer, 2009; Ong & Puteh, 2017).

Ordinal measurements detail the order in which data is organized and ranked. These measures or scales deal with the greater than( >)compared to those less than (<) within a data set. Again, these are organized (named/ categorized)  and ranked (ordinal), such as class rank, ability level (beginner, intermediate, expert), or Likert scale answers (strongly agree, agree, undecided, disagree, strongly disagree) (Bryman & Cramer, 2009; Ong & Puteh, 2017).  

Interval measurements take data and order them (nominal), rank them (ordinal) and then evenly distribute them in equal intervals. There is also a zero point which is established. deal with equal units where a zero point is established. Interval measurements can be used for height, weight where there would be an absence of one of those variables (Bryman & Cramer, 2009; Ong & Puteh, 2017).

Ratio measurements allow for data to be measured by equal units (interval) and an absolute zero point is established. Here, in ratio measurements, the absolute zero value signifies the absence of the variable. For example, 0 lbs means the absence of weight. Height, weight, temperature are all examples of variables that can be measured through ratio scale (Bryman & Cramer, 2009; Ong & Puteh, 2017).

Bloomfield, J., & Fisher, M. J. (2019). Quantitative research design. Journal of the Australasian Rehabilitation Nurses Association, 22 (2), 27-30. https://doi-org.kean.idm.oclc.org/10.33235/jarna.22.2.27-30

Bouma, G. D., Ling, R., & Wilkinson, L. (2012). The research process (2nd Canadian ed.). Oxford University Press.

Bryman, A., & Cramer, D. (2009). Quantitative data analysis with SPSS 14, 15 & 16: A guide for social scientists. Routledge/Taylor & Francis Group.

Creswell, J. W., & Creswell, J. D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches. Sage.

Fowler, F. J., Jr. (2008). Survey research methods (4th ed.). Sage.

Fowler, F. J., Jr. (2014). The problem with survey research. Contemporary Sociology, 43 (5): 660-662.

Kraemer, H. C., & Blasey, C. (2015). How many subjects?: Statistical power analysis in research. Sage.

Kumar, S. (2015). IRS introduction to research in special and inclusive education. [PowerPoint slides 4, 5, 37, 38, 39,43]. Informační systém Masarykovy univerzity. https://is.muni.cz/el/1441/podzim2015/SP_IRS/

Ong, M. H. A., & Puteh, F. (2017). Quantitative data analysis: Choosing between SPSS, PLS, and AMOS in social science research. International Interdisciplinary Journal of Scientific Research, 3 (1), 14-25.

Sharma, G. (2017). Pros and cons of different sampling techniques. International Journal of Applied Research, 3 (7), 749-752.

  • Last Updated: Jun 29, 2023 1:35 PM
  • URL: https://libguides.kean.edu/ResearchProcessGuide

An IERI – International Educational Research Institute Journal

  • Open access
  • Published: 26 February 2016

Causal inferences with large scale assessment data: using a validity framework

  • David Rutkowski 1 &
  • Ginette Delandshere 2  

Large-scale Assessments in Education volume  4 , Article number:  6 ( 2016 ) Cite this article

6863 Accesses

18 Citations

4 Altmetric

Metrics details

To answer the calls for stronger evidence by the policy community, educational researchers and their associated organizations increasingly demand more studies that can yield causal inferences. International large scale assessments (ILSAs) have been targeted as a rich data sources for causal research. It is in this context that we take up a discussion around causal inferences and ILSAs. Although these rich, carefully developed studies have much to offer in terms of understanding educational systems, we argue that the conditions for making strong causal inferences are rarely met. To develop our argument we first discuss, in general, the nature of causal inferences and then suggest and apply a validity framework to evaluate the tenability of claims made in two well-cited studies. The cited studies exemplify interesting design features and advances in methods of data analysis and certainly contribute to the knowledge base in educational research; however, methodological shortcomings, some of which are unavoidable even in the best of circumstances, urge a more cautious interpretation than that of strict “cause and effect.” We then discuss how findings from causal-focused research may not provide answers to the often broad questions posed by the policy community. We conclude with examples of the importance of the validity framework for the ILSA research community and a suggestion of what should be included in studies that wish to employ quasi-experimental methods with ILSA data.

Policy makers often express a need for scientifically - based evidence to articulate policy and make funding decisions (e.g., Raudenbush 2008 ; Stevens 2011 ; Sutherland et al. 2012 ). To partially address this need, the United States government, for example, has invested heavily in the What Works Clearinghouse, which attempts to bank educational research findings resulting primarily from randomized controlled trial (RCT) studies so that evidence-based decisions can be made by both policy makers and practitioners. Internationally, the Organisation for Economic Cooperation and Development (Henry et al. 2001 ) and the World Bank (Jones 2007 ) as well as conglomerations of many international players (Rutkowski and Sparks 2014 ) have all placed great focus on attaining scientifically-based evidence so that “evidence-based policy decisions” are possible.

To answer the calls for stronger evidence, educational researchers and their associated organizations increasingly demand more studies that can yield causal inferences. For example, one of the largest educational research organizations in the world, the American Educational Research Association (AERA), commissioned a report on estimating causal effects using observational data (Schneider et al. 2007 ). In this report, the authors state that “there is a general consensus in the education research community on the need to increase the capacity of researchers to study educational problems scientifically” (Schneider et al. 2007 , p. 109). These same authors argue that large cross-sectional educational assessment datasets are an important and often underused resource from which educational researchers and policy makers can draw valid causal inferences. A prime example of the sort of datasets that can be brought to bear in the quest for scientifically-based evidence are international, comparative assessments, such as the Trends in International Mathematics and Science Study, among others.

In spite of a desire on the part of policymakers and researchers to use scientifically-based evidence in policy, research, and practice, there are necessarily limitations to gleaning causal inferences from observational data. It is in this context that we take up a discussion around causal inferences and international large-scale assessments (ILSAs). Although these rich, carefully developed studies have much to offer in terms of understanding educational systems, we argue that the conditions for making strong causal inferences are rarely met. To develop our argument we first discuss, in general, the nature of causal inferences and then in the following section titled limitations of experimental and quasi - experimental design we suggest and apply a validity framework to evaluate the tenability of claims made in two well-cited studies (Mosteller 1995 ; Schmidt et al. 2001 ). The cited studies exemplify interesting design features and advances in methods of data analysis and certainly contribute to the knowledge base in educational research; however, methodological shortcomings, some of which are unavoidable even in the best of circumstances, urge a more cautious interpretation than that of strict “cause and effect.” The next section titled usefulness for policy , discusses how findings from causal-focused research may not provide answers to the often broad questions posed by the policy community. We conclude with examples of the importance of the validity framework for the ILSA research community and a suggestion of what should be included in studies that wish to employ quasi-experimental methods with ILSA data.

Nature of causal inferences

Causal inferences have primarily relied on so-called “gold standard” experimental designs, especially RCTs (Campbell and Stanley 1963 ; Cook and Campbell 1979 ; Shadish et al. 2002 ; Campbell Collaboration 2015 ; What Works Clearinghouse 2015 ; Shavelson and Towne 2002 ). In studies using experimental designs, cases are randomly assigned to treatment and control groups, with the treatment manipulation under the complete control of the researcher. These types of studies are common in the physical, medical and psychological sciences where environments are controlled in laboratories, allowing scientists to experiment with control and treatment groups to test hypotheses. Even in ideal RCT studies, however, researchers contend with threats such as attrition, experimenter training, and so on. Such experimental conditions are less common in the social sciences because randomization is often difficult (or impossible) and even unethical (e.g., randomly assigning students to low performing vs high performing schools). Given the logistical difficulties and ethical concerns of randomly assigning people to groups and controlling their environments, quasi-experiments are more common in the social sciences. In quasi-experiments random assignment is not possible and statistical control of pre-existing differences between groups has to be carefully exercised. Social science researchers also use data from natural experiments or correlational studies—that is, where observations of naturally occurring phenomena are used—to address causal questions. These natural experiments have been most prevalent in economics and epidemiology but also in political sciences (Dunning 2008 ). Increasingly, however, quasi-experimental methods and natural experiments are used in connection with ILSA data (Cattaneo and Wolter 2012 ; Jürges et al. 2005 ). We discuss this evolution next.

For many years educational researchers have made causal claims based on large-scale observational data (e.g., Coleman Report; National Educational Longitudinal Study 1988–2000; High School and Beyond—HS&B), and there is now a renewed interest in drawing causal inferences from the analysis of large-scale national and international assessment data (e.g., Schneider et al. 2007 ; West and Woessmann 2010 ; Woessman 2014 ). To do so, statisticians, economists and other social scientists have developed methods of analysis (e.g., instrumental variable approach, propensity scores, fixed-effect models) to devise conditions that emulate random assignment by selecting “equivalent” treatment and control groups across a number of non-treatment variables (e.g., matching groups). Such a strategy seeks to equalize differences between groups with the exception of membership to either the “treatment” or “control” group. Each strategy has its own limitations, some of which will be addressed in the remaining articles of this special issue. Here, however, we want to broadly focus on the nature of causal inferences based on large-scale assessment data, which are typically observational and cross-sectional.

In order to make causal claims researchers often have to limit the scope of the claim because it must focus on a particular cause and a particular effect in order to establish a relationship between them, holding everything else equal . Although a traditional laboratory offers unrivaled control (e.g., a carefully designed and implemented RCT), scholars have questioned the associated causal claims, pointing to the inherent simplicity of the claim (Cronbach 1982 ). And even proponents of causal inferences in social science research (Cook 2002 ) have recognized that such studies are best suited for addressing very simple and focused questions. As educational researchers expand their questions to cover more complex topics, these criticisms become all the more relevant, as the risk of violating important assumptions increases. We outline examples of typical violations later in the paper. First, however, we discuss an important distinction between causal description and causal explanation.

Shadish et al. ( 2002 ) explain the distinction between causal descriptions —which describe the consequence of varying the cause on the effect or causal relationship—, and causal explanations —which provides an account of “the mechanisms through which and the conditions under which that causal relationship holds” (p. 9). Further, Shadish et al. ( 2002 ) describes the strength of experiments as having an ability to describe the “consequences attributable to deliberately varying a treatment.” The same authors argue that “experiments do less well in clarifying the mechanisms through which the conditions under which that causal relationship holds”—what they define as causal explanations (p. 9). This important distinction allows us to better understand what relationships are being defined by any causal statement. For many policy makers, understanding the mechanism is often of less importance whereas researchers are more likely to value uncovering the specific causal mechanisms or explanations that underlie causal descriptions. For example, if a study finds that spending more time on a subject leads to better performance on a test, such findings are only helpful for researchers if we understand what teachers who spend more time are doing differently than the teachers who spend less time on a topic. In other words, time is not the cause here but how the time is used. Yet in much of social science research this distinction is not clear. Instead, blanket causal statements are made without any further explanation.

Let us now examine the various factors that might affect the validity of causal claims. To do so, we mainly use a validity framework that has been developed and refined over several decades (e.g., Campbell and Stanley 1963 ; Cook and Campbell 1979 ; Shadish et al. 2002 ) in the context of experimental and quasi-experimental designs. We could have also placed our focus on Rubin’s Causal Model (RCM) or framework of potential outcomes (Holland 1986 ; Rubin 1978 , 2008 ) that similarly focuses on the analysis of cause in experiments but also extends to the use of observational studies for approximating randomized experiments. As Rubin ( 2008 ) states, however, “many of the appealing features of randomized experiments can and should be duplicated when designing observational comparative studies, that is, nonrandomized studies whose purpose is to obtain, as closely as possible, the same answer that would have been obtained in a randomized experiment comparing the same analogous treatment and control conditions in the same population” (pp. 809–810). RCM outlines conceptual and design considerations that might make it possible to use observational studies to approximate randomized experiments and also requires an explicit consideration of the “assignment mechanism” by which cases were assigned to the treatment and control conditions (Rubin 2008 ). For example, in a study comparing students attending public and private schools, it is the individual (or his/her parents) who is responsible for deciding on assignment to the treatment (e.g., private school) or to the control (e.g., public school) conditions. The RCM also makes use of a probability model associated with the assignment mechanism and of Bayesian analysis to consider the full set of potential outcomes for each case rather than relying on the observed outcome, which, according to Rubin ( 2008 ), is inadequate and “can lead to serious errors” (p. 813). RCM also rests on the important concept of “key covariates,” or relevant variables that could explain pre-existing differences between the treatment and control conditions and that can be used to ensure that the distribution of these variables only differ randomly between the two groups—one of the crucial conditions to make causal inferences. In other words, this ensures that the cases in the treatment and control conditions are more or less equivalent on all important variables. Other design considerations relate to clear specification of the hypothesized experiment that is to be approximated (i.e., clear specification of treatment conditions and outcomes), adequacy of sample sizes under all conditions to ensure power, clear understanding of who made the treatment condition assignment and based on what variables, and measurement quality of key covariates (Rubin 2008 ). As we will see, many of these considerations are similar to and/or compatible with those we consider in the validity framework that we use and describe subsequently.

Drawing from the work of Campbell and Stanley ( 1963 ), Cook and Campbell ( 1979 ), and Shadish et al. ( 2002 ), which focuses on the validity and generalization of causal claims, in the context of randomized experiments, causal claims are affected by a number of factors: (1) the meaning and representation of the constructs related to the claims, and the consequences of using these for a particular purpose— construct validity , (2) the study design and proper specification of the model(s) used to test the causal hypothesis and the ruling out of alternative hypotheses— internal validity , (3) the sampling of cases, “treatment or cause,” outcomes, and settings or contexts— external validity , and (4) the use of proper statistical methods to estimate the strength of the relationships between the presumed cause and effect— statistical conclusion validity . All these considerations are generally concerned with minimizing various “errors” in making causal claims: error of representation, error in logic of reasoning or in the implied mechanism that underlie the causal relationship, error in estimating differences or relationships, error in extrapolation. All these validity concerns support each other on the one hand, but also compete with one another on the other hand. For example, if the constructs used to represent the cause and effect are problematic in terms of their definitions and measurements, this will have consequences for all other aspects of validity and will seriously affect the validity of the causal claims. On the other hand, narrowing the definition of treatment conditions (or cause), for example, to better fit a particular context may enhance internal validity but may severely limit external validity or extrapolation. We use this validity framework subsequently to evaluate some causal claims.

An additional, yet related issue concerns the comparability of constructs and their measurement across different populations, referred to as construct and measurement equivalence, respectively. This issue is especially important in the international context where complex social phenomena are measured across differing contexts, cultures, and locations. To be fair, in some instances, when important social concepts manifest themselves differently in different countries, for example, (e.g., socio-economic status), accommodations are made regarding the measurement of some of these concepts. For example, adding national options to home background scales. This departure from strictly equivalent measures may enhance the meaningfulness of a concept in a particular context but also makes comparisons across contexts more problematic, since important concepts are measured differently.

Making causal claims using ILSA data, however, cannot simply be achieved by modifying the measurement of particular concepts or variables for a particular context. The broader issue is whether the causal mechanisms or causal explanations for a particular phenomenon are comparable across contexts (e.g., groups, countries)—an issue that has not received much attention in the literature on causal inference. The primary interest of most international assessment programs is in measuring educational achievement (defined and measured differently, depending on the study) and a pre-defined set of achievement correlates (e.g., the learning environment or the student’s home situation). Footnote 1 The mechanisms that explain the relationships between these variables and the outcomes measures in the context of each country are rarely problematized or conceptualized and a priori differences in the conceptualization or theorization of these mechanisms across contexts are often not considered. For example, answers to questions regarding the value and purposes of education and schooling in societies as different as Germany, Qatar, and Zimbabwe are not fully articulated prior to designing data collection instruments. Such articulation would likely explain the mechanisms underlying variability in student achievement or performance in school and their differing associated correlates. In other words, one universal questionnaire administered in all contexts cannot possibly cover all relevant explanatory variables for all participating countries. We know, for example, that issues of gender and socio-economic status are understood differently in countries around the world. As such, the conceptualizations of the mechanisms that might explain varied educational achievement would likely yield a number of other variables not currently included in the data collection design.

In the case of ILSA, when a set of variables is imposed across contexts and is examined in terms of the relationships to an outcome variable, some relationships between the variables are bound to be found even if only by chance. The meaning of these relationships, however, is not often clear and plausible alternative causal relationships cannot be examined due to the limited set of variables and the absence of causal mechanisms conceptualized for different contexts. In the field of comparative policy analysis, Falleti and Lynch ( 2009 ) have emphasized the importance of causal mechanisms and their interactions with context in order to make credible causal explanations. They argue “that unless causal mechanisms are appropriately contextualized, we run the risk of making faulty causal inferences.” They see the importance of context for making causal claims as a “problem of unit homogeneity” (p. 1144), where unit, here, refers to the variables and to the attributes of the units of analysis as well as their meaning and equivalence in the presumed causal mechanisms. So, for example, is a 14-year-old boy from an economically developed country who is picked-up 200 m from his house by the school bus “equivalent” to a 14-year-old boy from an economically developing country who has to walk to school barefoot for several miles every day? Does the same causal mechanism explain variability in achievement test scores? What are the relevant attributes that play out in these different contexts? These are the important and difficult questions that would need to be addressed instead of making comparisons on a universal set of variables, which may not universally apply.

The case of the boys from a developed and developing economy is most likely not a meaningful comparison but it serves here to illustrate the question of meaning and equivalence and the importance of examining these in articulating causal explanations or mechanisms and the setting in which they play out. In addition, the same causal mechanism may have different outcomes in different contexts while different causal mechanisms and multiple causes for the same effect may be at play in the same context. This can be explained by the fact that contexts are multilayered and develop from the interactions of individual characteristics as well as social and institutional norms, values, and functioning. The multiplicity of possible and plausible causal mechanisms defies the usefulness of single simplistic deterministic or probabilistic models as they can only provide a very partial description of “a” possible causal relationship. Most statistical models currently in use cannot readily accommodate the complexity of causal mechanisms, and it is, therefore, necessary to reduce their complexity to test a causal relationship (or hypothesis) between a cause and an effect. We contend, therefore, that a crucial issue in the nature of causal inference is their need for simplicity and lack of generalizability.

In the next section we outline some limitations to making causal claims in educational research by examining two popular educational studies and submitting them to the validity framework outlined above.

Limitations of experimental and quasi-experimental design

As we argued above, most research studies yielding causal inferences in the social sciences tends to be descriptive rather than explanatory. This is an important distinction as it sets expectations for the claims that can be made when using the results. Causal inferences are inherently linked to their validity and generalization—that is, how true, and how specific or universal are the claims? We illustrate some of these considerations and limitations, first in a randomized experimental study, and then in a study that employs ILSA data where random assignment is not possible. We examine possible reasons why causal claims might be weakened and highlight the importance of acknowledging possible validity threats so that claims can be qualified accordingly. In the end, it may be the case that an experimental or quasi-experimental design is the best choice for the research question at hand; however, the threats to validity may not support causal conclusions.

For our first example we draw from a well-known experimental study in education (Mosteller 1995 ), and discuss some validity concerns that may affect the study’s claims. For the second example (Schmidt et al. 2001 ), we focus on a study that uses large-scale assessment data to examine the effect of curriculum coverage on achievement gain scores. This example was chosen because it is one of the first instances where researchers explored the Trends in International Mathematics and Science Study (TIMSS) data to design a study that uses matching groups to calculate gain scores and “sophisticated statistical techniques [that presumably] allows them to generate causal hypotheses concerning specific aspects of the curriculum on student learning” (p. 80). Additionally, both of these studies were chosen because they are reasonably straightforward to explain in a limited space, which is not often the case when non-experimental studies are used with the aim of making causal inferences. Although we offer only a brief review of two studies, in this special issue many of the included papers provide resources and examples of how their corresponding topic has been used in research.

Study 1: Tennessee class size study

In the often cited Tennessee Study of Class Size in the Early School Grades (Mosteller 1995 ), early grades are defined as kindergarten through third grade and the treatment conditions are (1) smaller classes of 13–17 students, (2) larger class size of 22–25 students, and (3) larger class size of 22–25 students with a teacher aid in the classroom. Students and teachers were randomly assigned to classes at least for the first year of implementation. To participate in this study schools had to commit for a period of 4 years, have a minimum of 57 students in each of the targeted grade levels, and guarantee that no new textbooks or curricula would be introduced. Approximately 180 Tennessee schools volunteered to participate, 100 qualified for the study and 79 ultimately participated in the first year of implementation in kindergarten. Achievement in reading and mathematics were measured using the Stanford Achievement Test (SAT) and the Tennessee Basic Skills First (TBSF) which is described as a “curriculum-based measure.” The differences between the students in smaller and larger class (with and without aide) are reported as effect sizes (differences in means divided by standard deviation) and range between .13 and .27 in mathematics and between .21 and .23 in reading.

From the basic study description, we raise two issues. First, setting cut-offs for determining smaller and larger classrooms is not discussed or justified. Second, the possible causal mechanism of the effect of class size on achievement is limited to the explanation that in smaller classes there are fewer distractions and the teacher has more time to attend to each child than in larger classes. Next, we consider the main conclusion that “[t]he evidence is strong that smaller class size at the beginning of the school experience does improve performance of children on cognitive tests” (p. 123).

We begin by examining the nature of the constructs related to cause (class size) and effect (performance on cognitive tests). Although class size is the variable manipulated in this study, it appears, from the brief explanation provided, that it is a proxy for number of distractions and teacher time spent with each child. Yet there is little information in the study about what teachers were doing in the smaller and larger classes during implementation in terms of working individually with children and minimizing distractions. In addition, although one can easily imagine differences between a class of 13 and a class of 25; it is not entirely clear how a class of 17 and a class of 22 vary. This is an example where the explicit articulation of the causal mechanism(s) that would explain the differences in achievement for students in smaller and larger classes and evidence to support that these mechanisms have actually taken place is missing. For example, it is possible that, since all treatment conditions were implemented in the same school, teachers assigned to larger classes might have felt some resentment or that teachers assigned to smaller class size felt re-energized, which, in both cases, could have affected their teaching performance and, in turn, the performance of their students. These reactions to treatment assignment, in essence, affect the construct of class size being tested in the study. Without an explanation of the causal mechanisms and supporting evidence, attributing differences in achievement to class size is potentially misleading. With regard to the effect of the cause, achievement is measured in mathematics and reading by two different tests—a general standardized test (SAT), and a standardized curriculum-based test (TBSF), which is presumably more sensitive to the Tennessee school curriculum. Given the necessarily limited nature of these measures—in just two content areas—we raise the possibility that other learning constructs should also be measured and compared before any decisions are made on the basis of the evidence from this study.

Next, we highlight issues around the study design and treatment manipulation that relate directly to internal and construct validity elements that may affect the validity of the study’s claim. In particular, the definition of treatment conditions—in this case, class size—is related to treatment manipulation. In this study, in addition to smaller and larger class sizes, the researchers included a treatment condition of larger class size with a teacher aide, presumably to understand whether an additional adult might confer the same advantage as a smaller class. Unfortunately, the role of the aides was not specifically defined; some aides engaged in instructional activities while others did not. The absence of data about what happened in the different treatment conditions during implementation makes it difficult to explain what about class size makes a difference in achievement or to rule out alternative explanations for the differences in achievement between treatment conditions. Taken as a whole, the study provides some evidence of the relationship between class size and student achievement. But threats to both construct and internal validity prevent strong causal inference and an understanding of the actual causal mechanism. The study does appear to establish a causal description that applies to the context of the particular treatment conditions, outcome measures, time, persons, and settings used in this study.

Beyond the identified issues, questions remain about the generalization of the causal relationship— external validity —to variations in treatment conditions, test scores, persons and settings. For example, one issue concerns the representativeness of the 79 participating schools out of 180 that initially volunteered. Eighty schools were eliminated because they did not meet the study criteria for participation, including smaller schools with fewer than 57 students per grade. Further, of the 100 schools that qualified, only 79 participated in the first year, raising questions about the degree to which participating schools might have differed from non-participating Tennessee schools. Although Mosteller states that “The study findings apply to poor and well-to-do, farm and city, minority and majority children” (p. 116), he also reports that the effect of class size on achievement was twice as large for minority students. In addition, how generalizable are the findings to other states that might differ in important ways, including education policies, funding structures, and curricula? Finally, the definition of class size is relative to a particular setting, leaving open the possibility of different outcomes if the cut-offs had been defined as fewer than 20 and more than 20 , for example. Given what is currently known internationally, findings from this study, assuming they still hold today, should also be reconciled with other reports where some of the highest achieving countries (e.g., Japan, Korea) also have large class sizes. Looking at each of these issues as part of a whole brings us back to the need for clearly articulated causal mechanisms that would explain the presence or absence of a relationship between these two factors and the importance of the interaction with the setting in which this plays out. Although, practically speaking, no study will cover all design and implementation possibilities, we want to highlight the importance of both grounding study decisions in theory to the degree possible while also tempering claims that are associated with potential threats to validity.

In addition to possible problems with construct, internal, and external validity already discussed, another limitation relates to statistical inferences— statistical conclusion validity —regarding the co-variation of the cause and the effect. In this study the co-variation between class size and achievement is addressed by testing mean differences between the different treatment conditions. Systematic differences between the smaller and larger class size conditions are implied although Mosteller does not directly report any test of statistical significance, but focuses rather on the magnitude of the effect sizes. Given the large sample sizes we can reasonably assume that differences between means were indeed statistically significant. Differences in effect sizes were observed when different reading and math tests are used; but it is difficult to make sense of these differences without information about homogeneity of variance, psychometric quality of the measures, and confidence intervals about estimated effect sizes.

With regard to the reliability of treatment implementation, Mosteller, reports that after the first year of the study some “incompatible children” were moved from smaller to larger class size, which might have increased the differences between means and effect sizes. The author also reported some “class size drift” with some smaller classes becoming larger and some larger classes becoming smaller than their initial limits, possibly reducing differences between treatment conditions. This study exemplifies how, even when random assignment to treatment conditions is possible, a number of important issues can threaten the validity of researchers’ causal claims.

Study 2: TIMSS 1995 curriculum and learning study

As a second example, we turn now to an observational context using ILSA data where, importantly, random assignment to treatment conditions is not possible. The highlighted study (Schmidt et al. 2001 , also analyzed in the AERA report Schneider et al. 2007 ) investigates the relationships between different aspects of curriculum, instruction, and learning using the 1995 TIMSS data. In what follows we consider the degree to which issues related to measurement, sampling, and model choice and specification might raise questions around some of the researchers’ claims. To be clear, according to the AERA report, this study does not claim to make causal inferences but rather to “conceptually model and statistically evaluate the potential causal effects of specific aspects of the curriculum on student learning” (p. 84)—a subtle distinction that will likely be overlooked by policy makers who might want to make use of these claims. Further, Schmidt et al. ( 2001 ) refer to the model they use as a “causal statistical model” (p. 164).

In this study the researchers use the TIMSS data from approximately 30 countries to construct a quasi - longitudinal design in order to examine the impact of different aspects of curriculum on learning. The conceptual analysis for the study focused on the intended curriculum (represented by content standards and textbooks ratings), implemented curriculum (represented by textbook ratings and self-reported teacher content coverage), and attained curriculum (represented by student achievement gain or increase in percentage of correct items)—all ratings and percentages aggregated at the topic and country levels. The data used were collected at the end of the school year in “two adjacent grades containing the majority of thirteen-year-olds in each country” (p. 5) and consist of school and teacher questionnaires, curriculum documents (e.g., content standards, curriculum guides), textbooks, and achievement test results in mathematics Footnote 2 and science, covering a large number of topics and sub-topics. By sampling from the two adjacent grades (e.g., 7th and 8th grades in the US), the researchers constructed presumably equivalent groups to estimate achievement gain scores as the difference in percentage of items correct between the two grades averaged over all items in a topic. Curriculum documents and textbooks were divided in blocks and qualitatively coded to capture the representation and content coverage. These codes were then quantified to characterize the national curriculum for each country relative to twenty mathematics topics included in the quantitative analyses.

Information was collected on how many lessons teachers devoted to specific topics translated into the percentage of teachers in each country addressing topics along with the percentage of instructional time allocated to each topic. At the end, the measures of content standards, curriculum coverage, teacher coverage, instructional time, and achievement gains were aggregated at the topic and country level in order to make the measurement consistent across all variables. This consistency in the level of measurement was perceived as making the impact on achievement gain more sensitive to variability in content and instructional time and coverage. The structural relationships between these constructs were then examined. The hypotheses tested in the model were that the “official” curriculum documents (developed at the national, regional or local level) or content standards would have an impact on textbook coverage used in the classroom, and a direct and indirect relationship to teacher coverage , instructional time and achievement gain . Textbook coverage was also hypothesized to have a direct effect on instructional time and teacher coverage and, through those variables, an indirect effect on achievement gain, in addition to a direct effect.

In what follows, we apply the same validity framework to analyze several issues that might justify tempered claims from the Schmidt et al. ( 2001 ) study. First, construct measurement is of particular concern. The challenge of retro-fitting the definition and measurement of constructs is inherent to ILSA data that are often collected for different purposes and is not a problem unique to the Schmidt, et al. study. Further, most of the limitations regarding the measurement of the constructs used in this study were appropriately acknowledged by the researchers. For example, the researchers recognized that they are working from a very specific definition of curriculum, that is, one among several possible perspectives. The operationalization of these curriculum definitions and their measurement is another concern.

With regard to coding content standards and textbook ratings, there was considerable variability in document availability across countries—some had multiple documents and textbooks while others had just one. Further, there is only limited information on the rating system and the reliability of the ratings. Coding and rating were initially done at the sub-topic level and then aggregated at the topic and national level. Given that the availability of documents varied across countries, questions naturally arise regarding the meaning and comparability of these concepts and whether they represent well the intended or implemented curriculum. The aggregation procedures (at the national level) also leave open the possibility of committing ecological fallacy—that a relationship that exists at a higher level does not exist or is in a different direction at a lower level. And, as was the case in the class size study, there is no articulation of the causal mechanism that would explain the relationship between content standard national ratings and average gain percent correct response on the items for a particular topic. Documents were also coded relative to the TIMSS frameworks (or world core curriculum ), which means that topics not included in the frameworks were not coded and therefore not taken into account in the study. Nevertheless, these unaccounted for topics were part of the intended curriculum for the different countries but were excluded from the analyses. The study did not describe whether other coding schemes were considered and whether these different approaches would yield different ratings. Such an approach—often referred to as a sensitivity analysis—would have eliminated, to some extent, competing explanations for observed differences.

As in the class-size study, outcomes are limited in scope to measures of achievement in math and sciences. Further, the number of items per topic is a possible area of concern. For example, in mathematics, the study estimates achievement gains for twenty different topics. Due to the nature of the data, 14 of the 20 topics only had 5–10 items, raising concern around construct representativeness. An additional issue worth highlighting surrounds the issue of comparability across countries. For example, researchers extensively describe meaningful variation between countries in the way and the level at which the curriculum is articulated and structured. Further, the researchers found large cross-country differences in the perceived influence that the curricular structure has on the implemented curriculum in schools. Countries varied in terms of topic coverage according to content standards, by textbooks, by teachers, and in instructional time allocated to the different topic. Although it is natural to expect cross-national variation in these relationships, the authors fit an overall structural model to understand the relationship between content standards, textbook coverage , teacher coverage (and instructional time ) for all countries. A key assumption in such an approach is that the constructs are all understood and measured equivalently in each analyzed country (Millsap 2011 ). The authors do allow for country-by-topic interaction effects, which are further examined by fitting models to each country individually. Nevertheless, these by-country models assume that the same variables and “causal mechanisms” are at play in all countries and that, as we have already mentioned, these variables have the same meaning across different contexts, even though the study found major differences in how the curriculum is structured in different countries.

Causal relationships are based on the principal of ceteris paribus or “all else being equal”; however, given evidence to the contrary, the tenability of this assumption is difficult to defend. Therefore, the researchers’ general claim that “more curriculum coverage of a topic area—no matter whether manifested as emphasis in content standards, as proportion of textbook space, or as measured by either teacher implementation variable—is related to larger gains in the same topic area” (Schmidt et al. p. 261) can only be descriptive, and is conditional on the definition and measurement of these variables in the particular study. Indeed, the authors recognize the descriptive nature of their claims when they qualify with the following: “the nature of the general relationship is not the same for all countries. This implies that a general relationship between achievement gain and one aspect of curriculum may not even exist at all for some countries.” (p. 261). Needless to say, this raises issues regarding the generalizability and the causal nature of their claims.

We also point to a few possible issues related to statistical conclusion validity, particularly as they pertain to cross-country comparability. For each of 29 countries, five regression analyses (one for each pair of coding variables) are fit to the data. Topic ( N  = 20) served as the unit of analysis and results are combined and presented together (see Table 8.1, pp. 274–275). Although the practice of presenting findings for dozens of countries is fairly common in ILSA research, extra care is warranted when the inferential target is causal. These analyses are followed by another 29 regression analyses (one per country) to simultaneously estimate the structural coefficients (direct effects only) between three of the curriculum variables and achievement gains (see Table 8.2, pp. 277–278). Ten countries appear to have a significant (p  <  .05) structural coefficient between textbook coverage and achievement gain ; three countries have a significant coefficient between content standards coverage and achievement gain ; and six countries have a significant coefficient between instructional time and achievement . The reported overall coefficients of determination for these 29 models range from .08 to .70. Given the variability in the findings and questions around comparability, the strength of causal claims arising from this study should likely be revisited.

Usefulness for policy

Policy makers in modern democratic societies often look for causal inferences from the research community to support the perception of “objectivity” in decision making. As Stone ( 2011 ) has argued, ideas around objectivity and the subsequent need for “causal theories” is critical to the policy process even if the resulting decisions are not truly based on the causal information provided by the research community. In education, a number of countries, including the US, have invested a great deal of resources in programs like the What Works Clearinghouse and continue to invest in the development and promotion of research that focuses on making causal claims using ILSA data. Footnote 3 Although the US government has viewed experiments as the “gold-standard” for social research since the 1950s it is significant that they are now investing resources in promoting research aimed at causal inferences with cross-sectional international assessment data. Much of this is in response to the fact that experimental research is often not feasible in the social sciences; however, policy makers want clear and compelling “evidence” for policy making and distributing resources. As such, in the current manuscript our intention is to acknowledge that experiments and quasi-experiments have an important place in educational research but also to argue that the results from such research are often narrowly focused and rarely succeed in providing answers to larger questions that are most relevant to the policy making process. In what follows we discuss causal inferences in the policy context in light of our previous points.

Policy makers often ask questions in broad terms (e.g., How can we improve student achievement ? What are effective instruction, programs, and curricula ? How do we improve teacher quality ? How can we close the achievement gaps? ) (Huang et al. 2003 ) whereas researchers often address narrower questions (e.g., What is the effect of a particular pedagogical approach on students standardized test scores in mathematics and language arts ? What is the effect of teacher retention on future student achievement? What is the effect of class size on student achievement ?) due to methodological considerations and data limitations. Under the best conditions, even when using methods that focus on making causal claims, answers to most researchers’ questions are qualified and limited in scope. The need to qualify findings and/or limit their scope can be attributed to measurement problems, selection bias, and the lack of ability to control for all relevant variables. When, and if, we are able to attend to the majority of these threats, especially with ILSA data, resulting claims are often limited in scope and not often suitable to address larger policy-focused questions. For example, simply creating a policy that mandates closer alignment to TIMSS will probably not improve test scores if the teachers do not teach the material or if the curriculum becomes too vast to be covered in 1 year. The necessarily narrow focus of most research aimed at causal inferences, especially with ILSA data, unfortunately creates a landscape where learning and achievement are presented as a highly simplified problem. In other words, findings that are overly narrow and do not account for the known complexity of our educational systems can misguide policymakers by ignoring complex interactions just as much as they can inform them about educational systems.

In this paper we have also shown some serious and some minor threats to construct, internal, external and statistical conclusion validity, drawing from two well-known, often cited examples in educational research. Each identified problem leads us away from clear explanatory causal claims and can even point to serious concerns about our ability to make descriptive causal claims. The distinction between explanatory and descriptive causal claims is important in part because most of causal-focused research in education emphasizes descriptive claims; however, the same research falls short of articulating valid causal explanations. In the case of the Tennessee study, limitations aside, the findings only provide the causal description that a lower student/teacher ratio leads to higher achievement scores on select assessments for a sample of teachers and students in Tennessee. This is another example of how research aimed at making causal inferences is often focused and narrow in nature. To policy makers, however, findings from a class size study might seem like useful information. Unfortunately, it most likely does not provide the key information policymakers need to create general policies to reduce class size. Few of us in the field of education are naïve enough to believe that simply putting more teachers into classrooms will increase scores. In fact, having poorly qualified teachers in classrooms has been associated with lower scores on standardized assessments (Darling-Hammond 2000 ). As such, the information that would best inform policy is not simply knowing that we need more teachers but also a clear explanation of what teachers do in small classroom that results in increased student understanding and performance. Hence, we need to know the conditions under which the causal relationship holds.

Another example of how a lack of clear explanation can lead to a range of policy prescriptions can be taken from ILSAs. As ILSAs have grown in both popularity and scope national policy makers have taken a keen interest in identifying policy levers that can improve educational achievement, with ILSA results serving as a frequent pool from which to draw possible solutions. Both class size and curriculum have drawn the attention of policy makers. The recent US discussions around widening income gaps in general and the impact on educational achievement in particular is another example of an important policy issue upon which ILSA data can be brought to bear. An historical and clearly problematic approach to measuring socioeconomic status (SES) in ILSA studies is via the “books in the home” indicator. This single item expresses rough quantities of the number of self-reported books in a child’s home. In conjunction with causal models and methods, it is possible to identify what appears to be a “causal effect” of SES (as measured by the number of books in the home) on achievement. Indeed, there is often a fairly strong, positive association between number of books and achievement; however, the explanation for this relationship is unclear and therefore it is important that we also depend on a very clear and well-reasoned argument from the researcher who employed the causal modeling. For example, is the mere presence of books in the home sufficient to stimulate interest in reading, which translates into improved achievement? Or is the number of books serving as a proxy for cultural possessions and indicating a better resourced home environment? In the absence of any clear and well established explanation, the findings are not even causally descriptive, and of limited usefulness for enacting meaningful policy, where possible policies could range from wealth redistribution to providing books for children with fewer resources.

An important barrier to supporting causal explanations in the ILSA context also includes the design of the studies (e.g., they are cross-sectional and observational). Prima facie these design features do not lend themselves to causal explanations. And although quasi-experimental methods can overcome this barrier, a host of validity assumptions must be tested before causal explanations are supported. For example, except in limited cases such as the Schmidt et al. ( 2001 ) study, it is very difficult to know if the cause precedes the effect. Not being able to provide such information to policy makers greatly reduces the usefulness of any causal claims being made. Although the findings might be able to suggest to a policy maker that a causal relationship exists, the claim does not provide policy makers with the breadth of information that they need to enact change.

As we have argued, even though many policy makers emphasize the need for causal inferences to support “objective” policy decisions, the reality of the policy process is much more complex and influenced by a host of social and political values and interests. That said, as a research community we often embrace uncertainty and operate with caution as we move forward with research, especially when using ILSA data. Although policy makers’ and researchers’ goals are not mutually exclusive, since they both aim to improve education, the two groups approach problems differently and, as a result, often have different objectives for the findings. Understanding and communicating these differences will be an important step as we move forward with more causal modeling of ILSA data.

Experimental and quasi-experimental designs play a key role in developing an understanding of important phenomena in educational research. In fact, we contend that many such studies assist us in better understanding our educational system and also allow both policy makers and researchers to explore different types of questions. For example, each author in this special issue provides interesting examples of how a given quasi-experimental design or method can be used in educational research to explore important topics and, to a certain degree, eliminate alternative explanations for identified relationships. Nevertheless, even in an ideal setting, where subjects are randomly assigned to treatment and control groups, threats to the validity of inferences are persistent and should be recognized when interpreting findings. In quasi-experimental research, these issues are even more prevalent given the fact that no random assignment has occurred and only an approximation of this process is possible. Finally, regardless of whether subjects are randomized, it is important to recognize the narrow focus of most research studies that aim at making causal inferences as well as the critical difference between causal descriptions and explanations. With this in mind, we offer a few recommendations to researchers who are using and interpreting the results of research that attempts to approximate randomized experiment using ILSA data.

Paying attention to both conceptual and design considerations is key to approximating randomized experiments with ILSA data. A clear articulation of the causal mechanism(s) investigated would greatly enhance the design of a study by highlighting the important elements that should be included in the design as well as those that would allow for testing alternative explanations across different contexts. We agree with Rubin ( 2008 ) and contend that ILSA researchers should make use of a probability model associated with the assignment mechanism as well a Bayesian analysis to consider the full set of potential outcomes for each case. This process would strengthen resulting inferences that can be made from the research. Researchers should also justify their selection of what Rubin termed “key covariates” and ensure that they only differ randomly between control and treatment groups. Again, “key covariates” can only be identified if researchers have carefully articulated the possible mechanisms that could explain the association between the constructs and events being investigated. Further, all research should clearly articulate: specification of the treatment and outcome condition; sample sizes that ensure acceptable power; who made the treatment condition assignment and based on what variables; and providing evidence of the measurement quality of the key covariates. Within ILSA research each of these poses its own set of challenges. For example, missing rates and disagreement between students and parents on identical, policy relevant variables has been shown to be high (see Rutkowski and Rutkowski 2010 ). Similarly, scale reliabilities can vary widely between countries, from high to unacceptably low (see Rutkowski and Rutkowski 2013 ).

The validity framework offers important considerations that the research community can use to further minimize errors when quasi-experimental designs are used with ILSA data. Reminding ourselves that error exists throughout the entire research process and reasoning helps clarify that statistical techniques alone do not establish causal claims. In other words, sound statistical conclusion validity does not lead to an acceptable causal claim unless it is supported by a compelling causal mechanism that has been clearly articulated and taken into account in the design of the study. When designing a study that approximates a randomized experiment using ILSA data issues of construct, internal, and external validity are of critical importance, as we have illustrated in the context of the two studies we discussed earlier. As we have noted, there are a number of ways to examine the validity of findings in relation to experimental and quasi-experimental studies. In this paper we depended largely on the framework that was first developed by Campbell and Stanley ( 1963 ) and later refined by Cook and Campbell ( 1979 ) and Shadish et al. ( 2002 ). This framework allowed us to productively examine the validity of claims made by two well cited studies in education. As such, we recommend that all quasi-experimental studies that use ILSA data: (1) choose an established validity framework to work from; and (2) clearly explain threats to the validity of their claims. For example, if the validity framework outlined in this paper is chosen the study should include a discussion of: construct validity , internal validity , external validity , and statistical conclusion validity . Shadish et al. ( 2002 ) provide a detailed description of possible threats to validity and are a useful resource for both researchers and the reviewers of these studies.

Finally, we would like to point out some issues that are especially important given the design of ILSA data collection. The following issues do not constitute an exhaustive list but are simply examples where the validity of inferences may be threatened. With respect to construct validity, we are always at the mercy of the available data. That is, the validity of the claims made about a self-efficacy construct, for example, rests on the availability of sufficient variables or items to meaningfully represent this construct. Defending this proposition is the responsibility of the researcher, using self-efficacy literature as well as a thorough psychometric investigation of the data and providing supporting evidence for the validity of the claims made. Similarly, a primary issue with internal validity relates to model specification and ensuring that all relevant variables have been included in the model to support a thorough investigation of the hypothesized causal mechanism(s) and to make possible ruling out alternative hypotheses for an estimated effect. Again, based on a thorough examination of the substantive literature, the researcher is responsible for evaluating whether possible (reasonable) alternative explanations can be tested and whether relevant variables are included in the model. Given that ILSA data only provides a fixed set of variables researchers need to be transparent about the variables that were not included and have an open discussion about how that weakens their conclusions. Regarding external validity, a key issue in ILSA data is the operationalization of the outcome variable(s) which relies on appropriate and relevant but necessarily limited measurements. That is, one cannot reasonably use ILSA data to estimate the causal effect of some variable on “schooling” or “education” writ large, since most ILSAs are limited to only a few schooling outcomes (e.g., math, science, reading, and affective variables) and a highly selective sample of students (i.e., 8th graders or 15 year olds).

Of course, even RCTs can fail to meet ideal conditions (e.g., the Tennessee study). In ILSA research, there will always be further threats and more justifications will be necessary to allay concerns surrounding the validity of causal claims. Clearly articulated and reasonable research questions, well specified research design consistent with hypothesized causal mechanism(s), relevant and quality data, and well-specified models continue to be critical to support research claims made in this context. It is equally important to be thorough and transparent in acknowledging weaknesses in the causal chain of inferences as well as other limitations. As such, we urge everyone who works with ILSA data, and especially in applying “causal models” to ILSA data, to openly engage in a thorough and self-critical process that utilizes a well-recognized validity framework such as the one we discussed in the current paper. Through this process, we can have honest conversations about what the data and models can reasonably tell us about educational inputs, processes, and outcomes. We can also better engage with policy makers about the usefulness and limitations of research claims to inform policy.

Such large scale data collection efforts also suffer from a concern for trend analysis (across different waves of data collection) which prevents changes that would compromise comparisons across time.

In this analysis we only focus on mathematics and on the cross-country analyses to illustrate the claims made and the limitations of the study.

With support from the U.S. National Science Foundation in 2015 AERA held workshops on making causal claims with ILSA data. More information can be found here: http://www.aera.net/ProfessionalOpportunitiesFunding/AERAFundingOpportunities/StatisticalAnalysis-CausalAnalysisUsingInternationalData/tabid/14751/Default.aspx .

Campbell Collaboration. (2015). The Campbell collaboration: What helps? What harms? Based on what evidence? Retrieved 21 July 2015, from http://www.campbellcollaboration.org/ .

Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research on teaching. In N. L. Gage (Ed.), Handbook of research on teaching . Chicago: Rand McNally.

Google Scholar  

Cattaneo, A., & Wolter, S. C. (2012). Migration policy san boost PISA results: Findings from a natural experiment (SSRN Scholarly Paper No. ID 1999328). Rochester: Social Science Research Network. Retrieved from http://papers.ssrn.com/abstract=1999328 .

Cook, T. (2002). Randomized experiments in educational policy research: a critical examination of the reasons the educational evaluation community has offered for not doing them. Educational Evaluation and Policy Analysis , 24 (3), 175–199.

Article   Google Scholar  

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis for field settings . Chicago: Rand McNally.

Cronbach, L. J. (1982). Prudent aspirations for social inquiry. The social sciences: Their nature and uses, 61 , 81.

Darling-Hammond, L. (2000). Teacher quality and student achievement. Education Policy Analysis Archives, 8 , 1.

Dunning, T. (2008). Improving causal inference: Strengths and limitations of natural experiments. Political Research Quarterly, 61 (2), 282–293.

Falleti, T. G., & Lynch, J. F. (2009). Context and causal mechanisms in political analysis. Comparative Political Studies, 42 (9), 1143–1166.

Henry, M., Lingard, B., Rizvi, F., & Taylor, S. (2001). The OECD, globalisation and education policy . Published for IAU Press, Pergamon. Retrieved from http://www.lavoisier.fr/livre/notice.asp?id=OOSWOKA2KK6OWG .

Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistics Association, 81 , 945–970.

Huang, G., Reiser, M., Parker, A., Muniec, J., & Salvucci, S. (2003). Institute of education science findings from interviews with education policymakers. Institute of Education Sciences. Retrieved from http://eric.ed.gov/?id=ED480144 .

Jones, P. W. (2007). World Bank financing of education: Lending, learning and development . New York: Routledge.

Jürges, H., Schneider, K., & Büchel, F. (2005). The effect of central exit examinations on student achievement: Quasi-experimental evidence from TIMSS Germany. Journal of the European Economic Association , 3 (5), 1134–1155. http://doi.org/10.1162/1542476054729400 .

Millsap, R. E. (2011). Statistical approaches to measurement invariance . New York: Routledge.

Mosteller, F. (1995). The Tennessee study of class size in the early school grades. The Future of Children, 5 (2), 113–127.

Raudenbush, S. W. (2008). Advancing educational policy by advancing research on instruction. American Educational Research Journal, 45 (1), 206–230.

Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. The Annals of Statistics, 6 (1), 34–58.

Rubin, D. B. (2008). Objective causal inference, design trumps analysis. The Annals of Applied Statistics, 2 (3), 808–840.

Rutkowski, L., & Rutkowski, D. (2010). Getting it better: The importance of improving background questionnaires in International Large‐Scale Assessment. Journal of Curriculum Studies , 42 (3), 411–430. http://doi.org/10.1080/00220272.2010.487546 .

Rutkowski, D., & Rutkowski, L. (2013). Measuring socioeconomic background in PISA: One size might not fit all. Research in Comparative and International Education, 8 (3), 259–278.

Rutkowski, D., & Sparks, J. (2014). The new scalar politics of evaluation: An emerging governance role for. Evaluation , 20 (4), 492–508. http://doi.org/10.1177/1356389014550561 .

Schmidt, W. H., McKnight, C. C., Houang, R. T., Wang, H., Wiley, D. E., Cogan, L. S., et al. (2001). Why schools matter: A cross-national comparison of curriculum and learning . New York: Wiley.

Schneider, B., Carnoy, M., Kilpatrick, J., Schmidt, W. H., & Shavelson, R. J. (2007). Estimating causal effects using experimental and observational designs . Washington: American Educational Research Association.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference . Boston: Houghton Mifflin Company.

Shavelson, R. J., & Towne, L. (2002). Scientific research in education . Washington: National Academy Press.

Stevens, A. (2011). Telling policy stories: An ethnographic study of the use of evidence in policy-making in the UK. Journal of Social Policy, 40 (2), 237–255.

Stone, D. (2011). Policy paradox: The art of political decision making (3rd ed.). New York: W. W. Norton & Company.

Sutherland, W. J., Bellingan, L., Bellingham, J. R., et al. (2012). A collaboratively-derived science-policy research agenda. PLoS One, . doi: 10.1371/journal.pone.0031824 .

West, M. R., & Woessmann, L. (2010). Every catholic child in a catholic school: Historical resistance to state schooling, contemporary private competition and student achievement across countries. The Economic Journal, 120 (546), F229–F255.

What Works Clearinghouse. (n.d.). Homepage. Retrieved July 21, 2015, from http://ies.ed.gov/ncee/wwc/ .

Woessman, L. (2014). The economic case for education (No. 20). European Expert Network on Econmics of Education.

Download references

Authors' contributions

The work represents extensive collaboration and discussion between DR and GD. Both authors participated in the development of the rationale for the study as well as the writing. Both authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Authors’ information

David Rutkowski is a professor of educational assessment at the Center for Educational Measurement (CEMO) at the University of Oslo, Norway. David's research is focused in the area of educational policy and technical topics within international large-scale assessment and program evaluation. His interests include how large scale assessments are used within policy debates, the impact of background questionnaire quality on achievement results, and topics concerning immigrant students at the international level.

Ginette Delandshere is a professor of Inquiry Methodology and Chair of the Counseling and Educational Psychology Department in the School of Education at Indiana University, Bloomington. Her research interests are measurement and assessment and the associated validity of inferences and research claims as well as the study of the socio-political practice of assessment and its purpose and meaning in the context of teaching and learning.

Author information

Authors and affiliations.

University of Oslo, Oslo, Norway

David Rutkowski

Indiana University, Bloomington, IN, USA

Ginette Delandshere

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to David Rutkowski .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Rutkowski, D., Delandshere, G. Causal inferences with large scale assessment data: using a validity framework. Large-scale Assess Educ 4 , 6 (2016). https://doi.org/10.1186/s40536-016-0019-1

Download citation

Received : 12 January 2016

Accepted : 20 January 2016

Published : 26 February 2016

DOI : https://doi.org/10.1186/s40536-016-0019-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Causal Mechanism
  • Causal Inference
  • Causal Explanation
  • Causal Claim
  • Large Class Size

is random assignment possible in causal comparative research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Dtsch Arztebl Int
  • v.117(7); 2020 Feb

Methods for Evaluating Causality in Observational Studies

Emilio a.l.gianicolo.

1 Institute for Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University of Mainz

2 Institute of Clinical Physiology of the Italian National Research Council, Lecce, Italy

Martin Eichler

3 Technical University Dresden, University Hospital Carl Gustav Carus, Medical Clinic 1, Dresden

Oliver Muensterer

4 Department of Pediatric Surgery, Faculty of Medicine, Johannes Gutenberg University of Mainz

Konstantin Strauch

5 Institute of Genetic Epidemiology, Helmholtz Zentrum München—German Research Center for Environmental Health, Neuherberg; Chair of Genetic Epidemiology, Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig-Maximilians-Universität, München

Maria Blettner

In clinical medical research, causality is demonstrated by randomized controlled trials (RCTs). Often, however, an RCT cannot be conducted for ethical reasons, and sometimes for practical reasons as well. In such cases, knowledge can be derived from an observational study instead. In this article, we present two methods that have not been widely used in medical research to date.

The methods of assessing causal inferences in observational studies are described on the basis of publications retrieved by a selective literature search.

Two relatively new approaches—regression-discontinuity methods and interrupted time series—can be used to demonstrate a causal relationship under certain circumstances. The regression-discontinuity design is a quasi-experimental approach that can be applied if a continuous assignment variable is used with a threshold value. Patients are assigned to different treatment schemes on the basis of the threshold value. For assignment variables that are subject to random measurement error, it is assumed that, in a small interval around a threshold value, e.g., cholesterol values of 160 mg/dL, subjects are assigned essentially at random to one of two treatment groups. If patients with a value above the threshold are given a certain treatment, those with values below the threshold can serve as control group. Interrupted time series are a special type of regression-discontinuity design in which time is the assignment variable, and the threshold is a cutoff point. This is often an external event, such as the imposition of a smoking ban. A before-and-after comparison can be used to determine the effect of the intervention (e.g., the smoking ban) on health parameters such as the frequency of cardiovascular disease.

The approaches described here can be used to derive causal inferences from observational studies. They should only be applied after the prerequisites for their use have been carefully checked.

The fact that correlation does not imply causality was frequently mentioned in 2019 in the public debate on the effects of diesel emission exposure ( 1 , 2 ). This truism is well known and generally acknowledged. A more difficult question is how causality can be unambiguously defined and demonstrated ( box 1 ) . According to the eighteenth-century philosopher David Hume, causality is present when two conditions are satisfied: 1) B always follows A—in which case, A is called a “sufficient cause” of B; 2) if A does not occur, then B does not occur—in which case, A is called a “necessary cause” of B ( 3 ). These strict logical criteria are only rarely met in the medical field. In the context of exposure to diesel emissions, they would be met only if fine-particle exposure always led to lung cancer, and lung cancer never occurred without prior fine-particle exposure. Of course, neither of these is true. So what is biological, medical, or epidemiological causality? In medicine, causality is generally expressed in probabilistic terms, i.e. exposure to a risk factor such as cigarette smoking or diesel emissions increases the probability of a disease, e.g., lung cancer. The same understanding of causality applies to the effects of treatment: for instance, a certain type of chemotherapy increases the likelihood of survival in patients with a diagnosis of cancer, but does not guarantee it.

Causality in epidemiological observational studies (modified from Parascondola and Weed [34])

  • ausality as production: A produces B. Causality is to be distinguished from mere temporal sequence. It does not suffice to note that A is always followed by B; rather, A must in some way produce, lead to, or create B. However, it remains unclear what ’producing’, ‘leading to’, or ‘creating’ exactly means. On a practical level, the notion of production is what is illustrated in the diagrams of cause-and-effect relationships that are commonly seen in medical publications.
  • Sufficient and necessary causes: A is a sufficient cause of B if B always happens when A has happened. A is a necessary cause of B if B only happens when A has happened. Although these relationships are logically clear and seemingly simple, this type of deterministic causality is hardly ever found in real-life scientific research. Thus, smoking is neither a sufficient nor a necessary cause of lung cancer. Smoking is not always followed by lung cancer (not a sufficient cause), and lung cancer can occur in the absence of tobacco exposure (not a necessary cause, either).
  • Sufficient component cause: This notion was developed in response to the definitions of sufficient and necessary causes. In this approach, it is assumed that multiple causes act together to produce an effect where no single one of them could do so alone. There can also be different combinations of causes that produce the same effect.
  • Probabilistic causality: In this scenario, the cause (A) increases the probability (P) that the effect (B) will occur: in symbols, P (B | A) > (B | not A). Sufficient and necessary causes, as defined above in ( 2 ), are only those extreme cases in which P (B | A) = 1 and P (B | not A) = 0, respectively. When these probabilities take on values that are neither 0 nor 1, causality is no longer deterministic, but rather probabilistic (stochastic). There is no assumption that a cause must be followed by an effect. This viewpoint corresponds to the method of proceeding in statistically oriented scientific disciplines.
  • Causal inference: This is the determination that a causal relationship exists between two types of event. Causal inferences are made by analyzing the changes in the effect that arise when there are changes in the cause. Causal inference goes beyond the mere assertion of an association and is connected to a number of specific concepts: some that have been widely discussed recently are counterfactuals, potential outcomes, causal diagrams, and structural equation models ( 36 , 37 ).
  • Triangulation: Not all questions can be answered with an experiment or a randomized controlled trial. Alternatively, methodological pluralism is needed, or, as it is now sometimes called, triangulation: confidence in a finding increases when the same finding is arrived at from multiple data sets, multiple scientific disciplines, multiple theories, and/or multiple methods ( 35 ).
  • The criterion of consequentiality: The claim that a causal relationship exists has consequences on a societal level (taking action or not taking action). Olsen has called for the formulation of a criterion to determine when action should be taken and when not ( 7 ).

In many scientific disciplines, causality must be demonstrated by an experiment. In clinical medical research, this purpose is achieved with a randomized controlled trial (RCT) ( 4 ). An RCT, however, often cannot be conducted for either ethical or practical reasons. If a risk factor such as exposure to diesel emissions is to be studied, persons cannot be randomly allocated to exposure or non-exposure. Nor is any randomization possible if the research question is whether or not an accident associated with an exposure, such as the Chernobyl nuclear reactor disaster, increased the frequency of illness or death. The same applies when a new law or regulation, e.g., a smoking ban, is introduced.

When no experiment can be conducted, observational studies need to be performed. The object under study—i.e., the possible cause—cannot be varied in a targeted and controlled way; instead, the effect this factor has on a target variable, such as a particular illness, is observed and documented.

Several publications in epidemiology have dealt with the ways in which causality can be inferred in the absence of an experiment, starting with the classic work of Bradford Hill and the nine aspects of causality (viewpoints) that he proposed ( box 2 ) ( 5 ) and continuing up to the present ( 6 , 7 ).

The Bradford Hill criteria for causality (modified from [5])

  • Strength: the stronger the observed association between two variables, the less likely it is due to chance.
  • Consistency: the association has been observed in multiple studies, populations at risk, places, and times, and by different researchers.
  • Specificity: it is a strong argument for causality when a specific population suffers from a specific disease.
  • Temporality: the effect must be temporally subsequent to the cause.
  • Biological gradient: the association displays a dose–response effect, e.g., the incidence of lung cancer is greater when more cigarettes are smoked per day.
  • Plausibility: a plausible mechanism linking the cause to the effect is helpful, but not absolutely required. What is biologically plausible depends upon the state-of-the-art knowledge of the time.
  • Coherence: the causal interpretation of the data should not conflict with biological knowledge about the disease.
  • Experiment: experimental evidence should be adduced in support, if possible.
  • Analogy: an association speaks for causality if similar causes are already known to have similar effects.

Aside from the statistical uncertainty that always arises when only a sample of an affected population is studied, rather than its entirety ( 8 ), the main obstacle to the study of putative causal relationships comes from confounding variables (“confounders”). These are so named because they can, depending on the circumstances, either obscure a true effect or simulate an effect that is, in fact, not present ( 9 ). Age, for example, is a confounder in the study of the association between occupational radiation exposure and cataract ( 10 ), because both cumulative radiation exposure and the risk of cataract rise with increasing age.

The various statistical methods of dealing with known confounders in the analysis of epidemiological data have already been presented in other articles in this series ( 9 , 11 , 12 ). In the current article, we discuss two new approaches that have not been widely applied in medical and epidemiological research to date.

Methods of evaluating causal inferences in observational studies

The main advantage of an RCT is randomization, i.e., the random allocation of the units of observation (patients) to treatment groups. Potential confounders, whether known or unknown, are thereby distributed to the treatment groups at random as well, although differences between groups may arise through sample variance. Whenever randomization is not possible, the effect of confounders must be taken into account in the planning of the study and in data analysis, as well as in the interpretation of study findings.

Classic methods of dealing with confounders in study planning are stratification and matching ( 13 , 14 ), as well as so-called propensity score matching (PSM) ( 11 ).

The best-known and most commonly used method of data analysis is regression analysis, e.g., linear, logistic, or Cox regression ( 15 ). This method is based on a mathematical model created in order to explain the probability that any particular outcome will arise as the combined result of the known confounders and the effect under study.

Regression analyses are used in the analysis of clinical or epidemiological data and are found in all commonly used statistical software packages. However, they are often used inappropriately because the prerequisites for their correct application have not been checked. They should not be used, for example, if the sample is too small, if the number of variables is too large, or if a correlation between the model variables makes the results uninterpretable ( 16 ).

Regression-discontinuity methods

Regression-discontinuity methods have been little used in medical research to date, but they can be helpful in the study of cause-and-effect relationships from observational data ( 17 ). Regression-discontinuity design is a quasi-experimental approach ( box 3 ) that was developed in educational psychology in the 1960s ( 18 ). It can be used when a threshold value of a continuous variable (the “assignment variable”) determines the treatment regimen to which each patient in the study is assigned ( box 4 ) .

Terms used to characterize experiments ( 18 )

  • Experiment/trial A study in which an intervention is deliberately introduced in order to observe an effect.
  • Randomized experiment/trial An experiment in which persons, patients, or other units of observation are randomly assigned to one of two or more treatment groups (or intervention groups).
  • Quasi-experiment An experiment in which the units of observation are not randomly assigned to the treatment/intervention groups.
  • Natural experiment A study in which a natural event (e.g., an earthquake) is compared with a comparison scenario.
  • Non-experimental observational study A study in which the size and direction of the association between two variables is determined.

In the simplest case, that of a linear regression, the parameters in the following model are to be estimated:

y i = ß 0 + ß 1 z i + ß 2 (x i - x c ) + e i,

i from 1 to N represents the statistical units

y is the outcome

ß 0 is the y-intercept

z is a dichotomous variable (0, ) indicating whether the patient was treated ( 1 ) or not treated (0)

x is the assignment variable

x c is the threshold

ß 1 is the effect of treatment

ß 2 is the regression coefficient of the assignment variable

e is the random error

A possible assignment variable could be, for example, the serum cholesterol level: consider a study in which patients with a cholesterol level of 160 mg/dL or above are assigned to receive a therapy. Since the cholesterol level (the assignment variable) is subject to random measurement error, it can be assumed that patients whose level of cholesterol is close to the threshold (160 mg/dL) are randomly assigned to the different treatment regimens. Thus, in a small interval around the threshold value, the assignment of patients to treatment groups can effectively be considered random ( 18 ). This sample of patients with near-threshold measurements can thus be used for the analysis of treatment efficacy. For this line of argument to be valid, it must truly be the case that the value being measured is subject to measuring error, and that there is practically no difference between persons with measured values slightly below or slightly above threshold. Treatment allocation in this narrow range can be considered quasi-random.

This method can be applied if the following prerequisites are met:

  • The assignment variable is a continuous variable that is measured before the treatment is provided. If the assignment variable is totally independent of the outcome and has no biological, medical, or epidemiological significance, the method is theoretically equivalent to an RCT ( 19 ).
  • The treatment must not affect the assignment variable ( 18 ).
  • The patients in the two treatment groups with near-threshold values of the assignment variable must be shown to be similar in their baseline properties, i.e., covariables, including possible confounders. This can be demonstrated either with statistical techniques or graphically ( 20 ).
  • The range of the assignment variable in the vicinity of the threshold must be optimally set: it must be large enough to yield samples of adequate size in the treatment groups, yet small enough that the effect of the assignment variable itself does not alter the outcome being studied. Methods of choosing this range appropriately are available in the literature ( 21 , 22 ).
  • The treatment can be decided upon solely on the basis of the assignment variable (deterministic regression-discontinuity methods), or on the basis of other clinical factors (fuzzy regression-discontinuity methods).

Example 1: The one-year mortality of neonates as a function of the intensity of medical and nursing care was to be studied, where the intensity of care was determined by a birth-weight threshold: infants with very low birth weight (<1500 g) (group A) were cared for more intensively than heavier infants (group B) ( 23 ). The question to be answered was whether the greater intensity of care in group A led to a difference in mortality between the two groups. It was assumed that children with birth weight near the threshold are identical in all other respects, and that their assignment to group A or group B is quasi-random, because the measured value (birth weight) is subject to a relatively small error. Thus, for example, one might compare children weighing 1450–1500 g to those weighing 1501–1550 g at birth to study whether, and how, a greater intensity of care affects mortality.

In this example, it is assumed that the variable “birth weight” has a random measuring error, and thus that neonates whose (true) weight is near the threshold will be randomly allocated to one or the other category. But birth weight itself is an important factor affecting infant mortality, with lower birth weight associated with higher mortality ( 23 ); thus, the interval taken around the threshold for the purpose of this study had to be kept narrow. The study, in fact, showed that the children treated more intensively because their birth weight was just below threshold had a lower mortality than those treated less intensively because their birth weight was just above threshold.

Example 2: A regression-discontinuity design was used to evaluate the effect of a measure taken by the Canadian government: the introduction of a minimum age of 19 years for alcohol consumption. The researchers compared the number of alcohol-related disorders and of violent attacks, accidents, and suicides under the influence of alcohol in the months leading up to (group A) and subsequent to (group B) the 19 th birthday of the persons involved. It was found that persons in group B had a greater number of alcohol-related inpatient treatments and emergency hospitalizations than persons in group A. With the aid of this quasi-experimental approach, the researchers were able to demonstrate the success of the measure ( 24 ). It may be assumed that the two groups differed only with respect to age, and not with respect to any other property affecting alcohol consumption.

Interrupted time series

Interrupted time series are a special type of regression-discontinuity design in which time is the assignment variable. The cutoff point is often an external event that is unambiguously identifiable as having occurred at a certain point in time, e.g., an industrial accident or a change in the law. A before-and-after comparison is made in which the analysis must still take adequate account of any relevant secular trends and seasonal fluctuations ( box 5 ) .

In the simplest case of a study involving an interrupted time series, the temporal sequence is analyzed with a piecewise regression. The following model is used to study both a shift in slope and a shift in the level of an outcome before and after an intervention, e.g., the introduction of a law banning smoking ( figure 2 ):

y = ß 0 + ß 1 × time + ß 2 × intervention + ß 3 × time × intervention + e,

y is the outcome, e.g., cardiovascular diseases

intervention is a dummy variable for the time before (0) and after (1) the intervention (e.g., smoking ban)

time is the time since the beginning of the study

ß 0 is the baseline incidence of cardiovascular diseases

ß 1 is the slope in the incidence of cardiovascular diseases over time before the introduction of the smoking ban

ß 2 is the change in the incidence level of cardiovascular diseases after the introduction of the smoking ban (level effect)

ß 3 is the change in the slope over time (cf. ß 1 ) after the introduction of the smoking ban (slope effect)

The prerequisites for the use of this method must be met ( 18 , 25 ):

  • Interrupted time series are valid only if a single intervention took place in the period of the study.
  • The time before the intervention must be clearly distinguishable from the time after the intervention.
  • There is no required minimum number of data points, but studies with only a small number of data points or small effect sizes must be interpreted with caution. The power of a study is greatest when the number of data points before the intervention equals the number after the intervention ( 26 ).
  • Although the equation in Box 5 has a linear specification, polynomial and other nonlinear regression models can be used as well. Meticulous study of the temporal sequence is very important when a nonlinear model is used.
  • If an observation at time t —e.g., the monthly incidence of cardiovascular diseases—is correlated with previous observations (autoregression), then the appropriate statistical techniques must be used (autoregressive integrated moving average [ARIMA] models).

Example 1: In one study, the rates of acute hospitalization for cardiovascular diseases before and after the temporary closure of Heathrow Airport because of volcanic ash were determined to investigate the putative effect of aircraft noise ( 27 ). The intervention (airport closure) took place from 15 to 20 April 2010. The hospitalization rate was found to have decreased among persons living in the urban area with the most aircraft noise. The number of observation points was too low, however, to show a causal link conclusively.

Example 2: In another study, the rates of hospitalization before and after the implementation of a smoking ban (the intervention) in public areas in Italy were determined ( 28 ). The intervention occurred in January 2004 (the cutoff time). The number of hospitalizations for acute coronary events was measured from January 2002 to November 2006 ( figure 1 ) . The analysis took account of seasonal dependence, and an effect modification for two age groups—persons under age 70 and persons aged 70 and up—was determined as well. The hospitalization rate declined in the former group, but not the latter.

An external file that holds a picture, illustration, etc.
Object name is Dtsch_Arztebl_Int-117_0101_001.jpg

Age-standardized hospitalization rates for acute coronary events (ACE) in persons under age 70 before and after the implementation of a smoking ban in public places in Italy, studied with the corresponding methods ( 30 ). The observed and predicted rates are shown (circles and solid lines, respectively). The dashed lines show the seasonally adjusted trend in ACE before and after the introduction of the nationwide smoking ban.

The necessary distinction between causality and correlation is often emphasized in scientific discussions, yet it is often not applied strictly enough. Furthermore, causality in medicine and epidemiology is mostly probabilistic in nature, i.e., an intervention alters the probability that the event under study will take place. A good illustration of this principle is offered by research on the effects of radiation, in which a strict distinction is maintained between deterministic radiation damage on the one hand, and probabilistic (stochastic) radiation damage on the other ( 29 ). Deterministic radiation damage—radiation-induced burns or death—arises with certainty whenever a subject receives a certain radiation dose (usually a high one). On the other hand, the risk of cancer-related mortality after radiation exposure is a stochastic matter. Epidemiological observations and biological experiments should be evaluated in tandem to strengthen conclusions about probabilistic causality ( box 1 ) .

While RCTs still retain their importance as the gold standard of clinical research, they cannot always be carried out. Some indispensable knowledge can only be obtained from observational studies. Confounding factors must be eliminated, or at least accounted for, early on when such studies are planned. Moreover, the data that are obtained must be carefully analyzed. And, finally, a single observational study hardly ever suffices to establish a causal relationship.

In this article, we have presented two newer methods that are relatively simple and which, therefore, could easily be used more widely in medical and epidemiological research ( 30 ). Either one should be used only after the prerequisites for its applicability have been meticulously checked. In regression-discontinuity methods, the assumption of continuity must be verified: in other words, it must be checked whether other properties of the treatment and control groups are the same, or at least equally balanced. The rules of group assignment and the role played by the continuous assignment variable must be known as well. Regression-discontinuity methods can generate causal conclusions, but any such conclusion will not be generalizable if the treatment effects are heterogeneous over the range of the assignment variable. The estimate of effect size is applicable only in a small, predefined interval around the threshold value. It must also be checked whether the outcome and the assignment variable are in a linear relationship, and whether there is any interaction between the treatment and assignment variables that needs to be considered.

In the analysis of interrupted time series, the assumption of continuity must be tested as well. Furthermore, the method is valid only if the occurrence of any other intervention at the same time point as the one under study can be ruled out ( 20 ). Finally, the type of temporal sequence must be considered, and more complex statistical methods must be applied, as needed, to take such phenomena as autoregression into account.

Observational studies often suggest causal relationships that will then be either supported or rejected after further studies and experiments. Knowledge of the effects of radiation exposure was derived, at first, mainly from observations on victims of the Hiroshima and Nagasaki atomic bomb explosions ( 31 ). These findings were reinforced by further epidemiological studies on other populations exposed to radiation (e.g., through medical procedures or as an occupational hazard), by physical considerations, and by biological experiments ( 32 ). A classic example from the mid-19 th century is the observational study by Snow ( 33 ): until then, the biological cause of cholera was unknown. Snow found that there had to be a causal relationship between the contamination of a well and a subsequent outbreak of cholera. This new understanding led to improved hygienic measures, which did, indeed, prevent infection with the cholera pathogen. Cases such as these prove that it is sometimes reasonable to take action on the basis of an observational study alone ( 6 ). They also demonstrate, however, that further studies are necessary for the definitive establishment of a causal relationship.

An external file that holds a picture, illustration, etc.
Object name is Dtsch_Arztebl_Int-117_0101_002.jpg

The effect of a smoking ban on the incidence of cardiovascular diseases

Key messages

  • Causal inferences can be drawn from observational studies, as long as certain conditions are met.
  • Confounding variables are a major impediment to the demonstration of causal links, as they can either obscure or mimic such a link.
  • Random assignment leads to the even distribution of known and unknown confounders among the intervention groups that are being compared in the study.
  • In the regression-discontinuity method, it is assumed that the assignment of patients to treatment groups is random with, in a small range of the assignment variable around the threshold, with the result that the confounders are randomly distributed as well.
  • The interrupted time series is a variant of the regression-discontinuity method in which a given point in time splits the subjects into a before group and an after group, with random distribution of confounders to the two groups.

Acknowledgments

Translated from the original German by Ethan Taub, M.D.

Conflict of interest statement The authors state that they have no conflict of interest.

Sac State Library

  • My Library Account
  • Articles, Books & More
  • Course Reserves
  • Site Search
  • Advanced Search
  • Sac State Library
  • Research Guides

Research Methods Simplified

Comparative method/quasi-experimental.

  • Quantitative Research
  • Qualitative Research
  • Primary, Seconday and Tertiary Research and Resources
  • Definitions
  • Sources Consulted

Comparative method or quasi-experimental ---a method used to describe similarities and differences in variables in two or more groups in a natural setting, that is, it resembles an experiment as it uses manipulation but lacks random assignment of individual subjects. Instead it uses existing groups.  For examples see http://www.education.com/reference/article/quasiexperimental-research/#B

  • << Previous: Qualitative Research
  • Next: Primary, Seconday and Tertiary Research and Resources >>
  • Last Updated: Nov 8, 2022 2:13 PM
  • URL: https://csus.libguides.com/res-meth

IMAGES

  1. Random Assignment in Experiments

    is random assignment possible in causal comparative research

  2. Causal Research: Definition, Examples and How to Use it

    is random assignment possible in causal comparative research

  3. PPT

    is random assignment possible in causal comparative research

  4. Causal Comparative Research: Definition, Types & Benefits

    is random assignment possible in causal comparative research

  5. PPT

    is random assignment possible in causal comparative research

  6. Exercises Causal Comparative Research

    is random assignment possible in causal comparative research

VIDEO

  1. BPAC107 solved assignment|| IGNOU solved assignment|| Comparative Public Administration

  2. random sampling & assignment

  3. Quantitative Approach

  4. Data Analytics Colloquium:Two way Fixed Effects Models, Recent Criticism and Remedies, Dr. Yiqing Xu

  5. Case study, causal comparative or ex-post-facto research, prospective, retrospective research

  6. Causal

COMMENTS

  1. Random Assignment in Experiments

    Revised on June 22, 2023. In experimental research, random assignment is a way of placing participants from your sample into different treatment groups using randomization. With simple random assignment, every member of the sample has a known or equal chance of being placed in a control group or an experimental group.

  2. 1.4.2

    1.4.2 - Causal Conclusions. In order to control for confounding variables, participants can be randomly assigned to different levels of the explanatory variable. This act of randomly assigning cases to different levels of the explanatory variable is known as randomization. An experiment that involves randomization may be referred to as a ...

  3. Random Assignment in Experiments

    Random assignment helps you separation causation from correlation and rule out confounding variables. As a critical component of the scientific method, experiments typically set up contrasts between a control group and one or more treatment groups. The idea is to determine whether the effect, which is the difference between a treatment group ...

  4. Causal Inference in Oncology Comparative Effectiveness Research Using

    In this example, random assignment itself is the instrument. If all patients in the treatment group received the treatment, then the intention-to-treat estimate is the same as the IV estimate. If very few people received the treatment (ie, random assignment is a weak instrument), then the IV estimate could be an unrealistic large number.

  5. Random Assignment in Psychology: Definition & Examples

    Random selection (also called probability sampling or random sampling) is a way of randomly selecting members of a population to be included in your study. On the other hand, random assignment is a way of sorting the sample participants into control and treatment groups. Random selection ensures that everyone in the population has an equal ...

  6. Quasi-Experimental Designs for Causal Inference

    This article discusses four of the strongest quasi-experimental designs for identifying causal effects: regression discontinuity design, instrumental variable design, matching and propensity score designs, and the comparative interrupted time series design. For each design we outline the strategy and assumptions for identifying a causal effect ...

  7. Matching methods for causal inference: A review and a look forward

    1.2 Notation and Background: Estimating Causal Effects. As first formalized in Rubin (1974), the estimation of causal effects, whether from a randomized experiment or a non-experimental study, is inherently a comparison of potential outcomes.In particular, the causal effect for individual i is the comparison of individual i's outcome if individual i receives the treatment (the potential ...

  8. PDF Beyond Random Assignment: National Bureau of Economic Research

    ABSTRACT. Random assignment is insufficient for measured treatment responses to recover causal effects (comparative statics) in dynamic economies. We characterize analytically bias probabilities and magnitudes. If the policy variable is binary there is attenuation bias.

  9. How often does random assignment fail? Estimates and recommendations

    Abstract. A fundamental goal of the scientific process is to make causal inferences. Random assignment to experimental conditions has been taken to be a gold-standard technique for establishing causality. Despite this, it is unclear how often random assignment fails to eliminate non-trivial differences between experimental conditions.

  10. Case selection and causal inferences in qualitative comparative research

    If cases are homogeneous, causal inferences based on small-N qualitative comparative methods become possible, and the validity of these causal inferences depends on the employed selection rule. ... the absence of random sampling in comparative research may appear surprising. But it is not. Random selection of cases leads to inferences which are ...

  11. Chapter 16 Causal Comparative Research How to Design and Evaluate

    As we have mentioned, random assignment of subjects to groups is not possible in causal-comparative research since the groups are already formed. Manipulation of the independent variable is not possible because the groups have already been exposed to the independent variable.

  12. PDF Types of Group Comparison Research

    Portfolio Activity #8 Mini-proposal 3. Briefly describe a causal-comparative research project relevant to one of your identified research topics. In small groups discuss your mini-proposal ideas and be prepared to share your discussions with the rest of the class. 4.

  13. Causal Comparative Research: Methods And Examples

    In a causal-comparative research design, the researcher compares two groups to find out whether the independent variable affected the outcome or the dependent variable. A causal-comparative method determines whether one variable has a direct influence on the other and why. It identifies the causes of certain occurrences (or non-occurrences).

  14. Natural Experiments: Missed Opportunities for Causal Inference in

    For the first two of these three types of studies, we distinguish between random assignment to treatment/instrument and as-if random assignment to treatment/instrument. Random assignment means that participants are assigned to the treatment/instrument through a randomization process with a known probability distribution. As-if random assignment ...

  15. Random assignment

    Random assignment or random placement is an experimental technique for assigning human participants or animal subjects to different groups in an experiment (e.g., a treatment group versus a control group) using randomization, such as by a chance procedure (e.g., flipping a coin) or a random number generator. This ensures that each participant or subject has an equal chance of being placed in ...

  16. Step 6a

    Causal-comparative research allows for the investigator to compare two or more groups in terms of a treatment that has already happened. For correlational design, the researcher is looking to examine the relationship between variables or set of scores (Bloomfield & Fisher, 2019; Creswell & Creswell, 2018).

  17. Randomization Does Not Help Much, Comparability Does

    Rather, random assignment may lull researchers into a false sense of security. It is true that chance, in the guise of randomization, by and large supports comparability. However, since the former is blind with respect to the concrete factors and relevant interactions that may be present, it needs a large number of experimental units to do so.

  18. Causal inferences with large scale assessment data: using a validity

    To answer the calls for stronger evidence by the policy community, educational researchers and their associated organizations increasingly demand more studies that can yield causal inferences. International large scale assessments (ILSAs) have been targeted as a rich data sources for causal research. It is in this context that we take up a discussion around causal inferences and ILSAs.

  19. Look At Causal Comparative Research Psychology Essay

    Causal-comparative research design can be defined as a research that permits researchers to study naturally occurring, cause and effect relationship through comparison of data from participant groups ... but random assignment is not possible in causal comparative studies because the groups are naturally formed before the start of the study ...

  20. Research Methods- Intro to qualitative analysis of data

    In _____, random assignment to groups is never possible and the researcher cannot manipulate the independent variable. a. basic research b. quantitative research c. experimental research d. causal-comparative and correlational research. D. 13. Research in which the researcher uses the qualitative paradigm for one phase and the quantitative ...

  21. Methods for Evaluating Causality in Observational Studies

    Random assignment leads to the even distribution of known and unknown confounders among the intervention groups that are being compared in the study. In the regression-discontinuity method, it is assumed that the assignment of patients to treatment groups is random with, in a small range of the assignment variable around the threshold, with the ...

  22. Final Reveiw Flashcards

    Study with Quizlet and memorize flashcards containing terms like Which of the following is not possible in causal-comparative research?, Causal-comparative research is so named because:, Causal-comparative studies are conducted when experimentation is: and more. ... Random assignment to treatment or comparison groups. Causal-comparative ...

  23. Comparative Method/Quasi-Experimental

    Comparative method or quasi-experimental---a method used to describe similarities and differences in variables in two or more groups in a natural setting, that is, it resembles an experiment as it uses manipulation but lacks random assignment of individual subjects. Instead it uses existing groups.