Assessment Systems

What is automated essay scoring?

Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment.  In fact, it’s been around far longer than “machine learning” and “artificial intelligence” have been buzzwords in the general public!  The field of psychometrics has been doing such groundbreaking work for decades.

So how does AES work, and how can you apply it?

The first and most critical thing to know is that there is not an algorithm that “reads” the student essays.  Instead, you need to train an algorithm.  That is, if you are a teacher and don’t want to grade your essays, you can’t just throw them in an essay scoring system.  You have to  actually grade the essays (or at least a large sample of them) and then use that data to fit a machine learning algorithm.  Data scientists use the term train the model , which sounds complicated, but if you have ever done simple linear regression, you have experience with training models.

There are three steps for automated essay scoring:

  • Establish your data set (collate student essays and grade them).
  • Determine the features (predictor variables that you want to pick up on).
  • Train the machine learning model.

Here’s an extremely oversimplified example:

  • You have a set of 100 student essays, which you have scored on a scale of 0 to 5 points.
  • The essay is on Napoleon Bonaparte, and you want students to know certain facts, so you want to give them “credit” in the model if they use words like: Corsica, Consul, Josephine, Emperor, Waterloo, Austerlitz, St. Helena.  You might also add other Features such as Word Count, number of grammar errors, number of spelling errors, etc.
  • You create a map of which students used each of these words, as 0/1 indicator variables.  You can then fit a multiple regression with 7 predictor variables (did they use each of the 7 words) and the 5 point scale as your criterion variable.  You can then use this model to predict each student’s score from just their essay text.

Obviously, this example is too simple to be of use, but the same general idea is done with massive, complex studies.  The establishment of the core features (predictive variables) can be much more complex, and models are going to be much more complex than multiple regression (neural networks, random forests, support vector machines).

Here’s an example of the very start of a data matrix for features, from an actual student essay.  Imagine that you also have data on the final scores, 0 to 5 points.  You can see how this is then a regression situation.

How do you score the essay?

If they are on paper, then automated essay scoring won’t work unless you have an extremely good software for character recognition that converts it to a digital database of text.  Most likely, you have delivered the exam as an online assessment and already have the database.  If so, your platform should include functionality to manage the scoring process, including multiple custom rubrics.  An example of our FastTest platform is provided below.

FastTest_essay-marking

Some rubrics you might use:

  • Supporting arguments
  • Organization
  • Vocabulary / word choice

How do you pick the Features?

This is one of the key research problems.  In some cases, it might be something similar to the Napoleon example.  Suppose you had a complex item on Accounting, where examinees review reports and spreadsheets and need to summarize a few key points.  You might pull out a few key terms as features (mortgage amortization) or numbers (2.375%) and consider them to be Features.  I saw a presentation at Innovations In Testing 2022 that did exactly this.  Think of them as where you are giving the students “points” for using those keywords, though because you are using complex machine learning models, it is not simply giving them a single unit point.  It’s contributing towards a regression-like model with a positive slope.

In other cases, you might not know.  Maybe it is an item on an English test being delivered to English language learners, and you ask them to write about what country they want to visit someday.  You have no idea what they will write about.  But what you can do is tell the algorithm to find the words or terms that are used most often, and try to predict the scores with that.  Maybe words like “jetlag” or “edification” show up in students that tend to get high scores, while words like “clubbing” or “someday” tend to be used by students with lower scores.  The AI might also pick up on spelling errors.  I worked as an essay scorer in grad school, and I can’t tell you how many times I saw kids use “ludacris” (name of an American rap artist) instead of “ludicrous” when trying to describe an argument.  They had literally never seen the word used or spelled correctly.  Maybe the AI model finds to give that a negative weight.   That’s the next section!

How do you train a model?

Well, if you are familiar with data science, you know there are TONS of models, and many of them have a bunch of parameterization options.  This is where more research is required.  What model works the best on your particular essay, and doesn’t take 5 days to run on your data set?  That’s for you to figure out.  There is a trade-off between simplicity and accuracy.  Complex models might be accurate but take days to run.  A simpler model might take 2 hours but with a 5% drop in accuracy.  It’s up to you to evaluate.

If you have experience with Python and R, you know that there are many packages which provide this analysis out of the box – it is a matter of selecting a model that works.

How well does automated essay scoring work?

Well, as psychometricians love to say, “it depends.”  You need to do the model fitting research for each prompt and rubric.  It will work better for some than others.  The general consensus in research is that AES algorithms work as well as a second human, and therefore serve very well in that role.  But you shouldn’t use them as the only score; of course, that’s impossible in many cases.

Here’s a graph from some research we did on our algorithm, showing the correlation of human to AES.  The three lines are for the proportion of sample used in the training set; we saw decent results from only 10% in this case!  Some of the models correlated above 0.80 with humans, even though this is a small data set.   We found that the Cubist model took a fraction of the time needed by complex models like Neural Net or Random Forest; in this case it might be sufficiently powerful.

Automated essay scoring results

How can I implement automated essay scoring without writing code from scratch?

There are several products on the market.  Some are standalone, some are integrated with a human-based essay scoring platform.  ASC’s platform for automated essay scoring is SmartMarq; click here to learn more .  It is currently in a standalone approach like you see below, making it extremely easy to use.  It is also in the process of being integrated into our online assessment platform, alongside human scoring, to provide an efficient and easy way of obtaining a second or third rater for QA purposes.

Want to learn more?  Contact us to request a demonstration .

SmartMarq automated essay scoring

  • Latest Posts

Avatar for Nathan Thompson, PhD

Nathan Thompson, PhD

Latest posts by nathan thompson, phd ( see all ).

  • What is a T score? - April 15, 2024
  • Item Review Workflow for Exam Development - April 8, 2024
  • Likert Scale Items - February 9, 2024

Online Testing Solutions

LinkedIn Assessment Systems

Psychometrics

e-rater ®  Scoring Engine

Evaluates students’ writing proficiency with automatic scoring and feedback

Selection an option below to learn more.

About the e-rater Scoring Engine

The e-rater automated scoring engine uses AI technology and Natural Language Processing (NLP) to evaluate the writing proficiency of student essays by providing automatic scoring and feedback. The engine provides descriptive feedback on the writer’s grammar, mechanics, word use and complexity, style, organization and more.

Who uses the e-rater engine and why?

Companies and institutions use this patented technology to power their custom applications.

The e-rater engine is used within the  Criterion ®  Online Writing Evaluation Service . Students use the e-rater engine's feedback to evaluate their essay-writing skills and to identify areas that need improvement. Teachers use the Criterion service to help their students develop their writing skills independently and receive automated, constructive feedback. The e-rater engine is also used in other low-stakes practice tests include TOEFL ®  Practice Online and GRE ®  ScoreItNow!™.

In high-stakes settings, the engine is used in conjunction with human ratings for both the Issue and Argument prompts of the GRE test's Analytical Writing section and the TOEFL iBT ®  test's Independent and Integrated Writing prompts. ETS research has shown that combining automated and human essay scoring demonstrates assessment score reliability and measurement benefits.

For more information about the use of the e-rater engine, read  E-rater as a Quality Control on Human Scores (PDF) .

How does the e-rater engine grade essays?

The e-rater engine provides a holistic score for an essay that has been entered into the computer electronically. It also provides real-time diagnostic feedback about grammar, usage, mechanics, style and organization, and development. This feedback is based on NLP research specifically tailored to the analysis of student responses and is detailed in  ETS's research publications (PDF) .

How does the e-rater engine compare to human raters?

The e-rater engine uses NLP to identify features relevant to writing proficiency in training essays and their relationship with human scores. The resulting scoring model, which assigns weights to each observed feature, is stored offline in a database that can then be used to score new essays according to the same formula.

The e-rater engine doesn’t have the ability to read so it can’t evaluate essays the same way that human raters do. However, the features used in e-rater scoring have been developed to be as substantively meaningful as they can be, given the state of the art in NLP. They also have been developed to demonstrate strong reliability — often greater reliability than human raters themselves.

Learn more about  how it works .

About Natural Language Processing

The e-rater engine is an artificial intelligence engine that uses Natural Language Processing (NLP), a field of computer science and linguistics that uses computational methods to analyze characteristics of a text. NLP methods support such burgeoning application areas as machine translation, speech recognition and information retrieval.

Ready to begin? Contact us to learn how the e-rater service can enhance your existing program.

Young man with glasses and holding up a pen in a library

Boulder Labs Logo

Automated Essay Grading

State-of-the-art machine learning framework for automatically grading student essays.

Automated essay grading

Project Details

  • Product: Machine learning framework for automated essay grading
  • Technologies: Python, Javascript, Flask, NLTK, SciPy, Deep Learning, Cloud Infrastructure

In close collaboration with our client, a global education company, Boulder Labs designed and implemented a state of the art machine learning system with cloud infrastructure for automated essay grading.

More About this project

We developed a system to automate and streamline much of the work involved in building models to perform automated essay grading. The system includes an API for data collection and validation, tools to automate the modeling process and facilitate research, an interface for reporting on modeling performance, and support for deploying trained models to a production environment.

The system is useful to both the R&D and engineering teams, and facilitates an easy transfer of technology from research to production.

As a separate project for the same client, Boulder Labs developed a grammar checker specifically tailored for student writing. The new system provides a natural interface for integrating external tools, including this grammar checker, into the automated essays grading system.

logo

More About us

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, automated essay scoring.

24 papers with code • 1 benchmarks • 1 datasets

Essay scoring: Automated Essay Scoring is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics.

Source: A Joint Model for Multimodal Document Quality Assessment

Benchmarks Add a Result

Most implemented papers, automated essay scoring based on two-stage learning.

Current state-of-art feature-engineered and end-to-end Automated Essay Score (AES) methods are proven to be unable to detect adversarial samples, e. g. the essays composed of permuted sentences and the prompt-irrelevant essays.

A Neural Approach to Automated Essay Scoring

nusnlp/nea • EMNLP 2016

SkipFlow: Incorporating Neural Coherence Features for End-to-End Automatic Text Scoring

automated essay grading project

Our new method proposes a new \textsc{SkipFlow} mechanism that models relationships between snapshots of the hidden representations of a long short-term memory (LSTM) network as it reads.

Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input

Youmna-H/Coherence_AES • NAACL 2018

We demonstrate that current state-of-the-art approaches to Automated Essay Scoring (AES) are not well-suited to capturing adversarially crafted input of grammatical but incoherent sequences of sentences.

Co-Attention Based Neural Network for Source-Dependent Essay Scoring

This paper presents an investigation of using a co-attention based neural network for source-dependent essay scoring.

Language models and Automated Essay Scoring

In this paper, we present a new comparative study on automatic essay scoring (AES).

Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems

midas-research/calling-out-bluff • 14 Jul 2020

This number is increasing further due to COVID-19 and the associated automation of education and testing.

Prompt Agnostic Essay Scorer: A Domain Generalization Approach to Cross-prompt Automated Essay Scoring

Cross-prompt automated essay scoring (AES) requires the system to use non target-prompt essays to award scores to a target-prompt essay.

Many Hands Make Light Work: Using Essay Traits to Automatically Score Essays

To find out which traits work best for different types of essays, we conduct ablation tests for each of the essay traits.

EXPATS: A Toolkit for Explainable Automated Text Scoring

octanove/expats • 7 Apr 2021

Automated text scoring (ATS) tasks, such as automated essay scoring and readability assessment, are important educational applications of natural language processing.

ERB Logo

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

An automated essay scoring systems: a systematic literature review

Dadi ramesh.

1 School of Computer Science and Artificial Intelligence, SR University, Warangal, TS India

2 Research Scholar, JNTU, Hyderabad, India

Suresh Kumar Sanampudi

3 Department of Information Technology, JNTUH College of Engineering, Nachupally, Kondagattu, Jagtial, TS India

Associated Data

Assessment in the Education system plays a significant role in judging student performance. The present evaluation system is through human assessment. As the number of teachers' student ratio is gradually increasing, the manual evaluation process becomes complicated. The drawback of manual evaluation is that it is time-consuming, lacks reliability, and many more. This connection online examination system evolved as an alternative tool for pen and paper-based methods. Present Computer-based evaluation system works only for multiple-choice questions, but there is no proper evaluation system for grading essays and short answers. Many researchers are working on automated essay grading and short answer scoring for the last few decades, but assessing an essay by considering all parameters like the relevance of the content to the prompt, development of ideas, Cohesion, and Coherence is a big challenge till now. Few researchers focused on Content-based evaluation, while many of them addressed style-based assessment. This paper provides a systematic literature review on automated essay scoring systems. We studied the Artificial Intelligence and Machine Learning techniques used to evaluate automatic essay scoring and analyzed the limitations of the current studies and research trends. We observed that the essay evaluation is not done based on the relevance of the content and coherence.

Supplementary Information

The online version contains supplementary material available at 10.1007/s10462-021-10068-2.

Introduction

Due to COVID 19 outbreak, an online educational system has become inevitable. In the present scenario, almost all the educational institutions ranging from schools to colleges adapt the online education system. The assessment plays a significant role in measuring the learning ability of the student. Most automated evaluation is available for multiple-choice questions, but assessing short and essay answers remain a challenge. The education system is changing its shift to online-mode, like conducting computer-based exams and automatic evaluation. It is a crucial application related to the education domain, which uses natural language processing (NLP) and Machine Learning techniques. The evaluation of essays is impossible with simple programming languages and simple techniques like pattern matching and language processing. Here the problem is for a single question, we will get more responses from students with a different explanation. So, we need to evaluate all the answers concerning the question.

Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. ( 1973 ). PEG evaluates the writing characteristics such as grammar, diction, construction, etc., to grade the essay. A modified version of the PEG by Shermis et al. ( 2001 ) was released, which focuses on grammar checking with a correlation between human evaluators and the system. Foltz et al. ( 1999 ) introduced an Intelligent Essay Assessor (IEA) by evaluating content using latent semantic analysis to produce an overall score. Powers et al. ( 2002 ) proposed E-rater and Intellimetric by Rudner et al. ( 2006 ) and Bayesian Essay Test Scoring System (BESTY) by Rudner and Liang ( 2002 ), these systems use natural language processing (NLP) techniques that focus on style and content to obtain the score of an essay. The vast majority of the essay scoring systems in the 1990s followed traditional approaches like pattern matching and a statistical-based approach. Since the last decade, the essay grading systems started using regression-based and natural language processing techniques. AES systems like Dong et al. ( 2017 ) and others developed from 2014 used deep learning techniques, inducing syntactic and semantic features resulting in better results than earlier systems.

Ohio, Utah, and most US states are using AES systems in school education, like Utah compose tool, Ohio standardized test (an updated version of PEG), evaluating millions of student's responses every year. These systems work for both formative, summative assessments and give feedback to students on the essay. Utah provided basic essay evaluation rubrics (six characteristics of essay writing): Development of ideas, organization, style, word choice, sentence fluency, conventions. Educational Testing Service (ETS) has been conducting significant research on AES for more than a decade and designed an algorithm to evaluate essays on different domains and providing an opportunity for test-takers to improve their writing skills. In addition, they are current research content-based evaluation.

The evaluation of essay and short answer scoring should consider the relevance of the content to the prompt, development of ideas, Cohesion, Coherence, and domain knowledge. Proper assessment of the parameters mentioned above defines the accuracy of the evaluation system. But all these parameters cannot play an equal role in essay scoring and short answer scoring. In a short answer evaluation, domain knowledge is required, like the meaning of "cell" in physics and biology is different. And while evaluating essays, the implementation of ideas with respect to prompt is required. The system should also assess the completeness of the responses and provide feedback.

Several studies examined AES systems, from the initial to the latest AES systems. In which the following studies on AES systems are Blood ( 2011 ) provided a literature review from PEG 1984–2010. Which has covered only generalized parts of AES systems like ethical aspects, the performance of the systems. Still, they have not covered the implementation part, and it’s not a comparative study and has not discussed the actual challenges of AES systems.

Burrows et al. ( 2015 ) Reviewed AES systems on six dimensions like dataset, NLP techniques, model building, grading models, evaluation, and effectiveness of the model. They have not covered feature extraction techniques and challenges in features extractions. Covered only Machine Learning models but not in detail. This system not covered the comparative analysis of AES systems like feature extraction, model building, and level of relevance, cohesion, and coherence not covered in this review.

Ke et al. ( 2019 ) provided a state of the art of AES system but covered very few papers and not listed all challenges, and no comparative study of the AES model. On the other hand, Hussein et al. in ( 2019 ) studied two categories of AES systems, four papers from handcrafted features for AES systems, and four papers from the neural networks approach, discussed few challenges, and did not cover feature extraction techniques, the performance of AES models in detail.

Klebanov et al. ( 2020 ). Reviewed 50 years of AES systems, listed and categorized all essential features that need to be extracted from essays. But not provided a comparative analysis of all work and not discussed the challenges.

This paper aims to provide a systematic literature review (SLR) on automated essay grading systems. An SLR is an Evidence-based systematic review to summarize the existing research. It critically evaluates and integrates all relevant studies' findings and addresses the research domain's specific research questions. Our research methodology uses guidelines given by Kitchenham et al. ( 2009 ) for conducting the review process; provide a well-defined approach to identify gaps in current research and to suggest further investigation.

We addressed our research method, research questions, and the selection process in Sect.  2 , and the results of the research questions have discussed in Sect.  3 . And the synthesis of all the research questions addressed in Sect.  4 . Conclusion and possible future work discussed in Sect.  5 .

Research method

We framed the research questions with PICOC criteria.

Population (P) Student essays and answers evaluation systems.

Intervention (I) evaluation techniques, data sets, features extraction methods.

Comparison (C) Comparison of various approaches and results.

Outcomes (O) Estimate the accuracy of AES systems,

Context (C) NA.

Research questions

To collect and provide research evidence from the available studies in the domain of automated essay grading, we framed the following research questions (RQ):

RQ1 what are the datasets available for research on automated essay grading?

The answer to the question can provide a list of the available datasets, their domain, and access to the datasets. It also provides a number of essays and corresponding prompts.

RQ2 what are the features extracted for the assessment of essays?

The answer to the question can provide an insight into various features so far extracted, and the libraries used to extract those features.

RQ3, which are the evaluation metrics available for measuring the accuracy of algorithms?

The answer will provide different evaluation metrics for accurate measurement of each Machine Learning approach and commonly used measurement technique.

RQ4 What are the Machine Learning techniques used for automatic essay grading, and how are they implemented?

It can provide insights into various Machine Learning techniques like regression models, classification models, and neural networks for implementing essay grading systems. The response to the question can give us different assessment approaches for automated essay grading systems.

RQ5 What are the challenges/limitations in the current research?

The answer to the question provides limitations of existing research approaches like cohesion, coherence, completeness, and feedback.

Search process

We conducted an automated search on well-known computer science repositories like ACL, ACM, IEEE Explore, Springer, and Science Direct for an SLR. We referred to papers published from 2010 to 2020 as much of the work during these years focused on advanced technologies like deep learning and natural language processing for automated essay grading systems. Also, the availability of free data sets like Kaggle (2012), Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE) by Yannakoudakis et al. ( 2011 ) led to research this domain.

Search Strings : We used search strings like “Automated essay grading” OR “Automated essay scoring” OR “short answer scoring systems” OR “essay scoring systems” OR “automatic essay evaluation” and searched on metadata.

Selection criteria

After collecting all relevant documents from the repositories, we prepared selection criteria for inclusion and exclusion of documents. With the inclusion and exclusion criteria, it becomes more feasible for the research to be accurate and specific.

Inclusion criteria 1 Our approach is to work with datasets comprise of essays written in English. We excluded the essays written in other languages.

Inclusion criteria 2  We included the papers implemented on the AI approach and excluded the traditional methods for the review.

Inclusion criteria 3 The study is on essay scoring systems, so we exclusively included the research carried out on only text data sets rather than other datasets like image or speech.

Exclusion criteria  We removed the papers in the form of review papers, survey papers, and state of the art papers.

Quality assessment

In addition to the inclusion and exclusion criteria, we assessed each paper by quality assessment questions to ensure the article's quality. We included the documents that have clearly explained the approach they used, the result analysis and validation.

The quality checklist questions are framed based on the guidelines from Kitchenham et al. ( 2009 ). Each quality assessment question was graded as either 1 or 0. The final score of the study range from 0 to 3. A cut off score for excluding a study from the review is 2 points. Since the papers scored 2 or 3 points are included in the final evaluation. We framed the following quality assessment questions for the final study.

Quality Assessment 1: Internal validity.

Quality Assessment 2: External validity.

Quality Assessment 3: Bias.

The two reviewers review each paper to select the final list of documents. We used the Quadratic Weighted Kappa score to measure the final agreement between the two reviewers. The average resulted from the kappa score is 0.6942, a substantial agreement between the reviewers. The result of evolution criteria shown in Table ​ Table1. 1 . After Quality Assessment, the final list of papers for review is shown in Table ​ Table2. 2 . The complete selection process is shown in Fig. ​ Fig.1. 1 . The total number of selected papers in year wise as shown in Fig. ​ Fig.2. 2 .

Quality assessment analysis

Final list of papers

An external file that holds a picture, illustration, etc.
Object name is 10462_2021_10068_Fig1_HTML.jpg

Selection process

An external file that holds a picture, illustration, etc.
Object name is 10462_2021_10068_Fig2_HTML.jpg

Year wise publications

What are the datasets available for research on automated essay grading?

To work with problem statement especially in Machine Learning and deep learning domain, we require considerable amount of data to train the models. To answer this question, we listed all the data sets used for training and testing for automated essay grading systems. The Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE) Yannakoudakis et al. ( 2011 ) developed corpora that contain 1244 essays and ten prompts. This corpus evaluates whether a student can write the relevant English sentences without any grammatical and spelling mistakes. This type of corpus helps to test the models built for GRE and TOFEL type of exams. It gives scores between 1 and 40.

Bailey and Meurers ( 2008 ), Created a dataset (CREE reading comprehension) for language learners and automated short answer scoring systems. The corpus consists of 566 responses from intermediate students. Mohler and Mihalcea ( 2009 ). Created a dataset for the computer science domain consists of 630 responses for data structure assignment questions. The scores are range from 0 to 5 given by two human raters.

Dzikovska et al. ( 2012 ) created a Student Response Analysis (SRA) corpus. It consists of two sub-groups: the BEETLE corpus consists of 56 questions and approximately 3000 responses from students in the electrical and electronics domain. The second one is the SCIENTSBANK(SemEval-2013) (Dzikovska et al. 2013a ; b ) corpus consists of 10,000 responses on 197 prompts on various science domains. The student responses ladled with "correct, partially correct incomplete, Contradictory, Irrelevant, Non-domain."

In the Kaggle (2012) competition, released total 3 types of corpuses on an Automated Student Assessment Prize (ASAP1) (“ https://www.kaggle.com/c/asap-sas/ ” ) essays and short answers. It has nearly 17,450 essays, out of which it provides up to 3000 essays for each prompt. It has eight prompts that test 7th to 10th grade US students. It gives scores between the [0–3] and [0–60] range. The limitations of these corpora are: (1) it has a different score range for other prompts. (2) It uses statistical features such as named entities extraction and lexical features of words to evaluate essays. ASAP +  + is one more dataset from Kaggle. It is with six prompts, and each prompt has more than 1000 responses total of 10,696 from 8th-grade students. Another corpus contains ten prompts from science, English domains and a total of 17,207 responses. Two human graders evaluated all these responses.

Correnti et al. ( 2013 ) created a Response-to-Text Assessment (RTA) dataset used to check student writing skills in all directions like style, mechanism, and organization. 4–8 grade students give the responses to RTA. Basu et al. ( 2013 ) created a power grading dataset with 700 responses for ten different prompts from US immigration exams. It contains all short answers for assessment.

The TOEFL11 corpus Blanchard et al. ( 2013 ) contains 1100 essays evenly distributed over eight prompts. It is used to test the English language skills of a candidate attending the TOFEL exam. It scores the language proficiency of a candidate as low, medium, and high.

International Corpus of Learner English (ICLE) Granger et al. ( 2009 ) built a corpus of 3663 essays covering different dimensions. It has 12 prompts with 1003 essays that test the organizational skill of essay writing, and13 prompts, each with 830 essays that examine the thesis clarity and prompt adherence.

Argument Annotated Essays (AAE) Stab and Gurevych ( 2014 ) developed a corpus that contains 102 essays with 101 prompts taken from the essayforum2 site. It tests the persuasive nature of the student essay. The SCIENTSBANK corpus used by Sakaguchi et al. ( 2015 ) available in git-hub, containing 9804 answers to 197 questions in 15 science domains. Table ​ Table3 3 illustrates all datasets related to AES systems.

ALL types Datasets used in Automatic scoring systems

Features play a major role in the neural network and other supervised Machine Learning approaches. The automatic essay grading systems scores student essays based on different types of features, which play a prominent role in training the models. Based on their syntax and semantics and they are categorized into three groups. 1. statistical-based features Contreras et al. ( 2018 ); Kumar et al. ( 2019 ); Mathias and Bhattacharyya ( 2018a ; b ) 2. Style-based (Syntax) features Cummins et al. ( 2016 ); Darwish and Mohamed ( 2020 ); Ke et al. ( 2019 ). 3. Content-based features Dong et al. ( 2017 ). A good set of features appropriate models evolved better AES systems. The vast majority of the researchers are using regression models if features are statistical-based. For Neural Networks models, researches are using both style-based and content-based features. The following table shows the list of various features used in existing AES Systems. Table ​ Table4 4 represents all set of features used for essay grading.

Types of features

We studied all the feature extracting NLP libraries as shown in Fig. ​ Fig.3. that 3 . that are used in the papers. The NLTK is an NLP tool used to retrieve statistical features like POS, word count, sentence count, etc. With NLTK, we can miss the essay's semantic features. To find semantic features Word2Vec Mikolov et al. ( 2013 ), GloVe Jeffrey Pennington et al. ( 2014 ) is the most used libraries to retrieve the semantic text from the essays. And in some systems, they directly trained the model with word embeddings to find the score. From Fig. ​ Fig.4 4 as observed that non-content-based feature extraction is higher than content-based.

An external file that holds a picture, illustration, etc.
Object name is 10462_2021_10068_Fig3_HTML.jpg

Usages of tools

An external file that holds a picture, illustration, etc.
Object name is 10462_2021_10068_Fig4_HTML.jpg

Number of papers on content based features

RQ3 which are the evaluation metrics available for measuring the accuracy of algorithms?

The majority of the AES systems are using three evaluation metrics. They are (1) quadrated weighted kappa (QWK) (2) Mean Absolute Error (MAE) (3) Pearson Correlation Coefficient (PCC) Shehab et al. ( 2016 ). The quadratic weighted kappa will find agreement between human evaluation score and system evaluation score and produces value ranging from 0 to 1. And the Mean Absolute Error is the actual difference between human-rated score to system-generated score. The mean square error (MSE) measures the average squares of the errors, i.e., the average squared difference between the human-rated and the system-generated scores. MSE will always give positive numbers only. Pearson's Correlation Coefficient (PCC) finds the correlation coefficient between two variables. It will provide three values (0, 1, − 1). "0" represents human-rated and system scores that are not related. "1" represents an increase in the two scores. "− 1" illustrates a negative relationship between the two scores.

RQ4 what are the Machine Learning techniques being used for automatic essay grading, and how are they implemented?

After scrutinizing all documents, we categorize the techniques used in automated essay grading systems into four baskets. 1. Regression techniques. 2. Classification model. 3. Neural networks. 4. Ontology-based approach.

All the existing AES systems developed in the last ten years employ supervised learning techniques. Researchers using supervised methods viewed the AES system as either regression or classification task. The goal of the regression task is to predict the score of an essay. The classification task is to classify the essays belonging to (low, medium, or highly) relevant to the question's topic. Since the last three years, most AES systems developed made use of the concept of the neural network.

Regression based models

Mohler and Mihalcea ( 2009 ). proposed text-to-text semantic similarity to assign a score to the student essays. There are two text similarity measures like Knowledge-based measures, corpus-based measures. There eight knowledge-based tests with all eight models. They found the similarity. The shortest path similarity determines based on the length, which shortest path between two contexts. Leacock & Chodorow find the similarity based on the shortest path's length between two concepts using node-counting. The Lesk similarity finds the overlap between the corresponding definitions, and Wu & Palmer algorithm finds similarities based on the depth of two given concepts in the wordnet taxonomy. Resnik, Lin, Jiang&Conrath, Hirst& St-Onge find the similarity based on different parameters like the concept, probability, normalization factor, lexical chains. In corpus-based likeness, there LSA BNC, LSA Wikipedia, and ESA Wikipedia, latent semantic analysis is trained on Wikipedia and has excellent domain knowledge. Among all similarity scores, correlation scores LSA Wikipedia scoring accuracy is more. But these similarity measure algorithms are not using NLP concepts. These models are before 2010 and basic concept models to continue the research automated essay grading with updated algorithms on neural networks with content-based features.

Adamson et al. ( 2014 ) proposed an automatic essay grading system which is a statistical-based approach in this they retrieved features like POS, Character count, Word count, Sentence count, Miss spelled words, n-gram representation of words to prepare essay vector. They formed a matrix with these all vectors in that they applied LSA to give a score to each essay. It is a statistical approach that doesn’t consider the semantics of the essay. The accuracy they got when compared to the human rater score with the system is 0.532.

Cummins et al. ( 2016 ). Proposed Timed Aggregate Perceptron vector model to give ranking to all the essays, and later they converted the rank algorithm to predict the score of the essay. The model trained with features like Word unigrams, bigrams, POS, Essay length, grammatical relation, Max word length, sentence length. It is multi-task learning, gives ranking to the essays, and predicts the score for the essay. The performance evaluated through QWK is 0.69, a substantial agreement between the human rater and the system.

Sultan et al. ( 2016 ). Proposed a Ridge regression model to find short answer scoring with Question Demoting. Question Demoting is the new concept included in the essay's final assessment to eliminate duplicate words from the essay. The extracted features are Text Similarity, which is the similarity between the student response and reference answer. Question Demoting is the number of repeats in a student response. With inverse document frequency, they assigned term weight. The sentence length Ratio is the number of words in the student response, is another feature. With these features, the Ridge regression model was used, and the accuracy they got 0.887.

Contreras et al. ( 2018 ). Proposed Ontology based on text mining in this model has given a score for essays in phases. In phase-I, they generated ontologies with ontoGen and SVM to find the concept and similarity in the essay. In phase II from ontologies, they retrieved features like essay length, word counts, correctness, vocabulary, and types of word used, domain information. After retrieving statistical data, they used a linear regression model to find the score of the essay. The accuracy score is the average of 0.5.

Darwish and Mohamed ( 2020 ) proposed the fusion of fuzzy Ontology with LSA. They retrieve two types of features, like syntax features and semantic features. In syntax features, they found Lexical Analysis with tokens, and they construct a parse tree. If the parse tree is broken, the essay is inconsistent—a separate grade assigned to the essay concerning syntax features. The semantic features are like similarity analysis, Spatial Data Analysis. Similarity analysis is to find duplicate sentences—Spatial Data Analysis for finding Euclid distance between the center and part. Later they combine syntax features and morphological features score for the final score. The accuracy they achieved with the multiple linear regression model is 0.77, mostly on statistical features.

Süzen Neslihan et al. ( 2020 ) proposed a text mining approach for short answer grading. First, their comparing model answers with student response by calculating the distance between two sentences. By comparing the model answer with student response, they find the essay's completeness and provide feedback. In this approach, model vocabulary plays a vital role in grading, and with this model vocabulary, the grade will be assigned to the student's response and provides feedback. The correlation between the student answer to model answer is 0.81.

Classification based Models

Persing and Ng ( 2013 ) used a support vector machine to score the essay. The features extracted are OS, N-gram, and semantic text to train the model and identified the keywords from the essay to give the final score.

Sakaguchi et al. ( 2015 ) proposed two methods: response-based and reference-based. In response-based scoring, the extracted features are response length, n-gram model, and syntactic elements to train the support vector regression model. In reference-based scoring, features such as sentence similarity using word2vec is used to find the cosine similarity of the sentences that is the final score of the response. First, the scores were discovered individually and later combined two features to find a final score. This system gave a remarkable increase in performance by combining the scores.

Mathias and Bhattacharyya ( 2018a ; b ) Proposed Automated Essay Grading Dataset with Essay Attribute Scores. The first concept features selection depends on the essay type. So the common attributes are Content, Organization, Word Choice, Sentence Fluency, Conventions. In this system, each attribute is scored individually, with the strength of each attribute identified. The model they used is a random forest classifier to assign scores to individual attributes. The accuracy they got with QWK is 0.74 for prompt 1 of the ASAS dataset ( https://www.kaggle.com/c/asap-sas/ ).

Ke et al. ( 2019 ) used a support vector machine to find the response score. In this method, features like Agreeability, Specificity, Clarity, Relevance to prompt, Conciseness, Eloquence, Confidence, Direction of development, Justification of opinion, and Justification of importance. First, the individual parameter score obtained was later combined with all scores to give a final response score. The features are used in the neural network to find whether the sentence is relevant to the topic or not.

Salim et al. ( 2019 ) proposed an XGBoost Machine Learning classifier to assess the essays. The algorithm trained on features like word count, POS, parse tree depth, and coherence in the articles with sentence similarity percentage; cohesion and coherence are considered for training. And they implemented K-fold cross-validation for a result the average accuracy after specific validations is 68.12.

Neural network models

Shehab et al. ( 2016 ) proposed a neural network method that used learning vector quantization to train human scored essays. After training, the network can provide a score to the ungraded essays. First, we should process the essay to remove Spell checking and then perform preprocessing steps like Document Tokenization, stop word removal, Stemming, and submit it to the neural network. Finally, the model will provide feedback on the essay, whether it is relevant to the topic. And the correlation coefficient between human rater and system score is 0.7665.

Kopparapu and De ( 2016 ) proposed the Automatic Ranking of Essays using Structural and Semantic Features. This approach constructed a super essay with all the responses. Next, ranking for a student essay is done based on the super-essay. The structural and semantic features derived helps to obtain the scores. In a paragraph, 15 Structural features like an average number of sentences, the average length of sentences, and the count of words, nouns, verbs, adjectives, etc., are used to obtain a syntactic score. A similarity score is used as semantic features to calculate the overall score.

Dong and Zhang ( 2016 ) proposed a hierarchical CNN model. The model builds two layers with word embedding to represents the words as the first layer. The second layer is a word convolution layer with max-pooling to find word vectors. The next layer is a sentence-level convolution layer with max-pooling to find the sentence's content and synonyms. A fully connected dense layer produces an output score for an essay. The accuracy with the hierarchical CNN model resulted in an average QWK of 0.754.

Taghipour and Ng ( 2016 ) proposed a first neural approach for essay scoring build in which convolution and recurrent neural network concepts help in scoring an essay. The network uses a lookup table with the one-hot representation of the word vector of an essay. The final efficiency of the network model with LSTM resulted in an average QWK of 0.708.

Dong et al. ( 2017 ). Proposed an Attention-based scoring system with CNN + LSTM to score an essay. For CNN, the input parameters were character embedding and word embedding, and it has attention pooling layers and used NLTK to obtain word and character embedding. The output gives a sentence vector, which provides sentence weight. After CNN, it will have an LSTM layer with an attention pooling layer, and this final layer results in the final score of the responses. The average QWK score is 0.764.

Riordan et al. ( 2017 ) proposed a neural network with CNN and LSTM layers. Word embedding, given as input to a neural network. An LSTM network layer will retrieve the window features and delivers them to the aggregation layer. The aggregation layer is a superficial layer that takes a correct window of words and gives successive layers to predict the answer's sore. The accuracy of the neural network resulted in a QWK of 0.90.

Zhao et al. ( 2017 ) proposed a new concept called Memory-Augmented Neural network with four layers, input representation layer, memory addressing layer, memory reading layer, and output layer. An input layer represents all essays in a vector form based on essay length. After converting the word vector, the memory addressing layer takes a sample of the essay and weighs all the terms. The memory reading layer takes the input from memory addressing segment and finds the content to finalize the score. Finally, the output layer will provide the final score of the essay. The accuracy of essay scores is 0.78, which is far better than the LSTM neural network.

Mathias and Bhattacharyya ( 2018a ; b ) proposed deep learning networks using LSTM with the CNN layer and GloVe pre-trained word embeddings. For this, they retrieved features like Sentence count essays, word count per sentence, Number of OOVs in the sentence, Language model score, and the text's perplexity. The network predicted the goodness scores of each essay. The higher the goodness scores, means higher the rank and vice versa.

Nguyen and Dery ( 2016 ). Proposed Neural Networks for Automated Essay Grading. In this method, a single layer bi-directional LSTM accepting word vector as input. Glove vectors used in this method resulted in an accuracy of 90%.

Ruseti et al. ( 2018 ) proposed a recurrent neural network that is capable of memorizing the text and generate a summary of an essay. The Bi-GRU network with the max-pooling layer molded on the word embedding of each document. It will provide scoring to the essay by comparing it with a summary of the essay from another Bi-GRU network. The result obtained an accuracy of 0.55.

Wang et al. ( 2018a ; b ) proposed an automatic scoring system with the bi-LSTM recurrent neural network model and retrieved the features using the word2vec technique. This method generated word embeddings from the essay words using the skip-gram model. And later, word embedding is used to train the neural network to find the final score. The softmax layer in LSTM obtains the importance of each word. This method used a QWK score of 0.83%.

Dasgupta et al. ( 2018 ) proposed a technique for essay scoring with augmenting textual qualitative Features. It extracted three types of linguistic, cognitive, and psychological features associated with a text document. The linguistic features are Part of Speech (POS), Universal Dependency relations, Structural Well-formedness, Lexical Diversity, Sentence Cohesion, Causality, and Informativeness of the text. The psychological features derived from the Linguistic Information and Word Count (LIWC) tool. They implemented a convolution recurrent neural network that takes input as word embedding and sentence vector, retrieved from the GloVe word vector. And the second layer is the Convolution Layer to find local features. The next layer is the recurrent neural network (LSTM) to find corresponding of the text. The accuracy of this method resulted in an average QWK of 0.764.

Liang et al. ( 2018 ) proposed a symmetrical neural network AES model with Bi-LSTM. They are extracting features from sample essays and student essays and preparing an embedding layer as input. The embedding layer output is transfer to the convolution layer from that LSTM will be trained. Hear the LSRM model has self-features extraction layer, which will find the essay's coherence. The average QWK score of SBLSTMA is 0.801.

Liu et al. ( 2019 ) proposed two-stage learning. In the first stage, they are assigning a score based on semantic data from the essay. The second stage scoring is based on some handcrafted features like grammar correction, essay length, number of sentences, etc. The average score of the two stages is 0.709.

Pedro Uria Rodriguez et al. ( 2019 ) proposed a sequence-to-sequence learning model for automatic essay scoring. They used BERT (Bidirectional Encoder Representations from Transformers), which extracts the semantics from a sentence from both directions. And XLnet sequence to sequence learning model to extract features like the next sentence in an essay. With this pre-trained model, they attained coherence from the essay to give the final score. The average QWK score of the model is 75.5.

Xia et al. ( 2019 ) proposed a two-layer Bi-directional LSTM neural network for the scoring of essays. The features extracted with word2vec to train the LSTM and accuracy of the model in an average of QWK is 0.870.

Kumar et al. ( 2019 ) Proposed an AutoSAS for short answer scoring. It used pre-trained Word2Vec and Doc2Vec models trained on Google News corpus and Wikipedia dump, respectively, to retrieve the features. First, they tagged every word POS and they found weighted words from the response. It also found prompt overlap to observe how the answer is relevant to the topic, and they defined lexical overlaps like noun overlap, argument overlap, and content overlap. This method used some statistical features like word frequency, difficulty, diversity, number of unique words in each response, type-token ratio, statistics of the sentence, word length, and logical operator-based features. This method uses a random forest model to train the dataset. The data set has sample responses with their associated score. The model will retrieve the features from both responses like graded and ungraded short answers with questions. The accuracy of AutoSAS with QWK is 0.78. It will work on any topics like Science, Arts, Biology, and English.

Jiaqi Lun et al. ( 2020 ) proposed an automatic short answer scoring with BERT. In this with a reference answer comparing student responses and assigning scores. The data augmentation is done with a neural network and with one correct answer from the dataset classifying reaming responses as correct or incorrect.

Zhu and Sun ( 2020 ) proposed a multimodal Machine Learning approach for automated essay scoring. First, they count the grammar score with the spaCy library and numerical count as the number of words and sentences with the same library. With this input, they trained a single and Bi LSTM neural network for finding the final score. For the LSTM model, they prepared sentence vectors with GloVe and word embedding with NLTK. Bi-LSTM will check each sentence in both directions to find semantic from the essay. The average QWK score with multiple models is 0.70.

Ontology based approach

Mohler et al. ( 2011 ) proposed a graph-based method to find semantic similarity in short answer scoring. For the ranking of answers, they used the support vector regression model. The bag of words is the main feature extracted in the system.

Ramachandran et al. ( 2015 ) also proposed a graph-based approach to find lexical based semantics. Identified phrase patterns and text patterns are the features to train a random forest regression model to score the essays. The accuracy of the model in a QWK is 0.78.

Zupanc et al. ( 2017 ) proposed sentence similarity networks to find the essay's score. Ajetunmobi and Daramola ( 2017 ) recommended an ontology-based information extraction approach and domain-based ontology to find the score.

Speech response scoring

Automatic scoring is in two ways one is text-based scoring, other is speech-based scoring. This paper discussed text-based scoring and its challenges, and now we cover speech scoring and common points between text and speech-based scoring. Evanini and Wang ( 2013 ), Worked on speech scoring of non-native school students, extracted features with speech ratter, and trained a linear regression model, concluding that accuracy varies based on voice pitching. Loukina et al. ( 2015 ) worked on feature selection from speech data and trained SVM. Malinin et al. ( 2016 ) used neural network models to train the data. Loukina et al. ( 2017 ). Proposed speech and text-based automatic scoring. Extracted text-based features, speech-based features and trained a deep neural network for speech-based scoring. They extracted 33 types of features based on acoustic signals. Malinin et al. ( 2017 ). Wu Xixin et al. ( 2020 ) Worked on deep neural networks for spoken language assessment. Incorporated different types of models and tested them. Ramanarayanan et al. ( 2017 ) worked on feature extraction methods and extracted punctuation, fluency, and stress and trained different Machine Learning models for scoring. Knill et al. ( 2018 ). Worked on Automatic speech recognizer and its errors how its impacts the speech assessment.

The state of the art

This section provides an overview of the existing AES systems with a comparative study w. r. t models, features applied, datasets, and evaluation metrics used for building the automated essay grading systems. We divided all 62 papers into two sets of the first set of review papers in Table ​ Table5 5 with a comparative study of the AES systems.

State of the art

Comparison of all approaches

In our study, we divided major AES approaches into three categories. Regression models, classification models, and neural network models. The regression models failed to find cohesion and coherence from the essay because it trained on BoW(Bag of Words) features. In processing data from input to output, the regression models are less complicated than neural networks. There are unable to find many intricate patterns from the essay and unable to find sentence connectivity. If we train the model with BoW features in the neural network approach, the model never considers the essay's coherence and coherence.

First, to train a Machine Learning algorithm with essays, all the essays are converted to vector form. We can form a vector with BoW and Word2vec, TF-IDF. The BoW and Word2vec vector representation of essays represented in Table ​ Table6. 6 . The vector representation of BoW with TF-IDF is not incorporating the essays semantic, and it’s just statistical learning from a given vector. Word2vec vector comprises semantic of essay in a unidirectional way.

Vector representation of essays

In BoW, the vector contains the frequency of word occurrences in the essay. The vector represents 1 and more based on the happenings of words in the essay and 0 for not present. So, in BoW, the vector does not maintain the relationship with adjacent words; it’s just for single words. In word2vec, the vector represents the relationship between words with other words and sentences prompt in multiple dimensional ways. But word2vec prepares vectors in a unidirectional way, not in a bidirectional way; word2vec fails to find semantic vectors when a word has two meanings, and the meaning depends on adjacent words. Table ​ Table7 7 represents a comparison of Machine Learning models and features extracting methods.

Comparison of models

In AES, cohesion and coherence will check the content of the essay concerning the essay prompt these can be extracted from essay in the vector from. Two more parameters are there to access an essay is completeness and feedback. Completeness will check whether student’s response is sufficient or not though the student wrote correctly. Table ​ Table8 8 represents all four parameters comparison for essay grading. Table ​ Table9 9 illustrates comparison of all approaches based on various features like grammar, spelling, organization of essay, relevance.

Comparison of all models with respect to cohesion, coherence, completeness, feedback

comparison of all approaches on various features

What are the challenges/limitations in the current research?

From our study and results discussed in the previous sections, many researchers worked on automated essay scoring systems with numerous techniques. We have statistical methods, classification methods, and neural network approaches to evaluate the essay automatically. The main goal of the automated essay grading system is to reduce human effort and improve consistency.

The vast majority of essay scoring systems are dealing with the efficiency of the algorithm. But there are many challenges in automated essay grading systems. One should assess the essay by following parameters like the relevance of the content to the prompt, development of ideas, Cohesion, Coherence, and domain knowledge.

No model works on the relevance of content, which means whether student response or explanation is relevant to the given prompt or not if it is relevant to how much it is appropriate, and there is no discussion about the cohesion and coherence of the essays. All researches concentrated on extracting the features using some NLP libraries, trained their models, and testing the results. But there is no explanation in the essay evaluation system about consistency and completeness, But Palma and Atkinson ( 2018 ) explained coherence-based essay evaluation. And Zupanc and Bosnic ( 2014 ) also used the word coherence to evaluate essays. And they found consistency with latent semantic analysis (LSA) for finding coherence from essays, but the dictionary meaning of coherence is "The quality of being logical and consistent."

Another limitation is there is no domain knowledge-based evaluation of essays using Machine Learning models. For example, the meaning of a cell is different from biology to physics. Many Machine Learning models extract features with WordVec and GloVec; these NLP libraries cannot convert the words into vectors when they have two or more meanings.

Other challenges that influence the Automated Essay Scoring Systems.

All these approaches worked to improve the QWK score of their models. But QWK will not assess the model in terms of features extraction and constructed irrelevant answers. The QWK is not evaluating models whether the model is correctly assessing the answer or not. There are many challenges concerning students' responses to the Automatic scoring system. Like in evaluating approach, no model has examined how to evaluate the constructed irrelevant and adversarial answers. Especially the black box type of approaches like deep learning models provides more options to the students to bluff the automated scoring systems.

The Machine Learning models that work on statistical features are very vulnerable. Based on Powers et al. ( 2001 ) and Bejar Isaac et al. ( 2014 ), the E-rater was failed on Constructed Irrelevant Responses Strategy (CIRS). From the study of Bejar et al. ( 2013 ), Higgins and Heilman ( 2014 ), observed that when student response contain irrelevant content or shell language concurring to prompt will influence the final score of essays in an automated scoring system.

In deep learning approaches, most of the models automatically read the essay's features, and some methods work on word-based embedding and other character-based embedding features. From the study of Riordan Brain et al. ( 2019 ), The character-based embedding systems do not prioritize spelling correction. However, it is influencing the final score of the essay. From the study of Horbach and Zesch ( 2019 ), Various factors are influencing AES systems. For example, there are data set size, prompt type, answer length, training set, and human scorers for content-based scoring.

Ding et al. ( 2020 ) reviewed that the automated scoring system is vulnerable when a student response contains more words from prompt, like prompt vocabulary repeated in the response. Parekh et al. ( 2020 ) and Kumar et al. ( 2020 ) tested various neural network models of AES by iteratively adding important words, deleting unimportant words, shuffle the words, and repeating sentences in an essay and found that no change in the final score of essays. These neural network models failed to recognize common sense in adversaries' essays and give more options for the students to bluff the automated systems.

Other than NLP and ML techniques for AES. From Wresch ( 1993 ) to Madnani and Cahill ( 2018 ). discussed the complexity of AES systems, standards need to be followed. Like assessment rubrics to test subject knowledge, irrelevant responses, and ethical aspects of an algorithm like measuring the fairness of student response.

Fairness is an essential factor for automated systems. For example, in AES, fairness can be measure in an agreement between human score to machine score. Besides this, From Loukina et al. ( 2019 ), the fairness standards include overall score accuracy, overall score differences, and condition score differences between human and system scores. In addition, scoring different responses in the prospect of constructive relevant and irrelevant will improve fairness.

Madnani et al. ( 2017a ; b ). Discussed the fairness of AES systems for constructed responses and presented RMS open-source tool for detecting biases in the models. With this, one can change fairness standards according to their analysis of fairness.

From Berzak et al.'s ( 2018 ) approach, behavior factors are a significant challenge in automated scoring systems. That helps to find language proficiency, word characteristics (essential words from the text), predict the critical patterns from the text, find related sentences in an essay, and give a more accurate score.

Rupp ( 2018 ), has discussed the designing, evaluating, and deployment methodologies for AES systems. They provided notable characteristics of AES systems for deployment. They are like model performance, evaluation metrics for a model, threshold values, dynamically updated models, and framework.

First, we should check the model performance on different datasets and parameters for operational deployment. Selecting Evaluation metrics for AES models are like QWK, correlation coefficient, or sometimes both. Kelley and Preacher ( 2012 ) have discussed three categories of threshold values: marginal, borderline, and acceptable. The values can be varied based on data size, model performance, type of model (single scoring, multiple scoring models). Once a model is deployed and evaluates millions of responses every time for optimal responses, we need a dynamically updated model based on prompt and data. Finally, framework designing of AES model, hear a framework contains prompts where test-takers can write the responses. One can design two frameworks: a single scoring model for a single methodology and multiple scoring models for multiple concepts. When we deploy multiple scoring models, each prompt could be trained separately, or we can provide generalized models for all prompts with this accuracy may vary, and it is challenging.

Our Systematic literature review on the automated essay grading system first collected 542 papers with selected keywords from various databases. After inclusion and exclusion criteria, we left with 139 articles; on these selected papers, we applied Quality assessment criteria with two reviewers, and finally, we selected 62 writings for final review.

Our observations on automated essay grading systems from 2010 to 2020 are as followed:

  • The implementation techniques of automated essay grading systems are classified into four buckets; there are 1. regression models 2. Classification models 3. Neural networks 4. Ontology-based methodology, but using neural networks, the researchers are more accurate than other techniques, and all the methods state of the art provided in Table ​ Table3 3 .
  • The majority of the regression and classification models on essay scoring used statistical features to find the final score. It means the systems or models trained on such parameters as word count, sentence count, etc. though the parameters extracted from the essay, the algorithm are not directly training on essays. The algorithms trained on some numbers obtained from the essay and hear if numbers matched the composition will get a good score; otherwise, the rating is less. In these models, the evaluation process is entirely on numbers, irrespective of the essay. So, there is a lot of chance to miss the coherence, relevance of the essay if we train our algorithm on statistical parameters.
  • In the neural network approach, the models trained on Bag of Words (BoW) features. The BoW feature is missing the relationship between a word to word and the semantic meaning of the sentence. E.g., Sentence 1: John killed bob. Sentence 2: bob killed John. In these two sentences, the BoW is "John," "killed," "bob."
  • In the Word2Vec library, if we are prepared a word vector from an essay in a unidirectional way, the vector will have a dependency with other words and finds the semantic relationship with other words. But if a word has two or more meanings like "Bank loan" and "River Bank," hear bank has two implications, and its adjacent words decide the sentence meaning; in this case, Word2Vec is not finding the real meaning of the word from the sentence.
  • The features extracted from essays in the essay scoring system are classified into 3 type's features like statistical features, style-based features, and content-based features, which are explained in RQ2 and Table ​ Table3. 3 . But statistical features, are playing a significant role in some systems and negligible in some systems. In Shehab et al. ( 2016 ); Cummins et al. ( 2016 ). Dong et al. ( 2017 ). Dong and Zhang ( 2016 ). Mathias and Bhattacharyya ( 2018a ; b ) Systems the assessment is entirely on statistical and style-based features they have not retrieved any content-based features. And in other systems that extract content from the essays, the role of statistical features is for only preprocessing essays but not included in the final grading.
  • In AES systems, coherence is the main feature to be considered while evaluating essays. The actual meaning of coherence is to stick together. That is the logical connection of sentences (local level coherence) and paragraphs (global level coherence) in a story. Without coherence, all sentences in a paragraph are independent and meaningless. In an Essay, coherence is a significant feature that is explaining everything in a flow and its meaning. It is a powerful feature in AES system to find the semantics of essay. With coherence, one can assess whether all sentences are connected in a flow and all paragraphs are related to justify the prompt. Retrieving the coherence level from an essay is a critical task for all researchers in AES systems.
  • In automatic essay grading systems, the assessment of essays concerning content is critical. That will give the actual score for the student. Most of the researches used statistical features like sentence length, word count, number of sentences, etc. But according to collected results, 32% of the systems used content-based features for the essay scoring. Example papers which are on content-based assessment are Taghipour and Ng ( 2016 ); Persing and Ng ( 2013 ); Wang et al. ( 2018a , 2018b ); Zhao et al. ( 2017 ); Kopparapu and De ( 2016 ), Kumar et al. ( 2019 ); Mathias and Bhattacharyya ( 2018a ; b ); Mohler and Mihalcea ( 2009 ) are used content and statistical-based features. The results are shown in Fig. ​ Fig.3. 3 . And mainly the content-based features extracted with word2vec NLP library, but word2vec is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other terms, but word2vec is capable of capturing the context word in a uni-direction either left or right. If a word has multiple meanings, there is a chance of missing the context in the essay. After analyzing all the papers, we found that content-based assessment is a qualitative assessment of essays.
  • On the other hand, Horbach and Zesch ( 2019 ); Riordan Brain et al. ( 2019 ); Ding et al. ( 2020 ); Kumar et al. ( 2020 ) proved that neural network models are vulnerable when a student response contains constructed irrelevant, adversarial answers. And a student can easily bluff an automated scoring system by submitting different responses like repeating sentences and repeating prompt words in an essay. From Loukina et al. ( 2019 ), and Madnani et al. ( 2017b ). The fairness of an algorithm is an essential factor to be considered in AES systems.
  • While talking about speech assessment, the data set contains audios of duration up to one minute. Feature extraction techniques are entirely different from text assessment, and accuracy varies based on speaking fluency, pitching, male to female voice and boy to adult voice. But the training algorithms are the same for text and speech assessment.
  • Once an AES system evaluates essays and short answers accurately in all directions, there is a massive demand for automated systems in the educational and related world. Now AES systems are deployed in GRE, TOEFL exams; other than these, we can deploy AES systems in massive open online courses like Coursera(“ https://coursera.org/learn//machine-learning//exam ”), NPTEL ( https://swayam.gov.in/explorer ), etc. still they are assessing student performance with multiple-choice questions. In another perspective, AES systems can be deployed in information retrieval systems like Quora, stack overflow, etc., to check whether the retrieved response is appropriate to the question or not and can give ranking to the retrieved answers.

Conclusion and future work

As per our Systematic literature review, we studied 62 papers. There exist significant challenges for researchers in implementing automated essay grading systems. Several researchers are working rigorously on building a robust AES system despite its difficulty in solving this problem. All evaluating methods are not evaluated based on coherence, relevance, completeness, feedback, and knowledge-based. And 90% of essay grading systems are used Kaggle ASAP (2012) dataset, which has general essays from students and not required any domain knowledge, so there is a need for domain-specific essay datasets to train and test. Feature extraction is with NLTK, WordVec, and GloVec NLP libraries; these libraries have many limitations while converting a sentence into vector form. Apart from feature extraction and training Machine Learning models, no system is accessing the essay's completeness. No system provides feedback to the student response and not retrieving coherence vectors from the essay—another perspective the constructive irrelevant and adversarial student responses still questioning AES systems.

Our proposed research work will go on the content-based assessment of essays with domain knowledge and find a score for the essays with internal and external consistency. And we will create a new dataset concerning one domain. And another area in which we can improve is the feature extraction techniques.

This study includes only four digital databases for study selection may miss some functional studies on the topic. However, we hope that we covered most of the significant studies as we manually collected some papers published in useful journals.

Below is the link to the electronic supplementary material.

Not Applicable.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Dadi Ramesh, Email: moc.liamg@44hsemaridad .

Suresh Kumar Sanampudi, Email: ni.ca.hutnj@idupmanashserus .

  • Adamson, A., Lamb, A., & December, R. M. (2014). Automated Essay Grading.
  • Ajay HB, Tillett PI, Page EB (1973) Analysis of essays by computer (AEC-II) (No. 8-0102). Washington, DC: U.S. Department of Health, Education, and Welfare, Office of Education, National Center for Educational Research and Development
  • Ajetunmobi SA, Daramola O (2017) Ontology-based information extraction for subject-focussed automatic essay evaluation. In: 2017 International Conference on Computing Networking and Informatics (ICCNI) p 1–6. IEEE
  • Alva-Manchego F, et al. (2019) EASSE: Easier Automatic Sentence Simplification Evaluation.” ArXiv abs/1908.04567 (2019): n. pag
  • Bailey S, Meurers D (2008) Diagnosing meaning errors in short answers to reading comprehension questions. In: Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications (Columbus), p 107–115
  • Basu S, Jacobs C, Vanderwende L. Powergrading: a clustering approach to amplify human effort for short answer grading. Trans Assoc Comput Linguist (TACL) 2013; 1 :391–402. doi: 10.1162/tacl_a_00236. [ CrossRef ] [ Google Scholar ]
  • Bejar, I. I., Flor, M., Futagi, Y., & Ramineni, C. (2014). On the vulnerability of automated scoring to construct-irrelevant response strategies (CIRS): An illustration. Assessing Writing, 22, 48-59.
  • Bejar I, et al. (2013) Length of Textual Response as a Construct-Irrelevant Response Strategy: The Case of Shell Language. Research Report. ETS RR-13-07.” ETS Research Report Series (2013): n. pag
  • Berzak Y, et al. (2018) “Assessing Language Proficiency from Eye Movements in Reading.” ArXiv abs/1804.07329 (2018): n. pag
  • Blanchard D, Tetreault J, Higgins D, Cahill A, Chodorow M (2013) TOEFL11: A corpus of non-native English. ETS Research Report Series, 2013(2):i–15, 2013
  • Blood, I. (2011). Automated essay scoring: a literature review. Studies in Applied Linguistics and TESOL, 11(2).
  • Burrows S, Gurevych I, Stein B. The eras and trends of automatic short answer grading. Int J Artif Intell Educ. 2015; 25 :60–117. doi: 10.1007/s40593-014-0026-8. [ CrossRef ] [ Google Scholar ]
  • Cader, A. (2020, July). The Potential for the Use of Deep Neural Networks in e-Learning Student Evaluation with New Data Augmentation Method. In International Conference on Artificial Intelligence in Education (pp. 37–42). Springer, Cham.
  • Cai C (2019) Automatic essay scoring with recurrent neural network. In: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications (2019): n. pag.
  • Chen M, Li X (2018) "Relevance-Based Automated Essay Scoring via Hierarchical Recurrent Model. In: 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia, 2018, p 378–383, doi: 10.1109/IALP.2018.8629256
  • Chen Z, Zhou Y (2019) "Research on Automatic Essay Scoring of Composition Based on CNN and OR. In: 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, p 13–18, doi: 10.1109/ICAIBD.2019.8837007
  • Contreras JO, Hilles SM, Abubakar ZB (2018) Automated essay scoring with ontology based on text mining and NLTK tools. In: 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE), 1-6
  • Correnti R, Matsumura LC, Hamilton L, Wang E. Assessing students’ skills at writing analytically in response to texts. Elem Sch J. 2013; 114 (2):142–177. doi: 10.1086/671936. [ CrossRef ] [ Google Scholar ]
  • Cummins, R., Zhang, M., & Briscoe, E. (2016, August). Constrained multi-task learning for automated essay scoring. Association for Computational Linguistics.
  • Darwish SM, Mohamed SK (2020) Automated essay evaluation based on fusion of fuzzy ontology and latent semantic analysis. In: Hassanien A, Azar A, Gaber T, Bhatnagar RF, Tolba M (eds) The International Conference on Advanced Machine Learning Technologies and Applications
  • Dasgupta T, Naskar A, Dey L, Saha R (2018) Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications p 93–102
  • Ding Y, et al. (2020) "Don’t take “nswvtnvakgxpm” for an answer–The surprising vulnerability of automatic content scoring systems to adversarial input." In: Proceedings of the 28th International Conference on Computational Linguistics
  • Dong F, Zhang Y (2016) Automatic features for essay scoring–an empirical study. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing p 1072–1077
  • Dong F, Zhang Y, Yang J (2017) Attention-based recurrent convolutional neural network for automatic essay scoring. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) p 153–162
  • Dzikovska M, Nielsen R, Brew C, Leacock C, Gi ampiccolo D, Bentivogli L, Clark P, Dagan I, Dang HT (2013a) Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge
  • Dzikovska MO, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Trang Dang H (2013b) SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. *SEM 2013: The First Joint Conference on Lexical and Computational Semantics
  • Educational Testing Service (2008) CriterionSM online writing evaluation service. Retrieved from http://www.ets.org/s/criterion/pdf/9286_CriterionBrochure.pdf .
  • Evanini, K., & Wang, X. (2013, August). Automated speech scoring for non-native middle school students with multiple task types. In INTERSPEECH (pp. 2435–2439).
  • Foltz PW, Laham D, Landauer TK (1999) The Intelligent Essay Assessor: Applications to Educational Technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1, 2, http://imej.wfu.edu/articles/1999/2/04/ index.asp
  • Granger, S., Dagneaux, E., Meunier, F., & Paquot, M. (Eds.). (2009). International corpus of learner English. Louvain-la-Neuve: Presses universitaires de Louvain.
  • Higgins D, Heilman M. Managing what we can measure: quantifying the susceptibility of automated scoring systems to gaming behavior” Educ Meas Issues Pract. 2014; 33 :36–46. doi: 10.1111/emip.12036. [ CrossRef ] [ Google Scholar ]
  • Horbach A, Zesch T. The influence of variance in learner answers on automatic content scoring. Front Educ. 2019; 4 :28. doi: 10.3389/feduc.2019.00028. [ CrossRef ] [ Google Scholar ]
  • https://www.coursera.org/learn/machine-learning/exam/7pytE/linear-regression-with-multiple-variables/attempt
  • Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, e208. [ PMC free article ] [ PubMed ]
  • Ke Z, Ng V (2019) “Automated essay scoring: a survey of the state of the art.” IJCAI
  • Ke, Z., Inamdar, H., Lin, H., & Ng, V. (2019, July). Give me more feedback II: Annotating thesis strength and related attributes in student essays. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3994-4004).
  • Kelley K, Preacher KJ. On effect size. Psychol Methods. 2012; 17 (2):137–152. doi: 10.1037/a0028086. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S. Systematic literature reviews in software engineering–a systematic literature review. Inf Softw Technol. 2009; 51 (1):7–15. doi: 10.1016/j.infsof.2008.09.009. [ CrossRef ] [ Google Scholar ]
  • Klebanov, B. B., & Madnani, N. (2020, July). Automated evaluation of writing–50 years and counting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7796–7810).
  • Knill K, Gales M, Kyriakopoulos K, et al. (4 more authors) (2018) Impact of ASR performance on free speaking language assessment. In: Interspeech 2018.02–06 Sep 2018, Hyderabad, India. International Speech Communication Association (ISCA)
  • Kopparapu SK, De A (2016) Automatic ranking of essays using structural and semantic features. In: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), p 519–523
  • Kumar, Y., Aggarwal, S., Mahata, D., Shah, R. R., Kumaraguru, P., & Zimmermann, R. (2019, July). Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 9662–9669).
  • Kumar Y, et al. (2020) “Calling out bluff: attacking the robustness of automatic scoring systems with simple adversarial testing.” ArXiv abs/2007.06796
  • Li X, Chen M, Nie J, Liu Z, Feng Z, Cai Y (2018) Coherence-Based Automated Essay Scoring Using Self-attention. In: Sun M, Liu T, Wang X, Liu Z, Liu Y (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL 2018, NLP-NABD 2018. Lecture Notes in Computer Science, vol 11221. Springer, Cham. 10.1007/978-3-030-01716-3_32
  • Liang G, On B, Jeong D, Kim H, Choi G. Automated essay scoring: a siamese bidirectional LSTM neural network architecture. Symmetry. 2018; 10 :682. doi: 10.3390/sym10120682. [ CrossRef ] [ Google Scholar ]
  • Liua, H., Yeb, Y., & Wu, M. (2018, April). Ensemble Learning on Scoring Student Essay. In 2018 International Conference on Management and Education, Humanities and Social Sciences (MEHSS 2018). Atlantis Press.
  • Liu J, Xu Y, Zhao L (2019) Automated Essay Scoring based on Two-Stage Learning. ArXiv, abs/1901.07744
  • Loukina A, et al. (2015) Feature selection for automated speech scoring.” BEA@NAACL-HLT
  • Loukina A, et al. (2017) “Speech- and Text-driven Features for Automated Scoring of English-Speaking Tasks.” SCNLP@EMNLP 2017
  • Loukina A, et al. (2019) The many dimensions of algorithmic fairness in educational applications. BEA@ACL
  • Lun J, Zhu J, Tang Y, Yang M (2020) Multiple data augmentation strategies for improving performance on automatic short answer scoring. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34(09): 13389-13396
  • Madnani, N., & Cahill, A. (2018, August). Automated scoring: Beyond natural language processing. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 1099–1109).
  • Madnani N, et al. (2017b) “Building better open-source tools to support fairness in automated scoring.” EthNLP@EACL
  • Malinin A, et al. (2016) “Off-topic response detection for spontaneous spoken english assessment.” ACL
  • Malinin A, et al. (2017) “Incorporating uncertainty into deep learning for spoken language assessment.” ACL
  • Mathias S, Bhattacharyya P (2018a) Thank “Goodness”! A Way to Measure Style in Student Essays. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications p 35–41
  • Mathias S, Bhattacharyya P (2018b) ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  • Mikolov T, et al. (2013) “Efficient Estimation of Word Representations in Vector Space.” ICLR
  • Mohler M, Mihalcea R (2009) Text-to-text semantic similarity for automatic short answer grading. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) p 567–575
  • Mohler M, Bunescu R, Mihalcea R (2011) Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies p 752–762
  • Muangkammuen P, Fukumoto F (2020) Multi-task Learning for Automated Essay Scoring with Sentiment Analysis. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop p 116–123
  • Nguyen, H., & Dery, L. (2016). Neural networks for automated essay grading. CS224d Stanford Reports, 1–11.
  • Palma D, Atkinson J. Coherence-based automatic essay assessment. IEEE Intell Syst. 2018; 33 (5):26–36. doi: 10.1109/MIS.2018.2877278. [ CrossRef ] [ Google Scholar ]
  • Parekh S, et al (2020) My Teacher Thinks the World Is Flat! Interpreting Automatic Essay Scoring Mechanism.” ArXiv abs/2012.13872 (2020): n. pag
  • Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).
  • Persing I, Ng V (2013) Modeling thesis clarity in student essays. In:Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) p 260–269
  • Powers DE, Burstein JC, Chodorow M, Fowles ME, Kukich K. Stumping E-Rater: challenging the validity of automated essay scoring. ETS Res Rep Ser. 2001; 2001 (1):i–44. [ Google Scholar ]
  • Powers DE, Burstein JC, Chodorow M, Fowles ME, Kukich K. Stumping e-rater: challenging the validity of automated essay scoring. Comput Hum Behav. 2002; 18 (2):103–134. doi: 10.1016/S0747-5632(01)00052-8. [ CrossRef ] [ Google Scholar ]
  • Ramachandran L, Cheng J, Foltz P (2015) Identifying patterns for short answer scoring using graph-based lexico-semantic text matching. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications p 97–106
  • Ramanarayanan V, et al. (2017) “Human and Automated Scoring of Fluency, Pronunciation and Intonation During Human-Machine Spoken Dialog Interactions.” INTERSPEECH
  • Riordan B, Horbach A, Cahill A, Zesch T, Lee C (2017) Investigating neural architectures for short answer scoring. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications p 159–168
  • Riordan B, Flor M, Pugh R (2019) "How to account for misspellings: Quantifying the benefit of character representations in neural content scoring models."In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
  • Rodriguez P, Jafari A, Ormerod CM (2019) Language models and Automated Essay Scoring. ArXiv, abs/1909.09482
  • Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes' theorem. The Journal of Technology, Learning and Assessment, 1(2).
  • Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetric™ essay scoring system. The Journal of Technology, Learning and Assessment, 4(4).
  • Rupp A. Designing, evaluating, and deploying automated scoring systems with validity in mind: methodological design decisions. Appl Meas Educ. 2018; 31 :191–214. doi: 10.1080/08957347.2018.1464448. [ CrossRef ] [ Google Scholar ]
  • Ruseti S, Dascalu M, Johnson AM, McNamara DS, Balyan R, McCarthy KS, Trausan-Matu S (2018) Scoring summaries using recurrent neural networks. In: International Conference on Intelligent Tutoring Systems p 191–201. Springer, Cham
  • Sakaguchi K, Heilman M, Madnani N (2015) Effective feature integration for automated short answer scoring. In: Proceedings of the 2015 conference of the North American Chapter of the association for computational linguistics: Human language technologies p 1049–1054
  • Salim, Y., Stevanus, V., Barlian, E., Sari, A. C., & Suhartono, D. (2019, December). Automated English Digital Essay Grader Using Machine Learning. In 2019 IEEE International Conference on Engineering, Technology and Education (TALE) (pp. 1–6). IEEE.
  • Shehab A, Elhoseny M, Hassanien AE (2016) A hybrid scheme for Automated Essay Grading based on LVQ and NLP techniques. In: 12th International Computer Engineering Conference (ICENCO), Cairo, 2016, p 65-70
  • Shermis MD, Mzumara HR, Olson J, Harrington S. On-line grading of student essays: PEG goes on the World Wide Web. Assess Eval High Educ. 2001; 26 (3):247–259. doi: 10.1080/02602930120052404. [ CrossRef ] [ Google Scholar ]
  • Stab C, Gurevych I (2014) Identifying argumentative discourse structures in persuasive essays. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) p 46–56
  • Sultan MA, Salazar C, Sumner T (2016) Fast and easy short answer grading with high accuracy. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies p 1070–1075
  • Süzen, N., Gorban, A. N., Levesley, J., & Mirkes, E. M. (2020). Automatic short answer grading and feedback using text mining methods. Procedia Computer Science, 169, 726–743.
  • Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. In: Proceedings of the 2016 conference on empirical methods in natural language processing p 1882–1891
  • Tashu TM (2020) "Off-Topic Essay Detection Using C-BGRU Siamese. In: 2020 IEEE 14th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, p 221–225, doi: 10.1109/ICSC.2020.00046
  • Tashu TM, Horváth T (2019) A layered approach to automatic essay evaluation using word-embedding. In: McLaren B, Reilly R, Zvacek S, Uhomoibhi J (eds) Computer Supported Education. CSEDU 2018. Communications in Computer and Information Science, vol 1022. Springer, Cham
  • Tashu TM, Horváth T (2020) Semantic-Based Feedback Recommendation for Automatic Essay Evaluation. In: Bi Y, Bhatia R, Kapoor S (eds) Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing, vol 1038. Springer, Cham
  • Uto M, Okano M (2020) Robust Neural Automated Essay Scoring Using Item Response Theory. In: Bittencourt I, Cukurova M, Muldner K, Luckin R, Millán E (eds) Artificial Intelligence in Education. AIED 2020. Lecture Notes in Computer Science, vol 12163. Springer, Cham
  • Wang Z, Liu J, Dong R (2018a) Intelligent Auto-grading System. In: 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS) p 430–435. IEEE.
  • Wang Y, et al. (2018b) “Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning.” EMNLP
  • Zhu W, Sun Y (2020) Automated essay scoring system using multi-model Machine Learning, david c. wyld et al. (eds): mlnlp, bdiot, itccma, csity, dtmn, aifz, sigpro
  • Wresch W. The Imminence of Grading Essays by Computer-25 Years Later. Comput Compos. 1993; 10 :45–58. doi: 10.1016/S8755-4615(05)80058-1. [ CrossRef ] [ Google Scholar ]
  • Wu, X., Knill, K., Gales, M., & Malinin, A. (2020). Ensemble approaches for uncertainty in spoken language assessment.
  • Xia L, Liu J, Zhang Z (2019) Automatic Essay Scoring Model Based on Two-Layer Bi-directional Long-Short Term Memory Network. In: Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence p 133–137
  • Yannakoudakis H, Briscoe T, Medlock B (2011) A new dataset and method for automatically grading ESOL texts. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies p 180–189
  • Zhao S, Zhang Y, Xiong X, Botelho A, Heffernan N (2017) A memory-augmented neural model for automated grading. In: Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale p 189–192
  • Zupanc K, Bosnic Z (2014) Automated essay evaluation augmented with semantic coherence measures. In: 2014 IEEE International Conference on Data Mining p 1133–1138. IEEE.
  • Zupanc K, Savić M, Bosnić Z, Ivanović M (2017) Evaluating coherence of essays using sentence-similarity networks. In: Proceedings of the 18th International Conference on Computer Systems and Technologies p 65–72
  • Dzikovska, M. O., Nielsen, R., & Brew, C. (2012, June). Towards effective tutorial feedback for explanation questions: A dataset and baselines. In  Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies  (pp. 200-210).
  • Kumar, N., & Dey, L. (2013, November). Automatic Quality Assessment of documents with application to essay grading. In 2013 12th Mexican International Conference on Artificial Intelligence (pp. 216–222). IEEE.
  • Wu, S. H., & Shih, W. F. (2018, July). A short answer grading system in chinese by support vector approach. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications (pp. 125-129).
  • Agung Putri Ratna, A., Lalita Luhurkinanti, D., Ibrahim I., Husna D., Dewi Purnamasari P. (2018). Automatic Essay Grading System for Japanese Language Examination Using Winnowing Algorithm, 2018 International Seminar on Application for Technology of Information and Communication, 2018, pp. 565–569. 10.1109/ISEMANTIC.2018.8549789.
  • Sharma A., & Jayagopi D. B. (2018). Automated Grading of Handwritten Essays 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018, pp 279–284. 10.1109/ICFHR-2018.2018.00056

Automated Essay Grading

A cs109a final project by anmol gupta, annie hwang, paul lisker, and kevin loughlin, introduction.

One of the main responsibilities of teachers and professors in the humanities is grading students essays [1]. Of course, manual essay grading for a classroom of students is a time-consuming process, and can even become tedious at times. Furthermore, essay grading can be plagued by inconsistencies in determining what a “good” essay really is. Indeed, the grading of essays is often a topic of controversy, due to its intrinsic subjectivity. Instructors might be more inclined to better reward essays with a particular voice or writing style, or even a specific position on the essay prompt.

With these and other issues taken into consideration, the problem of essay grading is clearly a field ripe for a more systematic, unbiased method of rating written work. There has been much research into creating AI agents, ultimately based on statistical models, that can automatically grade essays and therefore reduce or even eliminate the potential for bias. Such a model would take typical features of strong essays into account, analyzing each essay for the existence of these features.

In this project, which stems from an existing Kaggle competition sponsored by the William and Flora Hewlett Foundation [2], we have attempted to provide an efficient, automated solution to essay grading, thereby eliminating grader bias, as well as expediting a tedious and time-consuming job. While superior auto-graders that have resulted from years of extensive research surely exist, we feel that our final project demonstrates our ability to apply the data science process learned in this course to a complex, real-world problem.

Data Exploration

It was unnecessary for us to collect any data, as the essays were provided by the Hewlett Foundation. The data is comprised of eight separate essay sets, consisting of a training set of 12,976 essays and a validation set of 4,218 essays. Each essay set had a unique topic and scoring system, which certainly complicated fitting a model, given the diverse data. On the bright side, the essay sets were complete—that is, there was no missing data. Furthermore, the data was clearly laid out in both txt and csv formats that made importing them into a Pandas DataFrame a relatively simple process. In fact, the only complication to arise from collecting the data was a rather sneaky one, only discovered in the later stages when we attempted to spell check the essays. A very small number of essays contained special characters that could not be processed in unicode (the most popular method of text encoding for English). To handle these special characters, we used ISO-8859-1 text encoding, which eliminated encoding-related errors.

The training and validation sets did have plenty of information that we deemed to be extraneous to the scope of our project. For example, the score was often broken down by scorer, and at times into subcategories. We decided to take the average of the overall provided scores as our notion of “score” for each essay. Ultimately, then, the three crucial pieces of information were the essay, the essay set to which it belonged, and the overall essay score.

With the information we needed in place, we tested a few essay features at a basic level to get a better grasp on the data’s format as well as to investigate the sorts of features that might prove useful in predicting an essay’s score. In particular, we calculated the number of words and the vocabulary sizes (number of unique words), for each essay in the training set, plotting them against the provided essay scores.

We hypothesized that word count would certainly be correlated positively with essay score. As students, we note that longer essays often reflect deeper thought and stronger content. On the flip side, there is also value in being succinct by eliminating “filler” content and unnecessary details in papers. As such, we figured that the strength of correlation would weaken as the length of the essays increased.

With regard to vocabulary sizes, we reasoned that individuals who typically read and write more have broader lexicons, and as such a larger vocabulary size would correlate with a higher quality essay, and thus a higher score. After all, the more individuals read and write, the greater their exposure to a larger vocabulary and a more thorough understanding of how properly use it in their own writing. As such, a skilled writer will likely use a variety of exciting words in an effort to more effectively keep readers engaged and to best express their ideas in response to the prompt. Therefore, we hypothesized that a larger vocabulary list would correlate with a higher essay score.

Figure 1: Essay set 1 stats

The first notable finding, as evidenced in Figure 1, is that the Word Count vs. Score scatter plots closely mirror the Vocab Size vs Score scatter plots when paired by essay set. This suggests that there might be a relationship between the length of the essay and the different number of words that a writer uses, a discovery that makes sense: a longer essay is bound to have more unique words.

From the Word Counts vs Score scatter plots, we note that in general, there seems to be an upward, positive trend between the essay words counts and the score, with the data expanding in a funnel-like shape. In set 4, there are certainly a couple of essays with the score of 1 that have a smaller word count and vocabulary list than the essays with a score of 0, but that result is likely due to essays with a score of 0 being either incomplete or unrelated to the prompt. As such, these data represent outliers and therefore do not speak to the general, positive relationship. Similar trends and patterns hold true for Vocab Size vs Score.

For set 3, we see that as the scores increase, the range of values for the number of words increases, meaning the number of words themselves tend to increase with score in a tornado-like shape, as mentioned. That is, while low scores were almost exclusively reserved for short essays, good grades were assigned to essays anywhere along the word count spectrum. In other words, there are many essays which have comparable word and vocabulary counts with different scores—especially those of smaller size. On the other hand, those essays with a distinctly greater word count and vocabulary size clearly receive higher scores. Similarly, for sets 1, 2, 4, 5, 6, and 7, we noted that, although the average word count increases as the score increases, the range of word counts also becomes wider, resulting in significant overlap of word counts across scores. This reinforces the conclusion that while word count is, in fact, correlated to essay score, the correlation is weaker for higher-scored essays, since there exists a significant overlap of word count across different scores.

Essay set 8 has different trends: essays with large word counts and vocabulary sizes range greatly in scores. However, despite the unpredictability highlighted by this wide range, a clear predictor does emerge: essays with a small word count and small vocabulary size are graded with correspondingly low scores. As such, unlike in other datasets, where higher word and vocabulary counts equate to higher scores, we see that higher word essays may still be graded across the full range of scores. On the other hand, low word and vocabulary counts are a strong predictor of low score. In our investigation of this phenomenon, we noticed a disparity with essay set 8: it was the only prompt that has a maximum essay length, as measured by word count. Ultimately, this factor could have encouraged essays of particular size, regardless of essay quality.

The Baseline

With a sufficient grasp on the data, we set out to create a baseline essay grading model to which we could compare our final (hopefully more advanced) model. In order to generalize the model across different essay sets (which each contained different scoring systems, as mentioned), we standardized each essay set’s score distribution to have a mean of 0 and a standard deviation of 1.

For the baseline model, we began by considering the various essay features in order to choose the ones that we believed would be most effective, ultimately settling on n-grams. In natural language processing, n-grams are a very powerful resource. An n-gram refers to a consecutive sequence of n words in a given text. As an example, the n-grams with n=1 (unigram) of the sentence “I like ice cream” would be “I”, “like”, “ice”, and “cream”. The bigrams (n=2) of this same sentence would thus be “I like”, “like ice”, and “ice cream”. Ultimately, for a sufficiently large text, n-grams may be analyzed for any positive, nonzero integer n.

In analyzing a text, using n-grams of different n values may be important. For example, while the meaning of “bad” is successfully conveyed as a unigram, it would be lost in “not good,” since the two words would be analyzed independently. In this scenario, then, a bigram would be more useful. By a similar argument, a bigram may be effective for “not good,” but less so for “bad,” since it could associate the word with potentially unrelated words. For our baseline, however, we decided to proceed with unigrams in the name of simplicity.

To quantify the concept of n-grams, we used an information retrieval method called term frequency-inverse document frequency (tf-idf). This measure quantifies the number of times that an n-gram appears in the essay while weighting them based on how frequently the words appear in a general corpus of text. In other words, the tf-idf measure provides a powerful way of standardizing n-gram counts based on the expected number of times that they would have appeared in an essay in the first place. As a result, while the count of a particular n-gram may be large if found often in the text, this can be offset when processed by the tf-idf method if the n-gram is one already frequently appears in essays.

As such, given the benefits of n-grams and their quantification via the tf-idf method, we created a baseline model using unigrams with tf-idf as the predictive features. As our baseline model, we decided to use a simple linear regression model to predict a set of (standardized) scores for our training essays.

To evaluate our linear regression model, we opted to eschew the traditional R^2 measure in favor of Spearman’s Rank Correlation Coefficient. While the traditional R^2 measures determines the accuracy of our model—that is, how closely the predicted scores correspond to the true scores—Spearman instead measures the strength and direction of monotonic association between the essay feature and the score. In other words, it determines how well the ranking of the features corresponds with the ranking of the scores. The benefit of this approach is that this is a useful measure for grading essays, since we're interested to know how directly a feature predicts the relative score of an essay (i.e., how an essay compares to another essay) rather than the actual score given to the essay. Ultimately, this is a better model to measure rather than accuracy, since it gives direct insight into the influence of the feature on the score, and furthermore, because relative accuracy might be more important than actual accuracy.

Spearman results in a score ranging from -1 to 1, where the closer the score is to an absolute value of 1, the stronger the monotonic association (and where positive values imply a positive monotonic association, versus negative values implying a negative one). The closer the value to 0, the weaker the monotonic association. The general consensus of Spearman correlation strength interpretation is as follows:

  • .00-.19 “very weak”
  • .20-.39 “weak”
  • .40-.59 “moderate”
  • .60-.79 “strong”
  • .80-1.0 “very strong”[3]

As seen in Figure 2, the baseline model received scores that ranged from very weak to moderate, all with p-scores of several factors less than 0.05 (i.e. statistically significant results). However, even with this statistical significance, such weak Spearman correlations are ultimately far too low for this baseline model to provide a trustworthy system. As such, we clearly need a stronger model with a more robust selection of features, as expected!

Advanced Modeling

To improve upon our original model, we first brainstormed what other essay features might better predict an essay’s scores. Our early data exploration pointed to word count and vocab size being useful features. Other trivial features that we opted to include were number of sentences, percent of misspellings, and percentages of each part of speech. We believed these features would be valuable additions to our existing baseline model, as they provide greater insight to the overall structure of each essay, and thus foreseeably could be correlated with score.

However, we also wanted to include at least one nontrivial feature, operating under the belief that essay grading depends on the actual content of the essay—that is, an aspect of the writing that is not captured by trivial statistics on the essay. After all, the number of words in an essay tells us very little about the essay’s content; rather, it is simply generally correlated with better scores. Based on a recommendation by our Teaching Fellow Yoon, we decided to implement the nontrivial perplexity feature.

Perplexity is a measure of the likelihood of a sequence of words of appearing, given a training set of text. Somewhat confusingly, a low perplexity score corresponds to a high likelihood of appearing. As an example, if my training set of three essays were “I like food”, “I like donuts”, and “I like pasta”, the essay “I love pasta” would have a lower perplexity than “you hate cabbage,” since “I love pasta” is more similar to an essay in the training set. This is important, because it gives us a quantifiable way to measure an essay’s content relative to other essays in a set. One would logically conclude that good essays on a certain topic would have similar ideas (and thus similar vocabulary). As such, it follows that given a sufficient training set, perplexity may well provide a valid measure of the content of the essays [4].

Using perplexity proved to be much more of a challenge than anticipated. While the NLTK module provides a method that builds a language model and can subsequently calculate the perplexity of a string based from this model, the method is currently removed from NLTK due to several existing bugs [5]. While alternatives to NLTK do exist, they are all either (a) not free, or (b) generally implemented in C++. Though it is possible to port C++ code into Python, this approach seemed to be time-consuming and beyond the scope of this project. As such, we concluded that the most appealing option was to implement a basic version of the perplexity library ourselves.

We therefore constructed a unigram language model and perplexity function. Ideally, we will be able to expand this functionality to n-grams in the future, but due to time constraints, complexity, code efficiency, and the necessity of testing code we write ourselves, we have only managed to implement perplexity on a unigram model for now. The relationship of each feature to the score can be seen in Figure 3.

Figure 3: features vs. score

Unique word count, word count, and sentence count all seem to have a clearly correlated relationship with score, while perplexity demonstrates a possible trend. It is our belief that with a more advanced perplexity library, perhaps one based on n-grams rather than unigrams, this relationship would be strengthened. Indeed, this is a point of discussion later in this report.

With these these additional features in place, we moved on to select the actual model to predict our response variable. In the end, we decided to continue using linear regression, as we saw no reason to stray from this approach, and also because we were recommended to use such a model! However, we decided that it was important to include a regularization component in order to limit the influence of any collinear relationships among our thousands of features.

We experimented with both Lasso and Ridge regularization, tuning for optimal alpha with values ranging from 0.05 to 1 in 0.05 increments. As learned in class, Lasso performs both parameter shrinkage and variable selection, automatically removing predictors that are collinear with other predictors. Ridge regression, on the other hand, does not zero out coefficients for the predictors, but does minimize them, limiting their effect on the Spearman correlation.

Figure 4: Lasso and Ridge Regularization

Analysis & Interpretation

With this improved model, we see that the Spearman rank correlations have significantly improved from the baseline model. The Spearman rank correlation values now mostly lie in either the “strong” or “very strong” range, a notable improvement from our baseline model producing mostly “very weak” to “moderate” Spearman values.

In Figure 5, we highlight the scores of the models that yielded the highest Spearman correlations for each of the essay sets. Our highest Spearman correlation was achieved on Essay Set 1, at approximately 0.884, whereas our lowest was achieved on Essay Set 8, at approximately 0.619. It is interesting to note the vast difference in performance across essay sets, a fact that may indicate a failure to sufficiently and successfully generalize the model’s accuracy across such a wide variety of essay sets and prompts. We discuss ways to improve this in the following section.

Figures 4 and 5 also show that Lasso regularization generally performed better than the Ridge regularization, exhibiting better Spearman scores in six out of the eight essay sets; in fact, the average score of Lasso was also slightly higher (.793 as compared to .780). While this difference is not large, we would nonetheless opt for the Lasso model. Given that we have thousands of features with the inclusion of tf-idf, it is likely that plenty of these features are not statistically significant in our linear model. Hence, completely eliminating those features—as Lasso does—rather than just shrinking their coefficients, gives us a more interpretable, computationally efficient, and simpler model.

Ultimately, our Lasso linear regression yielded the greatest overall Spearman correlation, and is intuitively justifiable as a model. With proper tuning for regularization, we note that an alpha value of no greater than 0.5 yielded best results (it should be noted, though, that all nonzero alphas produced comparable Spearman scores, as evidenced in Figure 4). Importantly, p-values remained well below 0.05, confirming the statistical significance of our findings. In layman terms, this high Spearman correlation is significant because it indicates that the scores we have predicted for the essays are relatively similar in rank to the actual scores that the essays have received (as in, if essay A is ranked higher than essay B, our model did well in successfully providing the same conclusion).

Future Work & Concluding Thoughts

In sum, we were able to successfully implement a Lasso linear regression model using both trivial and nontrivial essay features to vastly improve upon our baseline model. While features like word count appear to have the most correlated relationship with score from a graphical standpoint, we believe that a feature such as perplexity, which actually takes a language model into account, would in the long run be a superior predictor. Namely, we would ideally extend our self-implemented perplexity functionality to the n-gram case, rather than simply using unigrams. With this added capability, we believe our model could achieve even greater Spearman correlation scores.

Other features that we believe could improve the effectiveness of the model include parse trees. Parse trees are ordered trees that represent the syntactic structure of a phrase. This linguistic model is, much like perplexity, based on content rather than the “metadata” that many trivial features provide. As such, it may prove effective in contributing to the model a more in-depth analysis of the context and construction of sentences, pointing to writing styles that may correlate to higher grades. Finally, we would like to take the prompts of the essays into account. This could be a significant feature for our model, because depending on the type of essay being writing—e.g. persuasive, narrative, summary—the organization of the essay could vary, which would then affect how we create our models and which features become more important.

There is certainly room for improvement on our model—namely, the features we just mentioned, as well as many more we have not discussed. However, given the time, resources and scope for this project, we were very pleased with our results. None of us had ever performed NLP before, but we now look forward to continuing to apply statistical methodology to such problems in the future!

  • U.S. Bureau of Labor Statistics. "What High School Teachers Do." U.S. Bureau of Labor Statistics, Dec. 2015. Web. 13 Dec. 2016. http://www.bls.gov/ooh/education-training-and-library/high-school-teachers.htm#tab-2 .
  • The Hewlett Foundation. "The Hewlett Foundation: Automated Essay Scoring." Kaggle, Feb. 2012. Web. 13 Dec. 2016. https://www.kaggle.com/c/asap-aes .
  • "Spearman's Correlation." Statstutor, n.d. Web. 14 Dec. 2016. http://www.statstutor.ac.uk/resources/uploaded/spearmans.pdf
  • Berwick, Robert C. "Natural Language Processing Notes for Lectures 2 and 3, Fall 2012." Massachusetts Institute of Technology - Natural Language Processing Course. Massachusetts Institute of Technology, n.d. Web. 13 Dec. 2016. http://web.mit.edu/6.863/www/fall2012/lectures/lecture2&3-notes12.pdf .
  • "NgramModel No Longer Available? - Issue #738 - Nltk/nltk." GitHub. NLTK Open Source Library, Aug. 2014. Web. 13 Dec. 2016. https://github.com/nltk/nltk/issues/738 .

automated essay grading project

Texas Launches AI Grader for Student Essay Tests But Insists It's Not Like ChatGPT

K ids in Texas are taking state-mandated standardized tests this week to measure their proficiency in reading, writing, science, and social studies. But those tests aren’t going to necessarily be graded by human teachers anymore. In fact, the Texas Education Agency will deploy a new “automated scoring engine” for open-ended questions on the tests. And the state hopes to save millions with the new program.

The technology, which has been dubbed an “auto scoring engine” (ASE) by the Texas Education Agency, uses natural language processing to grade student essays, according to the Texas Tribune . After the initial grading by the AI model, roughly 25% of test responses will be sent back to human graders for review, according to the San Antonio Report news outlet.

Texas expects to save somewhere between $15-20 million with the new AI tool, mostly because fewer human graders will need to be hired through a third-party contracting agency. Previously, about 6,000 graders were needed, but that’s being cut down to about 2,000, according to the Texas Tribune.

A presentation published on the Texas Education Agency’s website appears to show that tests of the new system revealed humans and the automated system gave comparable scores to most kids. But a lot of questions remain about how the tech works exactly and what company may have helped the state develop the software. Two education companies, Cambium and Pearson, are mentioned as contractors at the Texas Education Agency’s site but the agency didn’t respond to questions emailed Tuesday.

The State of Texas Assessments of Academic Readiness (STAAR) was first introduced in 2011 but redesigned in 2023 to include more open-ended essay-style questions. Previously, the test contained many more questions in the multiple choice format which, of course, was also graded by computerized tools. The big difference is that scoring a bubble sheet is different from scoring a written response, something computers have more difficulty understanding.

In a sign of potentially just how toxic AI tools have become in mainstream tech discourse, the Texas Education Agency has apparently been quick to shoot down any comparisons to generative AI chatbots like ChatGPT , according to the Texas Tribune. And the PowerPoint presentation on the Texas Education Agency’s site appears to confirm that unease with comparisons to anything like ChatGPT.

“This kind of technology is different from AI in that AI is a computer using progressive learning algorithms to adapt, allowing the data to do the programming and essentially teaching itself,” the presentation explains. “Instead, the automated scoring engine is a closed database with student response data accessible only by TEA and, with strict contractual privacy control, its assessment contractors, Cambium and Pearson.”

Any family who’s upset with their child’s grade can request that a human take another look at the test, according to the San Antonio Report . But it’ll set you back $50.

For the latest news, Facebook , Twitter and Instagram .

Image: charles taylor (Shutterstock)

Automated Scoring of Writing

  • Open Access
  • First Online: 15 September 2023

Cite this chapter

You have full access to this open access chapter

Book cover

  • Stephanie Link   ORCID: orcid.org/0000-0002-5586-1495 8 &
  • Svetlana Koltovskaia 9  

4111 Accesses

For decades, automated essay scoring (AES) has operated behind the scenes of major standardized writing assessments to provide summative scores of students’ writing proficiency (Dikli in J Technol Learn Assess 5(1), 2006). Today, AES systems are increasingly used in low-stakes assessment contexts and as a component of instructional tools in writing classrooms. Despite substantial debate regarding their use, including concerns about writing construct representation (Condon in Assess Writ 18:100–108, 2013; Deane in Assess Writ 18:7–24, 2013), AES has attracted the attention of school administrators, educators, testing companies, and researchers and is now commonly used in an attempt to reduce human efforts and improve consistency issues in assessing writing (Ramesh and Sanampudi in Artif Intell Rev 55:2495–2527, 2021). This chapter introduces the affordances and constraints of AES for writing assessment, surveys research on AES effectiveness in classroom practice, and emphasizes implications for writing theory and practice.

You have full access to this open access chapter,  Download chapter PDF

  • Automated essay scoring
  • Summative assessment

Automated essay scoring (AES) is used internationally to rapidly assess writing and provide summative holistic scores and score descriptors for formal and informal assessments. The ease of using AES for response to writing is especially attractive for large-scale essay evaluation, providing also a low-cost supplement to human scoring and feedback provision. Additionally, intended benefits of AES include the elimination of human bias, such as rater fatigue, expertise, severity/leniency, inconsistency, and Halo effect. While AES developers also commonly suggest that their engines perform as reliably as human scorers (e.g., Burstein & Chodorow, 2010 ; Riordan et al., 2017 ; Rudner et al., 2006 ), AES is not free of critique. Automated scoring is frequently under scrutiny for use with university-level composition students in the United States (Condon, 2013 ) and second language writers (Crusan, 2010 ), with some writing practitioners discouraging its replacement of adequate literacy education because of its inability to evaluate meaning from a humanistic, socially-situated perspective (Deane, 2013 ; NCTE, 2013 ). AES also suffers from biases, such as imperfections in the quality and representation of training data to develop the systems and inform feedback generation. These biases question the fairness of AES (Loukina et al., 2019 ), especially if scores are modeled based on data that does not adequately represent a user population—a particular concern for use of AES with minoritized populations.

Despite reservations, the utility of AES in writing practices has increased significantly in recent years (Ramesh & Sanampudi, 2021 ), partially due to its integration into classroom-based tools (see Cotos, “ Automated Feedback on Writing ” for a review of automated writing evaluation). Thus, the affordances of AES for language testing are now readily available to writing practitioners and researchers, and the time is ripe for better understanding its potential impact on the pedagogical approaches to writing studies by first better understanding the history that drives AES development.

Dating back to the 1960s, AES started with the advent of Project Essay Grade (Page, 1966 ). Since then, automated scoring has advanced into leading technologies, including e-rater by the Educational Testing Service (ETS) (Attali & Burstein, 2006 ), Intelligent Essay Assessor (IEA) by Knowledge Analysis Technologies (Landauer et al., 2003 ), Intellimetric by Vantage Learning (Elliot, 2003 ), and a large number of prospective newcomers (e.g., Nguyen & Dery, 2016 ; Riordan et al., 2017 ). These AES engines are used for tests like the Test of English as a Foreign Language (TOEFL iBT), Graduate Management Admissions Test (GMAT), and the Pearson Test of English (PTE). In such tests, AES researchers not only found the scores reliable, but some argued that they also allowed for reproducibility, tractability, consistency, objectivity, item specification, granularity, and efficiency (William et al., 1999 ), characteristics that human raters can lack (Williamson et al., 2012 ).

The immediate AES response to writing is without much question a salient feature of automated scoring for testing contexts. However, research on classroom-based implementation has suggested that instructors can utilize the AES feedback to flag students’ writing that requires teachers’ special attention (Li et al., 2014 ), highlighting its potential for constructing individual development plans or conducting analysis of students’ writing needs. AES also provides constant, individualized feedback to lighten instructors’ feedback load (Kellogg et al., 2010 ), enhance student autonomy (Wang et al., 2013 ), and stimulate editing and revision (Li et al., 2014 ).

2 Core Idea of the Technology

Automated essay scoring involves automatic assessment of a students’ written work, usually in response to a writing prompt. This assessment generally includes (1) a holistic score of students’ performance, knowledge, and/or skill and (2) a score descriptor on how the student can improve the text. For example, e-rater by ETS ( 2013 ) scores essays on a scale from 0 to 6. A score of 6 may include the following feedback:

Score of 6: Excellent

Looks at the topic from a number of angles and responds to all aspects.

Responds thoughtfully and insightfully to the issues in the topic.

Develops with a superior structure and apt reasons or examples.

Uses sentence styles and language that have impact and energy.

Demonstrates that you know the mechanics of correct sentence structure.

AES engine developers over the years have undertaken a core goal of making the assessment of writing accurate, unbiased, and fair (Madnani & Cahill, 2018 ). The differences in score generation, however, are stark given the variation in philosophical foundations, intended purposes, extraction of features for scoring writing, and criteria used to test the systems (Yang et al., 2002 ). To this end, it is important to understand the prescribed use of automated systems so that they are not implemented inappropriately. For instance, if a system is meant to measure students’ writing proficiency, the system should not be used to assess students’ aptitude. Thus, scoring models for developing AES engines are valuable and effective in distinct ways and for their specific purposes.

Because each engine may be designed to assess different levels, genres, and/or skills of writing, developers utilize different natural language processing (NLP) techniques for establishing construct validity, or the extent to which an AES scoring engine measures what it intends to measure—a common concern for AES critics (Condon, 2013 ; Perelman, 2014 , 2020 ). NLP helps computers understand human input (text and speech) by starting with human and/or computer analysis of textual features so that a computer can process the textual input and offer reliable output (e.g., a holistic score and score descriptor) on new text. These features may include statistical features (e.g., essay length, word co-occurrences also known as n-grams), style-based features (e.g., sentence structure, grammar, part-of-speech), and content-based features (e.g., cohesion, semantics, prompt relevance) (see Ramesh & Sanampudi, 2021 , for an overview of features). Construct validity should thus be interpreted in relation to feature extraction of a given AES system to adequately appreciate (or challenge) the capabilities that system offers writing studies.

In addition to a focus on a variety of textual features, AES developers have utilized varied machine learning (ML) techniques to establish construct validity and efficient score modeling. Machine learning is a category of artificial intelligence (AI) that helps computers recognize patterns in data and continuously learn from the data to make accurate holistic score predictions and adjustments without further programming (IBM, 2020 ). Early AES research utilized standard multiple regression analysis to predict holistic scores based on a set of rater-defined textual features. This approach was utilized in the early 1960s for developing Project Essay Grade by Page ( 1966 ), but it has been criticized for its bias in favor of longer texts (Hearst, 2000 ) and its ignorance towards content and domain knowledge (Ramesh & Sanampudi, 2021 ).

In subsequent years, classification models, such as the bag of words approach (BOW), were common (e.g., Chen et al., 2010 ; Leacock & Chodorow, 2003 ). BOW models extract features in writing using NLP by counting the occurrences and co-occurrences of words within and across texts. Texts with multiple shared word strings are classified into similar holistic score categories (e.g., low, medium, high) (Chen et al., 2010 ; Zhang et al., 2010 ). E-rater by ETS is a good example of this approach. The aforementioned approaches are human-labor intensive. Latent semantic analysis (LSA) is advantageous in this regard; it is also strong in evaluating semantics. In LSA, the semantic representation of a text is compared to the semantic representation of other similarly scored responses. This analysis is done by training the computer on specific corpora that mimics a given writing prompt. Landauer et al. ( 2003 ) used LSA in Intelligent Essay Grade.

Advances in NLP and progress in ML have motivated AES researchers to move away from statistical regression-based modeling and classification approaches to advanced models involving neural network approaches (Dong et al., 2017 ; Kumar & Boulanger, 2020 ; Riordan et al., 2017 ). To develop these AES models, data undergoes a process of supervised learning, where the computer is provided with labeled data that enables it to produce a score as a human would. The supervised learning process often starts with a training set—a large corpus of representative, unbiased writing that is typically human- or auto-coded for specific linguistic features with each text receiving a holistic score. Models are then generated to teach a computer to identify and extract these features and provide a holistic score that correlates with the human rating. The models are evaluated on a testing set that the computer has never seen previously. Accuracy of algorithms is then evaluated by using testing set scores and human scores to determine human–computer consistency and reliability. Common evaluations are quadrated weighted kappa, Mean Absolute Error, and Pearson Correlation Coefficient.

Once accuracy results meet an industry standard (Powers et al., 2015 ), which varies across disciplines (Weigle, 2013 ), the algorithms are made public through user-friendly interfaces for testing contexts (i.e., to provide summative feedback, formal assessments to assess students’ performance or proficiency) and direct classroom use (i.e., informal assessments to improve students’ learning). For the classroom, teachers should be active in evaluating the feedback to determine whether it is reasonably accurate in assessing a learning goal, does not lead students away from the goal, and encourages students to engage in different ways with their text and/or the course content. Effective evaluation of AES should start with an awareness of AES affordances that can impact writing practice and then continue with the training of students in the utility of these affordances.

3 Functional Specifications

The overall functionality of AES for classroom use is to provide summative assessment of writing quality. AES accomplishes this through two key affordances: a holistic score and score descriptor.

Holistic score: The summative score provides an overall, generic assessment of writing quality. For example, Grammarly provides a holistic score or “performance” score out of 100%. The score represents the quality of writing (as determined by features, such as word count, readability statistics, vocabulary usage). If a student receives a score below 60–70%, this means that it could be understood by a reader who has a 9th grade education. For the text to be readable by 80% of English speakers, Grammarly suggests getting at least 60–70%.

Score descriptor: The holistic score is typically accompanied by a descriptor that indicates what the score represents. This characterization of the score meaning can be used to interpret the feedback, evaluate the feedback, and make decisions regarding editing and revising.

That is, these key affordances can be utilized to complete several main activities.

Interpreting feedback : Once students receive the holistic score along with the descriptor, they should interpret the score. Information provided for adequate score interpretation varies across AES systems, so students may need help in interpreting the meaning of this feedback.

Evaluating feedback : After interpreting the score and the descriptor, students need to think critically about how the feedback applies to their writing. That is, students need to determine whether the computer feedback is an adequate representation of their writing weaknesses. Evaluating feedback thus entails noticing the gap or problem found in one’s own writing and becoming consciously aware of how the feedback might be used to increase the quality of writing through self-editing (Ferris, 2011 ).

Making a decision about action : Once students evaluate their writing based on a given score and descriptor, they then need to decide whether to address the issues highlighted in the descriptor or seek additional feedback. Making and executing a revision plan can ensure that the student is being critical towards the feedback rather than accepting it outright.

Revising/editing : The student then revises the paper and resubmits it to the system to see if the score improves—an indicator of higher quality writing. If needed, the student can repeat the above actions or move on to editing of surface-level writing concerns.

4 Research on AES

AES research can be categorized along two lines: system-centric research that evaluates the system itself and user-centric research that evaluates use/impact of a system on learning. From a system-centric perspective, various studies have been conducted to validate AES-system-generated scores for the testing context. The majority have focused on reliability, or the extent to which results can be considered consistent or stable (Brown, 2005 ). They often evaluate reliability based on agreement between human and computer scoring (e.g., Burstein & Chodorow, 1999 ; Elliot, 2003 ; Streeter et al., 2011 ). (See Table  1 for a summary of reliability statistics from three major AES developers.)

The process of establishing validity should not start and stop with inter-coder reliability; however, automated scoring presents some distinctive validity challenges, such as “the potential to under- or misrepresent the construct of interest, vulnerability to cheating, impact on examinee behavior, and score users’ interpretation and use of scores” (Williamson et al., 2012 , p. 3). Thus, some researchers have also demonstrated reliability by using alternative measures, such as the association with independent measures (Attali et al., 2010 ) and the generalizability of scores (Attali et al., 2010 ). Others have gone a step further and suggested a unified approach to AES validation (Weigle, 2013 , Williamson et al., 2012 ). In general, results reveal promising developments in AES with modest correlations between AES and external criteria, such as independent proficiency assessments (Attali et al., 2010 ; Powers et al., 2015 , suggesting that automated scores can relate in a similar manner to select assessment criteria and that both have the potential to reflect similar constructs, although results across AES systems can vary, and not all data are readily available to the public.

While much research has focused on reliability of AES, little is known about the quality of holistic scores in testing or classroom contexts as well as teachers’ and students’ use and perceptions of automatically generated scores. In a testing context, James ( 2006 ) compared the IntelliMetric scores of the ACCUPLACER OnLine WritePlacer Plus test to the scores of “untrained” faculty raters. Results revealed a relatively high level of correspondence between the two. In a similar study with a group of developmental writing students in a two-year college in South Texas, Wang and Brown ( 2007 ) found that ACCUPLACER’s overall holistic mean score showed significant difference between IntelliMetric and human raters, indicating that IntelliMetric tends to assign higher scores than human raters do. Li et al. ( 2014 ) investigated the correlation between Criterion’s numeric scores with the English as a second language instructors’ numeric grades and analytic ratings for classroom-based assessment. The results showed low to moderate positive correlations between Criterion’s scores and instructors’ scores and analytic ratings. Taken together, these studies suggest limited continuity of findings on AES reliability across tools.

Results of multiple studies demonstrate varied uses for holistic scores and varied teachers’ and students’ perceptions toward the scores. For example, Li et al. ( 2014 ) found that Criterion’s holistic scores in the English as a second language classroom were used in three ways. First, instructors used the scores as a forewarning. That is, the scores alerted instructors to problematic writing. Second, the scores were used as a pre-submission benchmark. That is, the students were required to obtain a certain score before submitting a final draft to their teacher. Finally, Criterion's scores were utilized as an assessment tool—scores were part of course grading. Similar findings were reported in Chen and Cheng’s ( 2008 ) study that focused on EFL Tawainese teachers’ and students’ use and perception of My Access! While one teacher used My Access! as a pre-submission benchmark, the other used it for both formative and summative assessment, heavily relying on the scores to assessing writing performance. The third teacher did not make My Access! a requirement and asked the students to use it if they needed to.

In terms of teachers’ perceptions of holistic scores, holistic scores seem to be motivators for promoting student revision (Li et al. 2014 ; Scharber et al., 2008 ) although a few teachers in Maeng ( 2010 ) commented that the score caused some stress albeit was still helpful for facilitating the feedback process (i.e., for providing sample writing and revising). Teachers also tend to have mixed confidence in holistic scores (Chen & Cheng, 2008 ; Li et al, 2014 ). For example, in Li et al.’s ( 2014 ) study, English as a second language instructors had high trust in Criterion’s low holistic scores as the essays Criterion scored low were, in fact, poor essays. However, instructors possessed low levels of trust when Criterion assigned high scores to writing as instructors judged such writing lower.

Students also tend to have low trust in holistic scores (Chen & Cheng, 2008 ; Scharber et al., 2008 ). For example, Chen and Cheng ( 2008 ) found that EFL Taiwanese students’ low level of trust in holistic scores was influenced by teachers’ low level of trust in the scores as well as discrepancies in teachers’ scores and holistic scores of My Access! that students noticed. Similar findings were reported in Scharber et al.’s ( 2008 ) study that focused on Educational Theory into Practice Software’s (ETIPS) automated scorer implemented in a post-baccalaureate program at a large public Midwestern US university. The students in their study experienced negative emotions due to discrepancies in teachers’ and ETIPS’ holistic scores. ETIPS scores were one point lower than teachers’ scores. Additionally, the students found holistic scores with the short descriptor insufficient in guiding them as to how to actually improve their essays.

5 Implications of This Technology for Writing Theory and Practice

The rapid advancement of NLP and ML approaches to automated scoring lends well to theoretical contributions that help to (re-)define traditional notions of how learning takes place and the phenomena that underscores language development. Social- and cognitive-based theories to writing studies can be expanded with the integration of AES technology by offering new, socially-situated learning opportunities in online environments that can impact how students respond to feedback. These digitally-rich learning opportunities can thus significantly impact the writing process, offering a new mode of feedback that can be meaningful, constant, timely, and manageable while addressing individual learner needs. From a traditional pen-and-paper approach, these benefits are known to contribute significantly to writing accuracy (Hartshorn et al., 2010 ), and so the addition of rapid technology has the potential to add new knowledge to writing development research.

AES research can also contribute to practice. Due to its instantaneous nature, AES holistic scores could be used for placement purposes (e.g., by using ACCUPLACER) at schools, colleges, and universities. However, relying on the AES holistic score alone may not be adequate. Therefore, just like in large-scale tests, it is important that students’ writing is double-rated to enhance reliability, with a third rater used if there is a discrepancy in AES holistic score and a human rater’s score. Similarly, AES holistic scores could be used for diagnostic assessment. Diagnostic assessment is given prior to or at the start of the semester/course to get information about students’ language proficiency as well as their strengths and weaknesses in writing. Finally, AES scoring could be used for summative classroom assessment. For example, teachers could use AES scores as a pre-submission benchmark and require students to revise their essays until they get a predetermined score, or teachers could use the AES score for partial (rather than sole) assessment of goal attainment (Li et al., 2014 ; Weigle, 2013 ). Overall, in order to avoid pitfalls such as students focusing too intensively on obtaining high scores without actually improving their writing skills, teachers and students need to be trained or seek training on the different merits and demerits of a selected AES scoring system.

6 Concluding Remarks

While traditional approaches to written corrective feedback are still leading writing studies research, the ever-changing digitalization of the writing process shines light on new opportunities for enhancing the nature of feedback provision. The evolution of AI will undoubtedly expand the affordances of AES so that writing in digital spaces can be supplemented by computer-based feedback that is increasingly accurate and reliable. For now, these technologies are only foregrounding what can come from technological advancements, and in the meantime, it is the task of researchers and practitioners to cast a critical eye while also remaining open to the potential for AES technologies to promote autonomous, lifelong learning and writing development.

7 Tool List

List of well-known Automated Essay Scoring (AES) Tools

Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology , Learning, and Assessment, 10 (3). http://www.jtla.org

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V.2. Journal of Technology, Learning, and Assessment, 4 (3), 1–30.

Google Scholar  

Brown, J. D. (2005). Testing in language programs. A comprehensive guide to English language assessment . McGraw Hill.

Burstein, J., & Chodorow, M. (1999). Automated essay scoring for nonnative English speakers . Proceedings of the ACL99 Workshop on Computer-Mediated Language Assessment and Evaluation of Natural Language Processing. http://www.ets.org/Media/Research/pdf/erater_acl99rev.pdf

Burstein, J., & Chodorow, M. (2010). Progress and new directions in technology for automated essay evaluation. In R. Kaplan (Ed.), The Oxford handbook of applied linguistics (2nd ed., pp. 487–497). Oxford University Press.

Chen, C., & Cheng, W. (2008). Beyond the design of automated writing evaluation: Pedagogical practices and perceived learning effectiveness in EFL writing classes. Language Learning & Technology, 12 (2), 94–112.

Chen, Y. Y., Liu, C. L., Chang, T. H., & Lee, C. H. (2010). An unsupervised automated essay scoring system. IEEE Intelligent Systems, 25 (5), 61–67. https://doi.org/10.1109/MIS.2010.3

Article   Google Scholar  

Condon, W. (2013). Large-scale assessment, locally-developed measures, and automated scoring of essays: Fishing for red herrings? Assessing Writing, 18 , 100–108. https://doi.org/10.1016/j.asw.2012.11.001

Crusan, D. (2010). Assessment in the second language writing classroom . University of Michigan Press.

Book   Google Scholar  

Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18 , 7–24.

Dexter, S. (2007). Educational theory into practice software. In D. Gibson, C. Aldrich, & M. Prensky (Eds.), Games and simulations in online learning: Research and development frameworks (pp. 223–238). IGI Global. https://doi.org/10.4018/978-1-59904-304-3.ch011

Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Technology, Learning and Assessment, 5 (1). https://ejournals.bc.edu/index.php/jtla/article/view/1640

Dong, F., Zhang, Y., & Yang, J. (2017). Attention-based recurrent convolutional neural network for automatic essay scoring . Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). https://aclanthology.org/K17-1017.pdf

Elliot, S. (2003). IntelliMetric: From here to validity. In M. D. Shermis & J. C. Burstein (Eds.), Automatic essay scoring: A cross-disciplinary perspective (pp. 71–86). Lawrence Erlbaum Associates.

ETS. (2013). Criterion scoring guide . Retrieved September 27, 2013, from http://www.ets.org/Media/Products/Criterion/topics/co-1s.htm

Ferris, D. R. (2011). Treatment of errors in second language student writing (2nd ed.). The University of Michigan Press.

Hartshorn, K. J., Evans, N. W., Merrill, P. F., Sudweeks, R. R., Strong-Krause, D., & Anderson, N. J. (2010). Effects of dynamic corrective feedback on ESL writing accuracy. TESOL Quarterly, 44 , 84–109.

Hearst, M. (2000). The debate on automated essay grading. IEEE Intelligent Systems and their Applications, 15 (5), 22–37. https://doi.org/10.1109/5254.889104

IBM. (2020). Machine learning . IBM Cloud Education. https://www.ibm.com/cloud/learn/machine-learning

James, C. (2006). Validating a computerized scoring system for assessing writing and placing students in composition courses. Assessing Writing, 11 (3), 167–178.

Kellogg, R., Whiteford, A., & Quinlan, T. (2010). Does automated feedback help students learn to write? Journal of Educational Computing Research, 42 , 173–196.

Kumar, V., & Boulanger, D. (2020). Explainable automated essay scoring: Deep learning really has pedagogical value. Frontiers in Education (Lausanne) , 5 . https://doi.org/10.3389/feduc.2020.572367

Landauer, T. K., Laham, D., & Foltz, P. (2003). Automatic essay assessment. Assessment in Education, 10 (3), 295–308.

Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37 , 389–405.

Li, Z., Link, S., Ma, H., Yang, H., & Hegelheimer, V. (2014). The role of automated writing evaluation holistic scores in the ESL classroom. System, 44 , 66–78. https://doi.org/10.1016/j.system.2014.02.007

Loukina, A., et al. (2019). The many dimensions of algorithmic fairness in educational applications . BEA@ACL.

Madnani, N., & Cahill, A. (2018). Automated scoring: Beyond natural language processing . COLING.

Maeng, U. (2010). The effect and teachers’ perception of using an automated essay scoring system in L2 writing. English Language and Linguistics, 16 (1), 247–275.

NCTE. (2013, April 20). NCTE position statement on machine scoring . National Council of Teachers of English. https://ncte.org/statement/machine_scoring/

Nguyen, H., & Dery, L. (2016). Neural networks for automated essay grading (pp. 1–11). CS224d Stanford Reports.

Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 48 , 238–243.

Perelman, L. (2014). When “the state of the art” is counting words. Assessing Writing, 21 , 104–111.

Perelman, L. (2020). The BABEL generator and E-rater: 21st century writing constructs and automated essay scoring (AES).  Journal of Writing Assessment, 13 (1).

Powers, D. E., Escoffery, D. S., & Duchnowski, M. P. (2015). Validating automated essay scoring: A (modest) refinement of the “gold standard.” Applied Measurement in Education, 28 (2), 130–142. https://doi.org/10.1080/08957347.2014.1002920

Ramesh, D., & Sanampudi, S. K. (2021). An automated essay scoring systems: A systematic literature review. The Artificial Intelligence Review, 55 (3), 2495–2527. https://doi.org/10.1007/s10462-021-10068-2

Riordan, B., Horbach, A., Cahill, A., Zesch, T., & Lee, C. M. (2017). Investigating neural architectures for short answer scoring . Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. https://aclanthology.org/W17-5017.pdf

Rudner, L., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetricTM essay scoring system. Journal of Technology, Learning, and Assessment, 4 (4). http://escholarship.bc.edu/ojs/index.php/jtla/article/view/1651/1493

Scharber, C., Dexter, S., & Riedel, E. (2008). Students’ experiences with an automated essay scorer. Journal of Technology, Learning and Assessment, 7 (1), 1–45. https://ejournals.bc.edu/index.php/jtla/article/view/1628

Streeter, L., Bernstein, J., Foltz, P., & DeLand, D. (2011). Pearson’s automated scoring of writing, speaking, and mathematics . White Paper. http://images.pearsonassessments.com/images/tmrs/PearsonsAutomatedScoringofWritingSpeakingandMathematics.pdf

Wang, J., & Brown, M. S. (2007). Automated essay scoring versus human scoring: A comparative study. Journal of Technology, Learning, and Assessment, 6 (2). http://www.jtla.org

Wang, Y., Shang, H., & Briody, P. (2013). Exploring the impact of using automated writing evaluation in English as a foreign language university students’ writing. Computer Assisted Language Learning, 26 (3), 1–24.

Weigle, S. C. (2013). English as a second language writing and automated essay evaluation. In M. D. Shermis & J. C. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 36–54). Routledge.

William, D. M., Bejar, I. I., & Hone, A. S. (1999). ’Mental model’ comparison of automated and human scoring. Journal of Educational Measurement, 35 (2), 158–184.

Williamson, D., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31 (1), 2–13.

Yang, Y., Buckendahl, C. W., Juszkiewicz, P. J., & Bhola, D. S. (2002). A review of strategies for validating computer-automated scoring. Applied Measurement in Education, 15 (4), 391–412. https://doi.org/10.1207/S15324818AME1504_04

Zhang, Y., Jin, R., & Zhou, Z. H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1 , 43–52.

Download references

Author information

Authors and affiliations.

Oklahoma State University, 205 Morrill Hall, Stillwater, OK, 74078, USA

Stephanie Link

Department of Languages and Literature, Northeastern State University, Tahlequah, OK, 74464, USA

Svetlana Koltovskaia

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Stephanie Link .

Editor information

Editors and affiliations.

School of Applied Linguistics, Zurich University of Applied Sciences, Winterthur, Switzerland

School of Management and Law, Center for Innovative Teaching and Learning, Zurich University of Applied Sciences, Winterthur, Switzerland

Christian Rapp

North Carolina State University, Raleigh, NC, USA

Chris M. Anson

TECFA, Faculty of Psychology and Educational Sciences, University of Geneva, Geneva, Switzerland

Kalliopi Benetos

English Department, Iowa State University, Ames, IA, USA

Elena Cotos

School of Education, Trinity College Dublin, Dublin, Ireland

TD School, University of Technology Sydney, Sydney, NSW, Australia

Antonette Shibani

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

© 2023 The Author(s)

About this chapter

Link, S., Koltovskaia, S. (2023). Automated Scoring of Writing. In: Kruse, O., et al. Digital Writing Technologies in Higher Education . Springer, Cham. https://doi.org/10.1007/978-3-031-36033-6_21

Download citation

DOI : https://doi.org/10.1007/978-3-031-36033-6_21

Published : 15 September 2023

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-36032-9

Online ISBN : 978-3-031-36033-6

eBook Packages : Education Education (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Help | Advanced Search

Computer Science > Computation and Language

Title: transformer-based joint modelling for automatic essay scoring and off-topic detection.

Abstract: Automated Essay Scoring (AES) systems are widely popular in the market as they constitute a cost-effective and time-effective option for grading systems. Nevertheless, many studies have demonstrated that the AES system fails to assign lower grades to irrelevant responses. Thus, detecting the off-topic response in automated essay scoring is crucial in practical tasks where candidates write unrelated text responses to the given task in the question. In this paper, we are proposing an unsupervised technique that jointly scores essays and detects off-topic essays. The proposed Automated Open Essay Scoring (AOES) model uses a novel topic regularization module (TRM), which can be attached on top of a transformer model, and is trained using a proposed hybrid loss function. After training, the AOES model is further used to calculate the Mahalanobis distance score for off-topic essay detection. Our proposed method outperforms the baseline we created and earlier conventional methods on two essay-scoring datasets in off-topic detection as well as on-topic scoring. Experimental evaluation results on different adversarial strategies also show how the suggested method is robust for detecting possible human-level perturbations.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. 5 Best Automated AI Essay Grader Software in 2024

    automated essay grading project

  2. GitHub

    automated essay grading project

  3. (PDF) A hybrid scheme for Automated Essay Grading based on LVQ and NLP

    automated essay grading project

  4. Automated Essay Scoring Explained

    automated essay grading project

  5. LSTM based Automated Essay Scoring System Python Project using HTML

    automated essay grading project

  6. GitHub

    automated essay grading project

VIDEO

  1. Essay Grading Demo

  2. Essay Grading Tip ✏️

  3. Automated Essay Grading

  4. Brisk

  5. TSSA 2020

  6. AI and the Future of Education: Mind-blowing Benefits vs. Alarming Risks

COMMENTS

  1. PDF Automated Essay Grading Using Machine Learning

    Automated grading, if proven to match or exceed the reliability of human graders, will signi cantly reduce costs. The purpose of this project is to implement and train machine learning algorithms to automatically assess and grade essay responses. These grades from the automatic grading system should match the human grades consistently.

  2. What is Automated Essay Scoring, Marking, Grading?

    Nathan Thompson, PhDApril 25, 2023. Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment. In fact, it's been around far longer than "machine learning" and "artificial intelligence" have been buzzwords in the general public!

  3. About the e-rater Scoring Engine

    The e-rater automated scoring engine uses AI technology and Natural Language Processing (NLP) to evaluate the writing proficiency of student essays by providing automatic scoring and feedback. The engine provides descriptive feedback on the writer's grammar, mechanics, word use and complexity, style, organization and more. ...

  4. PDF Automated Essay Scoring Using Machine Learning

    The automated essay scoring model is a topic of in-terest in both linguistics and Machine Learning. The model systematically classi es the quality of writing and can be applied in both academia and large indus-trial organizations to improve operational e ciency. 1.1. Motivation.

  5. Boulder Labs

    More About this project. We developed a system to automate and streamline much of the work involved in building models to perform automated essay grading. The system includes an API for data collection and validation, tools to automate the modeling process and facilitate research, an interface for reporting on modeling performance, and support ...

  6. PDF Neural Networks for Automated Essay Grading

    tional multiple-choice assessments is the large cost and effort required for scoring. This project is an attempt to use different neural network architectures to build an accurate automated essay grading system to solve this problem. 1 Introduction Attempts to build an automated essay grading system dated back to 1966 when Ellis B. Page proved

  7. Automated essay scoring

    Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting.It is a form of educational assessment and an application of natural language processing.Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades, for example, the numbers 1 to 6.

  8. (PDF) A Comprehensive Review of Automated Essay Scoring ...

    Automated Essay Scoring (AES) is a service or software that can predictively grade essay based on a pre-trained computational model. ... Project essay grade: PEG. In M. Shermis, & J. Burstein (Eds ...

  9. [2102.13136] Automated essay scoring using efficient transformer-based

    Automated Essay Scoring (AES) is a cross-disciplinary effort involving Education, Linguistics, and Natural Language Processing (NLP). The efficacy of an NLP model in AES tests it ability to evaluate long-term dependencies and extrapolate meaning even when text is poorly written. Large pretrained transformer-based language models have dominated the current state-of-the-art in many NLP tasks ...

  10. PDF Automated Essay Scoring Using Machine Learning

    The automated essay scoring model is a topic of in-terest in both linguistics and Machine Learning. The model systematically classi es our varying degrees of. CS224N Final Project, Shihui Song, Jason Zhao. [email protected] [email protected]. speech and can be applied in both academia and large industrial organizations to improve ...

  11. Automated Essay Scoring

    Essay scoring: **Automated Essay Scoring** is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics.

  12. Automated Essay Scoring Systems

    The first widely known automated scoring system, Project Essay Grader (PEG), was conceptualized by Ellis Battan Page in late 1960s (Page, 1966, 1968).PEG relies on proxy measures, such as average word length, essay length, number of certain punctuation marks, and so forth, to determine the quality of an open-ended response item.

  13. Automatic Essay Grading System Using Deep Neural Network

    Dong et al. [] have built a hierarchical sentence-document model to represent essays, using the attention mechanism to automatically decide the relative weights of words and sentences.To perform sentiment evaluation of brief writings, dos Santos [] suggested a deep convolutional neural community that specializes in certain stages of evaluation, from character-degree to sentence-degree facts.

  14. An automated essay scoring systems: a systematic literature review

    Many researchers are working on automated essay grading and short answer scoring for the last few decades, but assessing an essay by considering all parameters like the relevance of the content to the prompt, development of ideas, Cohesion, and Coherence is a big challenge till now. Few researchers focused on Content-based evaluation, while ...

  15. Ahead of the Curve: How PEG™ Has Led Automated Scoring for Years

    PEG, or Project Essay Grade, is the automated scoring system at the core of ERB Writing Practice. It was invented in the 1960s by Ellis Batten Page, a former high school English teacher, who spent "many long weekends sifting through stacks of papers wishing for some help."

  16. An automated essay scoring systems: a systematic literature review

    Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. . PEG evaluates the writing characteristics such as grammar, diction, construction, etc., to grade ...

  17. Automated Essay Scoring: Kaggle Competition

    Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a of educational assessment and an application of ...

  18. How the PEG Scoring Algorithm Builds Student Writing Skills

    The Project Essay Grade® (PEG®) software is an automated essay scoring solution that builds upon Dr. Ellis Batten Page's research in computational linguistics spanning over 40 years. PEG analyzes written prose, measures writing characteristics like fluency and grammar, and models the decision-making process of expert readers, ultimately ...

  19. An automated essay scoring systems: a systematic literature review

    Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. . PEG evaluates the writing characteristics such as grammar, diction, construction, etc., to grade ...

  20. CS109a Final Project: Automated Essay Grading

    Automated Essay Grading A CS109a Final Project by Anmol Gupta, Annie Hwang, Paul Lisker, and Kevin Loughlin View on GitHub Download .zip Download .tar.gz Introduction. One of the main responsibilities of teachers and professors in the humanities is grading students essays [1].

  21. 5 Best Automated AI Essay Grader Software in 2024

    Project Essay Grade by Measurement Incorporated (MI), is a great automated grading software that uses AI technology to read, understand, process and give you results. By the use of the advanced statistical techniques found in this software, PEG can analyze written prose, make calculations based on more than 300 measurements (fluency, diction ...

  22. PDF Automated Essay Scoring Systems

    automated essay scoring systems generate a single score or detailed evaluation of predefined assessment features. This chapter describes the evolution and features ... The first widely known automated scoring system, Project Essay Grader (PEG), was conceptualized by Ellis Battan Page in late 1960s (Page, 1966, 1968). PEG relies on

  23. Texas Launches AI Grader for Student Essay Tests But Insists It's ...

    In fact, the Texas Education Agency will deploy a new "automated scoring engine" for open-ended questions on the tests. And the state hopes to save millions with the new program.

  24. Automated Scoring of Writing

    Automated essay scoring involves automatic assessment of a students' written work, usually in response to a writing prompt. This assessment generally includes (1) a holistic score of students' performance, knowledge, and/or skill and (2) a score descriptor on how the student can improve the text. For example, e-rater by ETS ( 2013) scores ...

  25. How Texas will use AI to grade this year's STAAR tests

    Texas will use computers to grade written answers on this year's STAAR tests. The state will save more than $15 million by using technology similar to ChatGPT to give initial scores, reducing ...

  26. Transformer-based Joint Modelling for Automatic Essay Scoring and Off

    Automated Essay Scoring (AES) systems are widely popular in the market as they constitute a cost-effective and time-effective option for grading systems. Nevertheless, many studies have demonstrated that the AES system fails to assign lower grades to irrelevant responses. Thus, detecting the off-topic response in automated essay scoring is crucial in practical tasks where candidates write ...