AI, English Testing and Transparency: A Tutor’s Perspective on PTE Academic
- allroundtutor
- Jan 26
- 7 min read

By Reena Lopes
Over the past decade, artificial intelligence has become increasingly integrated into high-stakes language testing. Pearson Test of English Academic (PTE Academic) is one of the leading tests and it relies heavily on automated scoring technologies to assess candidates’ English proficiency.
As a tutor who prepares students for both IELTS and PTE Academic, I do not approach this topic from a position of brand loyalty. My interest is pedagogical: how well do these tests measure real-world English proficiency and how transparent and fair are the systems behind them?
AI-based assessment offers genuine advantages but there are important questions — particularly around validity, transparency and test design — that deserve closer scrutiny.
Disclaimer: This article is written from the perspective of an independent tutor and educator. It does not represent the views of Pearson, IELTS, or any testing authority. The analysis presented here is based on publicly available documentation, published research, and professional experience in test preparation. The intent is not to challenge the legitimacy of any English proficiency test, but to examine questions of assessment design, transparency, and validity in the context of AI-based scoring.
What PTE Academic’s AI Scoring Actually Does
PTE Academic uses automated scoring systems derived from Pearson’s Versant technology (Pearson, n.d.) and related machine-learning models. In simplified terms, speaking responses are analysed using:
Acoustic features (timing, stress, rhythm, intonation)
Pronunciation and phoneme matching
Fluency metrics (speech rate, pauses, continuity)
Automatic speech recognition (ASR) confidence scores
These systems are extremely good at identifying whether speech sounds fluent and English-like. They are fast, consistent and scalable — qualities that are attractive for institutions processing millions of test results globally (Pearson, n.d.).
What these systems do not do well is equally important.
They do not meaningfully or consistently evaluate:
communicative intent
pragmatic appropriateness
interactional competence
whether a candidate is responding in good faith to the task
This distinction sits at the heart of many concerns about automated language assessment.
Reliability Is Not the Same as Validity
Pearson frequently emphasises the reliability of PTE Academic scoring — and rightly so.
Automated systems do not experience fatigue, bias or inconsistency in the way human examiners can.
However, reliability only answers one question:
Is the test measuring something consistently?
It does not answer:
Is the test measuring the right thing?
This is where validity becomes critical.
A test can be highly reliable while still under-representing the construct it claims to measure (O’Sullivan, 2012; Xie & Cheng, 2015). In language testing terms, this is known as construct underrepresentation — and it becomes visible when test-wise strategies outperform genuine language ability.
Research into automated speech scoring has increasingly focused on “interpretability” — the attempt to understand which linguistic features machine learning models rely on when assigning proficiency scores. However, even studies that explicitly examine interpretability rely on post-hoc statistical techniques to infer feature importance, rather than offering direct or inspectable decision rules. Bamdev et al. (2021), for example, demonstrate that while automated systems can learn patterns aligned with scoring rubrics, interpreting why a particular response receives a given score remains an indirect and probabilistic process, requiring specialised analytical tools and access to internal model behaviour. This highlights a fundamental limitation of automated assessment systems: interpretability is not inherent, and transparency cannot be assumed.
Repeat Sentence: A Case Study in System Vulnerability
The PTE Academic Repeat Sentence question type illustrates this problem clearly.
In theory, the task is designed to measure (Pearson, 2025):
listening comprehension
short-term memory
spoken language accuracy
In practice, many candidates have learned that:
fluent rhythm and timing matter more than lexical accuracy
semantic nonsense delivered confidently can still score highly
Some test takers simply reproduce random or approximate word strings, matching speed and intonation rather than meaning. From a human examiner’s perspective, this behaviour would immediately raise red flags. While lexical accuracy is not absent from scoring, it is often outweighed by fluency-related features.
From an automated scoring system’s perspective, however, fluent delivery often satisfies the scoring criteria.
This is not merely anecdotal. It is a predictable outcome when:
surface features are rewarded more heavily than meaning
tasks assume honest participation
optimisation strategies spread faster than algorithmic updates
The issue is not student behaviour — it is test design interacting with fixed algorithms.
Gaming the System Is Structural, Not Exceptional
High-stakes tests inevitably encourage optimisation (O’Sullivan, 2012). This is not unique to PTE Academic.
However, automated systems are particularly vulnerable because:
scoring rules are consistent and repeatable
feedback loops emerge through coaching platforms
candidates adapt faster than scoring models can be recalibrated
When success depends on how the system interprets signals, rather than how humans interpret communication, strategic behaviour becomes rational.
This creates a widening gap between:
test performance
and
actual communicative competence
What has Pearson done to mitigate gaming?
In response to concerns about test-wise strategies and memorised responses or templates, Pearson has introduced human review for a limited number of PTE Academic item types that have historically shown higher susceptibility to gaming, particularly in writing tasks where memorised templates are common (Pearson, 2025).
While this change acknowledges a genuine weakness in fully automated assessment, key details remain unclear. Pearson has not publicly disclosed:
the proportion of tests or responses that receive human review
the qualifications, training, or standardisation procedures of human assessors
the criteria used to determine when a response is escalated from automated scoring to human judgement
Pearson’s Scoring Information for Teachers and Partners document clarifies that while question weighting and initial scoring decisions are generated algorithmically, extended responses that undergo rescoring are reviewed entirely by trained human raters, with adjudication applied where discrepancies arise (Pearson, 2026). While this process is presented as a fairness safeguard, it operates post hoc rather than as an integrated component of primary scoring, limiting its capacity to address construct validity issues at the task-design level.
This lack of transparency makes it difficult for educators and candidates to understand how automated and human scoring interact in practice.
The issue is further complicated by the fact that test fees have remained unchanged, raising reasonable questions about how human assessment is funded, how frequently it occurs, and what role it plays in final score decisions.
Transparency, Secrecy, and Gatekeeping
Perhaps the most consequential issue is not AI itself, but opacity.
Pearson’s scoring models are proprietary (Pearson, 2026). As a result:
scoring weightings are undisclosed
feature importance is unknown
error margins are unpublished
task-level validity cannot be independently verified
Tutors and candidates can only engage with the system through Pearson-controlled platforms.
Recently, Pearson released a Scoring Information for Teachers and Partners document which aimed to provide information on scoring procedures and correct misconceptions (Pearson, 2026). This effort offers limited clarity as the underlying algorithmic logic remains proprietary and inaccessible to external researchers, reinforcing ongoing transparency concerns.
By contrast, IELTS operates within a far more open ecosystem (O’Sullivan, 2012; Xie & Cheng, 2015):
public band descriptors
examiner criteria available for scrutiny
multiple publishers and preparation providers
ongoing academic debate and independent research
This openness does not eliminate flaws, but it allows for shared accountability. It also allows test takers, teachers, institutions and government bodies alike to ask questions.
Opacity, on the other hand, shifts authority entirely to the test provider.
What Independent Research Exists?
There is some independent research on PTE Academic — particularly studies examining correlations with IELTS scores or academic performance. These generally show moderate to strong correlations, suggesting PTE Academic measures something related to academic English ability (Xie & Cheng, 2015).
However, what is missing is:
independent psychometric analysis of AI speaking scores
task-level validation of automated speaking constructs
transparent evaluation of how gaming behaviour is handled
Most detailed evidence about scoring accuracy still comes from Pearson-produced reports, not externally audited research (Pearson, 2025; Pearson, 2026).
In assessment science, absence of independent evidence does not imply failure — but it does warrant caution, especially in high-stakes contexts such as migration and university admission.
Policy Decisions vs Pedagogical Ideals
At this point, it becomes clear that PTE Academic appears to prioritise policy and operational considerations over pedagogical ideals.
AI scoring enables:
speed
scalability
cost efficiency
global standardisation
What it sacrifices is:
transparency
communicative richness
adaptive judgement
This is not a technological inevitability. More advanced language models already demonstrate deeper discourse-level understanding. The decision not to deploy them fully is shaped by legal, operational, and financial considerations – rather than pedagogical validity or communicative realism.
As discussed earlier, even research-grade attempts at interpretability rely on indirect, post-hoc methods rather than transparent decision rules. They often require techniques like feature importance and Shapley analysis to approximate how scoring decisions align with proficiency rubrics, rather than surface acoustic features alone (Bamdev et al., 2021).
Final Thoughts
PTE Academic is not an illegitimate test, and AI-based assessment is not inherently flawed but when scoring systems are opaque, optimisation strategies outpace design updates, independent validation is limited and access to understanding outcomes is tightly controlled it becomes reasonable — and necessary — for teachers, institutions and candidates to ask harder questions.
Precision in scoring should never be confused with precision in measurement.
Language is not just a signal.
It is interaction, intent, repair and meaning, and any test that claims to measure proficiency must ultimately be judged by how well it captures that reality.
Pearson’s partial reintroduction of human assessment reinforces a central point: while AI offers efficiency and consistency, human judgement remains necessary when test validity is threatened by strategic behaviour. The question is not whether AI should be used, but whether it can function as the sole arbiter of language proficiency in high-stakes contexts.
The primary objective of English language assessments for visa applicants, migrants, and international students is to evaluate a candidate’s ability to communicate effectively in Australia’s national language. Based on the issues outlined above, it is reasonable to question whether PTE Academic, in its current format, fully meets this objective. While the use of artificial intelligence in assessment has clear benefits, effective evaluation of human communication requires careful consideration of where automation supports judgement — and where it may constrain it. Further work is therefore needed to ensure that AI-driven assessment systems align with the communicative demands they are intended to measure.
References
Bamdev, P., Grover, M. S., Singla, Y. K., Vafaee, P., Hama, M., & Shah, R. R. (2021). Automated speech scoring systems under the lens: Evaluating and interpreting the linguistic cues for language proficiency. arXiv.https://arxiv.org/abs/2106.12066
O’Sullivan, B. (2012). Assessment issues in language testing. Routledge.
Pearson. (2025). PTE Academic test taker score guide. Pearson Education. https://www.pearsonpte.com/ctf-assets/yqwtwibiobs4/3TQDBW61bfUHn8XJJAXm8v/ef900ec4e82f2043248e485bd4e3b15d/PTE_Academic_Test_Taker_Score_Guide.pdf
Pearson. (2025). The official guide to PTE Academic (3rd ed.). Pearson Education.
Pearson. (2026). Scoring Information for Teachers and Partners. Pearson Education.
Pearson. (n.d.). Human and automated scoring in PTE Academic. Pearson Education.
Pearson. (n.d.). PTE AI uncovered [White paper]. Pearson Education.https://www.pearsonpte.com/ctf-assets/yqwtwibiobs4/8WsMCnBlinM5ATEUXocXG/a722e363be50ffa7284ec9ca480a0b2d/ai-uncovered-dr-bonk.pdf
Xie, Q., & Cheng, L. (2015). The IELTS and PTE Academic: A comparison of test performance and construct coverage. Language Testing in Asia, 5(1).https://doi.org/10.1186/s40468-015-0010-8







Comments