Q&A on AI Text Detection
Are there any reliable tools available to help me identify AI-generated writing in student assignments?
The short answer is no.
Educational technology companies are saying ‘yes, please buy our product’, but active research and recent publications in the field of AI says ‘there is currently no way to reliably detect AI-generated text through a tool.’
A case study in the Croatian Journal of Medicine published June 2023 sought to gauge the effectiveness of current AI detection products and ran across an interesting finding in the experiment set up. To establish a control group, they isolated a subset of abstracts published in 2003 and 2013, far before generative AI was generally accessible, and ran them through three AI detection products. Then, they ran their experimental group of abstracts published in 2023 through the same AI detection products. The number of papers flagged as “potentially AI written” was roughly equivalent between the experimental and control groups (Homolak, 2023).
Multiple papers describe this same issue of not getting a significant enough difference between the control and experimental groups, or just too many ‘false positives’ – human generated text flagged as AI-generated (see expanded references). So who are the companies saying they can reliably detect AI-generated text and what evidence do they have to back them?
Takeaways
- There is currently no tool to reliably detect AI generated text, and it is likely to be even less detectable in the future as AI continues to improve.
- Content experts are more effective at discerning AI generated text from their area of expertise after familiarizing themselves with examples.
- Addressing AI use as a quality issue connects the potential long term impacts of using AI as a student to a student’s goals of being an independent practitioner post graduation.
Turnitin.com is a well known name in the field of plagiarism detection, and has arguably become the front-runner in AI-detection (Knox, 2023). For note, UAF currently subscribes to their plagiarism detector and has an integration set up in both Blackboard and Canvas LMSs. Their lead is not surprising, given that they are sitting on a database of student papers collected over the past two decades, coupled with the rights to use these works to provide “other Services” as outlined in their user agreement: “This User Agreement grants Turnitin and its affiliates, vendors, service providers and licensors only a non-exclusive right to Your paper solely for the purposes of plagiarism prevention and the other Services provided as part of Turnitin.” (Turnitin, 2018) That’s right! Every paper that has been submitted for plagiarism checking is potentially being used to make a market leading AI-detection tool and to profit off the backs of students. Foregoing ethical discussions, this still leaves the question of accuracy for their AI detection tool. Turnitin backs their claims of accuracy by citing a paper published late 2023 in Open Information Science titled: The Effectiveness of Software Designed to Detect AI-Generated Writing: A Comparison of 16 AI Text Detectors (Walters, 2023). The study shows that Turnitin and their competitor Copyleaks both have high degrees of accuracy in identifying AI-generated text, and close to zero false-positives. The experiment consisted of a control group of essays written by college undergraduate freshmen, an experimental group of similar essays generated by ChatGPT3.5, and a second experimental group of similar essays generated by ChatGPT4. Notably, all essays were stripped of the intro paragraph, identifying information, references, and any tables or figures. The biggest limitation noted was the study’s omission of an experimental group where a mix of human and AI-generated processes were used to create the essay.
A follow up study done by five computer scientists from the University of Maryland investigated this limitation using what they term a “paraphrasing attack” (Sadasivan, 2024). They take an AI generated response and put it through another AI trained to paraphrase materials. Depending on how much a given AI detector suspects the response of being AI generated, they may repeat the process until the given detector exhibits a 20 – 30% loss in accuracy. All detectors dropped in accuracy significantly within 3-5 paraphrasing attacks, leading the authors to conclude, “… the practical applications of AI-text detectors can become unreliable and invalid. Security methods need not be foolproof. However, we need to make sure that it is not an easy task for an attacker to break these security defenses.” (Sadasivan, 2024). In summary, AI detectors like Turnitin’s can be very accurate (Walters, 2023), however, they are susceptible to simple work-arounds that are broadly available (Sadasivan, 2024).
For contrast, Open AI, the developers of ChatGPT, also developed an AI detector, but quickly shut it down the same year it launched (Hendrik Kirchner et al., 2023). In fact, on OpenAI’s website they answer this question of: do AI detectors work? “In short, no, not in our experience. Our research into detectors didn’t show them to be reliable enough given that educators could be making judgments about students with potentially lasting consequences.” (How Can Educators Respond to Students Presenting AI-Generated Content as Their Own? | OpenAI Help Center, n.d.)
Let’s take a minute to let that sink in. The company, willing to develop artificial intelligence and risk the potential of an AI assisted armageddon, is worried about the “lasting consequences” of AI detector use in education. They also corroborate what previous experiments on AI detector use show: “When we at OpenAI tried to train an AI-generated content detector, we found that it labeled human-written text like Shakespeare and the Declaration of Independence as AI-generated.” (Hendrik Kirchner et al., 2023)
Open AI controls a reported 36% of the market and were last (2024) valued at over 80 billion USD (Williams, n.d.). If they aren’t able to make AI detection work, and have stopped investing in developing it, I have a hard time believing that others are somehow able to.
There are a number of other issues surrounding AI detection and the above explanation only looks at reliability. The ethics issue of using AI in education (Han et al., 2023), and the long term sustainability of a product that is built to try and keep pace with AI development (something unlikely to slow down (Henshall, 2023)), are all factors that make AI detection a questionable proposition for academics.
If AI detectors don’t work, what can I do?
In short, Familiarize yourself with AI generated content for your discipline to develop your own detection skills, and when you suspect AI use, treat it as a quality of work issue. Eventually, AI use will be much less detectable than it is now because of an increase in the quality of easily accessible AI models. Use this in-between time to talk with your students about how AI use affects their skill development.
First, recognize that there is an overlooked AI detector that can work quite well. That detector is you, the instructor and content expert.
Research showed significant improvements in human identification of AI generated text after the participant reviewed a number of AI generated text examples (Abdelaal et al., 2022). Notably, the participants in the study and the content of the material they reviewed were not linked. Taken inversely, if the participants were content experts and the material they reviewed was about their area of expertise, you could posit even greater improvements in detection accuracy. Apply these findings to your course by reviewing AI generated text for your discipline.
Go ahead and use a Large Language Model powered AI such as ChatGPT to attempt to respond to some of your course’s assignments.
What you are likely to find, is what you already anecdotally knew – AI generated text in a discipline specific context is generally of lower quality than the student work you have come to expect. This current state of AI is likely to only last for a short while however.
A study published this year (2024) in the Archives of Pathology and Laboratory Medicine set Chat GPT3.5 (the public version), Chat GPT4 (the paid version), and a staff pathologist that recently passed their Canadian pathology licensing exams, as test participants in a mock pathology license exam (Wang et al., 2024). They then had 15 licensed pathologists review the answers and rate the quality of the answers based on a likert scale, as well as try to identify which answer came from which participant – a sort of Turing test setup. The pathologists were able to, for the most part, correctly identify the Chat GPT3.5 answers based on the quality of the response. The surprising result from the study was that Chat GPT4’s responses were frequently ranked at a higher quality than the staff pathologist’s answers.
While GPT3.5, and models like it, remain free and easily accessible, we can predict that students are most likely to use them, making their answers detectable because of their low quality outputs for domain specific knowledge. If/When Open AI or another AI services provider upgrades their free service to a more powerful model, such as GPT4, AI generated answers will be more difficult to discern, even for content experts.
So how should you respond in this unknown window of time where AI generated responses can be detected by you? Say you identify a response from a student as AI generated – what then? Consider addressing this as a quality issue rather than a conduct issue. As the content expert, you understand what constitutes quality work for your field. An article from Higher Education entitled ‘Developing evaluative judgment: enabling students to make decisions about the quality of work’ elaborates on this:
The paper focuses on evaluative judgment and the ability to discern quality work within a domain. This skill is possessed by domain practitioners and the goal of students hoping to become practitioners. While the paper, published in 2017, doesn’t make connections to AI, there is a clear connection between poor quality student responses generated by AI, and the student in question’s current level of evaluative judgment for that domain. A conversation about why a response is of poor quality, how to improve it, and reviewing standards of quality are all great ways to address AI use. The authors of the article include a list of assessment activities in Table 1 that can be included in a course to help students build their evaluative judgment skill.(Tai et al., 2017) Including these assessments, or retrofitting a current assessment in your course can be a way to improve the overall quality of responses from students. AI use itself may remain unimpacted, but what is more important: a student’s ability to identify and generate a quality response, or their abstinence from engaging with AI tools?
Perhaps more than anything, having a conversation with your students early on about how AI dependence may affect their learning goals can help address the problem before it appears. As Tai et al. point out, “… students must gain an understanding of quality and how to make evaluative judgements, so that they may operate independently on future occasions, taking into account all forms of information and feedback comments, without explicit external direction from a teacher or teacher-like figure.” (2017) Confirming this shared goal of independence with your students, identifying generative AI as a ‘teacher-like figure’, and adapting your assessments to AI can set you and your students up for a productive semester.
The UAF Center for Teaching and Learning has a number of resources to assist you in adapting to AI. View any of the linked articles in the resources section, come join us at our monthly What’s New in AI drop in session, or schedule a one-to-one consultation. Finally, myself and others from the CTL will be presenting at the UA Faculty Thought Forum on AI March 28 and 29, schedule TBD.
References
Abdelaal, Elsayed & Gamage, Sithara & Mills, Julie. (2022). Assisting academics to identify computer generated writing. European Journal of Engineering Education. 47. 1-21. 10.1080/03043797.2022.2046709.
Hendrik Kirchner, J., Ahmad, L., Aaronson, S., & Leike, J. (2023, January 31). New AI classifier for indicating AI-written text. Openai.com. https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text
Henshall, W. (2023, August 2). 4 Charts That Show Why AI Progress Is Unlikely to Slow Down. Time. https://time.com/6300942/ai-progress-charts/
Homolak, J. (2023, June 30). Exploring the adoption of CHATGPT in Academic Publishing: Insights and lessons for scientific writing. Croatian medical journal. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10332292/
How can educators respond to students presenting AI-generated content as their own? | OpenAI Help Center. (n.d.). Help.openai.com. https://help.openai.com/en/articles/8313351-how-can-educators-respond-to-students-presenting-ai-generated-content-as-their-own
Knox, L. (2023, April 3). Can Turnitin Cure Higher Ed’s AI Fever? Inside Higher Ed. https://www.insidehighered.com/news/2023/04/03/turnitins-solution-ai-cheating-raises-faculty-concerns
Sadasivan, Vinu Sankar, Kumar, A., Balasubramanian, S., Wang, W., & Feizi, S. (2023). Can AI-Generated Text be Reliably Detected? ArXiv, 2303.11156v3. https://doi.org/10.48550/arxiv.2303.11156
Seo, K., Tang, J., Roll, I., Fels, S., & Yoon, D. (2021). The impact of artificial intelligence on learner–instructor interaction in online learning. International Journal of Educational Technology in Higher Education, 18(1). https://doi.org/10.1186/s41239-021-00292-9
Tai, J., Ajjawi, R., Boud, D., Dawson, P., & Panadero, E. (2017). Developing evaluative judgement: enabling students to make decisions about the quality of work. Higher Education, 76(3), 467–481. https://doi.org/10.1007/s10734-017-0220-3
Turnitin. (2018). Turnitin End-User License Agreement. Turnitin.com. https://www.turnitin.com/agreement.asp
Walters, W. H. (2023). The Effectiveness of Software Designed to Detect AI-Generated Writing: A Comparison of 16 AI Text Detectors. Open Information Science, 7(1). https://doi.org/10.1515/opis-2022-0158
Wang, A. Y., Lin, S., Tran, C., Homer, R. J., Wilsdon, D., Walsh, J. C., Goebel, E. A., Sansano, I., Snehal Sonawane, Cockenpot, V., Mukhopadhyay, S., Toros Taskin, Zahra, N., Cima, L., Orhan Semerci, Birsen Gizem Özamrak, Mishra, P., Naga Sarika Vennavalli, Po-Hsuan Cameron Chen, & Cecchini, M. J. (2024). Assessment of Pathology Domain-Specific Knowledge of ChatGPT and Comparison to Human Performance. Archives of Pathology & Laboratory Medicine. https://doi.org/10.5858/arpa.2023-0296-oa
Williams, S. (n.d.). NVIDIA, OpenAI & Microsoft leading in generative AI market. CFOtech India. Retrieved March 8, 2024, from https://cfotech.in/story/nvidia-openai-microsoft-leading-in-generative-ai-market
Extended References on AI Detection Feasibility
Search terms used in Google Scholar : AI detection, writing, education
Cooperman, S. R., & Brandão, R. A. (2024). AI Tools vs AI Text: Detecting AI-Generated Writing in Foot and Ankle Surgery. Foot & Ankle Surgery: Techniques, Reports & Cases, 100367–100367. https://doi.org/10.1016/j.fastrc.2024.100367
Elkhatat, A. M., Elsaid, K., & Saeed Al-Meer. (2023). Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text. International Journal for Educational Integrity, 19(1). https://doi.org/10.1007/s40979-023-00140-5
Han, B., Nawaz, S., Buchanan, G., & McKay, D. (2023). Ethical and Pedagogical Impacts of AI in Education. Lecture Notes in Computer Science, 667–673. https://doi.org/10.1007/978-3-031-36272-9_54
Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non-native English writers. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2304.02819
Resources (Most are external links)
Open AI’s Resource site for educators
On the Opportunities and Risks of Foundational Models (Stanford)
Professors proceed with caution using AI-detection tools (Higher Ed Chronicles) Guidance on AI Detection and Why We’re Disabling Turnitin’s AI Detector | Vanderbilt University