Plagiarism Detection in English Academic Documents using A Lexical-Semantic Hybrid and Support Vector Machine
DOI:
https://doi.org/10.35314/2zz12581Keywords:
Plagiarism, lexical-semantic hybrid, SVMAbstract
Detecting plagiarism in academic writing has become increasingly challenging due to advanced text modification strategies that reduce surface-level similarity while preserving the original meaning. This study proposes a hybrid plagiarism detection system that integrates lexical and semantic similarity features to distinguish between plagiarism and altered documents in academic texts. As a key contribution, this study provides a systematic evaluation of a lexical–semantic hybrid plagiarism detection approach using Support Vector Machine (SVM) on English-language academic documents, where all plagiarism cases across different obfuscation levels are consolidated into a single plagiarism class. Lexical similarity is modeled using Term Frequency–Inverse Document Frequency (TF–IDF), while semantic similarity is captured through Sentence-BERT embeddings. These features are combined into a two-dimensional hybrid similarity representation and classified using SVM. The proposed approach is evaluated on the PAN 2025 dataset using stratified 5-fold cross-validation. Experimental results show that the hybrid SVM-based model achieves an average accuracy of 92.5% with the optimal kernel, along with competitive precision, recall, F1-score, and AUC values. Kernel-based evaluation and cross-validation analyses further demonstrate the robustness and generalization capability of the proposed framework, indicating that the hybrid lexical–semantic representation is effective for distinguishing plagiarism and altered content in English academic writing.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 INOVTEK Polbeng - Seri Informatika

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

