An Indonesian Chatbot for Disease Diagnosis Using Retrieval-Augmented Generation

Authors

  • Muhammad Adrinta Abdurrazzaq Universitas Kalbis Author
  • Edwin Lesmana Tjiong Universitas Kalbis Author
  • Aulia Fasya Universitas Kalbis Author
  • Michelle Hiu Universitas Kalbis Author
  • Joses Tanuwidjaya Universitas Kalbis Author

DOI:

https://doi.org/10.35314/9nnkn955

Keywords:

Retrieval-Augmented Generation, GPT-OSS, Medical Chatbot, Information Retrieval, Hybrid Ranking

Abstract

The rapid advancement of Large Language Models (LLMs) has enabled their use in medical information systems, although challenges such as hallucinations, domain mismatches, and the lack of a verified knowledge base remain significant, particularly in low-source languages ​​like Indonesian. This study introduces an Indonesian-language medical chatbot based on the open-source GPT-OSS-20B model enhanced through a Retrieval-Augmented Generation (RAG) pipeline. The system combines semantic retrieval using jina-embeddings-v3, lexical re-ranking with the BM25 algorithm, and a lightweight Logistic Regression-based domain filter as an initial filter to prevent out-of-domain LLM usage. Evaluation using Indonesian medical articles and annotated patient-doctor conversations shows that the domain filter works well on synthetic data but results in misclassification of natural queries. A hybrid weighted reranker (FAISS L2 + BM25) performed the best with a Top-30 accuracy of 0.699. Black-box testing indicates that the system flow functions as designed, although the response quality has not been validated by clinical experts. These findings suggest that RAG-based open-source LLMs can improve access to Indonesian-language medical information, but still have important limitations such as the lack of clinical validation, potential errors in scraped data, and suboptimal robustness of domain filters.

Downloads

Download data is not yet available.

Downloads

Published

26-11-2025

Issue

Section

Articles

How to Cite

An Indonesian Chatbot for Disease Diagnosis Using Retrieval-Augmented Generation. (2025). INOVTEK Polbeng - Seri Informatika, 10(3), 1877-1887. https://doi.org/10.35314/9nnkn955