Digital Archive Classification Performance Analysis Using a Decision Tree Based on TF-IDF Features

Authors

Keywords:

Data Mining, Decision Tree, Archive Classification, Records Management, Office Automation, TF-IDF

Abstract

The management of digital archives at PT LNS Indonesia faces inefficiencies due to the absence of an automatic classification system, forcing employees to manually inspect file names. This leads to slow document retrieval processes and a high risk of misfiled archives. The aim of this study is to design and implement a digital archive classification system based on the Decision Tree algorithm to automate archive management. The research method follows the Knowledge Discovery in Databases (KDD) framework, starting from the selection of file names from four divisions Human Resources (HR), Finance, Information Technology (IT), and Engineering followed by text preprocessing and feature extraction using TF-IDF with a combination of word n-grams and character n-grams. The Decision Tree model is developed and evaluated using 5- Fold Cross-Validation. Experimental results show that the model achieves 91% accuracy, macro-precision of 0.92, macro-recall of 0.90, and a macro F1-score of 0.91. In conclusion, the implemented system successfully automates the grouping of thousands of files into folders according to division and document type, thereby significantly improving efficiency and accuracy in digital archive management within the company.

Downloads

Download data is not yet available.

References

N. Amalia, “Efektifitas digitalisasi arsip surat melalui pembuatan aplikasi document management system (DMS) pada subbagian tata usaha kantor kementerian agama kota lhokseumawe,” Jurnal Elektronika dan Teknologi Informasi, vol. 3, no. 2, pp. 29–36, 2022, DOI: https://doi.org/10.5201/jet.v3i2.292.

A. Syahidan, “Digital transformation in the management of the national archives of the republic of indonesia,” Social Impact Journal, vol. 3, no. 1, 2024, DOI: https://doi.org/10.61391/sij.v3i1.152.

G. A. Fad’li, M. Marsofiyati, and S. Suherdi, “Implementasi arsip digital untuk penyimpanan dokumen digital,” Jurnal Manuhara: Pusat Penelitian Ilmu Manajemen dan Bisnis, vol. 1, no. 4, pp. 1–10, 2023, DOI: https://doi.org/10.61132/manuhara.v1i4.115.

R. Sari and R. Alpiansah, “Implementasi aplikasi document management system untuk meningkatkan efisiensi dan akurasi proses pembiayaan bank,” Jurnal Ilmiah Pengabdian dan Inovasi, vol. 2, no. 4, pp. 923–932, 2024, DOI: https://doi.org/10.57248/jilpi.v2i4.442.

M. Artama, I. N. Sukajaya, and G. Indrawan, “Classification of official letters using TF-IDF method,” in Journal of Physics: Conference Series, vol. 1516, no. 1, 2020, p. 012001, DOI: https://doi.org/10.1088/1742-6596/1516/1/012001.

B. Charbuty and A. Abdulazeez, “Classification based on decision tree algorithm for machine learning,” Journal of Applied Science and Technology Trends, vol. 2, no. 1, pp. 1–8, 2021, DOI: https://doi.org/10.38094/jastt20165.

M. Ahmednor, Suhartono, and Imamudin, “Klasifikasi keterampilan kerja menggunakan metode TF-IDF dan decision tree pada data lowongan kerja LinkedIn,” Jurnal Aplikasi dan Inovasi Ipteks "SOLIDITAS" (J-SOLID), vol. 8, no. 1, pp. 52–60, 2025, DOI: https://doi.org/10.31328/js.v8i1.7152.

S. L. Luna, D. Garigliotti, F. M. Plumed, and C. F. Ramírez, “Automatic PDF document classification with machine learning,” in Intelligent Data Engineering and Automated Learning – IDEAL 2024, 2024, pp. 447–459, DOI: https://doi.org/10.1007/978-3-031-77731-840.

H. Barus, I. N. Fajri, and Y. Pristyanto, “Sentiment classification analysis of Tokopedia reviews using TF-IDF, SMOTE, and traditional machine learning models,” Journal of Applied Informatics and Computing, vol. 9, no. 5, pp. 2552–2561, 2025, DOI: https://doi.org/10.30871/jaic.v9i5.10524.

K. Madatov, S. Sattarova, and J. Vičič, “TF-IDF-based classification of Uzbek educational texts,” Applied Sciences, vol. 15, no. 19, p. 10808, 2025, DOI: https://doi.org/10.3390/app151910808.

Z. Li, S. Larson, and K. Leach, “Document type classification using file names,” arXiv Preprint, vol. arXiv:2410.01166, 2024, DOI: https://doi.org/10.48550/arXiv.2410.01166.

J. Franks, “Text classification for records management,” Journal on Computing and Cultural Heritage, vol. 15, no. 3, pp. 1–19, 2022, DOI: https://doi.org/10.1145/3485846.

A. Pacheco, C. G. D. Silva, and M. C. V. D. Freitas, “A metadata model for authenticity in digital archival descriptions,” Archival Science, vol. 23, no. 4, pp. 629–673, 2023, DOI: https://doi.org/10.1007/s10502-023-09422-w.

M. Das, S. Kamalanathan, and P. Alphonse, “A comparative study on TF-IDF feature weighting method and its analysis using unstructured dataset,” arXiv Preprint, pp. 98–107, 2023, DOI: https://doi.org/10.48550/arXiv.2308.04037.

P. Guleria, J. Frnda, and P. N. Srinivasu, “NLP-based text classification using TF-IDF enabled fine-tuned long short-term memory: An empirical analysis,” Array, vol. 27, no. 1, p. 100467, 2025, DOI: https://doi.org/10.1016/j.array.2025.100467.

L. Zhang, “Features extraction based on Naive Bayes algorithm and TF-IDF for news classification,” PLOS ONE, vol. 20, no. 1, p. e0327347, 2025, DOI: https://doi.org/10.1371/journal.pone.0327347.

M. Mujahid, E. Kina, F. Rustam, M. G. Villar, E. S. Alvarado, I. D. L. T. Díez, and I. Ashraf, “Data oversampling and imbalanced datasets: An investigation of performance for machine learning and feature engineering,” Journal of Big Data, vol. 11, no. 1, pp. 1–32, 2024, DOI: https://doi.org/10.1186/s40537-024-00943-4.

E. Helmud, F. Fitriyani, and P. Romadiana, “Classification comparison performance of supervised machine learning random forest and decision tree algorithms using confusion matrix,” Jurnal Sisfokom (Sistem Informasi dan Komputer), vol. 13, no. 1, pp. 92–97, 2024, DOI: https://doi.org/10.32736/sisfokom.v13i1.1985.

T. K. Deo, R. Deshmukh, and G. Sharma, “Comparative study among term frequency-inverse document frequency and count vectorizer towards k-nearest neighbor and decision tree classifiers for text dataset,” Nepal Journal of Multidisciplinary Research, vol. 7, no. 2, pp. 1–12, 2024, DOI: https://doi.org/10.3126/njmr.v7i2.68189.

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 4th ed. Morgan Kaufmann, 2022, DOI: https://doi.org/10.1016/C2009-0-61819-5.

G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R, 2nd ed. Springer, 2021, DOI: https://doi.org/10.1007/978-1-0716-1418-1.

P. L. Foalem, F. Khomh, and H. Li, “Studying logging practice in machine learning-based applications,” Information and Software Technology, vol. 170, no. 1, p. 107450, 2024, DOI: https://doi.org/10.1016/j.infsof.2024.107450.

H. Setiawan, R. R. Hanaputra, C. R. Anggoman, and A. L. A. Hindami, “Rancang bangun secure document management system (DMS) menggunakan metode Agile-SSDLC,” INSERT: Information System and Emerging Technology Journal, vol. 5, no. 1, 2024, DOI: https://doi.org/10.23887/insert.v5i1.75244.

A. Géron, Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, 2019.

S. Raschka and V. Mirjalili, Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd ed. Packt Publishing, 2019.

N. W. S. Saraswati, C. P. Yanti, I. D. M. K. Muku, and D. A. P. R. Dewi, “Evaluation analysis of the necessity of stemming and lemmatization in text classification,” MATRIK: Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 24, no. 2, pp. 321–332, 2025, DOI: https://doi.org/10.30812/matrik.v24i2.4833.

N. U. C. M. Safawi and N. A. Shafie, “Performance of TF-IDF for text classification reviews on Google Play Store: Shopee,” Journal of Computing Research and Innovation, vol. 9, no. 2, pp. 13–22, 2024, DOI: https://doi.org/10.24191/jcrinn.v9i2.410.

A. Anas and T. A. Salim, “Tinjauan literatur sistematis pemanfaatan electronic document management system bagi organisasi dalam menunjang manajemen pengetahuan,” Berkala Ilmu Perpustakaan dan Informasi, vol. 18, no. 2, 2022, DOI: https://doi.org/10.22146/bip.v18i2.5649.

M. Nasution, I. R. Munthe, F. A. Nasution, and S. Defit, “Optimizing text classification using techniques AdaBoost ensemble with decision tree algorithm,” CogITo Smart Journal, vol. 11, no. 1, pp. 39–51, 2025, DOI: https://doi.org/10.31154/cogito.v11i1.741.39-51.

Downloads

Published

2026-05-02

How to Cite

[1]
J. Handoyo, D. F. . Sushananto Wijaya, and M. . Junaidi, “Digital Archive Classification Performance Analysis Using a Decision Tree Based on TF-IDF Features”, IJCEDS, vol. 5, no. 1, pp. 37–46, May 2026.

ARK