Multilabel Classification: Methods Comparison, Performance Improvement, And Model Explainability

Prasetyo, Teguh

View/Open

Cover (667.6Kb)

Fulltext (1.697Mb)

Lampiran (253.7Kb)

Date

2024

Author

Prasetyo, Teguh

Susetyo, Budi

Kurnia, Anang

Metadata

Show full item record

Abstract

Optimizing tax revenues is difficult in Indonesia due to obstacles such as tax evasion and tax avoidance. It is closely related to an organization's compliance with tax regulations, known as the Taxpayers Risk Profile. However, this mechanism does not accurately detect tax avoidance and tax evasion risks. To overcome this limitation, we use a multilabel classification machine learning method in this study, which classifies a single observation into one or more labels at once. The approach involves Problem Transformation (Binary Relevance and Label Powerset), Algorithm Adaptation (ML-kNN and ML-ARAM)¸ and Ensemble (Label Space Partitioning and RAkELd). Based on the model performance comparisons, we discovered that the ML-ARAM method based on deep learning is the best, with an average F1-score of 95.5% and a hamming loss of 7.4%. We also examine the feature importance of the best model to reduce the dimensions of features so that we can identify the dominant factors that encourage a taxpayer entity to engage in tax avoidance or tax evasion. The findings of this study improve the accuracy of tax avoidance risk detection and tax evasion Risk Profiles using machine learning methods, ensuring maximum tax revenues in Indonesia. Multilabel classification modelling using deep learning-based methods is superior in the predictions' accuracy but is difficult for researchers to interpret and understand. This limitation is because deep learning-based methods use complex algorithms compared to probabilistic-based methods. Even though it has good prediction accuracy, the condition of the data also affects the model performance, such as class imbalance. These two things are challenges in implementing classification models, especially in the multilabel classification (MLC), which has a specific characteristic compared to other classification models. To overcome this limitation, in this study we offer a solution in addressing class imbalance using a combination of stratification and resampling methods, as well as applying the SHAP (SHapley Additive exPlanations) method to get an understandable interpretation of the deep learning-based MLC method. This study uses empirical tax avoidance and tax evasion data from the Indonesian Ministry of Finance from 2018 to 2022. We use the MLC method based on Deep Learning - ML-ARAM, applied on several combinations of stratification and resampling methods to address class imbalance. We measure model performance using F1-score and hamming loss, as well as model interpretation and explanation uses feature importance values based on the SHAP method. We found that the combination of the stratification and resampling methods increased the prediction accuracy on data indicated to have a class imbalance, compared to initial conditions without treatment or the use of one method alone. Furthermore, we successfully generate an understandable interpretation of the MLC model using the SHAP method, including identifying the dominant explanatory variables (features) and their influence on the prediction results for each observed value based on the Feature Importance value. The findings of this study provide benefits in the form of alternative methods for addressing class imbalance in MLC data using a combination of the stratification and resampling methods as well as an understandable interpretation of the MLC model. In this case, the interpretation of the deep learning-based model is helpful for decision-making and improving risk profile policies at the Ministry of Finance - Indonesia.

URI

http://repository.ipb.ac.id/handle/123456789/156306

Collections

MT - Mathematics and Natural Science [4143]