Detecting Outliers Using Modified Recursive PCA Algorithm For Dynamic Streaming Data
Abstract
Outlier analysis has been widely studied and has produced many methods. However, there is still rare a method to detect outliers for dynamically streaming batch data (online learning). In the present research, a novel online algorithm to detect outliers in such dataset is proposed. Data points are proceeded by applying a modified recursive PCA to predict sequentially parameters of the model; eigenvalues and eigenvectors of the statistical detection model are recursively updated using approximate values by perturbation methods. More specifically, the recursive eigenstructure is obtained from the derivation of the covariance matrix using the first-order perturbation technique. The Mahalanobis distance is then used as an outlier score. Our algorithm performances are evaluated using some metrics, namely accuration, precision, recall, F1-score, AUC-PR, and the execution time. Results show that the proposed online outlier detection is computationally efficient in time and the algorithm's performance effectiveness is comparable to that of the offline outlier detection algorithm via classical PCA.
References
Aggarwal, C. C. An introduction to outlier analysis. In Outlier analysis. Springer, 2017, pp. 1–34.
Ahmadi, M., Sharifi, A., Jafarian Fard, M., and Soleimani, N. Detection of brain lesion location in mri images using convolutional neural network and robust pca. International journal of neuroscience (2021), 1–12.
Al-Fawa’reh, M., Al-Fayoumi, M., Nashwan, S., and Fraihat, S. Cyber threat intelligence using pca-dnn model to detect abnormal network behavior. Egyptian Informatics Journal 23, 2 (2022), 173–185.
Alimohammadi, H., and Chen, S. N. Performance evaluation of outlier detection techniques in production timeseries: A systematic review and meta-analysis. Expert Systems with Applications 191 (2022), 116371.
Bosman, H. H., Liotta, A., Iacca, G., and W¨ortche, H. J. Anomaly detection in sensor systems using lightweight machine learning. In 2013 IEEE International Conference on Systems, Man, and Cybernetics (2013), IEEE, pp. 7–13.
Brownlee, J. Imbalanced classification with Python: better metrics, balance skewed classes, cost-sensitive learning. Machine Learning Mastery, 2020.
Caelen, O. A bayesian interpretation of the confusion matrix. Annals of Mathematics and Artificial Intelligence 81, 3 (2017), 429–450.
Cesa-Bianchi, N., and Orabona, F. Online learning algorithms. Annual review of statistics and its application (2021).
Chicco, D., Starovoitov, V., and Jurman, G. The benefits of the matthews correlation coefficient (mcc) over the diagnostic odds ratio (dor) in binary classification assessment. Ieee Access 9 (2021), 47112–47124.
Chicco, D., T¨otsch, N., and Jurman, G. The matthews correlation coefficient (mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData mining 14, 1 (2021), 1–22.
Emerson, J. W., and Kane, M. J. Don’t drown in the data. Significance 9, 4 (2012), 38–39.
Fieri, B., and Suhartono, D. Offensive language detection using soft voting ensemble model. MENDEL Journal 29, 1 (2023), 1–6.
Fischer, M. E., Cruickshanks, K. J., Dillard, L. K., Nondahl, D. M., Klein, B. E., Klein, R., Pankow, J. S., Tweed, T. S., Schubert, C. R., Dalton, D. S., et al. An epidemiologic study of the association between free recall dichotic digits test performance and vascular health. Journal of the American Academy of Audiology 30, 04 (2019), 282–292.
Gunawan, A. Y., Kresnowati, M. T. A. P., et al. Artificial neural network approach for the identification of clove buds origin based on metabolites composition. arXiv preprint arXiv:2007.05125 (2020).
Hawkins, D. M. Identification of outliers, vol. 11. Springer, 1980.
Hinch, E. Perturbation methods. Cambridge University Press, 1992.
Hoeltgebaum, H., Adams, N., and Fernandes, C. Estimation, forecasting, and anomaly detection for nonstationary streams using adaptive estimation. IEEE Transactions on Cybernetics (2021).
Ifzarne, S., Tabbaa, H., Hafidi, I., and Lamghari, N. Anomaly detection using machine learning techniques in wireless sensor networks. In Journal of Physics: Conference Series (2021), vol. 1743, IOP Publishing, p. 012021.
Ippel, L., Kaptein, M., and Vermunt, J. Dealing with data streams: An online, row-byrow, estimation tutorial. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences 12, 4 (2016), 124.
Jolliffe, I. T. Principal component analysis for special types of data. Springer, 2002.
Mahalanobis, P. C. On test and measures of group divergence: theoretical formulae. Journal and Proceedings of Asiatic Society of Bengal New series 26 (1930), 541–588
Majdoubi, R., Masmoudi, L., Bakhti, M., Elharif, A., and Jabri, B. Parameters estimation of bldc motor based on physical approach and weighted recursive least square algorithm. International Journal of Electrical & Computer Engineering (2088-8708) 11, 1 (2021).
Pokrajac, D., Lazarevic, A., and Latecki, L. J. Incremental local outlier detection for data streams. In 2007 IEEE symposium on computational intelligence and data mining (2007), IEEE, pp. 504–515.
Saberioon, M., C´ısaˇr, P., Labb´e, L., Souˇcek, P., Pelissier, P., and Kerneis, T. Comparative performance analysis of support vector machine, random forest, logistic regression and k-nearest neighbours in rainbow trout (oncorhynchus mykiss) classification using imagebased features. Sensors 18, 4 (2018), 1027.
Schifano, E. D., Wu, J., Wang, C., Yan, J., and Chen, M.-H. Online updating of statistical inference in the big data setting. Technometrics 58, 3 (2016), 393–403.
Sippola, V., and Mercer, R. E. An experimental comparison of the geometry of models trained on natural language and synthetic data. In Canadian Conference on AI (2021).
Snijders, C., Matzat, U., and Reips, U.-D. ” big data”: big gaps of knowledge in the field of internet science. International journal of internet science 7, 1 (2012), 1–5.
Thuy, T. T. T., Thuan, L. D., Duc, N. H., and Minh, H. T. A study on heuristic algorithms combined with lr on a dnn-based ids model to detect iot attacks. MENDEL Journal 29, 1 (2023) 62–70.
Wang, C., Chen, M.-H., Wu, J., Yan, J., Zhang, Y., and Schifano, E. Online updating method with new variables for big data streams. Canadian Journal of Statistics 46, 1 (2018), 123–146.
Wissel, B. D., Greiner, H. M., Glauser, T. A., Pestian, J. P., Kemme, A. J., Santel, D., Ficker, D. M., Mangano, F. T., Szczesniak, R. D., and Dexheimer, J. W. Early identification of epilepsy surgery candidates: A multicenter, machine learning study. Acta Neurologica Scandinavica 144, 1 (2021), 41–50.
Zangeneh-Nejad, F., Amiri-Simkooei, A., Sharifi, M., and Asgari, J. Recursive least squares with additive parameters: Application to precise point positioning. Journal of Surveying Engineering 144, 4 (2018), 04018006.
Zea-Vera, R., Ryan, C. T., Havelka, J., Corr, S. J., Nguyen, T. C., Chatterjee, S., Wall Jr, M. J., Coselli, J. S., Rosengart, T. K., and Ghanta, R. K. Machine learning to predict outcomes and cost by phase of care after coronary artery bypass grafting. The Annals of Thoracic Surgery 114, 3 (2022), 711–719.
Copyright (c) 2023 MENDEL
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
MENDEL open access articles are normally published under a Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA 4.0) https://creativecommons.org/licenses/by-nc-sa/4.0/ . Under the CC BY-NC-SA 4.0 license permitted 3rd party reuse is only applicable for non-commercial purposes. Articles posted under the CC BY-NC-SA 4.0 license allow users to share, copy, and redistribute the material in any medium of format, and adapt, remix, transform, and build upon the material for any purpose. Reusing under the CC BY-NC-SA 4.0 license requires that appropriate attribution to the source of the material must be included along with a link to the license, with any changes made to the original material indicated.