Speaker-Specific Method of Spoofing Attack Detection Based on Anomaly Detection
Keywords:
speaker-specific approach, spoofing detection, presentation attack detection, biometric systems, voice biometrics, transfer learning, anomaly detectionAbstract
Most research in the field of voice presentation attack detection relies on the speaker-independent approach. Nevertheless, several scientific works indicate that using the speaker-specific approach, which involves utilizing prior knowledge about the identity of the claimed speaker to enhance the accuracy of spoofing detection, is likely to be beneficial. Therefore, the goal of this work is to propose a speaker-specific method of spoofing attack detection based on anomaly detection and to evaluate its applicability to the detection of synthesized speech and converted voice. Artificial neural networks pre-trained for the tasks of spoofing detection, speaker recognition, and audio pattern recognition are used for feature extraction. A set of anomaly detection models are used as backend classifiers. Each of them is trained on bonafide data of a target speaker. The experimental evaluation of the proposed method on the ASVspoof 2019 LA dataset shows that the best speaker-specific spoofing detection system, which uses an anomaly detection model and a neural network pre-trained for the task of speaker recognition, achieves an EER of 4.74%. This result suggests that embeddings extracted by networks pre-trained for speaker recognition contain information that can be utilized for spoofing detection. In addition, the proposed method allowed to increase the accuracy of three baseline systems pre-trained for the task of spoofing detection. Experiments with two baseline systems on the ASVspoof 2019 LA dataset showed relative improvement in terms of EER by 7.1% and 9.2%, and in terms of min t-DCF by 4.6%. Experiments with the third baseline system on the ASVspoof 2021 LA dataset showed relative improvement in terms of EER by 3.9% without significant improvement of min t-DCF.
References
2. Wang X., Yamagishi J. A Practical Guide to Logical Access Voice Presentation Attack Detection. Frontiers in Fake Media Generation and Detection. Singapore: Springer. 2022. pp. 169–214. DOI: 10.1007/978-981-19-1524-6_8.
3. GOST R 58624.1-2019. Informacionnye tehnologii. Biometrija. Obnaruzhenie ataki na biometricheskoe predjavlenie. Chast' 1. Struktura [Information technology. Biometrics. Biometric presentation attack detection. Part 1. Framework]. M.: Gosstandart Rossii, 2019. (In Russ.).
4. Chettri B., Sturm B.L. A Deeper Look at Gaussian Mixture Model Based Anti-Spoofing Systems. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. pp. 5159–5163. DOI: 10.1109/ICASSP.2018.8461467.
5. Wei L., Long Y., Wei H., Li Y. New Acoustic Features for Synthetic and Replay Spoofing Attack Detection. Symmetry. 2022. vol. 14. no. 2. DOI: 10.3390/sym14020274.
6. Balamurali B.T., Lin K.E., Lui S., Chen J.-M., Herremans D. Toward Robust Audio Spoofing Detection: A Detailed Comparison of Traditional and Learned Features. IEEE Access. 2019. vol. 7. pp. 84229–84241. DOI: 10.1109/ACCESS.2019.2923806.
7. Markovnikov N., Kipyatkova I. An Analytic Survey of End-to-End Speech Recognition Systems. Trudy SPIIRAN – SPIIRAS Proceedings. 2018. vol. 3. no. 58. pp. 77-110. DOI: 10.15622/sp.58.4. (In Russ.).
8. Hua G., Teoh A.B.J., Zhang H. Towards End-To-End Synthetic Speech Detection. IEEE Signal Processing Letters. 2021. vol. 28. pp. 1265–1269. DOI: 10.1109/LSP.2021.3089437.
9. Wang X., Delgado H., Tak H., Jung J., Shim H., Todisco M., Kukanov I., Liu X., Sahidullah M., Kinnunen T., Evans N., Lee K.A., Yamagishi J. ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale. arxiv preprint: arXiv:2408.08739v1. 2024.
10. Novoselov S., Kozlov A., Lavrentyeva G., Simonchik K., Shchemelinin V. STC Anti-spoofing Systems for the ASVspoof 2015 Challenge. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2016. pp. 5475–5479. DOI: 10.1109/ICASSP.2016.7472724.
11. Lavrentyeva G., Novoselov S., Malykh E., Kozlov A., Kudashev O., Shchemelinin V. Audio Replay Attack Detection with Deep Learning Frameworks. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech. 2017. pp. 82–86. DOI: 10.21437/Interspeech.2017-360.
12. Lavrentyeva G., Novoselov S., Tseren A., Volkova M., Gorlanov A., Kozlov A. STC Antispoofing Systems for the ASVspoof2019 Challenge. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech. 2019. pp. 1033–1037. DOI: 10.21437/Interspeech.2019-1768.
13. Tomilov A., Svishchev A., Volkova M., Chirkovskiy A., Kondratev A., Lavrentyeva G. STC Antispoofing Systems for the ASVspoof2021 Challenge. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech. 2021. pp. 61–67. DOI: 10.21437/ASVSPOOF.2021-10.
14. Suthokumar G., Sriskandaraja K., Sethu V., Ambikairajah E., Li H. An Analysis of Speaker Dependent Models in Replay Detection. APSIPA Transactions on Signal and Information Processing. 2020. vol. 9. no. 1. DOI: 10.1017/ATSIP.2020.9.
15. Evsyukov M.V., Putyato M.M., Makaryan A.S. [The Effect of Speaker Variability on Distinguishability of Bonafide and Synthetized Speech]. Voprosy kiberbezopasnosti – Cybersecurity issues. 2024. vol. 60. no. 2. pp. 44–52. DOI: 10.21681/2311-3456-2024-2-44-52. (In Russ.).
16. Evsjukov M.V., Putjato M.M., Makarjan A.S., Cherkasov A.N. [Assessing Accuracy of Speaker-Specific Approach to Logical Access Spoofing Detection]. Vestnik Voronezhskogo gosudarstvennogo universiteta. Serija: Sistemnyj analiz i informacionnye tehnologii – Proceedings of Voronezh State University. Series: Systems Analysis and Information Technologies. 2024. no. 1. pp. 77–93. DOI: 10.17308/sait/1995-5499/2024/l/77-93. (In Russ.).
17. Castan D., Rahman M.H., Bakst S., Cobo-Kroenke C., McLaren M., Graciarena M., Lawson A. Speaker-Targeted Synthetic Speech Detection. Proc. of The Speaker and Language Recognition Workshop (Odyssey 2022). 2022. pp. 62–69. DOI: 10.21437/Odyssey.2022-9.
18. Zhang Y., Jiang F., Duan Z. One-Class Learning Towards Synthetic Voice Spoofing Detection. IEEE Signal Processing Letters. 2021. vol. 28. pp. 937–941. DOI: 10.1109/LSP.2021.3076358.
19. Brummer N., Swart A., Mosner L., Silnova A., Plchot O., Stafylakis T., Burget L. Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech. 2022. pp. 1446–1450. DOI: 10.21437/Interspeech.2022-731.
20. Liu X., Sahidullah M., Lee K.A., Kinnunen T. Speaker-Aware Anti-spoofing. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech. 2023. pp. 2498–2502. DOI: 10.21437/Interspeech.2023-1323.
21. Jung J.W. Heo H.S., Tak H., Shim H.J., Chung J.S., Lee B.J., Yu H.J., Evans N. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022. pp. 6367–6371. DOI: 10.1109/ICASSP43922.2022.9747766.
22. Fatemifar S., Arashloo S.R., Awais M., Kittler J. Client-Specific Anomaly Detection for Face Presentation Attack Detection. Pattern Recognition. 2020. vol. 112. no. 8. DOI: 10.1016/j.patcog.2020.107696.
23. Seliya N., Zadeh A.A., Khoshgoftaar T.M. A Literature Review on One-Class Classification and its Potential Applications in Big Data. Journal of Big Data. 2021. vol. 8. no. 1. DOI: 10.1186/s40537-021-00514-x.
24. Khan S., Madden M. A Survey of Recent Trends in One Class Classification. Artificial Intelligence and Cognitive Science, Lecture Notes in Computer Science. 2009. vol. 6206. pp. 188–197. DOI: 10.1007/978-3-642-17080-5_21.
25. Villalba J., Miguel A., Ortega A., Lleida E. Spoofing Detection with DNN and One-Class SVM for the ASVspoof 2015 Challenge. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech. 2015. pp. 2067–2071. DOI: 10.21437/interspeech.2015-468.
26. Amorim L.B.V., Cavalcanti G.D.C., Cruz R.M.O. The Choice of Scaling Technique Matters for Classification Performance. Applied Soft Computing. 2023. vol. 133. DOI: 10.1016/j.asoc.2022.109924.
27. Wang C., Xu R., Xu S., Meng W., Zhang X. CNDesc: Cross Normalization for Local Descriptors Learning. IEEE Transactions on Multimedia. 2022. vol. 99. DOI: 10.1109/TMM.2022.3169331.
28. Dorabiala O., Aravkin A.Y., Kutz J.N. Ensemble Principal Component Analysis. IEEE Access. 2024. vol. 12. pp. 6663–6671. DOI: 10.1109/ACCESS.2024.3350984.
29. Tak H., Todisco M., Wang X., Jung J., Yamagishi J., Evans N. Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2vec 2.0 and Data Augmentation. Proc. of The Speaker and Language Recognition Workshop (Odyssey 2022). 2022. pp. 112–119. DOI: 10.21437/Odyssey.2022-16.
30. Wang X. et al. ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech. Computer Speech & Language. 2020. vol. 64. DOI: 10.1016/j.csl.2020.101114.
31. Yamagishi J., Wang X., Todisco M., Sahidullah M., Patino J., Nautsch A., Liu X., Lee K.A., Kinnunen T., Evans N., Delgado H. ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech. 2021. pp. 47–54. DOI: 10.21437/asvspoof.2021-8.
32. Ge W., Tak H., Todisco M., Evans N. On the Potential of Jointly-Optimised Solutions to Spoofing Attack Detection and Automatic Speaker Verification. Proceedings of the 6th International Conference, IberSPEECH. 2022. pp. 51–55. DOI: 10.21437/iberspeech.2022-11.
33. Ding S., Chen T., Gong X., Zha W., Wang Z. AutoSpeech: Neural Architecture Search for Speaker Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech. 2020. pp. 916–920. DOI: 10.21437/Interspeech.2020-1258.
34. Xie W., Nagrani A., Chung J.S., Zisserman A. Utterance-Level Aggregation for Speaker Recognition in the Wild. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019. pp. 5791–5795. DOI: 10.1109/ICASSP.2019.8683120.
35. Ravanelli M., Bengio Y. Speaker Recognition from Raw Waveform with SincNet. IEEE Spoken Language Technology Workshop (SLT). 2018. pp. 1021–1028. DOI: 10.1109/SLT.2018.8639585.
36. Jung J.W., Kim Y., Heo H.S., Lee B.-J., Kwon Y., Son Chung J.S. Pushing the Limits of Raw Waveform Speaker Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech. 2022. pp. 2228–2232. DOI: 10.21437/Interspeech.2022-126.
37. Nagraniy A., Chungy J.S., Zisserman A. VoxCeleb: A large-scale speaker identification dataset. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech. 2017. pp. 2616–2620. DOI: 10.21437/Interspeech.2017-950.
38. Chung J.S., Nagrani A., Zisserman A. VoxCeleb2: Deep Speaker Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech. 2018. pp. 1086–1090. DOI: 10.21437/Interspeech.2018-1929.
39. Panayotov V., Chen G., Povey D., Khudanpur S. LibriSpeech: An ASR Corpus Based on Public Domain Audio Books. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2015. pp. 5206–5210. DOI: 10.1109/ICASSP.2015.7178964.
40. Kong Q., Cao Y., Iqbal T., Wang Y., Wang W., Plumbley M.D. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio Speech and Language Processing. 2020. vol. 28. pp. 2880–2894. DOI: 10.1109/TASLP.2020.3030497.
41. Gemmeke G.F., Ellis D.P.W., Freedman D., Jansen A., Lawrence W., Moore R.C. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017. pp. 776–780. DOI: 10.1109/ICASSP.2017.7952261.
42. Hosna A., Merry E., Gyalmo J., Alom Z., Aung Z., Azim M.A. Transfer Learning: A Friendly Introduction. Journal of Big Data. 2022. vol. 9. no. 1. DOI: 10.1186/s40537-022-00652-w.
43. Januzaj Y., Luma A. Cosine Similarity – A Computing Approach to Match Similarity Between Higher Education Programs and Job Market Demands Based on Maximum Number of Common Words. International Journal of Emerging Technologies in Learning. 2022. vol. 17. no. 12. pp. 258–268. DOI: 10.3991/ijet.v17i12.30375.
44. Ghorbani H. Mahalanobis Distance and its Application for Detecting Multivariate Outliers. Facta Universitatis, Series: Mathematics and Informatics. 2019. vol. 34. no. 3. pp. 583–595. DOI: 10.22190/fumi1903583g.
45. Alegre F., Amehraye A., Evans N. A One-Class Classification Approach to Generalised Speaker Verification Spoofing Countermeasures Using Local Binary Patterns. IEEE 6th International Conference on Biometrics: Theory, Applications and Systems (BTAS). 2013. pp. 1–8. DOI: 10.1109/BTAS.2013.6712706.
46. Scrucca L. Entropy-Based Anomaly Detection for Gaussian Mixture Modeling. Algorithms. 2023. vol. 16. no. 4. DOI: 10.3390/a16040195.
47. Reynolds D.A., Quatieri T.F., Dunn R.B. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing: A Review Journal. 2000. vol. 10. no. 1-3. pp. 19–41. DOI: 10.1006/dspr.1999.0361.
48. Liu F.T., Ting K.M., Zhou Z.H. Isolation forest. Proceedings of the Eighth IEEE International Conference on Data Mining (ICDM). 2008. pp. 413–422. DOI: 10.1109/ICDM.2008.17.
49. Hao B., Hei X. Voice Liveness Detection for Medical Devices. Design and Implementation of Healthcare Biometric Systems. 2019. pp. 109–136. DOI: 10.4018/978-1-5225-7525-2.ch005.
50. Kinnunen T., Lee K.A., Delgado H., Evans N., Todisco M., Sahidullah M., Yamagishi J., Reyonolds D.A. t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification. Proc. The Speaker and Language Recognition Workshop (Odyssey 2018), 2018. pp. 312–319.
51. Hazra A. Using the Confidence Interval Confidently. Journal of Thoracic Disease. 2017. vol. 9. no. 10. DOI: 10.21037/jtd.2017.09.14.
52. Martin A., Doggington G., Kamm T., Ordowski M. Przybocki M. The DET curve in assessment of detection task performance. Proceedings of the 5th European Conference on Speech Communication and Technology, Eurospeech (ISCA). 1997. pp. 1895–1898. DOI:10.21437/Eurospeech.1997-504.
Published
How to Cite
Section
Copyright (c) Михаил Витальевич Евсюков

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms: Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).