Personality Detection on Reddit Using DistilBERT

Alif Rahmat Julianda; Warih Maharani

doi:10.29207/resti.v7i5.5236

Alif Rahmat Julianda Telkom University
Warih Maharani Telkom University

DOI: https://doi.org/10.29207/resti.v7i5.5236

Keywords: personality detection, reddit, distilBERT

Abstract

Personality is a unique set of motivations, feelings, and behaviors humans possess. Personality detection on social media is a research topic commonly conducted in computer science. Personality models often used for personality detection research are the Big Five Indicator (BFI) and Myers-Briggs Type Indicator (MBTI) models. Unlike the BFI, which classifies personalities based on an individual’s traits, the MBTI model classifies personalities based on the type of the individual. So, MBTI performs better in several scenarios than the Big Five model. Many studies use machine learning to detect personality on social media, such as Logistic Regression, Naïve Bayes, and Support Vector Machine. With the recent popularity of Deep Learning, we can use language models such as DistilBERT to classify personality on social media. Because of DistilBERT’s ability to process large sentences and the ability for parallelization thanks to the transformer architecture. Therefore, the proposed research will detect MBTI personality on Reddit using DistilBERT. The evaluation shows that removing stopwords on the data preprocessing stage can reduce the model’s performance, and with class imbalance handling, DistilBERT performs worse than without class imbalance handling. Also, as a comparison, DistilBERT outperforms other machine learning classifiers such as Naïve Bayes, SVM, and Logistic Regression in accuracy, precision, recall, and f1-score.

Downloads

Download data is not yet available.

References

Y. Mehta, N. Majumder, A. Gelbukh, and E. Cambria, “Recent trends in deep learning based personality detection,” Artif. Intell. Rev., vol. 53, no. 4, pp. 2313–2339, 2020, doi: 10.1007/s10462-019-09770-z.

S. V. Therik, E. B. Setiawan, and U. Telkom, “Deteksi Kepribadian Big Five Pengguna Twitter,” vol. 8, no. 5, pp. 10277–10287, 2021.

F. Celli and B. Lepri, “Is big five better than MBTI? A personality computing challenge using Twitter data,” CEUR Workshop Proc., vol. 2253, 2018.

E. J. Choong and K. D. Varathan, “Predicting judging-perceiving of Myers-Briggs Type Indicator (MBTI) in online social forum,” PeerJ, no. 2015, 2021, doi: 10.7717/peerj.11382.

A. Karami, M. Lundy, F. Webb, and Y. K. Dwivedi, “Twitter and Research: A Systematic Literature Review through Text Mining,” IEEE Access, vol. 8, pp. 67698–67717, 2020, doi: 10.1109/ACCESS.2020.2983656.

N. Proferes, N. Jones, S. Gilbert, C. Fiesler, and M. Zimmer, “Studying Reddit: A Systematic Overview of Disciplines, Approaches, Methods, and Ethics,” Soc. Media Soc., vol. 7, no. 2, 2021, doi: 10.1177/20563051211019004.

M. Gjurković and J. Šnajder, “Reddit: A gold mine for personality prediction,” Proc. 2nd Work. Comput. Model. PFople’s Opin. Personal. Emot. Soc. Media, PEOPLES 2018 2018 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. T, pp. 87–97, 2018, doi: 10.18653/v1/w18-1112.

T. Ayu and W. Maharani, “Personality Detection on Twitter Social Media Using IndoBERT ( Indonesian BERT Model ),” vol. 99, no. 99, pp. 1–7, 2022, doi: 10.47065/bits.v9i9.999.

R. Zenico, E. B. Setiawan, and F. N. Nugraha, “Prediksi Big Five Personality dengan Term Frequency Inverse Document Frequency ( TF – IDF ) Menggunakan Metode Logistic Regression pada Pengguna Twitter,” e-Proceeding Eng. Telkom Univ., vol. 6, no. 2, pp. 9939–9945, 2019.

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” NAACL HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf., vol. 1, pp. 4171–4186, 2019.

T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI Open, vol. 3, no. September, pp. 111–132, 2022, doi: 10.1016/j.aiopen.2022.10.001.

Q. Jia, J. Cui, Y. Xiao, C. Liu, P. Rashid, and E. F. Gehringer, “ALL-IN-ONE: Multi-Task Learning BERT models for Evaluating Peer Assessments,” 2021.

V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” pp. 2–6, 2019.

D. Storey, “Myers Briggs Personality Tags on Reddit Data,” Jul. 2018, doi: 10.5281/ZENODO.1482951.

M. K. Dahouda and I. Joe, “A Deep-Learned Embedding Technique for Categorical Features Encoding,” IEEE Access, vol. 9, pp. 114381–114391, 2021, doi: 10.1109/ACCESS.2021.3104357.

M. D. Rahman, A. Djunaidy, and F. Mahananto, “Penerapan Weighted Word Embedding pada Pengklasifikasian Teks Berbasis Recurrent Neural Network untuk Layanan Pengaduan Perusahaan Transportasi,” J. Sains dan Seni ITS, vol. 10, no. 1, 2021, doi: 10.12962/j23373520.v10i1.56145.

X. Song, A. Salcianu, Y. Song, D. Dopson, and D. Zhou, “Fast WordPiece Tokenization,” EMNLP 2021 - 2021 Conf. Empir. Methods Nat. Lang. Process. Proc., no. October, pp. 2089–2103, 2021, doi: 10.18653/v1/2021.emnlp-main.160.

V. R. Joseph, “Optimal ratio for data splitting,” Stat. Anal. Data Min., vol. 15, no. 4, pp. 531–538, 2022, doi: 10.1002/sam.11583.

D. Grießhaber, J. Maucher, and N. T. Vu, “Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning,” COLING 2020 - 28th Int. Conf. Comput. Linguist. Proc. Conf., pp. 1158–1171, 2020, doi: 10.18653/v1/2020.coling-main.100.

F. Krüger, “Activity, Context, and Plan Recognition with Computational Causal Behaviour Models,” ResearchGate, no. August, 2018.

I. Markoulidakis, I. Rallis, I. Georgoulas, G. Kopsiaftis, A. Doulamis, and N. Doulamis, “Multiclass Confusion Matrix Reduction Method and Its Application on Net Promoter Score Classification Problem,” Technologies, vol. 9, no. 4, 2021, doi: 10.3390/technologies9040081.