Personality Detection on Reddit Using DistilBERT

Personality is a unique set of motivations, feelings, and behaviors humans possess. Personality detection on social media is a research topic commonly conducted in computer science. Personality models often used for personality detection research are the Big Five Indicator (BFI) and Myers-Briggs Type Indicator (MBTI) models. Unlike the BFI, which classifies personalities based on an individual’s traits, the MBTI model classifies personalities based on the type of the individual. So, MBTI performs better in several scenarios than the Big Five model. Many studies use machine learning to detect personality on social media, such as Logistic Regression, Naïve Bayes


Introduction
Personality is a unique set of motivations, feelings, and behaviors humans possess.Personality strongly influences a person's life, life choices, attitudes, health, and desires [1].Every person has a different personality.There have been many studies on personality in an individual or a group of individuals.Personality detection on social media is a research topic that is widely done in Computer Science.Shantika et al. performed research on detecting the Big Five personalities of Twitter users, where from 549,151 tweets and 287 Twitter accounts, the majority of users have an Openness personality 91 users while the least personality group of users is the Neuroticism personality with 43 Twitter users [2].Besides the Big Five personality trait, personality can also be identified using the Myers-Briggs Type Indicator (MBTI) model.Extraversion-Introversion (E/I), Sensing-Intuition (S/N), Thinking Feeling (T/F), and Judging-Perceiving (J/P) are the four binary classes that define MBTI which are then combined into 16 personality types, while BFI defines a person's personality based on 5 scalars namely Extraversion, Neuroticism, Agreeableness, Conscientiousness, and Openness [3].Unlike the Big Five model which classifies personalities based on an individual's traits, the MBTI classifies personalities based on the type of individual themselves [4].The MBTI personality model performs better in several scenarios than the Big Five personality model [3].
Most of the research on personality detection is conducted on Twitter social media because it has a public API (Application Program Interface) that makes it easy for researchers to get quality data [5].However, only some studies on personality detection are conducted on the online forum Reddit.According to Proferes et al., Reddit has started to approach Twitter for ease of collecting data and providing additional benefits for research [6].For example, the structure of subreddits makes it easier for researchers to search for data based on topics relevant to their research.Topics obtained on Reddit are also of higher quality because of the moderators and rules that must be obeyed in a subreddit [7].[10].Additionally, because of BERT's ability to process large sentences and the ability for parallelization thanks to the transformer architecture [11].Therefore, BERT has an advantage over the other deep learning models, such as RNN and LSTM, on NLP tasks that struggle with large sentence processing and parallelization.However, BERT also has drawbacks in terms of performance because it has many parameters [12].In 2019 Victor Sanh, et al conducted research to do knowledge distillation on BERT which was later named DistilBERT with a 40% smaller size and 60% faster performance than BERT and still held 97% language knowledge capability than the original BERT language model [13].Therefore, with the smaller size and faster performance of DistilBERT compared to BERT while regaining 97% language knowledge of BERT and while previous research mainly uses conventional machine learning methods such as Logistic Regression, AdaBoost, Decision Tree, and BERT.In this study, we conduct MBTI personality detection on Reddit using DistilBERT.

Research Methods
The system design process in this study begins with data acquisition.After the data is obtained, preprocessing and splitting of the data will be carried out before training on the DistilBERT model that was built.Then the testing process will be carried out on the model that has been trained and analyzing the results of the testing process.For more details, see Figure 1. 7 million records of Reddit posts scraped using Google Big query [14].The data consists of three main columns, i.e., 'author_flair_text', which is the user's MBTI personality type, 'body', which is the posts that the Reddit users posted; and 'subreddit', which denotes where the post is submitted to in Reddit.This study only used 120,000 data samples from the original dataset.

Data Preprocessing
Data preprocessing is done so that the data used for model implementation is efficient and highperformance.In the system built, data preprocessing is divided into four stages.First, data cleaning is a preprocessing stage to clean text data from URLs, emojis, emoticons, and special characters.Examples of data cleaning stages can be seen in Table 1.The second stage of data preprocessing is stopword removal.Unimportant words such as conjunctions, pronouns, articles, and prepositions are removed at this stage.An example of the removing stopwords stage can be seen in Table 2.The third stage in data preprocessing is Lemmatization.Lemmatization is applied to remove inflectional endings from words to return them to their base form, known as lemmas.An example of this Lemmatization stage can be seen in Table 3.The final stage of the data preprocessing is tokenization.Tokenization separates a text into tokens, which can be either words, characters, or subwords [15].Embedding tokenized sentences allows text data to be converted into numbers that machine-learning models can read.According to Rahman et al., "Word embedding is a method of word insertion, converting words into a continuous vector of a predefined length so that it will not be limited by a larger vocabulary" [16].BERT itself uses WordPiece Tokenizer to represent text into tokens where the first token of every sequence is a special classification token [CLS], and the last token is a separator token [SEP] [10].The details of BERT tokenization can be seen in Figure 2. As quoted by Song et al., "Given a vocabulary, WordPiece tokenizes words by repeatedly choosing the longest prefix of the remaining text that matches a vocabulary token until the entire word is segmented.If a word cannot be tokenized, the whole word is mapped to a special token" [17].WordPiece tokenization distinguishes word pieces at the start of a word from word pieces starting in the middle.The latter begins with a special symbol ## (in BERT) called the suffix indicator.In data preprocessing, we used Python libraries such as re, emojis, and NLTK to preprocess the data.While for word tokenization, we used the Python transformers module to tokenize the data.

Data Splitting
Data splitting is a method of dividing data into two or more, forming a subset of data.The most common ratio of data splitting is 80:20, 70:30, and 60:40 [18].In this case, data splitting aimed to get 80% training data from the main data to train the system built and 20% testing data from the main data to evaluate the system that has been trained.

Model Fine-Tuning
BERT, or Bidirectional Encoder Representation from Transformers, is a Natural Language Processing (NLP) language model designed to pre-train deep bidirectional representation of unlabeled data [8].DistilBERT can find adjacent words because DistilBERT uses a self-attention mechanism.Selfattention is a mechanism of giving weight (attention) to each input word represented in the form of a vector by using the position of each input to extract the relationship between words in a sentence.Illustration of self-attention can be seen in Figure 3.The way attention works can be summarized as follows, each input vector (X) will be mapped with 3 metrics, namely query (Q), key (K), and value (V) [11].The self-attention value/score (Z) is obtained from the softmax result between the dot product between the query, key, and value of each input vector.
Then the attention value is used in the DistilBERT feed forward stage.
With BERT, we do not need to train from scratch because BERT has been trained with a large dataset for a simple NLP task of predicting missing words in a text, so we only need to freeze the weight of BERT and only train additional layers [19].We can see how BERT is trained/fine-tuned in Figure 4. On the left is the pre-training process, and on the right is the finetuning process.

Evaluation
The evaluation was performed to measure the model's prediction performance quality.There will be four scenarios for the evaluation phase in which the evaluation will be observed with how different preprocessing steps can affect the overall performance of DistilBERT on the prediction.The first scenario is fine-tuning the model with fully preprocessed data (cleaning the data, stop word removal, and lemmatization), the second is fine-tuning the model with only doing data cleaning as the preprocessing method, third is fine-tuning the model with cleaning data and stop words removal for the data preprocessing step.Lastly, the final scene is fine-tuning the model with data that went through data cleaning and lemmatization preprocessing steps.The model will be evaluated using a confusion matrix for multi-class classification to get the model performance value with accuracy, precision, recall, and f1-score.The multiclass confusion matrix can be seen in Figure 5.

Results and Discussions
Before the data is used for fine-tuning the model, let us first view the class distribution of the data used.The data class distribution can be seen in Figure 6.

Model Performance Evaluation
The result of the evaluation process carries on as follows.The first scenario scored an accuracy of 0.336, where the model was fine-tuned using the fully preprocessed data, where the dataset was cleaned from URLs, emojis, special characters, and stop words, and was lemmatized.The second scenario scored an of 0.347, the highest out of all the scenarios where the data used only went through the data cleaning preprocess stage.The third scenario scored the lowest accuracy, where the data used was preprocessed with data cleaning and stop words removal.Lastly, the fourth scenario scored an accuracy score of 0.345, the second highest of the whole scenario, where the data used was preprocessed with data cleaning and lemmatization.The full classification report is shown in Table 4.The scenario with the highest precision score is scenario two, with a score of 0.347, while scenario three has the lowest precision score of 0.335.Scenario four has the highest recall score of 0.352, while the lowest is scenario one, with a score of 0.341.Lastly, scenarios two and four have the highest f1-score of the entire scenario, 0.348, while scenario one has the lowest f1-score of 0.338.Based on the result, it can be observed that stop words reduce the model's performance, and with lemmatization, the recall score of the model improves while the precision score decreases very slightly in our case.
The low score of the classification report is due to the unbalanced data, as seen in Figure 6.In addition, the training and testing data used still contain noise, such as imperfect URL deletion, the presence of non-ASCII characters, data consisting of only 1-2 words which causes a lack of context to train the model.In Table 4, the preprocessing scenario has very little effect on the model performance because DistilBERT's transformers architecture uses all the information in a sentence thanks to its ability to process long sentences [11].In contrast, other machine learning models can only use a little of the information obtained from a sentence because they cannot process long sentences.The most influential preprocessing stage is the preprocessing stage with the removal of stopwords, with the removal of stopwords resulting in reduced context that can be learned by DistilBERT.

Handling Class Imbalance
From Figure 6, the dataset used is very imbalanced.To handle this class imbalance, we used resampling techniques to overcome the data imbalance and manually predefined class weight to train the DistilBERT model rather than randomly assigning them.The resampling techniques we used were random oversampling (ROS) and random undersampling (RUS).As shown in Figure 7, we oversample our dataset by randomly duplicating examples from the minority class to even out the data.While for undersampling, we do the opposite of random oversampling, where we randomly delete the majority class so that it is even across the other classes.Class data distribution with random undersampling can be seen in Figure 8.  5.

Comparing with other methods
Our fine-tuned DistilBERT model will also be compared with other machine-learning methods.The methods used include Naïve Bayes, SVM, and Logistic Regression.We used scenario four as our baseline because it is one of the best scenarios for the DistilBERT model.The performance comparison between these methods and the DistilBERT model can be seen in Table 6.It can be seen that the model with the highest performance is DistilBERT because DistilBERT's transformer-based model can learn data in parallel and, with the attention used by the transformer model, can map the correlation between words in a sentence so that the model can learn data efficiently compared to traditional machine learning models that directly learn data without considering the correlation between words in the sentence and without the ability of parallelization [11].

Conclusion
The scenario that produces the best performing DistilBERT model for MBTI personality detection on Reddit is the model in scenarios two and four, where the data only went through data cleaning and lemmatization preprocessing stages.It can also be observed that removing stop words can affect the model's overall performance, as seen in scenarios one and three.The dataset used in this study is also imbalanced; because of that, we tried applying some imbalanced data handling techniques such as random oversampling, random undersampling, and adding class weight.With data imbalance handling, the model performs worse than the original model with random class weight assignments and without over/undersampling.Also, as a comparison, we compare our best model with several machine learning classifiers such as Naïve Bayes, SVM, and Logistic Regression.The comparison result shows that DistilBERT is the best-performing model compared to the other models because it can learn on the train data in parallel, and each word in the text is given a correlation weight between each other.For further research, we recommend using more data samples, better data preprocessing strategy to eliminate data noise, and a balanced dataset to fine-tune the model.

Figure 1 .
Figure 1.Research stages Drought and there's a disease spreading among the lettuce crops at the moment.Estimated about 1/3 of the yield this fall I'm out of the office from June 3 to June 9.I'll be happy to answer your message once I return!✈️ I'm out of the office from June 3 to June 9.I'll be happy to answer your message once I return!

7
No. 5 (2023) DOI: https://doi.org/10.29207/resti.v7i5.5236Creative Commons Attribution 4.0 International License (CC BY 4.0) 1142 Text Cleaned Text midnight.The dog was lying on the grass in the middle of the lawn in front of Mrs Shears' house lying grass middle lawn Shears' house Drought and there's a disease spreading among the lettuce crops at the moment.Estimated about 1/3 of the yield this fall Drought disease spreading lettuce crops moment.Estimated 1/3 yield fall Drought and there's a disease spreading among the lettuce crops at the moment.Estimated about 1/3 of the yield this fall Drought and there 's a disease spread among the lettuce crop at the moment.Estimate about 1/3 of the yield this fall

Figure 4 .
Figure 4. BERT pre-training and fine-tuning process [10] Because of BERT's large number of parameters, it is very computationally heavy.As such, there is a smaller and faster version of BERT called DistilBERT [13].The model used in this study is DistilBERT and will be fine-tuned with the recommended configuration of 16-32 batches, a learning rate of range between 2e-5 and 5e-5, and an epoch of 2-4 [10].To fine-tune DistilBERT we used the Python transformers module.Then after completing the training stage, it will produce a DistilBERT classification model to test and analyze existing testing data.

Figure 5 .
Figure 5. Multiclass confusion matrix [20] TP or True Positive is a Positive prediction result with positive actual data, TN of True Negative is a Negative prediction result with negative actual data, FP or False Positive is a Positive prediction result with negative actual data, FN or False Negative is a Negative prediction result with positive actual data.Lastly, C is a class or label from the dataset.From Figure 4, we can do calculations for accuracy, precision, recall, and f1-score to evaluate the performance of the model, which can be seen in Formula 1, 2, 3 and 4 [21]:  = ∑ (  )  =1 ∑ ∑  ,  =1  =1

Figure 6 .
Figure 6.Class data distribution As shown in Figure 6, the dataset used is imbalanced.Personality type INTP and INTJ has the most record in the dataset, while ESFJ has the least record on the dataset.All the model in each scenario is fine-tuned with 32 text batches, a learning rate of 2e-5, and an epoch value of 2.

Figure 7 .
Figure 7. Class data distribution with ROS

Figure 8 .
Figure 8. Class data distribution with RUS To determine the initial class weight of the model, we used the sci-kit-learn class weight Python module to compute the class weight to even out the bias and weight in the initial learning step of the model.The result of fine-tuning the model with the resampled data and manual class weight assignment on the fourth scenario can be seen in Table5.
Personality detection on social media often uses machine learning and deep learning methods to classify sentiment.From study about personality detection on Twitter social media conducted by Ayu and Maharani, the IndoBERT method has the highest accuracy value among the two classification scenarios Alif Rahmat Julianda, Warih Maharani Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. 7 No.5 (2023) BERT, or Bidirectional Encoder Representation from Transformers, is a Natural Language Processing (NLP) language model designed to pre-train deep bidirectional representations of unlabeled data.It allows pre-trained BERT models to be fine-tuned for diverse NLP tasks

Table 1 .
Data cleaning example

Table 4 .
Classification report for each scenario

Table 5 .
Classification report with class imbalance handling

Table 6 .
Comparing DistilBERT with other methods