Academia.eduAcademia.edu
A COMPREHENSIVE DATA… FUDMA JournalNuhu of Sciences (FJS) Abubakar, and Abubakar ISSN online: 2616-1370 ISSN print: 2645 - 2944 Vol. 5 No. 2, June, 2021, pp 26 - 33 DOI: https://doi.org/10.33003/fjs-2021-0502-519 FJS A COMPREHENSIVE DATA ANALYSIS ON FUDMA ASUU WHATSAPP GROUP CHAT *1Abubakar 1Department Ahmad, 1Nuhu Abdulhafiz A. and 2Abubakar Abdulkadir of Computer Science and Information Technology, Federal University Dutsin-Ma, Katsina State. of Computer Science Isah Kaita College of Education Dutsin-Ma, Katsina State 2Department *Corresponding Author’s: abubakarahmad82@gmail.com; aahmad1@fudutsinma.edu.ng ABSTRACT WhatsApp is an instant messaging application for information exchange in real time. It is a medium for communication and interaction among individuals, groups, institutions and business partners. Enormous amount of information is generated by WhatsApp in velocity, volume and variety which can serves as a source for various analyses, prediction and for other purposes. In this paper, dataset was collected from WhatsApp Group Chat, FUDMA ASUU MATTERS (FAM), a chat group of lecturers from Academic Staff Union of University (ASUU), Federal University Dutsin-Ma, Katsina state Nigeria. The primary goal is to present detailed analysis of the WhatsApp group chat to ascertain the level of involvement and participation by members in the group. Detailed analysis of fact such as the number of messages sent in different format, the most active date and time as well as the most active user(s) is to be investigated. Text classification method with Python and Jupyter notebook was used. The Python libraries applied include, Numpy, Pandas, Matplotlib and Seaborn. The result has shown that the level of participation of members compared to top ten members is by far uneven as only the top ten members accounted for more than half of the cumulative messages sent over a period of fourteen months. The research encourages members to be actively involved instead of allowing few members to dominate the platform. It is better to be an active contributor rather than remaining as a passive onlooker Keywords: Data Mining, Sentiment Analysis, WhatsApp, Emoji, Python libraries INTRODUCTION Recently, internet and its associated technologies have continued to revolutionize information exchanged. Social Media sites such as WhatsApp, Facebook, Twitter and Blogs etc. are becoming indispensable platforms for all forms of communications; individuals co-operate bodies, organizations, governments among others. The need to analyze such huge generated data to determine views and sentiments for desired purposes such as whether information shared are relevant or not, or to make prediction about occurrence of an event cannot be over emphasized. In most cases, messages exchange by group members appears irrelevant as investigated by Ahmad et al. (2021). Their findings revealed that of the total messages of over sixteen thousand, only 8.7% were found to be relevant messages, which is very insignificant compared to a significant percentage of messages found to be irrelevant constituting 43.3% of the total messages posted over a period of fourteen months. The result further exposed that messages such as jokes, spam, forwarded messages, greeting messages, blank emotions, devotional messages, social event wishes, personal comments etc. dominate most social network group (Ahmad et al. 2021). These and other factors compel groups to set out rules and regulations to guide exchange of information with limited or no compliance by recalcitrant members. The process involved in discovering and extracting sentiment or opinion from within text is called opinion mining or knowledge discovery. Opinion mining is a type of natural language processing to track opinions of people about a particular event or subject (Harshal et al., 2018). Opinion mining is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc., is positive, negative, or neutral. (Oxford, 2019). The goal of opinion mining is to determine the view point of individual or people towards any topic or an event by analyzing their views on social networking sites. This is possible through knowledge discovery which is synonymous to data or text mining in databases. It is a way of discovering hidden pattern or beneficial knowledge from data. The mining of huge dataset also called Big data is a trending area of research involving methods at the intersection of machine learning, statistics and mathematics. Today, Companies, organizations and individuals leverage mining techniques to look for patterns in large batches of data to improve their business, marketing strategies, learn about customers, increase sales and decrease cost among other benefits. (Court, 2015) Classification is a process of sorting and categorizing the data into various groups, forms or any other distinct class. This classification process provides separation and categorization of data according to data set requirements for various purposes. Thus target class can be treated for each and every case in the data. Algorithms driving this data management process are termed as ‘classifiers’. In machine learning and statistics, classification is a supervised learning approach in which a FUDMA Journal of Sciences (FJS) Vol. 5 No. 2, June, 2021, pp 26 - 33 26 A COMPREHENSIVE DATA… Abubakar, Nuhu and Abubakar program learns from the data given to it and then uses that experience to classify new observations. The goal of this paper is to present detailed analysis of the WhatsApp group chat to ascertain the level of involvement and participation by members in the group. To achieve this, we analyzed the FAM WhatsApp group dataset and extracted the thoughts and opinions of lecturers expressed on the group through their posts and classified them into different categories, namely; text messages, media messages, website links and Emojis. The results were used to determine the level of participation of members in the group chat. The analysis involved; the most active date in the group, the number of messages sent on the most active date, the overall most active user, total number of users, number of posts made by each individual on the group, and the most used word on the platform. In addition, an analysis of the top 10 users was also done and results visualized using python code in Jupyter notebook. The remaining parts of this paper were organized as follows; in section 2, we reviewed literatures related to this paper. Section 3 describes methodology for WhatsAapp sentiment analysis utilized in this research. In section 4, we demonstrate the experimental results conducted on FAM WhatsApp datasets. Finally, section 5 concludes and suggests some recommendations. RELATED WORK There are so many works done to date on sentiment analysis particularly on classifying twitter messages into positive, negative or neutral and for prediction purposes. Sentiment analysis has been handled as a Natural Language Processing task at many levels of granularity, word level, sentence level or document level. (Liu, 2010.). Sentiment classification techniques can broadly be grouped into Machine Learning (ML) approach, lexicon-based approach and hybrid approach (Diana & Adam, 2011). The ML text classification approach is divided into supervised and unsupervised learning methods. The supervised methods exploit large number of labeled training documents exploiting classifiers namely; Decision Tree, Linear classifiers, Rulebased classifiers and Probabilistic classifiers. While the unsupervised method is used when it is difficult to find labeled training documents (Li & Tsai, 2011). For Lexicon-based approach, it relies on a sentiment lexicon, a collection of known and precompiled sentiment terms. It is classified into dictionary-based approach and corpus-based approach. While the former approach relies on finding opinion seed words, and then searches the dictionary of their synonyms and antonyms, the later approach uses statistical or semantic methods to find sentiment polarity. The combination of the two approaches (Machine Learning and Lexicon Based approaches) referred to as hybrid approach get better classification results (Mudinas et al., 2012). Other concept-level sentiment analysis approaches such as pSenti is integrated into opinion mining lexicon-based and machine learning-based approaches. The system achieved higher accuracy in sentiment polarity classification as well as sentiment strength detection compared with pure lexicon-based systems using two real-world data sets (CNET software reviews and IMDB movie reviews). It outperformed the proposed FJS hybrid approach over state-of-the-art systems like SentiStrength (Mudinas etal., 2012). Kontopoulos etal. (2013) proposed the use of ontology-based techniques toward a more efficient sentiment analysis of twitter posts. They worked on the domain of smart phones. Their architecture gives more detailed analysis of post opinions regarding a specific topic. Other techniques not categorized as either ML approach or Lexicon-based approach includes; Formal Concept Analysis (FCA) which is a mathematical approach used for structuring, analyzing and visualizing data and Fuzzy Formal Concept Analysis (FFCA) developed in order to deal with uncertainty and unclear information (Walaa et al., 2014). In related researches, multiple techniques are combined to improve performance. Thakare and Sachin (2016) combined both Lexicon based and Machine Learning approach and achieved an increased performance in precession and recall for twitter data. Generally, bag-of-words has been used for mining sentiments online. In this approach, individual words are considered instead of complete sentences. Therefore, Traditional machine learning algorithms such as Support vector Machines, Naive Bayes’ and Maximum entropy etc. are commonly used to solve such classification problems with features such as unigrams, n-grams, Part-Of-Speech (POS) tags (Paridhi et al., 2018). Other researchers proposed enhancements of some approaches such as the ensemble model aimed to improve performance metrics of the existing algorithms like Naïve Bayes, SVM and Linear Regression model (Mohanavalli et al., 2018). Bhattacharjee et al. (2019) worked on boosting TF-IDF for effectiveness. Recently, Darwich et al. (2019) presented a comprehensive review on the notable research works that focus on the corpus-based approach for sentiment lexicon generation. The authors arrived at eight reasons in favor of corpus-based approach over a dictionary-based approach and concluded on the note that corpus-based techniques are considered as a vital part of any modern sentiment analysis system. Recently, Cambria et al. (2020) built new version of SenticNet, called SenticNet 6; a commonsense knowledge base for sentiment analysis which integrate top-down and bottom-up learning via an ensemble of symbolic and sub symbolic AI tools for logical reasoning within deep learning architectures. This system exploits both Artificial Intelligence and Semantic Web technologies for recognizing meaningful patterns in natural language text and, hence, represent these in a knowledge base using symbolic logic. In this research, the authors adopted the Supervised Machine Learning method of text classification in a unique way. Most of the researches focused on binary classification of sentiments, i.e. weather sentiments are positives or negatives, or even neutral in some cases. However, the authors’ approach, seek to find the level of involvement and participation by members in a chat group. Find out interesting insights like the most used form of messaging, the most active date and time of the day, most active users. These would be interesting insights to uncover. To accomplish that, the dataset created in the previous work of Ahmad et al. (2021)was adopted and utilized with exclusion of “Remark” column which contained the labeled attribute in the dataset. FUDMA Journal of Sciences (FJS) Vol. 5 No. 2, June, 2021, pp 26 - 33 27 A COMPREHENSIVE DATA… Abubakar, Nuhu and Abubakar METHODOLOGY In this paper, the authors exploit text classification method. There are many kinds of supervised classifiers in literatures. Some of the most frequently used classifiers in Sentiment Analysis are; Decision Tree, Linear, Rule-based and Probabilistic classifiers. Since the focus in this paper is not FJS prediction, Attention is not paid to classifiers. The text classification steps in this research involves; collection of data from the FAM WhatsApp, transformation of the data and then exploration, analysis and eventual visualization of the data using python in Jupyter Notebook. The processes are depicted in figure 1 below. The different phases in text classification are explained in the following subsection. Figure 1: Text Classification Technique showing different Processes Data Collection Process. The first step was to obtain the data from target source. WhatsApp provides a feature for exporting chats through a .txt format; this was achieved via the chat group (FAM). In exporting the WhatsApp data file, the procedures involved the following; on the FAM group page, click on the settings, select export data and then select without media. This means that media file is excluded to narrow the scope of the work. Additionally, exporting along with the media files, will lead to use of larger volume of data and waste of time for data collection. The text file is then transferred into pandas data frame. The following depicts how the file looks like I .txt format; Figure 2: FAM WhatsApp Group Chats exported to Notepad Data Preparation and Transformation At this stage, the dataset was uploaded to Jupyter Notebook for exploration, analysis and presentation. Details for the visualization were presented in ‘Result and Discussion’ section of the paper. The plain text file is parsed and tokenized in a meaningful manner to be stored in pandas data frame. The raw messages are broken down into four tokens; date, time, phone number (author) and FUDMA Journal of Sciences (FJS) Vol. 5 No. 2, June, 2021, pp 26 - 33 28 A COMPREHENSIVE DATA… Abubakar, Nuhu and Abubakar FJS message representing the table head. A function is written that detects whether each line begins with a date and time, it then splits the line into date, time, author and message tokens. The following depicts how the table is structured in pandas data frame. Figure 3: FAM WhatsApp Group Chats in data frame The data frame splits the line into the date, time, phone and message tokens. The data available to this research covered from 17th of August 2018 when the group was created to 21 st October 2019 when the group migrated to Telegram from Whatapp. A total of sixteen thousand, three hundred and ninety nine (16,399) messages were generated and transform to form the dataset for this research. It is important to note that the exported chat file was without a media. Therefore, the phrase “<Media omitted>” is used in place of media messages. Other messages are text, emojis and link messages. The implementation tools were explained in the following sub section. Data Mining Platform and Tools Python programming language was utilized in this research not only because of its simplicity and popularity but its ubiquitous collection of libraries handling various purposes such as imputation of data, transformation of data, exploration of data and data visualization. Python packages are directory of python scripts; each script is a module which can be a function, methods or new python type created for particular functionality. Numpywas used to handle the multidimensional arrays and functions required for the classification of chats into days, hours, minutes and seconds. Pandas were utilized for data extraction, manipulation, cleaning, and analysis. It was suitable for the kind of data used in this research. Another package utilized in this research is Python Matplotlib. matplotlib. pyplot is a plotting library used for 2D graphics. It was used for the various graphs plotted in this work. In addition, Seabon, which is a dominant data visualization library, was also imported in this work. Being a higher-level library, it was able to expand the plot and better beautify it. It doesn’t work alone hence it works on Matplotlib foundation. Ethical Considerations It is evidently clear that the data in WhatsApp file contain phone numbers and messages sent by members of the group. There are therefore certain privacy considerations that must be taken into account. Most notably, individual phone numbers should not be collected and/or released to public. The authors sought the consent of administrators to use only the information for research purposes. As a result, the identities of individual members are hidden. RESULTS AND DISCUSSION Figure 4a and 4b show the distribution of the various messages in all posts of the group chat. 63.4% of all the chats amounting to 10,393 messages were Emojis. It is the highest with a very wide merging almost doubled the other types of messages put together. Followed by is the media with a total of 2,264 messages accounting for 13.8% of the entire messages. Text messages which ought to be in the fore front is proceeding with 12.8% constituting 2,096 messages, a little behind media post. The least of all the posts were Linksposts which a total of 1,646 messages resulting to 10% of the entire posts. This leads to the conclusion that FAM group chats are mainly used to communicate via Emojis post over the period reviewed, while text, media and links posts were used to some extent. FUDMA Journal of Sciences (FJS) Vol. 5 No. 2, June, 2021, pp 26 - 33 29 A COMPREHENSIVE DATA… Abubakar, Nuhu and Abubakar Figure 4a: Bar Chart showing distribution of various messages types FJS Figure 4b: Pie Chart showing percentage distribution of message types Furthermore, the frequency of posts at a particular date was investigated. Figure 5a illustrates the dates (y-axis) in connection with the relative frequency of posts (x-axis). Most posts were sent on 15thand 20th of December. 2018 with over 350 messages and about 300 messages respectively. Further checks on the records revealed that these were the period when funds for Earned Academic Allowance (EEA) were released to universities. The graph as a whole presented the top ten most active dates over the period under review. This finding supports the statement that FAM WhatsApp constitutes a very fast communication medium among members. Similarly, the frequency of posts at a particular time of day was investigated. Figure 5b illustrates the day times in hours and minutes (y-axis) in connection with the relative frequency of posts (x-axis). Most posts were sent around 10:34 AM and 2:36 PM. The chart shows the trend that members communicate more after 10:30 AM in the morning, 2:36 PM in the afternoon and mostly around 9:54 PM in the night. Figure 5a: A Bar chart showing top ten date with most frequent posts in FAM Chat group Figure 5b: A Bar chart showing top ten times in hours and minutes with most frequent post in FAM Chat group Figure 6a depicts the total messages sent by top ten authors; it is astonishing to know that of the 16,399 cumulative messages, a total of 11,290 messages were sent by only ten top authors, meaning that by far, more than half of the messages were attributed only to top ten members. What is even more striking is the fact that of the 11,290 messages sent by the top ten members, almost half of these messages were sent by the highest contributing member. In comparison with the cumulative messages, the highest contributing member contributed more than one third of the cumulative messages, simply put 31.2% of the cumulative message over the period reviewed. FUDMA Journal of Sciences (FJS) Vol. 5 No. 2, June, 2021, pp 26 - 33 30 A COMPREHENSIVE DATA… Abubakar, Nuhu and Abubakar Figure 6a:A Bar chart showing the number of messages sent by the top 10 active authors in FAM Chat group FJS Figure 6b: A Bar chart showing distribution of messages sent by the top 10 active authors in FAM Chat group Furthermore, figure 6b shows exemplary the highest category of messages sent by the ten top authors is the emoji with 48.8% of the total messages equivalentto (5,507) messages, this cut across other authors with the exception of few. Followed to emojisdid text message constituting 43.7% (4,934) and then media messages with 6.9% (780). The least is link messages having 0.6% (69) as presented in figure 6b above. This reaffirmed the fact that members utilized emojis more than other forms of messages. Details of the statistics is presented in table 1 below; Table 1: Details statistics of ten top Authors Author Messages Media Emojis Links Total +234 XXXXXX 7330 260 21 251 13 545 +234 XXXXXX 5120 295 24 180 2 501 +234 XXXXXX 3141 372 18 278 4 672 +234 XXXXXX 2270 346 13 114 7 480 +234 XXXXXX 0804 345 17 3 0 365 +234 XXXXXX 2802 349 15 74 4 442 +234 XXXXXX 5540 429 63 504 5 1001 +234 XXXXXX 4266 444 93 658 4 1199 +234 XXXXXX 0879 658 83 232 3 976 +234 XXXXXX 5797 1436 433 3213 27 5109 Total 4934 780 5507 69 11290 In figure 7a and 7b, the distribution of number of words and letters respectively are presented. It is indicative that four out of the top ten contributing members stand out in the number of word and letters utilized throughout the period covered Figure 7a:A Bar chart showing the number of words sent by the top 10 active authors in FAM Chat group Figure 7b: A Bar chart showing the number of letters sent by the top 10 active authors in FAM Chat group FUDMA Journal of Sciences (FJS) Vol. 5 No. 2, June, 2021, pp 26 - 33 31 A COMPREHENSIVE DATA… Abubakar, Nuhu and Abubakar SUMMARY OF FINDINGS This paper investigated FAM group-based communication. The result shows that of the 16,399 cumulative messages, only 12.8% were found to be text messages, which is very insignificant compared to a significant percentage of messages found to be emojis consisting of 63.4% of the total messages posted, it concludes that members use emojis more than any form of messaging. Accordingly, 13.8% of the total messages were classified as media messages testifying that members send more videos than posting link message to the group. Similarly, the result further revealed that much traffic of messages was recorded on 15th and 20th of December 2018 as a result of Earned Academic Allowance (EAA). The result add that members communicate more after 10:30 AM in the morning, 2:36 PM in the afternoon and mostly around 9:54 PM in the night. With regards to posting on the FAM platform, an individual member alone contributed more than one third of the 16,399 messages, while top four members accounted for more than half of the entire messages. The ten top members contributed 69% of the cumulative messages confirming the fact that they are the most active members posting more often on the platform. Future Work In the future, there is need to extend the scope of this research to ascertain the level of non-involvement and in-activity by members in the chat group. Find out interesting insights like the most passive member. The scope of the dataset could be expanded to include multi-media messages such as voice, video, empty messages among others for a more detailed analysis. CONCLUSION Conclusively, WhatsApp application along with the power of python in Jupyter notebook has shown its strength as a tool for analyzing any form of dataset. This working instigated WhatsApp application, its potentials and how data from the platform could be exported, prepared, transformed, analyzed and later visualized. The analysis was done with Jupyter notebook, with the following libraries; Pandas, Numpy, Matplotlib and Seaborn. Text classification Technique was used. At the end of the work the results obtained shows comprehensive details of most active user(s), date, and time of the day and the most used format of posting messages on the FAM group platform. RECOMMENDATION The authors encourage members to keep to the purpose of the group, exchange information from a verified source, and involve actively in meaningful and relevant contributions in the platform. Passivity, sycophancy and praise singing by members is discouraged. REFERENCES FJS Sixth Social Networking Workshop, SN@COMSNETS 2019, Bengaluru,, 116 , pp. 10–19. IIT Hydrabad India. Cambria, E., Yang, L., Xing, Z. F., et al. (2020). SenticNet 6: Ensemble Application of Symbolic and Subsymbolic AI for Sentiment Analysis. CIKM, (pp. 105-114). Ireland. Court, D. (2015). Marketing & Sales Big Data, Analytics, and the Future of Marketing & Sales. McKinsey & Company. Darwich, M., Mohd, N., Shahrul, A., Omar, N., & Osman, N. (2019). Corpus-Based Techniques for Sentiment Lexicon Generation: A Review. Journal of Digital Information Management., 17, 296. 10.6025/jdim/2019/17/5/296-305. Diana, M. & Adam, F. (2011). Automatic detection of political opinions in tweets. Proceedings of the 8th international conference on the semantic web,European Semantic Web Conference (ESWC’11) (pp. 88–99). Springer. Harshal, K., Kalyani, G. & Tanmay, S. (2018, March 03). A review on: Sentiment polarity analysis on Twitter data from different Events. International Research Journal of Engineering and Technology (IRJET), 05 (03 | Mar-2018), Page 1479. Kontopoulos, E., Berberidis, C., Dergiades, T. & Bassiliades, N. (2013). Ontology-based sentiment analysis of twitter posts. Expert Systems with Applications, 40(10), 4065-4074. Li, S. & Tsai, F. (2011). Noise control in document classification based on fuzzy formal concept analysis. In: Presented at the IEEE. International Conference on Fuzzy Systems (FUZZ). IEEE. Liu, B. (2010.). Sentiment Analysis and Subjectivity. In Handbook of Natural Language Processing, Second Edition. Taylor and Francis Group, Boc. Mohanavalli, S., Karthika, S., Srividya, K.R., & Uthayan, N. S. (2018). Categorisation of Tweets Using Ensemble Classification Methods. nternational Journal of Engineering & Technology, 7 (3.12), 722-725. Mudinas, A., Zhang, D. & Levene, M. (2012). Combining lexicon and learning based approaches for concept-level sentiment analysis Presented at the. WISDOM’12. Beijing, China. Neumann, G. (2006). A Hybrid Machine Learning Approach for Information Extraction from Free Text. From Data and Information Analysis to Knowledge Engineering (pp. 390 397). Springer, Berlin, Heidelberg. Ahmad, A., Mukhtar, A., Akinyemi, O. O. (2021, JANUARY). Sentiment Analysis and Classification of ASUU WhatsApp Group Post Using Data Mining. JOURNAL OF CONFLICT RESOLUTION AND SOCIAL ISSUES, VOL. 1 (NO. 2), 18 - 27. Oxford. (2019). Oxford online dictionary. accessed, 12:53PM,16th, October 2019: https://www.lexico.com/en/definition/sentiment_analysis. Bhattacharjee, U., Srijith, P. K. & Maunendra, D. (2019). Term Specific TF-IDF Boosting for Detection of Rumours in Social Networks. In D. o. Engineering (Ed.), In Proceedings of the Paridhi, P. N., Dinesh D. P. & Yogesh, S. P. (2018). Sentiment Classification of Twitter Data: A Review. International Research Journal of Engineering and Technology (IRJET).05, FUDMA Journal of Sciences (FJS) Vol. 5 No. 2, June, 2021, pp 26 - 33 32 A COMPREHENSIVE DATA… Abubakar, Nuhu and Abubakar FJS pp. 929 - 931. p-ISSN: 2395-0072: ISO 9001:2008 Certified Journal. Patil, S. (2016). WhatsApp Group Data Analysis with R. International ournal of Computer Applications, 0975 – 8887. Walaa, M., Ahmed, H. & Hoda, K. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5, 1093 - 1113. FUDMA Journal of Sciences (FJS) Vol. 5 No. 2, June, 2021, pp 26 - 33 33