SNET : a statistical normalisation method for Twitter
Citation:Sosamphan, P. (2016). SNET: A statistical normalisation method for Twitter. An unpublished thesis submitted for the degree of Master of Computing, Unitec Institute of Technology, New Zealand.
Permanent link to Research Bank record:http://hdl.handle.net/10652/3508
One of the major problems in the era of big data use is how to ‘clean’ the vast amount of data on the Internet, particularly data in the micro-blog website Twitter. Twitter enables people to connect with their friends, colleagues, or even new people who they have never met before. Twitter, one of the world’s biggest social media networks, has around 316 million users, and 500 million tweets posted per day (Twitter, 2016). Undoubtedly, social media networks create huge opportunities in helping businesses build relationships with customers, gain more insights into their customers, and deliver more value to them. Despite all the advantages of Twitter use, comments – called tweets - posted on social media networks may not be all that useful if they contain irrelevant and incomprehensible information, therefore making it difficult to analyse. Tweets are commonly written in ‘ill-forms’, such as abbreviations, repeated characters, and misspelled words. These ‘noisy tweets’ become text normalisation challenges in terms of selecting the proper methods to detect and convert them into the most accurate English sentences. There are several existing text cleaning techniques which are proposed to solve the issues, however they possess some limitations and still do not achieve good results overall. In this research, our aim is to propose the SNET, a statistical normalisation method for cleaning noisy tweets at character-level (which contain abbreviations, repeated letters, and misspelled words) that combines different techniques to achieve more accurate and clean data. To clean noisy tweets, existing techniques are evaluated in order to find the best solution by combining techniques so as to solve all problems with high accuracy. This research proposes that abbreviations are converted to their standard form by using abbreviations dictionary lookup, while repeated characters are normalised by the Natural Language Toolkit (NLTK) platform and a dictionary based approach. Besides the NLTK, the edit distance algorithm is also utilised as a means of solving misspelling problems, while “Enchant” dictionary can be used to access the spell checking library. Furthermore, existing models, such as a spell corrector, can be deployed for conversion purposes, while text cleanser is advanced as superior for comparing the SNET with a baseline model. With experiments on a Twitter sample dataset, our results show that the SNET satisfies 88% accuracy in the Bilingual Evaluation Understudy (BLEU) score and 7% in the word error rate (WER) score, both of which are better than the baseline model. Devising such a method to clean tweets can make a great contribution in terms of its adoption in brand sentiment analysis or opinion mining, political analysis, and other applications seeking to make sound predictions.