Show simple record

dc.contributor.authorSosamphan, Phavanh
dc.date.accessioned2016-08-01T23:11:28Z
dc.date.available2016-08-01T23:11:28Z
dc.date.issued2016
dc.identifier.urihttps://hdl.handle.net/10652/3508
dc.description.abstractOne of the major problems in the era of big data use is how to ‘clean’ the vast amount of data on the Internet, particularly data in the micro-blog website Twitter. Twitter enables people to connect with their friends, colleagues, or even new people who they have never met before. Twitter, one of the world’s biggest social media networks, has around 316 million users, and 500 million tweets posted per day (Twitter, 2016). Undoubtedly, social media networks create huge opportunities in helping businesses build relationships with customers, gain more insights into their customers, and deliver more value to them. Despite all the advantages of Twitter use, comments – called tweets - posted on social media networks may not be all that useful if they contain irrelevant and incomprehensible information, therefore making it difficult to analyse. Tweets are commonly written in ‘ill-forms’, such as abbreviations, repeated characters, and misspelled words. These ‘noisy tweets’ become text normalisation challenges in terms of selecting the proper methods to detect and convert them into the most accurate English sentences. There are several existing text cleaning techniques which are proposed to solve the issues, however they possess some limitations and still do not achieve good results overall. In this research, our aim is to propose the SNET, a statistical normalisation method for cleaning noisy tweets at character-level (which contain abbreviations, repeated letters, and misspelled words) that combines different techniques to achieve more accurate and clean data. To clean noisy tweets, existing techniques are evaluated in order to find the best solution by combining techniques so as to solve all problems with high accuracy. This research proposes that abbreviations are converted to their standard form by using abbreviations dictionary lookup, while repeated characters are normalised by the Natural Language Toolkit (NLTK) platform and a dictionary based approach. Besides the NLTK, the edit distance algorithm is also utilised as a means of solving misspelling problems, while “Enchant” dictionary can be used to access the spell checking library. Furthermore, existing models, such as a spell corrector, can be deployed for conversion purposes, while text cleanser is advanced as superior for comparing the SNET with a baseline model. With experiments on a Twitter sample dataset, our results show that the SNET satisfies 88% accuracy in the Bilingual Evaluation Understudy (BLEU) score and 7% in the word error rate (WER) score, both of which are better than the baseline model. Devising such a method to clean tweets can make a great contribution in terms of its adoption in brand sentiment analysis or opinion mining, political analysis, and other applications seeking to make sound predictions.en_NZ
dc.language.isoenen_NZ
dc.subjectTwitteren_NZ
dc.subjectmicro-blogsen_NZ
dc.subjectnoisy tweetsen_NZ
dc.subjecttweetsen_NZ
dc.subjectdata normalisationen_NZ
dc.subjectnormalisationen_NZ
dc.subjectbig dataen_NZ
dc.subjecttext cleansersen_NZ
dc.subjectspell checkersen_NZ
dc.subjectNatural Language Toolkit (NLTK)en_NZ
dc.subjectabbreviationsen_NZ
dc.subjectsocial mediaen_NZ
dc.titleSNET : a statistical normalisation method for Twitteren_NZ
dc.typeMasters Thesisen_NZ
thesis.degree.nameMaster of Computingen_NZ
thesis.degree.levelMastersen_NZ
thesis.degree.grantorUnitec Institute of Technologyen_NZ
dc.subject.marsden080109 Pattern Recognition and Data Miningen_NZ
dc.subject.marsden150502 Marketing Communicationsen_NZ
dc.identifier.bibliographicCitationSosamphan, P. (2016). SNET: A statistical normalisation method for Twitter. An unpublished thesis submitted for the degree of Master of Computing, Unitec Institute of Technology, New Zealand.en_NZ
unitec.pages108en_NZ
unitec.institutionUnitec Institute of Technologyen_NZ
dc.contributor.affiliationUnitec Institute of Technologyen_NZ
unitec.advisor.principalYongchareon, Dr. Sira
unitec.advisor.associatedLiesaputra, Veronica
unitec.advisor.associatedMohaghegh, Dr Mahsa


Files in this item

Thumbnail

This item appears in

Show simple record