Text Mining & Natural Language Processing Ali Hürriyetoglu, Piet

29 Slides445.05 KB

Text Mining & Natural Language Processing Ali Hürriyetoglu, Piet Daas THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat

Outline Introduction Background Basic steps Use cases Machine learning for text mining 2 Eurostat Eurostat

Introduction 3 Eurostat Eurostat

What can you do with text mining? Named entity recognition Sentiment analysis Topic detection Information extraction Trend detection Clustering similar documents Automatic summarisation 4 Eurostat Eurostat

Ingredients of text mining Text analytics is a function of: The The The The amount and type of text you have task you want to achieve precision and recall you want to get time you can spend 5 Eurostat Eurostat

Text types Semi structured language use: Address, phone number, named entities, etc. Standard text: News articles, books, etc. User generated text: social media, comments 6 Eurostat Eurostat

Background 7 Eurostat Eurostat

Text Text is a rich combination of symbols that lead to a structure which has a context dependent interpretation. Symbols: character, word, punctuation, digit, emoticon Structure: tokens, links, user names, hashtags, noun, verb, named entity, emoticon, phrases, codes, etc. Context: writer, genre, platform, social environment, time, geographic location, etc. Interpretation: sense, meaning, 8 Eurostat Eurostat

Symbols Letters: A B Ç X Digits: 1 5 3 2 Punctuation: . , ! ? Emoticons: Special characters: # & Eurostat Eurostat

Structure Tokens: Any space separated symbol sequence (for European languages). Numbers: 6, 123, , Web specific tokens: user names, hashtags, URLs, Abbreviations: vs., etc., . Syntactic interpretation: noun, verb, adjective, . 10 Eurostat Eurostat

Context Anything about use of a token may have significant effect: The person who uses it The aim of the phrase Time and place of the language use Preceding and following expressions . 11 Eurostat Eurostat

Interpretation Tokens and phrases may have one or more interpretations. Ambiguity: Lexical meaning may differ Named entities: same entities names may refer to different real entities Genre: Orders, compliments, statements, instructions, etc. Usernames: will be interpreted differently in different platforms 12 Eurostat Eurostat

Basic steps 13 Eurostat Eurostat

Basic steps and tools You need some combination of: Language identification Sentence splitting Tokenization Lemmatization Anaphora resolution Regular expressions POS tagging Named entity recognition Parsing methodology, Pyparsing Language resources: stop words, a sentiment lexicon, multi-word expressions, ontology, etc. 14 Eurostat Eurostat

Use cases 15 Eurostat Eurostat

Named entities Problem: You want to know which named entities are available in a text. You do not have much time or resources. An approximate result is sufficient for you. Solution: Find and count all proper-cased token sequences: ([A-Z][a-z] (\s[A-Z][a-z] ) ) ('Sherlock Holmes', 90), ('United States', 71), ('New York', 54), ('New England', 46), ('Baker Street', 29), 16 Eurostat Eurostat

Street names Problem: You have a set of criminality reports. You wonder which street names are mentioned mostly. Solution: Write a more specific regular expression: [A-Z][a-z] [sS]treet ('Baker Street', 29), ('Leadenhall Street', 5), ('Fresno Street', 2), ('Fenchurch Street', 2), ('Bow Street', 2), ('Oxford Street', 2), 17 Eurostat Eurostat

Detect economic indicators Problem: You want to detect and track price changes. You want to be precise. You know and can spend some time to specify what you are looking for. Solution: Parse text with Pyparsing* action oneOf(["lower","increase","decrease"], caseless True) econ oneOf(["prices","expense","cost","price"], caseless True) item Word(alphas) economy grammar action("action") item("item") econ economy grammar2 econ Literal("of") item action *For R use tm package Eurostat Eurostat 18

Sentiment Analysis Problem: You want to understand how people feel about a certain issue or entity. Solution 1: Create or use an available sentiment lexicon. Count number of occurrences for the entries in the lexicon. Solution 2: Detailed syntactic and semantic analysis. 19 Eurostat Eurostat

Wordclouds Problem: You have text, and want to have a quick insight about what it mostly contains. Solution: Word cloud, streamgraph, t-SNE, 20 Eurostat Eurostat

https://github.com/amueller/word cloud/blob/master/examples/constitution.png Eurostat Eurostat 21

Track co-evoluation of language use https://blog.twitter.com/2010/the-2010-world-cup-a-global-conversation Eurostat Eurostat 22

Topic modelling Problem: You need a detailed analysis of the topics in a text collection, corpus. Solution: Topic modelling 23 Eurostat Eurostat

http://alexperrier.github.io/jekyll/update/2015/09/04/topic-modeling-of-twitter-followers.html Eurostat Eurostat 24

Machine learning 25 Eurostat Eurostat

Machine Learning You can attempt to solve almost any text mining task with machine learning approaches. The outcome will depend on: Feature extraction and selection Amount of labeled data in the case of supervised learning Time you have to analyze the output in unsupervised learning 26 Eurostat Eurostat

Thanks for listening! Any question or comment? 27 Eurostat Eurostat

Exercises 6) Search for key terms on Twitter and collect n tweets (n 200) 7) Determine most frequent hashtags, links, mentions 8) Create wordcloud of these tweets 9) Topic detection from tweets (either user or key terms search result) 10) Sentiment analysis, create your own list of 10 positive and 10 negative words, calculate count based score 11) Look for an online classifier (for the language of your tweets), get access key and test it (watch the rate limit) E.g. MonkeyLearn 12) Study emoticons as an example for basic emotions Eurostat Eurostat 28

Additional exercises Additional tasks: 13) Detect place name, person name, organisation name, number, date recognition, geolocation/temporal characteristics, find similar tweets 14) Apply t-distributed stochastic neighbour embedding (t-SNE) visualization technique on tweets 29 Eurostat Eurostat

Back to top button