Wednesday, June 5, 2019

Constructing Social Knowledge Graph from Twitter Data

Constructing Social Knowledge Graph from chirrup DataYue Han Loke1.1 IntroductionThe current era of technology allows its users to post and share their thoughts, images, and content via ne cardinalrks through contrary progress tos of applications and weather vanesites such(prenominal) as peep, Facebook and Instagram. With the emerging of tender media in our daily lives and it is becoming a norm for the current generation to share data, researchers are starting to suffice studies on the data that could be collected from social media 1 2.The con text edition of this research will be solely dedicated to Twitter data due to its human beingsly available wealth of data and its public Stream API. Twitters tops croupe be used to discover new experience, such as recommendations, and relationships for data analysis. Tweets in commonplace are short microblogs consisting of maximum 140 characters that can consists of normal sentences to hashtags and tags with , opposite short abbrevi ation of words (gtg, 2night), and different shit of a word (yup, nope). law-abiding how tweets are posted shows the strident and short lexical nature of these texts. This presents a challenge to the flexibility of Twitter data analysis. On the other hand, the availability of vivacious research conducted on entity descent and entity linking has decreased the gap between entities extracted and the relationships that could be discovered. Since 2014, the introduction of the take a shitd Entity rEcognition and Linking (NEEL) Challenge 3 has proved the significance of automated entity rootage, entity linking and classification appearing in different plaint flowings of side of meat tweets in the research and commercial communities to design and develop forms that could solve the challenging nature in tweets and to mine semantics from them.1.2 final examination cause AimThe focus of this research aims to construct a social association graph (Knowledge Base) from Twitter data. A fellowship graph is a technique to take apart social media networks victimisation the method of mapping and measurement for both relationships and information flows among group, plaques, and other connected entities in social networks 4. A few tasks are mandatory to successfully create a knowledge graph ground on Twitter dataA method to aid in the construction of knowledge graph is by extracting defecated entitiessuch as persons, organizations, locations, or brands from the tweets 5. In the domain of this research, the named entity to be referenced in the tweet is defined as a proper noun or acronym if it is found in the NEEL Taxonomy in the Appendix A of 3, and is linked to an side DBpedia 6 referent and a NIL referent. The second component in creating a social knowledge graph is to utilize those extracted entities and link them to their respective entities in a knowledge base. For example, Tweet The ITEE department is organizing a pizza gettogether at UQ. awesomeITEE ref ers to an organization and UQ refers to an organization as well. The annotation for this is ITEE, organization, NIL1, where NIL1 refers to the unique NIL referent describing the real-world entity ITEE that does not have the equivalent entry in DBpedia and UQ, Organization, dbpUniversity_of_Queensland which represents the RDF triple (subject, predicate, object).1.3 Project GoalsFirstly, getting the Twitter tweets. This can be achieved by crawling Twitter data using Public Stream API1 available in the Twitter developer website. The Public Stream API allows extraction of Twitter data in real time. Next, entity extraction and typing with the aid of a circumstantialally chosen information extraction line of reasoning called TwitIE2 open-source and specific to social media and has been tested most extensively on microblog sentences. This telephone line receives the tweets as input and recognises the entities in the same tweet.The third task is to link those entities mined from tweets to the entities in the available knowledge base. The knowledge base that has been selected for the context of this project is DBpedia. If in that respect is a referent in DBpedia, the entity extracted will be linked to that referent. Thus, the entity type is guessd base on the category received from the knowledge base. In the event of the unavailability of a referent, a NIL identifier is accustomed as shown in section 1.2. The woof of an entity linking system with the appropriate entity disambiguation and prognosis entity generation that receives the extracted entities from the same Tweet and produce a reheel with all the candidate entities in the knowledge base. The task is to accurately link the correct entity extracted to one of the candidates.The social knowledge graph is an entity-entity graph combining two extracted sources of entities. The frontmost is the analysis of the co-occurrence of those entities in same tweet or same sentence. in any event that, the active re lationships or categories extracted from DBpedia. Thus, the project aims to combine the extraction of co-occurrence of extracted entities and the extracted relationships to create a social knowledge graph to open new knowledge from the fusion of the two data sources.Named Entity Recognition (NER), Information Extraction (IE) are generally well researched in the domain of longer text such as newswire. However, overall, microblogs are possibly the hardest kind of content to process. For Twitter, some methods have been proposed by the research community such as 7 that uses a pipeline approach to perform the first tokenisation and POS tagging and topic puts were used to find named entities. 8 propose a gradient-descent graph- ground method for doing joint text normalisation and recognition, reaching 83.6% F1 measure. Besides that, entity linking in knowledge graphs have been studied in 9 using graph-based method by collectively gather the referent entities of all named entities in the same document and by modelling and exploiting the global interdependence between Entity Linking decisions. However, the combination of NER, and Entity Linking in Twitter tweets is pipe down a new area of research since the NEEL challenge was first found in 2013. Based on the evaluation conducted in 10 on the NEEL challenge, lexical similarity mention contracting strategy that exploit the popularity of the entities and apply a outer space similarity functions to rank entities efficiently, and n-gram 11 features are used. Besides that, Conditional Random Forest (CRF) 12 is another mentioned entity extraction strategy. In the entity detection context, graph distances and various rank features were used.2.1. Twitter crawling13 defined the public Twitter stream API provides the ability of collecting a sample of user tweets. Using the statuses/filter API provides a constant stream of public Tweets. Multiple optional parameters may be specified such as run-in and locations. Applyin g the method CreateStreamingConnection,a POST request to the API has the capability of returning the public statuses as a stream. The rate limit of the Streaming API allows each application to submit up to 5,000 Twitter. 13 Based on the documentation, Twitter currently allows the public to retrieve at most a 1% sample of their data posted on Twitter at a specific time. Twitter will bring to return the sample data to the user when the number of tweets reaches 1% of all tweets on Twitter.According to 14 research comparing Twitter Streaming API and Twitter Firehouse, the final results of the Streaming API depends strongly on the coverage and the type of analysis that the researcher wishes to perform. For example, the researchers found that if given a pay back of parameters and the number of tweets matching them sum ups, the coverage of the Streaming API is reduced. Thus, if the research is concerning a filtered content, the Twitter Firehose would be a better choice with regards to i ts drawback of restrictive cost. However, since our project requires random sampling of Twitter data without filters turn out for English language, Twitter Streaming API would be an appropriate choice since it is freely available.2.2. Entity Extraction15 suggested an open-source pipeline, called TwitIE which is solely dedicated for social media components in inlet 16. TwitIE consists for 7 part tweet import, language identification, tokenisation, gazetteer, sentence splitter, normalisation, part-of-speech tagging, and named entity recogniser. Twitter data is delivered from the Twitter Streaming API in JSON format. TwitIE included a new Format_Twitter plugin in the most recent GATE codebase which converts the tweets in JSON format automatically into GATE documents. This converter is automatically associated with documents names that end in .json, if not text/x-json-twitter should be specified. The TwitIE system uses TextCat a language processing and identification algorithm for i ts language identification. It has the capability to provide reliable tweet language identification for tweets written in English using the English POS tagger and named entity recogniser. Tokenisation oversees different characters, class sequence and rules. Since the TwitIE system is dealing with microblogs, it treats abbreviations and URLs as one token each by following the Ritters tokenisation scheme. Hashtags and user mentions are considered as two tokens and is covered by a separate annotation hashtags. Normalisation in TwitIE system is divided into two task the identification of orthographic errors and correction of the errors found. The TwitIE normalizer is designed specific to social media. TwitIE reuses the ANNIE gazetteer tilts which contain lists such as cities, organisations, days of the week, etc. TwiTie uses the adapted transformation of the Stanford Part-of speech tagger which is tweets tagged with Penn TreeBank(PTB) tagset educate. The results of using the combina tion of normalisation, gazetteer name lookup, and POS tagger, the carrying into action was increased to 86.93%. It was further increased to 90.54% token accuracy when the PTB tagset was used. Named entity recognition in TwitIE has a +30% absolute precision and +20% absolute performance increase as compare to ANNIE, mainly respect to date, Organizations and Person.7 proposed an innovative approach to distant supervision using topic models that pulls adult amount of entities gathered from Freebase, and large amount of unlabelled data. Using those entities gathered, the approach combines information about an entitys context across its mentions. T-NER POS Tagging system called T-POS has added new tags for Twitter specific phenomenal retweets such as usernames, urls and hashtags. The system uses cloping to group together disseminationally similar words for lexical variations and OOV words. T-POS utilizes the Brown Clusters and Conditional Random Fields. The combination of both featur es results in the ability to model strong dependencies between adjacent POS tags and make use of highly correlated features. The results of the T-POS are shown on a 4-fold cross validation over 800 tweets. It is proved that T-POS outperforms the Standford tagger, obtaining a 26% reduction in error. Besides that, when trained on 102K tokens, there is an error reduction of 41%. The system includes shallow parsing which can identify non-recursive phrases such as noun, verb and prepositional phrases in text. T-NERs shallow parsing component called T-CHUNK, obtained a better performance at shallow parsing of tweets as compared against the off the shelf OpenNLP chunker. As reported, a 22% reduction in error. Another component of the T-NER is the capitalization classifier, T-CAP, which analyse a tweet to predict capitalization. Named entity recognition in T-NER is divided into two components Named Entity naval division using T-SEG, and classifying named entities by applying LabeledLDA. T- SEG uses IOB encoding on sequence-labelling task to represent segmentations. Furthermore, Conditional Random Fields is used for learning and inference. Contextual, mental lexicon and orthographic features a set of type lists is included in the in-house dictionaries gathered from Freebase.Additionally, outputs of T-POS, T-CHUNK and T-CAP, and the Brown clusters are used to generate features. The outcome of the T-SEG as stated in the research study, Compared with the state-of-the-art news-trained Stanford Named Entity Recognizer. T-SEG obtains a 52% increase in F1 report. To address the issues of lack of context in tweets to identify the types of entities they contain and excessive distinctive named entity types present in tweets, the research paper presented and assessed a distantly supervised approach based on LabeledLD. This approach utilizes modelling of every entity as a combination of types. This allows information about an entitys distribution over types to be shared across m entions, naturally handling ambiguous entity strings whose mentions could refer to different types. Based on the empirical experiments conducted, there is a 25% increase in F1 score over the co-training approach to Named Entity Classification suggested by Collins and Singer (1999) when applied to Twitter.17 proposed a Twitter adapted version of Kanopy called Kanopy4Tweets that uses the approach of interlinking text documents with a knowledge base by using the relations between concepts and their neighbouring graph structure. The system consists of four parts Name Entity Recogniser (NER), Named Entity Linking (NEL), Named Entity Disambiguation(NED) and Nil Resources Clustering(NRC). The NER of Kanopy4Tweets uses a TwitIE a Twitter information extraction pipeline mentioned above. For the Named Entity Linking. For NEL, a DBpedia index is build using a selection of datasets to search for fit DBpedia resource candidates for each extracted entity. The datasets are store in a single bina ry file using HDT RDF format. This format has compact structures due to its binary example of RDF data. It allows for faster search functionality without the need of decompression. The datasets can be quickly browse and scan through for a specific object, subject or predicate at glance. For each named entity found by NER component, a list of resource candidates retrieved from DBpedia can be obtain using the top-down strategy. One of the challenges found is the large tawdriness of found resource candidates impacts negatively on the processing time for disambiguation process. However, this problem can be resolved by reducing the number of candidates using a rank method. The proposed ranking method ranks the candidates according to the document score assigned by the indexing engine and selects the top-x ingredients. The NED takes an input of a list of named entities which are candidate DBpedia resources after the previous NEL process. The best candidate resource for each named entit y is selected as output. A relatedness score is calculated based on the number of paths between the resources weight by the exclusivity of the edges of these paths which is applied to candidates with respect to the candidate resources of all other entities. The input named entities are jointly disambiguated and linked to the candidate resources with the highest combined relatedness. NRC is a stage whereby if there are no resource in the knowledge base that can be linked to a named entity extracted. Using the Monge-Elkan similarity measure, the first NIL segment is assign into a new cluster, then the next element is used to differentiate from the previous ones. An element is added to a cluster when the similarity between an element and the present clusters is above a fixed threshold, the element is added to that particular cluster, whereas a new cluster is formed if there are no current cluster with a similarity above the threshold is found.2.3. Entity Extraction and Entity Linking 18proposed a lexicon-based joint Entity Extraction and Entity Linking approach, where n-grams from tweets are mapped to DBpedia entities. A pre-processing stage sweets and classifies the part-of-speech tags, and normalises the initial tweets converting alphabetic, numeric, and symbolic Unicode characters to ASCII equivalents. Tokenisation is performed on non-characters except special characters joining compound words. The resulting list of tokens is fed into a shingle filter to construct token n-grams from the token stream. In the candidate mapping component, a gazetteer is used to map each token that is compiled from DBpedia redirect labels, disambiguation labels and entities labels that is linked to their own DBpedia entities. All labels are lowercase indexed and linked by exact matches only to the list of candidate entities in the form of tokens. The researcher used a method of prioritizing longer tokens than shorter ones to remove possible overlaps of tokens. For each entity ca ndidate, it considers both local and context-related features via a pipeline of analysis scorers. Examples of local features included are string distance between the candidate labels and the n-gram, the origin of the label, its DBpedia type, the candidates link graph popularity, the level of uncertainty of the token, and the surface form that matches best. On the other hand, the relation between a candidate entity and other candidates with a given context is accessed by the context-related features. Examples of mentioned context-related features are direct links to other context candidates in the DBpedia link graph, co-occurrence of other tokens surface forms in the corresponding Wikipedia article of the candidate under consideration, co-references in Wikipedia article, and further graph based feature of the link graph induced by all candidates of the context graph which includes graph distance measurements, connected component analysis, or centrality and density observations. Besid es that, the candidates are sorted per their trust score based on how an entity describes a mention. If the authorisation score is lower than the threshold chosen, a NIL referent is annotated.19 proposed a lexical based and n-grams features to look up resources in DBpedia. The role of the entity type was assigned by a Conditional Random Forest (CRF) classifier, that is specifically trained using DBpedia related feature (local features), word embedding (contextual features), temporal popularity knowledge of an entity extracted from Wikipedia page view data, string similarity measures to measure the similarity between the title of the entity and the mention (string distance), and linguistic features, with additional pruning stage to increase the precision of Entity Linking. The whole process of the system is split into five stages pre-processing, mention candidate generation, mention detection and disambiguation (candidate selection), NIL detection and entity mention typing predicti on. In the pre-processing stage, tweet tokenisation and part-of-speech tags were used based on ARK Twitter Part-of-Speech Tagger, together with the tweet timestamps extracted from tweet ID. The researchers used an in-house mention-entity dictionary of acronyms. This dictionary computes the n-grams (n20 research paper proposed an entity linking technique to link named entity mentions appearing in Web text with their corresponding entities in a knowledge base. The solution mentioned is by employing a knowledge base. Due to the vast knowledge shared among communities and the development of information extraction techniques, the existence of automated large scale knowledge bases has been ensured. Thus, this plenteous information about the worlds entities, their relationships, and their semantic classes which are all possibly live into a knowledge base, the method of relation extraction techniques is vital to obtain those web data that promotes discovery of useful relationships between entities extracted from text and their extracted relation. Once possible way is to map those entities extracted and associated them to a knowledge base before it could be existd into a knowledge base. The goal of entity linking is to map ever textual entity mention m M to its corresponding entry e E in the knowledge base. In some cases, when the entity mentioned in text does not have its corresponding entity record in the given knowledge base, a NIL referent is given to indicate a special label of un-linkable. It is mentioned in the paper that named entity recognition and entity linking o be jointly perform for both processes to strengthen one another. A method proposed in this paper is candidate entity generation. The objective of the entity linking system is to filter out irrelevant entities in the knowledge base that for each entity extracted. A list of candidates which might be the possible entities that the extracted entity is referring to is retrieved. The paper suggested three techniques to handle this goal such as name based dictionary techniques entity pages, redirect pages, disambiguation pages, bold phrases from the first paragraphs, and hyperlinks in Wikipedia articles. Another method proposed is the surface form expansion from the local document that consists of heuristics based methods and supervised learning methods, and methods based on search engine. In the context of candidate entity ranking method, five categories of methods are advised. The supervised ranking methods, unsupervised ranking methods, independent ranking methods, collective ranking methods and collaborative ranking methods. Lastly, the research paper mentioned ways to evaluate entity linking systems using precision, recall, F1-measure and accuracy. Despite all these methods used in the three main approaches is proposed to handle entity linking system, the paper clarified that it is still unclear which are the best techniques and systems. This is since different entity link ing system react or perform differently according to datasets and domains.21 proposed a new versatile algorithm based on multiple addictive regression trees called S-MART (Structured Multiple Additive Regression Trees) which emphasized on non-linear tree-based models and structured learning. The framework is a generalized Multiple addictive Regression Trees (MART) but is adapted for structured learning. This proposed algorithm was tested on entity linking primarily focused on tweet entity linking. The evaluation of the algorithm is based on both IE and IR situations. It is shown that non-linear performs better than linear during IE. However, for the IR setting, the results are similar except for LambdaRank, a neural network based model. The adoption of polynomial kernel further improves the performance of entity linking by non-LINEAR SSVM. The paper proved that entity linking of tweets perform better using tree-based non-linear models rather than the alternative linear and non-lin ear methods in IE and IR driven evaluations. Based on the experiments conducted, the S-MART framework outperforms the current up-to-date entity linking systems.2.4. Entity Linking and Knowledge BaseBased on 22, an approach to free text relation extraction was proposed. The system was trained to extract the entities from the text from existing large scale knowledge base in a cooperatively manner. Furthermore, it utilizes the learning of low-dimensional embedding of words, entities and relationships from a knowledge base with regards to score functions. Built upon the norm of employing weakly labelled text mention data but with a modified version which extract triples from the existing knowledge bases. Thus, by generalizing from knowledge base, it can learn the plausibility of new triples (h, r, t) h is the left-hand side entity (or head), the right-hand side entity (or tail) and r the relationship linking them, even though this specific triple does not exist. By using all knowledge b ase triples rather than training only on (mention, relationship), the precision on relation extraction was proved to be significantly improved.1 presented a novel system for named entity linking over microblog posts by leveraging the linked nature of DBpedia as knowledge base and using graph centrality scoring as disambiguation methods to worst polysemy and synonymy problems. The indigence for the authors to create this method is because linked entities tend to appear in the same tweets because tweets are topic specific and together with the assumption since tweets are topic specific, related entities tend to appear in the same tweet. Since the system is tackling noisy tweets acronyms handling and Hashtags in the process of entity linking were integrated. The system was compared with TAGME, a state-of-the-art system for named entity linking designed for short text. The results shown that it outperformed TAGME in Precision, Recall and F1 metrics with 68.3%, 70.8% and 69.5%.23 presen ted an automated method to populate a Web-scale probabilistic knowledge base called Knowledge Vault (KV) that uses the combination of extractions from the Web such as text documents (TXT), HTML trees (DOM), Html tables (TBL), and Human Annotated pages (ANO). By using RDF triples (subject, predicate, object) with association to a confidence score that represents the probability that KV believes the triple is correct. In addition, all 4 extractors are merged together to form one system called FUSED-EX by constructing a feature vector for each extracted triple. Next, a binary classifier is applied to compute the formula. The advantages of using this fusion extractor is that it can learn the relative reliabilities of each system as well as creating a model of the reliabilities. The benefits of combining multiple extractors include 7% higher confidence triples and a high AUC score (the higher probability that a classifier will choose a randomly chosen positive instance to be ranked) of 0 .927. To scale the unreliability of facts extracted from the Web, prior knowledge is used. In the domain of this paper, Freebase is used to fit the existing models. Two ways were proposed in the paper which are Path ranking algorithm with AUC scores of 0.884 and the Neural network model with a AUC score of 0.882. A fusion of both methods stated was conducted to increase performance with an increased AUC score of 0.911. With the evidence of the benefits of fusion quantitatively, the authors of the paper proposed another fusion of the prior methods and the extractors to gain additional performance boost. The result of the fusion is a generation of 271M high confidence facts with 33% new facts that are unavailable in Freebase.24proposed TremenRank, a graph based model to tackle the target entity disambiguation challenge, task of identifying target entities of the same domain. The motivation of this system is due to the challenges and unreliability of current methods that relies on kno wledge resources, the shortness of the context which a target word occurs, and the large scale of the document collected. To overcome these challenges, first TremenRank was built upon the notion of collectively identity target entities in short texts. This reduces memory storage because the graph is constructed locally and is continuously scale-up linearly as per the number of target entities. This graph was created locally via inverted index technology. There are two types of indexes used the document-to-word index and the word-to-document index. Next, the collection of documents (the shorts texts) are modelled as a multi-layer directed graph that holds various trust scores via propagation. This trust score provided an indication of the possibility of a true mention in a short text. A series of experiments was conducted on TremenRank and the model is more superior than the current advanced methods with a difference of 24.8% increase in accuracy and 15.2% increase in F1.25introduced a probabilistic fusion system called SIGMAKB that integrates strong, high precision knowledge base and weaker, and nosier knowledge bases into a single monolithic knowledge base. The system uses the Consensus Maximization Fusion algorithm to validate, aggregate, and ensemble knowledge extracted from web-scale knowledge bases such as YAGO and NELL and 69 Knowledge Base Population. The algorithm combines multiple supervised classifiers (high-quality and clean KBs), motivated by distant supervision and unsupervised classifiers (noisy KBs) Using this algorithm, a probabilistic interpretation of the results from complementary and conflicting data values can be shown in a singular response to its user. Thus, using a consensus maximization component, the supervised and unsupervised data collected from the method stated above produces a final combined probability for each triple. The standardization of string named entities and alignment of different ontologies is done in the pre-processin g stage.Project planSemester 1Task bolt downEndDuration(days)MilestoneResearch23/03/2017Twitter Call27/02/201702/03/20174Entity Recognition27/02/201702/03/20174Entity Extraction02/03/201702/03/20177Entity Linking09/03/201716/03/20177Knowledge Base Fusion16/03/201723/03/20177Proposal27/02/201730/03/20173030/03/2017Crawling Twitter data using Public Stream API31/03/201715/04/20171515/04/2017Collect Twitter data for training purp

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.