Tanaysingh42

Data Science, Soccer and Disasters

Data has many uses in our society. Twitter data in particular has many different uses in data science. It can be used for psychological issues like predicting depression ^[1]. Twitter data can also be used to predict elections ^[2]. Yet another use is to poll public opinions in numerous sectors ^[3]. These examples show the diversity of the possible uses of data in general. As well as those examples, social media and acquired data has been used for several applications including:

Soccer
Fake Disasters
Predicting Disasters

Soccer Rankings and Data

TanaySoccer

The problem is that a lot of the data analysis of soccer is based on basic things like shots, passes, goals saved, etc. and it doesn’t truly reflect a player’s effectiveness in a certain role. This is especially important for football clubs because they use that data to make decisions like player signings and whether to keep a player on their team or not depending on whether they are pulling their weight. One way to try to solve that is by focusing on the position of passes in order to gauge the offensive abilities of players. The data is already present from La Liga 2012-2013 hand-labeled annotations of each ball event that took place during a match. The data type is unstructured since it is positions on a field represented by words and diagrams, not numbers. The data features include the location, the start/end point of the pass, the player involved, and the outcome of each ball-event. The data is analyzed by feature extraction and splitting possessions of the ball into vectors in segments of the field. The possession vectors are then split into two groups to be used as a pseudo predictive analysis by establishing models with the first 80% of data and then applying the models to try to predict the remaining 20% of the data. This was done to simulate applying the models to newly generated data. The study used a Support Vector Machine (SVM) to measure the Area Under the Receiver Operating Characteristic Curve (AUROC) that created a model that tested the holdout (remaining 20%) data. The results of the study were that players were given an Average Pass Shot Value (APSV) that correlated to their passing capabilities. The players that had a higher APSV correlated with the players that performed well in the case of goals and assists, so much that the correlation between APSV and goals of “offensive” players was deemed significant (p=.27, p<0.05). This study used the data to create a map of passing relevance and then using it to rank the players on their offensive capabilities. This can be used in the future for better player comparison and if the parameters were more detailed, allow us to get more situation-specific analytics since soccer is such a complex sport. ^[4]

Fake Hurricane Images on Twitter

During emergency situations, social media is often used to communicate between individuals and groups to get information out. The problem arises when people misuse these dire times to spread fake images or information for varying reasons that may cause panic or miscommunication. One way to do that is to try to find a way to distinguish fake tweets from real tweets to try to combat that situation. The data was collected using an API that made sure the tweets met certain parameters and the data was then run through another API to check for key words that established which tweets would be used in the study. The data is all unstructured because it is text data. The tweets met a certain word count, were posted during a certain time period, and may or may not have had geolocation data included with the tweets. The data is analyzed by classifying it as either real or fake tweets about Hurricane Sandy. The results of the study were that classifying the data in such a way was roughly 90% accurate. This shows that the automated techniques used in the study are fairly effective in determining whether the data was fake or not^[5].

Crisis Mapping of Natural Disasters

Hurricane effect

Mobile communication devices have become a major part of society, especially in regards to social media. This is especially important during natural disasters as it allows data to be collected about the event. However, most of that data is retrospective. One way to try to solve that problem is by using Twitter data to map natural disasters in real time. This is important because it gives people current information about a natural disaster so that they can accurately respond to it. The data was collected by using several geospatial data extraction tools, such as the GooglePlaces API, to download geospatial data. The data is unstructured since it is text data from tweets and geographical data gathered from those tweets. The data features are the place, street, and region of the geographical data. The data was analyzed by comparing the number of tweets in an area to the level of disaster impact in that area (e.g. If an area has a high number of tweets, it is expected to be an area with a heavy impact from the disaster). The results of the study were that the models made from the geospatial data were over 90% accurate in comparison to the actual disaster ^[6].

References

[1] De Choudhury, Munmun, et al. "Predicting Depression via Social Media." ICWSM. 2013.

[2] Tumasjan, Andranik, et al. "Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment." ICWSM 10 (2010): 178-185.

[3] Cody, Emily M., et al. "Public Opinion Polling with Twitter." arXiv preprint arXiv:1608.02024 (2016).

[4] Brooks, Joel, Matthew Kerr, and John Guttag. "Developing a Data-Driven Player Ranking in Soccer using Predictive Model Weights."

[5] Gupta, Aditi, et al. "Faking sandy: characterizing and identifying fake images on twitter during hurricane sandy." Proceedings of the 22nd international conference on World Wide Web. ACM, 2013.

[6] Middleton, Stuart E., Lee Middleton, and Stefano Modafferi. "Real-time crisis mapping of natural disasters using social media." IEEE Intelligent Systems 29.2 (2014): 9-17.

[1]

[2]

[3]

[4]

[5]

[6]