Analysis Of Large-Scale Data From Human Activities On The Web
This work focuses on data mining and machine learning using large-scale datasets, with an emphasis on Web information, social computing, and on-line social networks. These datasets are becoming more numerous, and as the Web's reach grows, it is important to understand these datasets for two reasons. First, better understanding of the systems generating the data allows us to improve the systems. For example, by looking at where search queries come from, we can better select what results and advertisements to display. Second, an in-depth understanding of the data allows us to leverage it for a variety of purposes. For instance, by looking at the geographic sources of queries we can discover the reach of various ideas. In particular we will develop new algorithms to deal with these large datasets, answering the subtle and nuanced questions that require a huge amount of data and novel methodology. We will examine large social networks, and processes related to these networks such as group formation and network evolution. We will also look at data from web search, showing that it is a rich source of information which, when combined with IP address geolocation can tell us a great deal about the geographic extent of various terms. In addition to learning about these systems, we will also design algorithms for improving them. Through the use of server logs, we will show how changing content can be scheduled more optimally on web pages. Finally, we will examine some of the privacy implications of this style of research, showing a negative result which illustrates how careful we must be with our data.
dissertation or thesis