Data Science Seminar, Fall 2015

Past Seminars: Spring 2014 , Fall 2014
Date & Time Location Speaker/Title
SB-220 The seminar usually meets at Tuesday 11:25 am.

Tuesday, Sep 8
11:25 am--12:40 pm
SB-220 Dr. Jeffrey Larson, Argonne National Lab

Title: Exploiting problem-specific knowledge and computational resources in derivative-free optimization

Abstract: This talk begins with a comparison of methods for optimizing computationally expensive functions which lack reliable gradient information. We highlight recently developed algorithms that utilize the structure of common problems, and demonstrate their efficacy on relevant applications. We then show how such algorithms can be incorporated into an asynchronous, multi-start framework. Theoretical results and practical performance of such a framework concludes the talk.

Tuesday, Oct 13
11:25 am--12:40 pm
SB-220 Dr. Pan Chen , Senior Director, Business Analytics at HAVI Global Solutions.

Title: A practitioner’s perspective on Big Data analysis

Abstract: In this talk, the speaker would like to share his own observation on the analytics business trends, some of the current gaps between the promise and reality, and how analytics professionals and business professionals can work together to bridge these gaps. Lastly, the speaker would like to share his own opinions on what this means to schools that produce analytics talents.

Tuesday, Oct 20
11:25 am--12:40 pm
SB-220 Dr. Sydeaka Watson , Research Associate (Assistant Professor), Department of Public Health Science, University of Chicago.

Title: Survival model selection with missing data and correlated covariates

Abstract: A novel combination of existing methods was used to develop a survival prediction equation for pulmonary arterial hypertension patients awaiting lung transplantation. The Scientific Registry of Transplant Recipients (SRTR) dataset featured censored survival times, missing covariate data, and a large number of highly correlated candidate predictor variables. Penalized weighted least squares regression was repeatedly applied to bootstrap resamples of multiply imputed data, yielding a parsimonious model that satisfied internal validation criteria of clinical interest. Simulation studies under various degrees of predictor variable missing-ness, survival time censoring, effect size, and proportion of variables unrelated to survival have shown that this method tends to accurately recover the true list of Cox regression predictor variables.

Tuesday, Nov 17
11:25 am--12:40 pm
SB-220 Dr. Sou-Cheng Choi, Senior Statistician in NORC at the University of Chicago, and Research Assistant Professor in the Department of Applied Math at IIT.

Title: Probabilistic Record Linkage and Address Standardization

Abstract: Probabilistic record linkage (PRL) refers to the process of matching records from different data sources such as database tables with missing values in primary key. It can be applied to join or de-duplicate records, or to impute missing data, resulting in better overall data quality. An important subproblem in PRL is to parse or standardize a text field such as address into its component fields, e.g., street number, street name, city, state, zip code, and country. Often, various modern data analysis techniques such as natural language processing and machine learning methods are gainfully employed in both PRL and address standardization to achieve higher accuracies of linking or prediction. In a recent study, we compare the performance of a few widely used open-source PRL packages freely available in the public domain, namely FRIL, Link Plus, R RecordLinkage, and SERF. In addition, we evaluate the baseline performance and sensitivity of a number of address-parsing web services including the U.S. address parser, Google Maps APIs,, and Data Science Toolkit. We will present strengths and limitations of the software and services we have evaluated. This is joint work with Yongheng Lin and Edward Mulrow, NORC at the University of Chicago.

Tuesday, Nov 24
11:25 am--12:40 pm
Armour Dining Room-Hermann Hall Prof. William S. Cleveland, Shanti S. Gupta Professor of Statistics, Purdue University.

Title: Divide & Recombine (D&R) with Tessera: High Performance Computing for Deep Analysis of Big Data and Small

Abstract: The widely used term "big data" carries with it a notion of computational performance for the analysis of big datasets. But for data analysis, computational performance depends very heavily, not just on size, but on the computational complexity of the analytic routines used in the analysis. Datasets that have big computational challenges have a very wide range of sizes. Furthermore, the hardware power available to the data analyst is also an important factor. High performance computing for data analysis can be provided for wide ranges of dataset size, computational complexity, and hardware power by the (D&R) statistical approach, and the Tessera D&R software implementation that makes programming D&R easy (

Bio: William S. Cleveland is the Shanti S. Gupta Distinguished Professor of Statistics and Courtesy Professor of Computer Science at Purdue University. His areas of methodological research are in statistics, machine learning, and data visualization. He has analyzed data in his research in cyber security, computer networking, visual perception, environmental science, healthcare engineering, public opinion polling, and disease surveillance. In the course of this work, Cleveland has developed many new methods and models for data that are widely used throughout the worldwide technical community. He has led teams developing software systems implementing his methods that have become core programs in many commercial and open-source systems. In 1996 Cleveland was chosen national Statistician of the Year by the Chicago Chapter of the American Statistical Association. In 2002 he was selected as a Highly Cited Researcher by the American Society for Information Science & Technology in the newly formed mathematics category. He is a Fellow of the American Statistical Association, the Institute of Mathematical Statistics, the American Association of the Advancement of Science, and the International Statistical Institute. Today, Cleveland and colleagues develop the Divide & Recombine (D&R) approach to data analysis, and the Tessera software system that implements D&R. This provides high performance computing for datasets whose sizes, computational complexities, and cluster hardware power range from very small to very big.

For more information contact Lulu Kang.