Data Sciences- Foundations: Theory, Algorithms, Software (DS-F: TAS)


This arm of my research program is on the foundational aspects of artificial intelligence and deep learning (AI), machine learning (ML) and Statistics.


We primarily study the theoretical foundations and principlies, of AI, ML, Stats, to answer the "why use a data science technique?" question, but also research on the algorithmic and methodological aspects, to answer the "how to use a data science technique?" question, often in relation to a specific inter-disciplinary application.


Our research techniques often take us into the worlds of Bayesian statistics and resampling techniques, which help build inter-disciplinary bridges between Statistics, Computer Science and parts of Mathematics. We often look at high-dimensional data geometry to understand properties of the data and algorithms.


A considerable part of our research may be termed uncertainty quantification (UQ). We take a very broad view of UQ, and study probability and probabilistic inference including considerations for complex forms of dependencies, multi-dimensional extremes and heavy tails, generative models of many kinds, risk assessment and decision making under uncertainties, forecasting and predictions, multi-scale and multi-resolution aspects. In some studies, UQ also relates to privacy, confidentiality, fairness, equity, diversity, representativeness in both the data and methods, which we broadly study under limited description data methods.


The theoretical properties and algorithmic aspects may often be in competition with each other. For example, statistical optimality, which captures how to extract the greatest possible information out of data, may clash with computational optimality, which argues for efficient and fast computations.







Limited Description Data


Much of statistics, and almost all of artificial intelligence and deep learning and machine learning methodology is built around assumptions of ``nice'' data: (i) there is representative, unbiased and adequate data from all sub-groups and sub-domains of interest, (ii) sampling artefacts are not present in data, (iii) observations are statistically independent and identically distributed (iid), and so on.
What if all of these are not true, as if often the case for real data? That is, what if data is biased or unrepresentative and dependent on the sampling scheme, there are small amount of data or no data from some sub-populations, there are missing observations and systematic biases and errors in the observations, and the observations are not independent or identically distributed? What if the data fails to meet fairness, diversity, equity, representativeness standards? Can we ensure that the analysis of such datasets meet fairness, as well as privacy and confidentiality standards?


We research on such limited description data from a number of different perspectives:

  1. Small Area Models: This is now a well-established field, where we consider the problem that the data from some or all the sub-populations are not adequate in size, but we can borrow strengths across sub-populations. Our research on this takes us into Bayesian statistics, resampling techniques, and many other interesting topics.
  2. Record Linkage, Entity Resolution: We try to construct larger datasets from smaller ones, by discovering relationships between observations and among features. This topic engages us in Bayesian statistics as well as several interesting machine learning techniques.
  3. Federated Learning, Transfer Learning: This is a counterpart of record linkage and entity resolution, and instead of trying to discover relations between observations and features of different datasets, we transport the analyses and inferences of one dataset to another. This helps in many ways: we can preserve privacy, confidentiality, respondent rights and data and intellectual property security, we can reduce computational costs, we can obtain better statistical inference with high accuracy and precision.
    This arm of research also involves deep dives into Bayesian statistics as well as machine learning and artificial intelligence techniques.
  4. Privacy, Confidentiality, Respondent Rights, Data Security, Fairness, Equity, Diversity: There is considerable overlap of this topic with small area techniques, record linkage and entity resolution, federated and transfer learning.
    Apart from working on those topics, we also ask the question: How can we create artificial data that has desirable privacy, fairness and other properties? Such artificial datasets are extremely useful for public dissemination, research purposes, and many other uses. We work on the delicate balance between confidentiality and security aspects, representativeness and the quality of the results obtained analyzing such artificial datasets. This arm of research also involves deep dives into Bayesian statistics, resampling techniques as well as machine learning and artificial intelligence techniques.
  5. Repeated Surveys: This research domain is on using Bayesian statistics, resampling techniques and machine learning and artificial intelligence techniques to learn from surveys that are repeated over time.







Conditional Statistical Inference


Conditional statistical methodology is structured to exploit the available and existing data. This is our core topic of research in mainstream Statistics. The three main lines of research that we pursue here are (admittedly, these three streams have completely different philosophical foundations):


  1. Bayesian Statistics: Our research is mainly on using Bayesian principles and theory for understanding and enhancing machine learning, artificial intelligence including deep learning algorithms, primarily with a view to quantifying uncertainty in the answers obtained by such algorithms, assessing the risk associated with using results from such algorithms, and conducting statistical inference and causality studies in ML/AI frameworks.
    We also work on Bayesian methodological developments and applications to inter-disciplinary applications, and other statistical applications.
  2. Resampling Techniques: This is philosophically very different from Bayesian statistics, and encompasses a broad area of research where we either repeatedly use the observed data, or assign random weights to data, or try to imitate the data generating process or optimization framework in a variety of ways. Despite being fundamentally different from Bayesian techniques from a philosophical standpoint, in practical terms, resampling often achieves very similar goals of uncertainty quantification, risk assessment, statistical inference and causality in all kinds of data science frameworks.
    I have been working on resampling for a long time now, and this remains a core (and fun) topic of research for me, owing to its never-ending potential.
  3. Empirical Likelihood Techniques: This is a new topic of research for me, and it's been great fun and full of promise!







Trans-disciplinary Data Science Applications


It's wonderful to be able to use the multiple arms of Data Sciences, like AI, ML, Statistics, to solve real-world scientific problems! Here are some of the topics I work on:


  1. Climate Informatics: We try to understand properties of this planet's climate, and Physics-based climate models, using various data science methods. Since understanding climate or climate change is not a black-box prediction problem, we use lots of statistical techniques for inference and uncertainty quantification.
  2. Conflict Informatics: Can we use data science methods to predict violence, and more importantly, drivers of political violence? We study such tremendously important questions, along with their causes and effects tied to climate change, environment and ecological systems, migrations, supply chain dynamics.
  3. Health Informatics: We study multimodal and multi-resolution data on various aspects of health and well-being, including medical images, genomics, cancer biology, and omics-data. This is an exciting field, one where I am learning a lot every day!