# Research

An overview of my research interests and publications

I received my PhD from in the Department of Statistics and Data Science at Carnegie Mellon University in Fall 2022. I was extremely fortunate to be advised by Matey Neykov.

Broadly speaking my primary research interests focus on

Developing various generalizations of models and estimators in classical statistics, and analyzing their theoretical properties.

More specifically, the classical topics I study include density estimation, isotonic regression, and location-scale estimation. I’m also actively interested in application-driven methodology via statistical ranking (in particular the Bradley-Terry-Luce model), and spatiotemporal modeling (wildfire prediction).

Feel free to reach out if you’d like to work on a problem together. I’m always open to discussing new research directions, and applying statistics to solve problems in interesting domains.

## Collaborators

I particularly enjoy the collaborative aspect of research. I’ve been privileged to work with the following amazing co-authors1: Rebecca Barter, Heejong Bong, Niccolò (Nic) Dalmasso, Riccardo Fogliato, Arun Kumar Kuchibhotla, Wanshan Li, Matey Neykov, Alessandro (Ale) Rinaldo.

I recommend perusing some of my papers below to get a better idea of the type of problems I like to work on, and the style of papers I enjoy writing with my co-authors.

# Papers

Below is a list2 of published works, papers under review, and competition submissions3. Reproducible code for my research is also publicly available via GitHub.

## Published

$$\ell_{\infty}$$-Bounds of the MLE in the BTL Model under General Comparison Graphs
Wanshan Li*, Shamindra Shrotriya*, Alessandro Rinaldo
Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI), (2022).

The Bradley-Terry-Luce (BTL) model is a popular statistical approach for estimating the global ranking of a collection of items using pairwise comparisons. To ensure accurate ranking, it is essential to obtain precise estimates of the model parameters in the $$\ell_{\infty}$$-loss. The difficulty of this task depends crucially on the topology of the pairwise comparison graph over the given items. However, beyond very few well-studied cases, such as the complete and Erdös-Rényi comparison graphs, little is known about the performance of the maximum likelihood estimator (MLE) of the BTL model parameters in the $$\ell_{\infty}$$-loss under more general graph topologies. In this paper, we derive novel, general upper bounds on the ℓ∞ estimation error of the BTL MLE that depend explicitly on the algebraic connectivity of the comparison graph, the maximal performance gap across items and the sample complexity. We demonstrate that the derived bounds perform well and in some cases are sharper compared to known results obtained using different loss functions and more restricted assumptions and graph topologies. We carefully compare our results to Yan et al. (2012), which is closest in spirit to our work. We further provide minimax lower bounds under $$\ell_{\infty}$$-error that nearly match the upper bounds over a class of sufficiently regular graph topologies. Finally, we study the implications of our $$\ell_{\infty}$$-bounds for efficient (offline) tournament design. We illustrate and discuss our findings through various examples and simulations.

@inproceedings{li2022ellinfboundsbtl,
title   = {
{$\ell_{\infty}$}-Bounds of the {MLE} in the
{BTL} Model under General Comparison Graphs
},
author  = {Wanshan Li and Shamindra Shrotriya and Alessandro Rinaldo},
year    = 2022,
booktitle = {
Uncertainty in Artificial Intelligence, Proceedings of the Thirty-Eighth
Conference on Uncertainty in Artificial Intelligence, {UAI} 2022, 1-5
August 2022, Eindhoven, The Netherlands
},
publisher = {{PMLR}},
series  = {Proceedings of Machine Learning Research},
volume  = 180,
pages   = {1178--1187},
url     = {https://proceedings.mlr.press/v180/li22g.html},
editor  = {James Cussens and Kun Zhang}
}

Nonparametric Estimation in the Dynamic Bradley-Terry Model
Heejong Bong*, Wanshan Li*, Shamindra Shrotriya*, Alessandro Rinaldo
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS), (2020).

We propose a time-varying generalization of the Bradley-Terry model that allows for nonparametric modeling of dynamic global rankings of distinct teams. We develop a novel estimator that relies on kernel smoothing to pre-process the pairwise comparisons over time and is applicable in sparse settings where the Bradley-Terry may not be fit. We obtain necessary and sufficient conditions for the existence and uniqueness of our estimator. We also derive time-varying oracle bounds for both the estimation error and the excess risk in the model-agnostic setting where the Bradley-Terry model is not necessarily the true data generating process. We thoroughly test the practical effectiveness of our model using both simulated and real world data and suggest an efficient data-driven approach for bandwidth tuning.

@inproceedings{bong2020nonparamestdynbtl,
title   = {Nonparametric Estimation in the Dynamic Bradley-Terry Model},
author  = {Heejong Bong and Wanshan Li and Shamindra Shrotriya and Alessandro Rinaldo},
year    = 2020,
booktitle = {
The 23rd International Conference on Artificial Intelligence and
Statistics, {AISTATS} 2020, 26-28 August 2020, Online [Palermo, Sicily,
Italy]
},
publisher = {{PMLR}},
series  = {Proceedings of Machine Learning Research},
volume  = 108,
pages   = {3317--3326},
url     = {http://proceedings.mlr.press/v108/bong20a.html},
editor  = {Silvia Chiappa and Roberto Calandra}
}

Predictive Inference of a Wildfire Risk Pipeline in the United States
Niccolo Dalmasso*, Alex Reinhart*, Shamindra Shrotriya*
NeurIPS 2019 Workshop on Tackling Climate Change with Machine Learning (Proposal Track), (2019).

Wildfires are rare events that present severe threats to life and property. Understanding their propagation is of key importance to mitigate and contain their impact, especially since climate change is increasing their occurrence. We propose an end-to-end sequential model of wildfire risk components, including wildfire location, size, duration, and risk exposure. We do so through a combination of marked spatio-temporal point processes and conditional density estimation techniques. Unlike other approaches that use regression-based methods, this approach allows both predictive accuracy and an associated uncertainty measure for each risk estimate, accounting for the uncertainty in prior model components. This is particularly beneficial for timely decision-making by different wildfire risk management stakeholders. To allow us to build our models without limiting them to a specific state or county, we have collected open wildfire and climate data for the entire continental United States. We are releasing this aggregated dataset to enable further open research on wildfire models at a national scale.

@inproceedings{dalmasso2019firepred,
title   = {Predictive Inference of a Wildfire Risk Pipeline in the United States},
author  = {Dalmasso, Niccolo and Shrotriya, Shamindra and Reinhart, Alex},
year    = 2019,
booktitle = {NeurIPS 2019 Workshop on Tackling Climate Change with Machine Learning},
url     = {https://www.climatechange.ai/papers/neurips2019/42}
}

## Under Review

Revisiting Le Cam’s Equation: Exact Minimax Rates over Convex Density Classes
Shamindra Shrotriya, Matey Neykov (2022).

We study the classical problem of deriving minimax rates for density estimation over convex density classes. Building on the pioneering work of Le Cam (1973), Birge (1983, 1986), Wong and Shen (1995), Yang and Barron (1999), we determine the exact (up to constants) minimax rate over any convex density class. This work thus extends these known results by demonstrating that the local metric entropy of the density class always captures the minimax optimal rates under such settings. Our bounds provide a unifying perspective across both parametric and nonparametric convex density classes, under weaker assumptions on the richness of the density class than previously considered. Our proposed ‘multistage sieve’ MLE applies to any such convex density class. We apply our risk bounds to rederive known minimax rates including bounded total variation, and Holder density classes. We further illustrate the utility of the result by deriving upper bounds for less studied classes, e.g., convex mixture of densities. relevant trigger.

@misc{shrotriya2022lecamminimax,
title   = {
Revisiting Le Cam's Equation: Exact Minimax Rates over Convex Density
Classes
},
author  = {Shamindra Shrotriya and Matey Neykov},
year    = 2022,
eprint  = {arXiv:2210.11436}
}

Shamindra Shrotriya, Matey Neykov (2022).

Classical univariate isotonic regression involves nonparametric estimation under a monotonicity constraint of the true signal. We consider a variation of this generating process, which we term adversarial sign-corrupted isotonic (ASCI) regression. Under this ASCI setting, the adversary has full access to the true isotonic responses, and is free to sign-corrupt them. Estimating the true monotonic signal given these sign-corrupted responses is a highly challenging task. Notably, the sign-corruptions are designed to violate monotonicity, and possibly induce heavy dependence between the corrupted response terms. In this sense, ASCI regression may be viewed as an adversarial stress test for isotonic regression. Our motivation is driven by understanding whether efficient robust estimation of the monotone signal is feasible under this adversarial setting. We develop ASCIFIT, a three-step estimation procedure under the ASCI setting. The ASCIFIT procedure is conceptually simple, easy to implement with existing software, and consists of applying the PAVA with crucial pre- and post-processing corrections. We formalize this procedure, and demonstrate its theoretical guarantees in the form of sharp high probability upper bounds and minimax lower bounds. We illustrate our findings with detailed simulations.

@misc{shrotriya2022ascireg,
title   = {Adversarial Sign-Corrupted Isotonic Regression},
author  = {Shamindra Shrotriya and Matey Neykov},
year    = 2022,
eprint  = {arXiv:2207.07075}
}

maars: Tidy Inference under the ‘Models as Approximations’ Framework in R
Riccardo Fogliato*, Shamindra Shrotriya*, Arun Kumar Kuchibhotla (2021).

Linear regression using ordinary least squares (OLS) is a critical part of every statistician’s toolkit. In R, this is elegantly implemented via lm() and its related functions. However, the statistical inference output from this suite of functions is based on the assumption that the model is well specified. This assumption is often unrealistic and at best satisfied approximately. In the statistics and econometrics literature, this has long been recognized and a large body of work provides inference for OLS under more practical assumptions. This can be seen as model-free inference. In this paper, we introduce our package maars (“models as approximations”) that aims at bringing research on model-free inference to R via a comprehensive workflow. The maars package differs from other packages that also implement variance estimationin three key ways. First, all functions in maars follow a consistent grammar and return output in tidy format, with minimal deviation from the typical lm() workflow. Second, maars contains several tools for inference including empirical, multiplier, residual bootstrap, and subsampling, for easy comparison. Third, maars is developed with pedagogy in mind. For this, most of its functions explicitly return the assumptions under which the output is valid. This key innovation makes maars useful in teaching inference under misspecification and also a powerful tool for applied researchers. We hope our default feature of explicitly presenting assumptions will become a de facto standard for most statistical modeling in R.

@misc{fogliato2021maars,
title   = {maars: Tidy Inference under the 'Models as Approximations' Framework in R},
author  = {Riccardo Fogliato and Shamindra Shrotriya and Arun Kumar Kuchibhotla},
year    = 2021,
journal = {ArXiv e-prints},
url     = {https://shamindras.github.io/maars/},
note    = {R package version 1.2.0},
eprint  = {arXiv:2106.11188}
}

## PhD Thesis

On Some Problems in Nonparametric and Location-Scale Estimation
Shamindra Shrotriya* (2022).

We study three generalizations of classical nonparametric, and location scale estimation problems.

First, we study the classical problem of deriving minimax rates for density estimation over convex density classes. Our work extends known results by demonstrating that the local metric entropy of the density class always captures the exact (up to constants) minimax optimal rates under such settings. Our bounds provide a unifying perspective across both parametric and nonparametric convex density classes, under weaker assumptions on the richness of the density class than previously considered.

Second, we consider a variation of classical isotonic regression, which we term adversarial sign-corrupted isotonic (ASCI) regression. Here, the adversary can corrupt the sign of the responses having full access to the true response terms. We formalize ASCIFIT, a three-step estimation procedure under this regime, and demonstrate its theoretical guarantees in the form of sharp high probability upper bounds and minimax lower bounds.

Finally, we extend classical univariate uniform location-scale estimation over an interval, to multivariate uniform location-scale estimation over general convex bodies. Unlike the univariate setting, the observations are no longer totally ordered, and previous estimation techniques prove insufficient to account for the more refined geometry of the generating process. Under fixed dimension, our proposed location estimators converge at an n-1 rate. Our minimax lower bounds justify the optimality of our estimators in terms of the sample complexity. We also provide practical algorithms with provable convergence rates for our estimators, over a wide class of convex bodies.

@article{shrotriya2022phdthesis,
title        = {{On Some Problems in Nonparametric and Location-Scale Estimation}},
author       = {Shamindra Shrotriya},
year         = 2022,
month        = 12,
doi          = {10.1184/R1/21764069.v1},
url          = {
https://kilthub.cmu.edu/articles/thesis/On_Some_Problems_in_Nonparametric_and_Location-Scale_Estimation/21764069
}
}

## Competitions

Efficient Estimation of Distribution-free dynamics in the Bradley-Terry Model
Heejong Bong*, Wanshan Li*, Shamindra Shrotriya* (2019).

🏆 Winner: CMSAC Reproducible Research Competition (2019)

We propose a time-varying generalization of the original Bradley-Terry model. Our model directly captures the temporal dependence structure of the pairwise comparison data to model time-varying global rankings of N distinct objects. The convex formulation enables efficient analysis on sparse time-varying pairwise comparison data. Furthermore, depending on the choice of penalization norm, our model effectively provides a control on the degree of smoothing in the time-varying global rankings. We also prove that a relatively weak condition is necessary and sufficient to guarantee the existence and uniqueness of the solution of our model. Our condition is the weakest in literature till now. We implement various optimization algorithms to solve the model efficiently. We test the practical effectiveness of our model by separately ranking 5 seasons of publicly available National Football League (NFL) team data from 2011-2015, and NASCAR 2002 racing data. In particular, our ranking results on the NFL data compare favourably to the well-accepted and feature-rich NFL ELO ratings system. We thus view our time-varying Bradley-Terry model as a useful benchmarking tool for other feature-rich time-varying ranking models since it simply relies on the minimal time-varying pairwise comparison results for modeling.

Integrated Data Analysis for Early Warning of Lung Failure
Rebecca Barter*, Shamindra Shrotriya*.
Operational Database Management Systems (ODBMS), (2016).

🏆 Winner: Geisinger Health Collider Project (2016)

Chronic obstructive pulmonary disease (COPD) is a major cause of mortality worldwide, with approximately 12 million adults in the U.S. having been diagnosed with COPD. Our aim is to develop methods capable of effectively predicting cases of undiagnosed COPD among those whose primary reason for hospitalization was pneumonia. Most existing algorithmic approaches to similar prediction problems focus only on utilizing clinical information. Our approach, however, aims to incorporate external environmental data sources that are not captured by the clinical records using a process called “data blending”. We also investigate several leading supervised machine learning algorithms including Random Forest, Gradient Boosting Machines (GBM) and eXtreme Gradient Boosting (XGBoost) to improve COPD classification accuracy. We find that smoking and weather information significantly improve the predictive power of these algorithms in terms of predicting COPD among pneumonia pa-tients.

## Footnotes

1. Listed alphabetically by surname. Check out their cool statistical research.↩︎

2. Most of my papers are also easily accessed via ArXiv here, or per the links below.↩︎

3. the * symbol next to a persons name indicates equal authorship for the work.↩︎