We gratefully acknowledge support from
the Simons Foundation and member institutions.

Statistics

New submissions

[ total of 59 entries: 1-59 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Mon, 29 Apr 24

[1]  arXiv:2404.17019 [pdf, other]
Title: Neyman Meets Causal Machine Learning: Experimental Evaluation of Individualized Treatment Rules
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)

A century ago, Neyman showed how to evaluate the efficacy of treatment using a randomized experiment under a minimal set of assumptions. This classical repeated sampling framework serves as a basis of routine experimental analyses conducted by today's scientists across disciplines. In this paper, we demonstrate that Neyman's methodology can also be used to experimentally evaluate the efficacy of individualized treatment rules (ITRs), which are derived by modern causal machine learning algorithms. In particular, we show how to account for additional uncertainty resulting from a training process based on cross-fitting. The primary advantage of Neyman's approach is that it can be applied to any ITR regardless of the properties of machine learning algorithms that are used to derive the ITR. We also show, somewhat surprisingly, that for certain metrics, it is more efficient to conduct this ex-post experimental evaluation of an ITR than to conduct an ex-ante experimental evaluation that randomly assigns some units to the ITR. Our analysis demonstrates that Neyman's repeated sampling framework is as relevant for causal inference today as it has been since its inception.

[2]  arXiv:2404.17181 [pdf, other]
Title: Consistent information criteria for regularized regression and loss-based learning problems
Subjects: Methodology (stat.ME)

Many problems in statistics and machine learning can be formulated as model selection problems, where the goal is to choose an optimal parsimonious model among a set of candidate models. It is typical to conduct model selection by penalizing the objective function via information criteria (IC), as with the pioneering work by Akaike and Schwarz. Via recent work, we propose a generalized IC framework to consistently estimate general loss-based learning problems. In this work, we propose a consistent estimation method for Generalized Linear Model (GLM) regressions by utilizing the recent IC developments. We advance the generalized IC framework by proposing model selection problems, where the model set consists of a potentially uncountable set of models. In addition to theoretical expositions, our proposal introduces a computational procedure for the implementation of our methods in the finite sample setting, which we demonstrate via an extensive simulation study.

[3]  arXiv:2404.17209 [pdf, other]
Title: Generalized multi-view model: Adaptive density estimation under low-rank constraints
Authors: Julien Chhor (TSE-R), Olga Klopp (CREST-INSEE), Alexandre Tsybakov (CREST-INSEE)
Subjects: Statistics Theory (math.ST)

We study the problem of bivariate discrete or continuous probability density estimation under low-rank constraints.For discrete distributions, we assume that the two-dimensional array to estimate is a low-rank probability matrix.In the continuous case, we assume that the density with respect to the Lebesgue measure satisfies a generalized multi-view model, meaning that it is $\beta$-H{\"o}lder and can be decomposed as a sum of $K$ components, each of which is a product of one-dimensional functions.In both settings, we propose estimators that achieve, up to logarithmic factors, the minimax optimal convergence rates under such low-rank constraints.In the discrete case, the proposed estimator is adaptive to the rank $K$. In the continuous case, our estimator converges with the $L_1$ rate $\min((K/n)^{\beta/(2\beta+1)}, n^{-\beta/(2\beta+2)})$ up to logarithmic factors, and it is adaptive to the unknown support as well as to the smoothness $\beta$ and to the unknown number of separable components $K$. We present efficient algorithms for computing our estimators.

[4]  arXiv:2404.17211 [pdf, other]
Title: Pseudo-Observations and Super Learner for the Estimation of the Restricted Mean Survival Time
Authors: Ariane Cwiling (MAP5 - UMR 8145), Vittorio Perduca (MAP5 - UMR 8145), Olivier Bouaziz (MAP5 - UMR 8145)
Subjects: Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)

In the context of right-censored data, we study the problem of predicting the restricted time to event based on a set of covariates. Under a quadratic loss, this problem is equivalent to estimating the conditional Restricted Mean Survival Time (RMST). To that aim, we propose a flexible and easy-to-use ensemble algorithm that combines pseudo-observations and super learner. The classical theoretical results of the super learner are extended to right-censored data, using a new definition of pseudo-observations, the so-called split pseudo-observations. Simulation studies indicate that the split pseudo-observations and the standard pseudo-observations are similar even for small sample sizes. The method is applied to maintenance and colon cancer datasets, showing the interest of the method in practice, as compared to other prediction methods. We complement the predictions obtained from our method with our RMST-adapted risk measure, prediction intervals and variable importance measures developed in a previous work.

[5]  arXiv:2404.17222 [pdf, other]
Title: Asymptotic analysis for covariance parameter estimation of Gaussian processes with functional inputs
Authors: Lucas Reding (CERAMATHS), Andrés Felipe López-Lopera (CERAMATHS), François Bachoc (IMT)
Subjects: Statistics Theory (math.ST)

We consider covariance parameter estimation for Gaussian processes with functional inputs. From an increasing-domain asymptotics perspective, we prove the asymptotic consistency and normality of the maximum likelihood estimator. We extend these theoretical guarantees to encompass scenarios accounting for approximation errors in the inputs, which allows robustness of practical implementations relying on conventional sampling methods or projections onto a functional basis. Loosely speaking, both consistency and normality hold when the approximation error becomes negligible, a condition that is often achieved as the number of samples or basis functions becomes large. These later asymptotic properties are illustrated through analytical examples, including one that covers the case of non-randomly perturbed grids, as well as several numerical illustrations.

[6]  arXiv:2404.17271 [pdf, other]
Title: To democratize research with sensitive data, we should make synthetic data more accessible
Comments: 4 pages, 2 figures
Subjects: Other Statistics (stat.OT); Computers and Society (cs.CY)

For over 30 years, synthetic data has been heralded as a promising solution to make sensitive datasets accessible. However, despite much research effort and several high-profile use-cases, the widespread adoption of synthetic data as a tool for open, accessible, reproducible research with sensitive data is still a distant dream. In this opinion, Erik-Jan van Kesteren, head of the ODISSEI Social Data Science team, argues that in order to progress towards widespread adoption of synthetic data as a privacy enhancing technology, the data science research community should shift focus away from developing better synthesis methods: instead, it should develop accessible tools, educate peers, and publish small-scale case studies.

[7]  arXiv:2404.17380 [pdf, other]
Title: Correspondence analysis: handling cell-wise outliers via the reconstitution algorithm
Subjects: Methodology (stat.ME); Applications (stat.AP)

Correspondence analysis (CA) is a popular technique to visualize the relationship between two categorical variables. CA uses the data from a two-way contingency table and is affected by the presence of outliers. The supplementary points method is a popular method to handle outliers. Its disadvantage is that the information from entire rows or columns is removed. However, outliers can be caused by cells only. In this paper, a reconstitution algorithm is introduced to cope with such cells. This algorithm can reduce the contribution of cells in CA instead of deleting entire rows or columns. Thus the remaining information in the row and column involved can be used in the analysis. The reconstitution algorithm is compared with two alternative methods for handling outliers, the supplementary points method and MacroPCA. It is shown that the proposed strategy works well.

[8]  arXiv:2404.17398 [pdf, other]
Title: Online Policy Learning and Inference by Matrix Completion
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Making online decisions can be challenging when features are sparse and orthogonal to historical ones, especially when the optimal policy is learned through collaborative filtering. We formulate the problem as a matrix completion bandit (MCB), where the expected reward under each arm is characterized by an unknown low-rank matrix. The $\epsilon$-greedy bandit and the online gradient descent algorithm are explored. Policy learning and regret performance are studied under a specific schedule for exploration probabilities and step sizes. A faster decaying exploration probability yields smaller regret but learns the optimal policy less accurately. We investigate an online debiasing method based on inverse propensity weighting (IPW) and a general framework for online policy inference. The IPW-based estimators are asymptotically normal under mild arm-optimality conditions. Numerical simulations corroborate our theoretical findings. Our methods are applied to the San Francisco parking pricing project data, revealing intriguing discoveries and outperforming the benchmark policy.

[9]  arXiv:2404.17429 [pdf, other]
Title: Separation capacity of linear reservoirs with random connectivity matrix
Authors: Youness Boutaib
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)

We argue that the success of reservoir computing lies within the separation capacity of the reservoirs and show that the expected separation capacity of random linear reservoirs is fully characterised by the spectral decomposition of an associated generalised matrix of moments. Of particular interest are reservoirs with Gaussian matrices that are either symmetric or whose entries are all independent. In the symmetric case, we prove that the separation capacity always deteriorates with time; while for short inputs, separation with large reservoirs is best achieved when the entries of the matrix are scaled with a factor $\rho_T/\sqrt{N}$, where $N$ is the dimension of the reservoir and $\rho_T$ depends on the maximum length of the input time series. In the i.i.d. case, we establish that optimal separation with large reservoirs is consistently achieved when the entries of the reservoir matrix are scaled with the exact factor $1/\sqrt{N}$. We further give upper bounds on the quality of separation in function of the length of the time series. We complement this analysis with an investigation of the likelihood of this separation and the impact of the chosen architecture on separation consistency.

[10]  arXiv:2404.17441 [pdf, other]
Title: Comparison results for Markov tree distributions
Subjects: Statistics Theory (math.ST); Probability (math.PR)

We develop comparison results for Markov tree distributions extending ordering results from the literature on discrete time Markov processes and recently studied ordering results for conditionally independent factor models to tree structures. Based on fairly natural positive dependence conditions, our main contribution is a comparison result with respect to the supermodular order. Since this order is a pure dependence order, it has many applications in optimal transport, finance, and insurance. As an illustrative example, we consider hidden Markov models and study distributional robustness for functionals of the random walk under model uncertainty. Further, we show that, surprisingly, more general comparison results via the recently established rearrangement-based Schur order for conditional distributions, which implies an ordering of Chatterjee's rank correlation, do not carry over from star structures to trees. Several examples and a detailed discussion of the assumptions demonstrate the generality of our results and provide further insights into the behavior of multidimensional distributions.

[11]  arXiv:2404.17442 [pdf, ps, other]
Title: Uniform Generalization Bounds on Data-Dependent Hypothesis Sets via PAC-Bayesian Theory on Random Sets
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We propose data-dependent uniform generalization bounds by approaching the problem from a PAC-Bayesian perspective. We first apply the PAC-Bayesian framework on `random sets' in a rigorous way, where the training algorithm is assumed to output a data-dependent hypothesis set after observing the training data. This approach allows us to prove data-dependent bounds, which can be applicable in numerous contexts. To highlight the power of our approach, we consider two main applications. First, we propose a PAC-Bayesian formulation of the recently developed fractal-dimension-based generalization bounds. The derived results are shown to be tighter and they unify the existing results around one simple proof technique. Second, we prove uniform bounds over the trajectories of continuous Langevin dynamics and stochastic gradient Langevin dynamics. These results provide novel information about the generalization properties of noisy algorithms.

[12]  arXiv:2404.17464 [pdf, other]
Title: Bayesian Federated Inference for Survival Models
Comments: 22 pages, 5 figures, 2 tables
Subjects: Methodology (stat.ME); Computation (stat.CO); Machine Learning (stat.ML)

In cancer research, overall survival and progression free survival are often analyzed with the Cox model. To estimate accurately the parameters in the model, sufficient data and, more importantly, sufficient events need to be observed. In practice, this is often a problem. Merging data sets from different medical centers may help, but this is not always possible due to strict privacy legislation and logistic difficulties. Recently, the Bayesian Federated Inference (BFI) strategy for generalized linear models was proposed. With this strategy the statistical analyses are performed in the local centers where the data were collected (or stored) and only the inference results are combined to a single estimated model; merging data is not necessary. The BFI methodology aims to compute from the separate inference results in the local centers what would have been obtained if the analysis had been based on the merged data sets. In this paper we generalize the BFI methodology as initially developed for generalized linear models to survival models. Simulation studies and real data analyses show excellent performance; i.e., the results obtained with the BFI methodology are very similar to the results obtained by analyzing the merged data. An R package for doing the analyses is available.

[13]  arXiv:2404.17468 [pdf, ps, other]
Title: On Elliptical and Inverse Elliptical Wishart distributions: Review, new results, and applications
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)

This paper deals with matrix-variate distributions, from Wishart to Inverse Elliptical Wishart distributions over the set of symmetric definite positive matrices. Similar to the multivariate scenario, (Inverse) Elliptical Wishart distributions form a vast and general family of distributions, encompassing, for instance, Wishart or $t$-Wishart ones. The first objective of this study is to present a unified overview of Wishart, Inverse Wishart, Elliptical Wishart, and Inverse Elliptical Wishart distributions through their fundamental properties. This involves leveraging the stochastic representation of these distributions to establish key statistical properties of the Normalized Wishart distribution. Subsequently, this enables the computation of expectations, variances, and Kronecker moments for Elliptical Wishart and Inverse Elliptical Wishart distributions. As an illustrative application, the practical utility of these generalized Elliptical Wishart distributions is demonstrated using a real electroencephalographic dataset. This showcases their effectiveness in accurately modeling heterogeneous data.

[14]  arXiv:2404.17482 [pdf, other]
Title: A comparison of the discrimination performance of lasso and maximum likelihood estimation in logistic regression model
Subjects: Methodology (stat.ME); Computation (stat.CO)

Logistic regression is widely used in many areas of knowledge. Several works compare the performance of lasso and maximum likelihood estimation in logistic regression. However, part of these works do not perform simulation studies and the remaining ones do not consider scenarios in which the ratio of the number of covariates to sample size is high. In this work, we compare the discrimination performance of lasso and maximum likelihood estimation in logistic regression using simulation studies and applications. Variable selection is done both by lasso and by stepwise when maximum likelihood estimation is used. We consider a wide range of values for the ratio of the number of covariates to sample size. The main conclusion of the work is that lasso has a better discrimination performance than maximum likelihood estimation when the ratio of the number of covariates to sample size is high.

[15]  arXiv:2404.17483 [pdf, other]
Title: Differentiable Pareto-Smoothed Weighting for High-Dimensional Heterogeneous Treatment Effect Estimation
Comments: Accepted to UAI2024. 14 pages, 4 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)

There is a growing interest in estimating heterogeneous treatment effects across individuals using their high-dimensional feature attributes. Achieving high performance in such high-dimensional heterogeneous treatment effect estimation is challenging because in this setup, it is usual that some features induce sample selection bias while others do not but are predictive of potential outcomes. To avoid losing such predictive feature information, existing methods learn separate feature representations using the inverse of probability weighting (IPW). However, due to the numerically unstable IPW weights, they suffer from estimation bias under a finite sample setup. To develop a numerically robust estimator via weighted representation learning, we propose a differentiable Pareto-smoothed weighting framework that replaces extreme weight values in an end-to-end fashion. Experimental results show that by effectively correcting the weight values, our method outperforms the existing ones, including traditional weighting schemes.

[16]  arXiv:2404.17491 [pdf, other]
Title: Computationally Efficient Algorithms for Simulating Isotropic Gaussian Random Fields on Graphs with Euclidean Edges
Subjects: Statistics Theory (math.ST)

This work addresses the problem of simulating Gaussian random fields that are continuously indexed over a class of metric graphs, termed graphs with Euclidean edges, being more general and flexible than linear networks. We introduce three general algorithms that allow to reconstruct a wide spectrum of random fields having a covariance function that depends on a specific metric, called resistance metric, and proposed in recent literature. The algorithms are applied to a synthetic case study consisting of a street network. They prove to be fast and accurate in that they reproduce the target covariance function and provide random fields whose finite-dimensional distributions are approximately Gaussian.

[17]  arXiv:2404.17561 [pdf, other]
Title: Structured Conformal Inference for Matrix Completion with Applications to Group Recommender Systems
Subjects: Methodology (stat.ME); Machine Learning (stat.ML)

We develop a conformal inference method to construct joint confidence regions for structured groups of missing entries within a sparsely observed matrix. This method is useful to provide reliable uncertainty estimation for group-level collaborative filtering; for example, it can be applied to help suggest a movie for a group of friends to watch together. Unlike standard conformal techniques, which make inferences for one individual at a time, our method achieves stronger group-level guarantees by carefully assembling a structured calibration data set mimicking the patterns expected among the test group of interest. We propose a generalized weighted conformalization framework to deal with the lack of exchangeability arising from such structured calibration, and in this process we introduce several innovations to overcome computational challenges. The practicality and effectiveness of our method are demonstrated through extensive numerical experiments and an analysis of the MovieLens 100K data set.

[18]  arXiv:2404.17562 [pdf, other]
Title: Boosting e-BH via conditional calibration
Authors: Junu Lee, Zhimei Ren
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)

The e-BH procedure is an e-value-based multiple testing procedure that provably controls the false discovery rate (FDR) under any dependence structure between the e-values. Despite this appealing theoretical FDR control guarantee, the e-BH procedure often suffers from low power in practice. In this paper, we propose a general framework that boosts the power of e-BH without sacrificing its FDR control under arbitrary dependence. This is achieved by the technique of conditional calibration, where we take as input the e-values and calibrate them to be a set of "boosted e-values" that are guaranteed to be no less -- and are often more -- powerful than the original ones. Our general framework is explicitly instantiated in three classes of multiple testing problems: (1) testing under parametric models, (2) conditional independence testing under the model-X setting, and (3) model-free conformalized selection. Extensive numerical experiments show that our proposed method significantly improves the power of e-BH while continuing to control the FDR. We also demonstrate the effectiveness of our method through an application to an observational study dataset for identifying individuals whose counterfactuals satisfy certain properties.

[19]  arXiv:2404.17576 [pdf, ps, other]
Title: Enhancing Longitudinal Clinical Trial Efficiency with Digital Twins and Prognostic Covariate-Adjusted Mixed Models for Repeated Measures (PROCOVA-MMRM)
Comments: 29 pages, 9 tables
Subjects: Applications (stat.AP)

Clinical trials are critical in advancing medical treatments but often suffer from immense time and financial burden. Advances in statistical methodologies and artificial intelligence (AI) present opportunities to address these inefficiencies. Here we introduce Prognostic Covariate-Adjusted Mixed Models for Repeated Measures (PROCOVA-MMRM) as an advantageous combination of prognostic covariate adjustment (PROCOVA) and Mixed Models for Repeated Measures (MMRM). PROCOVA-MMRM utilizes time-matched prognostic scores generated from AI models to enhance the precision of treatment effect estimators for longitudinal continuous outcomes, enabling reductions in sample size and enrollment times. We first provide a description of the background and implementation of PROCOVA-MMRM, followed by two case study reanalyses where we compare the performance of PROCOVA-MMRM versus the unadjusted MMRM. These reanalyses demonstrate significant improvements in statistical power and precision in clinical indications with unmet medical need, specifically Alzheimer's Disease (AD) and Amyotrophic Lateral Sclerosis (ALS). We also explore the potential for sample size reduction with the prospective implementation of PROCOVA-MMRM, finding that the same or better results could have been achieved with fewer participants in these historical trials if the enhanced precision provided by PROCOVA-MMRM had been prospectively leveraged. We also confirm the robustness of the statistical properties of PROCOVA-MMRM in a variety of realistic simulation scenarios. Altogether, PROCOVA-MMRM represents a rigorous method of incorporating advances in the prediction of time-matched prognostic scores generated by AI into longitudinal analysis, potentially reducing both the cost and time required to bring new treatments to patients while adhering to regulatory standards.

Cross-lists for Mon, 29 Apr 24

[20]  arXiv:2404.16881 (cross-list from cs.LG) [pdf, other]
Title: On uncertainty-penalized Bayesian information criterion
Comments: 4 pages, 2 figures
Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST)

The uncertainty-penalized information criterion (UBIC) has been proposed as a new model-selection criterion for data-driven partial differential equation (PDE) discovery. In this paper, we show that using the UBIC is equivalent to employing the conventional BIC to a set of overparameterized models derived from the potential regression models of different complexity measures. The result indicates that the asymptotic property of the UBIC and BIC holds indifferently.

[21]  arXiv:2404.16928 (cross-list from astro-ph.IM) [pdf, other]
Title: Notes on the Practical Application of Nested Sampling: MultiNest, (Non)convergence, and Rectification
Comments: 12 pages, 10 figures. Comments welcome
Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computation (stat.CO)

Nested sampling is a promising tool for Bayesian statistical analysis because it simultaneously performs parameter estimation and facilitates model comparison. MultiNest is one of the most popular nested sampling implementations, and has been applied to a wide variety of problems in the physical sciences. However, MultiNest results are frequently unreliable, and accompanying convergence tests are a necessary component of any analysis. Using simple, analytically tractable test problems, I illustrate how MultiNest (1) can produce systematically biased estimates of the Bayesian evidence, which are more significantly biased for problems of higher dimensionality; (2) can derive posterior estimates with errors on the order of $\sim100\%$; (3) is more likely to underestimate the width of a credible interval than to overestimate it - to a minor degree for smooth problems, but much more so when sampling noisy likelihoods. Nevertheless, I show how MultiNest can be used to jump-start Markov chain Monte Carlo sampling or more rigorous nested sampling techniques, potentially accelerating more trustworthy measurements of posterior distributions and Bayesian evidences, and overcoming the challenge of Markov chain Monte Carlo initialization.

[22]  arXiv:2404.16954 (cross-list from cs.LG) [pdf, other]
Title: Taming False Positives in Out-of-Distribution Detection with Human Feedback
Comments: Appeared in the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024)
Journal-ref: PMLR 238:1486-1494, 2024
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Robustness to out-of-distribution (OOD) samples is crucial for safely deploying machine learning models in the open world. Recent works have focused on designing scoring functions to quantify OOD uncertainty. Setting appropriate thresholds for these scoring functions for OOD detection is challenging as OOD samples are often unavailable up front. Typically, thresholds are set to achieve a desired true positive rate (TPR), e.g., $95\%$ TPR. However, this can lead to very high false positive rates (FPR), ranging from 60 to 96\%, as observed in the Open-OOD benchmark. In safety-critical real-life applications, e.g., medical diagnosis, controlling the FPR is essential when dealing with various OOD samples dynamically. To address these challenges, we propose a mathematically grounded OOD detection framework that leverages expert feedback to \emph{safely} update the threshold on the fly. We provide theoretical results showing that it is guaranteed to meet the FPR constraint at all times while minimizing the use of human feedback. Another key feature of our framework is that it can work with any scoring function for OOD uncertainty quantification. Empirical evaluation of our system on synthetic and benchmark OOD datasets shows that our method can maintain FPR at most $5\%$ while maximizing TPR.

[23]  arXiv:2404.16956 (cross-list from cs.LG) [pdf, other]
Title: A Notion of Uniqueness for the Adversarial Bayes Classifier
Authors: Natalie S. Frank
Comments: 46 pages, 7 figures
Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)

We propose a new notion of uniqueness for the adversarial Bayes classifier in the setting of binary classification. Analyzing this notion of uniqueness produces a simple procedure for computing all adversarial Bayes classifiers for a well-motivated family of one dimensional data distributions. This characterization is then leveraged to show that as the perturbation radius increases, certain notions of regularity improve for adversarial Bayes classifiers. We demonstrate with various examples that the boundary of the adversarial Bayes classifier frequently lies near the boundary of the Bayes classifier.

[24]  arXiv:2404.17008 (cross-list from q-fin.RM) [pdf, other]
Title: The TruEnd-procedure: Treating trailing zero-valued balances in credit data
Comments: 21 pages, 7255 words, 10 Figures
Subjects: Risk Management (q-fin.RM); Statistical Finance (q-fin.ST); Applications (stat.AP)

A novel procedure is presented for finding the true but latent endpoints within the repayment histories of individual loans. The monthly observations beyond these true endpoints are false, largely due to operational failures that delay account closure, thereby corrupting some loans in the dataset with `false' observations. Detecting these false observations is difficult at scale since each affected loan history might have a different sequence of zero (or very small) month-end balances that persist towards the end. Identifying these trails of diminutive balances would require an exact definition of a "small balance", which can be found using our so-called TruEnd-procedure. We demonstrate this procedure and isolate the ideal small-balance definition using residential mortgages from a large South African bank. Evidently, corrupted loans are remarkably prevalent and have excess histories that are surprisingly long, which ruin the timing of certain risk events and compromise any subsequent time-to-event model such as survival analysis. Excess histories can be discarded using the ideal small-balance definition, which demonstrably improves the accuracy of both the predicted timing and severity of risk events, without materially impacting the monetary value of the portfolio. The resulting estimates of credit losses are lower and less biased, which augurs well for raising accurate credit impairments under the IFRS 9 accounting standard. Our work therefore addresses a pernicious data error, which highlights the pivotal role of data preparation in producing credible forecasts of credit risk.

[25]  arXiv:2404.17158 (cross-list from cs.LG) [pdf, ps, other]
Title: Online $\mathrm{L}^{\natural}$-Convex Minimization
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

An online decision-making problem is a learning problem in which a player repeatedly makes decisions in order to minimize the long-term loss. These problems that emerge in applications often have nonlinear combinatorial objective functions, and developing algorithms for such problems has attracted considerable attention. An existing general framework for dealing with such objective functions is the online submodular minimization. However, practical problems are often out of the scope of this framework, since the domain of a submodular function is limited to a subset of the unit hypercube. To manage this limitation of the existing framework, we in this paper introduce the online $\mathrm{L}^{\natural}$-convex minimization, where an $\mathrm{L}^{\natural}$-convex function generalizes a submodular function so that the domain is a subset of the integer lattice. We propose computationally efficient algorithms for the online $\mathrm{L}^{\natural}$-convex function minimization in two major settings: the full information and the bandit settings. We analyze the regrets of these algorithms and show in particular that our algorithm for the full information setting obtains a tight regret bound up to a constant factor. We also demonstrate several motivating examples that illustrate the usefulness of the online $\mathrm{L}^{\natural}$-convex minimization.

[26]  arXiv:2404.17180 (cross-list from physics.data-an) [pdf, other]
Title: PHYSTAT Informal Review: Marginalizing versus Profiling of Nuisance Parameters
Comments: 22 pages, 2 figures
Subjects: Data Analysis, Statistics and Probability (physics.data-an); Applications (stat.AP)

This is a writeup, with some elaboration, of the talks by the two authors (a physicist and a statistician) at the first PHYSTAT Informal review on January 24, 2024. We discuss Bayesian and frequentist approaches to dealing with nuisance parameters, in particular, integrated versus profiled likelihood methods. In regular models, with finitely many parameters and large sample sizes, the two approaches are asymptotically equivalent. But, outside this setting, the two methods can lead to different tests and confidence intervals. Assessing which approach is better generally requires comparing the power of the tests or the length of the confidence intervals. This analysis has to be conducted on a case-by-case basis. In the extreme case where the number of nuisance parameters is very large, possibly infinite, neither approach may be useful. Part I provides an informal history of usage in high energy particle physics, including a simple illustrative example. Part II includes an overview of some more recently developed methods in the statistics literature, including methods applicable when the use of the likelihood function is problematic.

[27]  arXiv:2404.17249 (cross-list from cs.LG) [pdf, other]
Title: Making Better Use of Unlabelled Data in Bayesian Active Learning
Comments: Published at AISTATS 2024
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Fully supervised models are predominant in Bayesian active learning. We argue that their neglect of the information present in unlabelled data harms not just predictive performance but also decisions about what data to acquire. Our proposed solution is a simple framework for semi-supervised Bayesian active learning. We find it produces better-performing models than either conventional Bayesian active learning or semi-supervised learning with randomly acquired data. It is also easier to scale up than the conventional approach. As well as supporting a shift towards semi-supervised models, our findings highlight the importance of studying models and acquisition methods in conjunction.

[28]  arXiv:2404.17293 (cross-list from cs.LG) [pdf, other]
Title: Lazy Data Practices Harm Fairness Research
Comments: Accepted for publication at the ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2024
Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP); Machine Learning (stat.ML)

Data practices shape research and practice on fairness in machine learning (fair ML). Critical data studies offer important reflections and critiques for the responsible advancement of the field by highlighting shortcomings and proposing recommendations for improvement. In this work, we present a comprehensive analysis of fair ML datasets, demonstrating how unreflective yet common practices hinder the reach and reliability of algorithmic fairness findings. We systematically study protected information encoded in tabular datasets and their usage in 280 experiments across 142 publications.
Our analyses identify three main areas of concern: (1) a \textbf{lack of representation for certain protected attributes} in both data and evaluations; (2) the widespread \textbf{exclusion of minorities} during data preprocessing; and (3) \textbf{opaque data processing} threatening the generalization of fairness research. By conducting exemplary analyses on the utilization of prominent datasets, we demonstrate how unreflective data decisions disproportionately affect minority groups, fairness metrics, and resultant model comparisons. Additionally, we identify supplementary factors such as limitations in publicly available data, privacy considerations, and a general lack of awareness, which exacerbate these challenges. To address these issues, we propose a set of recommendations for data usage in fairness research centered on transparency and responsible inclusion. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.

[29]  arXiv:2404.17358 (cross-list from cs.LG) [pdf, ps, other]
Title: Adversarial Consistency and the Uniqueness of the Adversarial Bayes Classifier
Authors: Natalie S. Frank
Comments: 17 pages
Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)

Adversarial training is a common technique for learning robust classifiers. Prior work showed that convex surrogate losses are not statistically consistent in the adversarial context -- or in other words, a minimizing sequence of the adversarial surrogate risk will not necessarily minimize the adversarial classification error. We connect the consistency of adversarial surrogate losses to properties of minimizers to the adversarial classification risk, known as \emph{adversarial Bayes classifiers}. Specifically, under reasonable distributional assumptions, a convex loss is statistically consistent for adversarial learning iff the adversarial Bayes classifier satisfies a certain notion of uniqueness.

[30]  arXiv:2404.17451 (cross-list from cs.LG) [pdf, other]
Title: Any-Quantile Probabilistic Forecasting of Short-Term Electricity Demand
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Power systems operate under uncertainty originating from multiple factors that are impossible to account for deterministically. Distributional forecasting is used to control and mitigate risks associated with this uncertainty. Recent progress in deep learning has helped to significantly improve the accuracy of point forecasts, while accurate distributional forecasting still presents a significant challenge. In this paper, we propose a novel general approach for distributional forecasting capable of predicting arbitrary quantiles. We show that our general approach can be seamlessly applied to two distinct neural architectures leading to the state-of-the-art distributional forecasting results in the context of short-term electricity demand forecasting task. We empirically validate our method on 35 hourly electricity demand time-series for European countries. Our code is available here: https://github.com/boreshkinai/any-quantile.

[31]  arXiv:2404.17452 (cross-list from cs.LG) [pdf, other]
Title: A Continuous Relaxation for Discrete Bayesian Optimization
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

To optimize efficiently over discrete data and with only few available target observations is a challenge in Bayesian optimization. We propose a continuous relaxation of the objective function and show that inference and optimization can be computationally tractable. We consider in particular the optimization domain where very few observations and strict budgets exist; motivated by optimizing protein sequences for expensive to evaluate bio-chemical properties. The advantages of our approach are two-fold: the problem is treated in the continuous setting, and available prior knowledge over sequences can be incorporated directly. More specifically, we utilize available and learned distributions over the problem domain for a weighting of the Hellinger distance which yields a covariance function. We show that the resulting acquisition function can be optimized with both continuous or discrete optimization algorithms and empirically assess our method on two bio-chemical sequence optimization tasks.

[32]  arXiv:2404.17487 (cross-list from cs.LG) [pdf, other]
Title: Conformal Prediction with Learned Features
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

In this paper, we focus on the problem of conformal prediction with conditional guarantees. Prior work has shown that it is impossible to construct nontrivial prediction sets with full conditional coverage guarantees. A wealth of research has considered relaxations of full conditional guarantees, relying on some predefined uncertainty structures. Departing from this line of thinking, we propose Partition Learning Conformal Prediction (PLCP), a framework to improve conditional validity of prediction sets through learning uncertainty-guided features from the calibration data. We implement PLCP efficiently with alternating gradient descent, utilizing off-the-shelf machine learning models. We further analyze PLCP theoretically and provide conditional guarantees for infinite and finite sample sizes. Finally, our experimental results over four real-world and synthetic datasets show the superior performance of PLCP compared to state-of-the-art methods in terms of coverage and length in both classification and regression scenarios.

[33]  arXiv:2404.17489 (cross-list from cs.LG) [pdf, other]
Title: Tabular Data Contrastive Learning via Class-Conditioned and Feature-Correlation Based Augmentation
Comments: 14 pages, 4 algorithms, 3 figures, 5 tables
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Contrastive learning is a model pre-training technique by first creating similar views of the original data, and then encouraging the data and its corresponding views to be close in the embedding space. Contrastive learning has witnessed success in image and natural language data, thanks to the domain-specific augmentation techniques that are both intuitive and effective. Nonetheless, in tabular domain, the predominant augmentation technique for creating views is through corrupting tabular entries via swapping values, which is not as sound or effective. We propose a simple yet powerful improvement to this augmentation technique: corrupting tabular data conditioned on class identity. Specifically, when corrupting a specific tabular entry from an anchor row, instead of randomly sampling a value in the same feature column from the entire table uniformly, we only sample from rows that are identified to be within the same class as the anchor row. We assume the semi-supervised learning setting, and adopt the pseudo labeling technique for obtaining class identities over all table rows. We also explore the novel idea of selecting features to be corrupted based on feature correlation structures. Extensive experiments show that the proposed approach consistently outperforms the conventional corruption method for tabular data classification tasks. Our code is available at https://github.com/willtop/Tabular-Class-Conditioned-SSL.

[34]  arXiv:2404.17554 (cross-list from cs.HC) [pdf, ps, other]
Title: A Novel Context driven Critical Integrative Levels (CIL) Approach: Advancing Human-Centric and Integrative Lighting Asset Management in Public Libraries with Practical Thresholds
Subjects: Human-Computer Interaction (cs.HC); Signal Processing (eess.SP); Systems and Control (eess.SY); Applications (stat.AP)

This paper proposes the context driven Critical Integrative Levels (CIL), a novel approach to lighting asset management in public libraries that aligns with the transformative vision of human-centric and integrative lighting. This approach encompasses not only the visual aspects of lighting performance but also prioritizes the physiological and psychological well-being of library users. Incorporating a newly defined metric, Mean Time of Exposure (MTOE), the approach quantifies user-light interaction, enabling tailored lighting strategies that respond to diverse activities and needs in library spaces. Case studies demonstrate how the CIL matrix can be practically applied, offering significant improvements over conventional methods by focusing on optimized user experiences from both visual impacts and non-visual effects.

[35]  arXiv:2404.17563 (cross-list from cs.LG) [pdf, other]
Title: An exactly solvable model for emergence and scaling laws
Subjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)

Deep learning models can exhibit what appears to be a sudden ability to solve a new problem as training time ($T$), training data ($D$), or model size ($N$) increases, a phenomenon known as emergence. In this paper, we present a framework where each new ability (a skill) is represented as a basis function. We solve a simple multi-linear model in this skill-basis, finding analytic expressions for the emergence of new skills, as well as for scaling laws of the loss with training time, data size, model size, and optimal compute ($C$). We compare our detailed calculations to direct simulations of a two-layer neural network trained on multitask sparse parity, where the tasks in the dataset are distributed according to a power-law. Our simple model captures, using a single fit parameter, the sigmoidal emergence of multiple new skills as training time, data size or model size increases in the neural network.

Replacements for Mon, 29 Apr 24

[36]  arXiv:2008.07007 (replaced) [pdf, other]
Title: Interpretable Representations in Explainable AI: From Theory to Practice
Comments: Published in the *Special Issue on Explainable and Interpretable Machine Learning and Data Mining* of the Springer *Data Mining and Knowledge Discovery* journal
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[37]  arXiv:2009.06560 (replaced) [pdf, other]
Title: Dual-Mandate Patrols: Multi-Armed Bandits for Green Security
Comments: Published at AAAI 2021. 9 pages (paper and references), 3 page appendix. 6 figures and 1 table
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[38]  arXiv:2101.00009 (replaced) [pdf, other]
Title: Adversarial Estimation of Riesz Representers
Subjects: Econometrics (econ.EM); Machine Learning (cs.LG); Machine Learning (stat.ML)
[39]  arXiv:2110.01360 (replaced) [pdf, other]
Title: Bayesian Machine Learning meets Formal Methods: An application to spatio-temporal data
Subjects: Computation (stat.CO); Logic in Computer Science (cs.LO)
[40]  arXiv:2111.06390 (replaced) [pdf, other]
Title: Full Characterization of Adaptively Strong Majority Voting in Crowdsourcing
Subjects: Applications (stat.AP); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC)
[41]  arXiv:2304.09988 (replaced) [pdf, other]
Title: The effect of estimating prevalences on the population-wise error rate
Comments: 14 pages, 7 figures
Subjects: Methodology (stat.ME)
[42]  arXiv:2305.02434 (replaced) [pdf, other]
Title: Uncertainty Quantification and Confidence Intervals for Naive Rare-Event Estimators
Authors: Yuanlu Bai, Henry Lam
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)
[43]  arXiv:2305.10416 (replaced) [pdf, ps, other]
Title: Minimax rate for multivariate data under componentwise local differential privacy constraints
Subjects: Statistics Theory (math.ST)
[44]  arXiv:2307.09055 (replaced) [pdf, ps, other]
Title: Robust Data Clustering with Outliers via Transformed Tensor Low-Rank Representation
Authors: Tong Wu
Comments: AISTATS 2024
Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[45]  arXiv:2309.15983 (replaced) [pdf, other]
Title: What To Do (and Not to Do) with Causal Panel Analysis under Parallel Trends: Lessons from A Large Reanalysis Study
Subjects: Methodology (stat.ME); Econometrics (econ.EM); Applications (stat.AP)
[46]  arXiv:2310.19091 (replaced) [pdf, other]
Title: Bridging the Gap: Towards an Expanded Toolkit for ML-Supported Decision-Making in the Public Sector
Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Methodology (stat.ME)
[47]  arXiv:2311.01194 (replaced) [pdf, other]
Title: Predictive Modelling of Critical Variables for Improving HVOF Coating using Gamma Regression Models
Comments: 37 pages, 7 figures
Subjects: Applications (stat.AP); Numerical Analysis (math.NA); Applied Physics (physics.app-ph)
[48]  arXiv:2311.02610 (replaced) [pdf, other]
Title: An adaptive standardisation methodology for Day-Ahead electricity price forecasting
Subjects: Applications (stat.AP); Machine Learning (cs.LG); Methodology (stat.ME)
[49]  arXiv:2311.06108 (replaced) [pdf, other]
Title: Nonparametric consistency for maximum likelihood estimation and clustering based on mixtures of elliptically-symmetric distributions
Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
[50]  arXiv:2312.02246 (replaced) [pdf, other]
Title: Conditional Variational Diffusion Models
Comments: Denoising Diffusion Probabilistic Models, Inverse Problems, Generative Models, Super Resolution, Phase Quantification, Variational Methods
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
[51]  arXiv:2312.09633 (replaced) [pdf, other]
Title: Natural Gradient Variational Bayes without Fisher Matrix Analytic Calculation and Its Inversion
Comments: 43 pages
Subjects: Methodology (stat.ME)
[52]  arXiv:2312.16011 (replaced) [pdf, other]
Title: Assigning Stationary Distributions to Sparse Stochastic Matrices
Comments: 29 pages, code available from this https URL In this third version, we have added clarifications, corrections and remarks suggested to us by anonymous reviewers
Subjects: Numerical Analysis (math.NA); Optimization and Control (math.OC); Probability (math.PR); Computation (stat.CO)
[53]  arXiv:2401.06575 (replaced) [pdf, other]
Title: A Weibull Mixture Cure Frailty Model for High-dimensional Covariates
Comments: 43 pages, 5 tables, 40 figures
Subjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO)
[54]  arXiv:2401.11119 (replaced) [pdf, ps, other]
Title: Measurement and comparison of distributional shift with applications to ecology, economics, and image analysis
Comments: 31 pages, 2 tables, 7 figures
Subjects: Methodology (stat.ME)
[55]  arXiv:2401.16392 (replaced) [pdf, other]
Title: A comprehensive survey of the home advantage in American football
Subjects: Applications (stat.AP)
[56]  arXiv:2402.06133 (replaced) [pdf, ps, other]
Title: Leveraging Quadratic Polynomials in Python for Advanced Data Analysis
Comments: The datasets can be freely accessed at this https URL To facilitate ease of use and accessibility, the code was made available through MyBinder.org (this https URL)
Subjects: Methodology (stat.ME); Computation (stat.CO)
[57]  arXiv:2403.13398 (replaced) [pdf, other]
Title: A unified framework for bounding causal effects on the always-survivor and other populations
Subjects: Methodology (stat.ME)
[58]  arXiv:2404.10759 (replaced) [pdf, other]
Title: Laplace-HDC: Understanding the geometry of binary hyperdimensional computing
Comments: 23 pages, 7 figures
Subjects: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
[59]  arXiv:2404.15417 (replaced) [pdf, other]
Title: The Power of Resets in Online Reinforcement Learning
Comments: Fixed a small typo
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[ total of 59 entries: 1-59 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2404, contact, help  (Access key information)