Methodology
- [1] arXiv:2405.10371 [pdf, ps, html, other]
-
Title: Causal Discovery in Multivariate Extremes with a Hydrological Analysis of Swiss River DischargesSubjects: Methodology (stat.ME); Applications (stat.AP)
Causal asymmetry is based on the principle that an event is a cause only if its absence would not have been a cause. From there, uncovering causal effects becomes a matter of comparing a well-defined score in both directions. Motivated by studying causal effects at extreme levels of a multivariate random vector, we propose to construct a model-agnostic causal score relying solely on the assumption of the existence of a max-domain of attraction. Based on a representation of a Generalized Pareto random vector, we construct the causal score as the Wasserstein distance between the margins and a well-specified random variable. The proposed methodology is illustrated on a hydrologically simulated dataset of different characteristics of catchments in Switzerland: discharge, precipitation, and snowmelt.
- [2] arXiv:2405.10461 [pdf, ps, html, other]
-
Title: Prediction in Measurement Error ModelsSubjects: Methodology (stat.ME)
We study the well known difficult problem of prediction in measurement error models. By targeting directly at the prediction interval instead of the point prediction, we construct a prediction interval by providing estimators of both the center and the length of the interval which achieves a pre-determined prediction level. The constructing procedure requires a working model for the distribution of the variable prone to error. If the working model is correct, the prediction interval estimator obtains the smallest variability in terms of assessing the true center and length. If the working model is incorrect, the prediction interval estimation is still consistent. We further study how the length of the prediction interval depends on the choice of the true prediction interval center and provide guidance on obtaining minimal prediction interval length. Numerical experiments are conducted to illustrate the performance and we apply our method to predict concentration of Abeta1-12 in cerebrospinal fluid in an Alzheimer's disease data.
- [3] arXiv:2405.10490 [pdf, ps, other]
-
Title: Neural Optimization with Adaptive Heuristics for Intelligent Marketing SystemChangshuai Wei, Benjamin Zelditch, Joyce Chen, Andre Assuncao Silva T Ribeiro, Jingyi Kenneth Tay, Borja Ocejo Elizondo, Keerthi Selvaraj, Aman Gupta, Licurgo Benemann De AlmeidaComments: KDD 2024Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Optimization and Control (math.OC)
Computational marketing has become increasingly important in today's digital world, facing challenges such as massive heterogeneous data, multi-channel customer journeys, and limited marketing budgets. In this paper, we propose a general framework for marketing AI systems, the Neural Optimization with Adaptive Heuristics (NOAH) framework. NOAH is the first general framework for marketing optimization that considers both to-business (2B) and to-consumer (2C) products, as well as both owned and paid channels. We describe key modules of the NOAH framework, including prediction, optimization, and adaptive heuristics, providing examples for bidding and content optimization. We then detail the successful application of NOAH to LinkedIn's email marketing system, showcasing significant wins over the legacy ranking system. Additionally, we share details and insights that are broadly useful, particularly on: (i) addressing delayed feedback with lifetime value, (ii) performing large-scale linear programming with randomization, (iii) improving retrieval with audience expansion, (iv) reducing signal dilution in targeting tests, and (v) handling zero-inflated heavy-tail metrics in statistical testing.
- [4] arXiv:2405.10527 [pdf, ps, html, other]
-
Title: Hawkes Models And Their ApplicationsSubjects: Methodology (stat.ME); Probability (math.PR); Applications (stat.AP)
The Hawkes process is a model for counting the number of arrivals to a system which exhibits the self-exciting property - that one arrival creates a heightened chance of further arrivals in the near future. The model, and its generalizations, have been applied in a plethora of disparate domains, though two particularly developed applications are in seismology and in finance. As the original model is elegantly simple, generalizations have been proposed which: track marks for each arrival, are multivariate, have a spatial component, are driven by renewal processes, treat time as discrete, and so on. This paper creates a cohesive review of the traditional Hawkes model and the modern generalizations, providing details on their construction, simulation algorithms, and giving key references to the appropriate literature for a detailed treatment.
- [5] arXiv:2405.10719 [pdf, ps, html, other]
-
Title: $\ell_1$-Regularized Generalized Least SquaresComments: 13 pages, 6 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
In this paper we propose an $\ell_1$-regularized GLS estimator for high-dimensional regressions with potentially autocorrelated errors. We establish non-asymptotic oracle inequalities for estimation accuracy in a framework that allows for highly persistent autoregressive errors. In practice, the Whitening matrix required to implement the GLS is unkown, we present a feasible estimator for this matrix, derive consistency results and ultimately show how our proposed feasible GLS can recover closely the optimal performance (as if the errors were a white noise) of the LASSO. A simulation study verifies the performance of the proposed method, demonstrating that the penalized (feasible) GLS-LASSO estimator performs on par with the LASSO in the case of white noise errors, whilst outperforming it in terms of sign-recovery and estimation error when the errors exhibit significant correlation.
- [6] arXiv:2405.10742 [pdf, ps, html, other]
-
Title: Efficient Sampling in Disease Surveillance through Subpopulations: Sampling Canaries in the Coal MineComments: 15 pages, 1 figureSubjects: Methodology (stat.ME); Applications (stat.AP)
We consider disease outbreak detection settings where the population under study consists of various subpopulations available for stratified surveillance. These subpopulations can for example be based on age cohorts, but may also correspond to other subgroups of the population under study such as international travellers. Rather than sampling uniformly over the entire population, one may elevate the effectiveness of the detection methodology by optimally choosing a subpopulation for sampling. We show (under some assumptions) the relative sampling efficiency between two subpopulations is inversely proportional to the ratio of their respective baseline disease risks. This leads to a considerable potential increase in sampling efficiency when sampling from the subpopulation with higher baseline disease risk, if the two subpopulation baseline risks differ strongly. Our mathematical results require a careful treatment of the power curves of exact binomial tests as a function of their sample size, which are erratic and non-monotonic due to the discreteness of the underlying distribution. Subpopulations with comparatively high baseline disease risk are typically in greater contact with health professionals, and thus when sampled for surveillance purposes this is typically motivated merely through a convenience argument. With this study, we aim to elevate the status of such "convenience surveillance" to optimal subpopulation surveillance.
- [7] arXiv:2405.10769 [pdf, ps, other]
-
Title: Efficient estimation of target population treatment effect from multiple source trials under effect-measure transportabilitySubjects: Methodology (stat.ME)
When the marginal causal effect comparing the same treatment pair is available from multiple trials, we wish to transport all results to make inference on the target population effect. To account for the differences between populations, statistical analysis is often performed controlling for relevant variables. However, when transportability assumptions are placed on conditional causal effects, rather than the distribution of potential outcomes, we need to carefully choose these effect measures. In particular, we present identifiability results in two cases: target population average treatment effect for a continuous outcome and causal mean ratio for a positive outcome. We characterize the semiparametric efficiency bounds of the causal effects under the respective transportability assumptions and propose estimators that are doubly robust against model misspecifications. We highlight an important discussion on the tension between the non-collapsibility of conditional effects and the variational independence induced by transportability in the case of multiple source trials.
- [8] arXiv:2405.10773 [pdf, ps, other]
-
Title: Proximal indirect comparisonSubjects: Methodology (stat.ME)
We consider the problem of indirect comparison, where a treatment arm of interest is absent by design in the target randomized control trial (RCT) but available in a source RCT. The identifiability of the target population average treatment effect often relies on conditional transportability assumptions. However, it is a common concern whether all relevant effect modifiers are measured and controlled for. We highlight a new proximal identification result in the presence of shifted, unobserved effect modifiers based on proxies: an adjustment proxy in both RCTs and an additional reweighting proxy in the source RCT. We propose an estimator which is doubly-robust against misspecifications of the so-called bridge functions and asymptotically normal under mild consistency of the nuisance models. An alternative estimator is presented to accommodate missing outcomes in the source RCT, which we then apply to conduct a proximal indirect comparison analysis using two weight management trials.
- [9] arXiv:2405.10925 [pdf, ps, other]
-
Title: High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariatesJanick Weberpals, Pamela A. Shaw, Kueiyu Joshua Lin, Richard Wyss, Joseph M Plasek, Li Zhou, Kerry Ngan, Thomas DeRamus, Sudha R. Raman, Bradley G. Hammill, Hana Lee, Sengwee Toh, John G. Connolly, Kimberly J. Dandreo, Fang Tian, Wei Liu, Jie Li, José J. Hernández-Muñoz, Sebastian Schneeweiss, Rishi J. DesaiSubjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors.
New submissions for Monday, 20 May 2024 (showing 9 of 9 entries )
- [10] arXiv:2405.10795 (cross-list from math.ST) [pdf, ps, html, other]
-
Title: Non trivial optimal sampling rate for estimating a Lipschitz-continuous function in presence of mean-reverting Ornstein-Uhlenbeck noiseComments: 14 pages, 5 figuresSubjects: Statistics Theory (math.ST); Probability (math.PR); Methodology (stat.ME)
We examine a mean-reverting Ornstein-Uhlenbeck process that perturbs an unknown Lipschitz-continuous drift and aim to estimate the drift's value at a predetermined time horizon by sampling the path of the process. Due to the time varying nature of the drift we propose an estimation procedure that involves an online, time-varying optimization scheme implemented using a stochastic gradient ascent algorithm to maximize the log-likelihood of our observations. The objective of the paper is to investigate the optimal sample size/rate for achieving the minimum mean square distance between our estimator and the true value of the drift. In this setting we uncover a trade-off between the correlation of the observations, which increases with the sample size, and the dynamic nature of the unknown drift, which is weakened by increasing the frequency of observation. The mean square error is shown to be non monotonic in the sample size, attaining a global minimum whose precise description depends on the parameters that govern the model. In the static case, i.e. when the unknown drift is constant, our method outperforms the arithmetic mean of the observations in highly correlated regimes, despite the latter being a natural candidate estimator. We then compare our online estimator with the global maximum likelihood estimator.
Cross submissions for Monday, 20 May 2024 (showing 1 of 1 entries )
- [11] arXiv:2110.07051 (replaced) [pdf, ps, html, other]
-
Title: Fast and Scalable Inference for Spatial Extreme Value ModelsSubjects: Methodology (stat.ME); Computation (stat.CO)
The generalized extreme value (GEV) distribution is a popular model for analyzing and forecasting extreme weather data. To increase prediction accuracy, spatial information is often pooled via a latent Gaussian process (GP) on the GEV parameters. Inference for GEV-GP models is typically carried out using Markov chain Monte Carlo (MCMC) methods, or using approximate inference methods such as the integrated nested Laplace approximation (INLA). However, MCMC becomes prohibitively slow as the number of spatial locations increases, whereas INLA is only applicable in practice to a limited subset of GEV-GP models. In this paper, we revisit the original Laplace approximation for fitting spatial GEV models. In combination with a popular sparsity-inducing spatial covariance approximation technique, we show through simulations that our approach accurately estimates the Bayesian predictive distribution of extreme weather events, is scalable to several thousand spatial locations, and is several orders of magnitude faster than MCMC. A case study in forecasting extreme snowfall across Canada is presented.
- [12] arXiv:2210.05983 (replaced) [pdf, ps, html, other]
-
Title: Model-based clustering in simple hypergraphs through a stochastic blockmodelLuca Brusa (UNIMIB), Catherine Matias (LPSM (UMR\_8001))Subjects: Methodology (stat.ME)
We propose a model to address the overlooked problem of node clustering in simple hypergraphs. Simple hypergraphs are suitable when a node may not appear multiple times in the same hyperedge, such as in co-authorship datasets. Our model generalizes the stochastic blockmodel for graphs and assumes the existence of latent node groups and hyperedges are conditionally independent given these groups. We first establish the generic identifiability of the model parameters. We then develop a variational approximation Expectation-Maximization algorithm for parameter inference and node clustering, and derive a statistical criterion for model selection.
To illustrate the performance of our R package HyperSBM, we compare it with other node clustering methods using synthetic data generated from the model, as well as from a line clustering experiment and a co-authorship dataset. - [13] arXiv:2210.09560 (replaced) [pdf, ps, html, other]
-
Title: A Bayesian Convolutional Neural Network-based Generalized Linear ModelComments: 25 pages, 7 figuresSubjects: Methodology (stat.ME)
Convolutional neural networks (CNNs) provide flexible function approximations for a wide variety of applications when the input variables are in the form of images or spatial data. Although CNNs often outperform traditional statistical models in prediction accuracy, statistical inference, such as estimating the effects of covariates and quantifying the prediction uncertainty, is not trivial due to the highly complicated model structure and overparameterization. To address this challenge, we propose a new Bayesian approach by embedding CNNs within the generalized linear models (GLMs) framework. We use extracted nodes from the last hidden layer of CNN with Monte Carlo (MC) dropout as informative covariates in GLM. This improves accuracy in prediction and regression coefficient inference, allowing for the interpretation of coefficients and uncertainty quantification. By fitting ensemble GLMs across multiple realizations from MC dropout, we can account for uncertainties in extracting the features. We apply our methods to biological and epidemiological problems, which have both high-dimensional correlated inputs and vector covariates. Specifically, we consider malaria incidence data, brain tumor image data, and fMRI data. By extracting information from correlated inputs, the proposed method can provide an interpretable Bayesian analysis. The algorithm can be broadly applicable to image regressions or correlated data analysis by enabling accurate Bayesian inference quickly.
- [14] arXiv:2305.15671 (replaced) [pdf, ps, html, other]
-
Title: Matrix Autoregressive Model with Vector Time Series Covariates for Spatio-Temporal DataSubjects: Methodology (stat.ME)
We develop a new methodology for forecasting matrix-valued time series with historical matrix data and auxiliary vector time series data. We focus on a time series of matrices defined on a static 2-D spatial grid and an auxiliary time series of non-spatial vectors. The proposed model, Matrix AutoRegression with Auxiliary Covariates (MARAC), contains an autoregressive component for the historical matrix predictors and an additive component that maps the auxiliary vector predictors to a matrix response via tensor-vector product. The autoregressive component adopts a bi-linear transformation framework following Chen et al. (2021), significantly reducing the number of parameters. The auxiliary component posits that the tensor coefficient, which maps non-spatial predictors to a spatial response, contains slices of spatially smooth matrix coefficients that are discrete evaluations of smooth functions from a Reproducible Kernel Hilbert Space (RKHS). We propose to estimate the model parameters under a penalized maximum likelihood estimation framework coupled with an alternating minimization algorithm. We establish the joint asymptotics of the autoregressive and tensor parameters under fixed and high-dimensional regimes. Extensive simulations and a geophysical application for forecasting the global Total Electron Content (TEC) are conducted to validate the performance of MARAC.
- [15] arXiv:2306.08485 (replaced) [pdf, ps, html, other]
-
Title: Graph-Aligned Random Partition Model (GARP)Comments: Journal of the American Statistical Association 2024Subjects: Methodology (stat.ME)
Bayesian nonparametric mixtures and random partition models are powerful tools for probabilistic clustering. However, standard independent mixture models can be restrictive in some applications such as inference on cell lineage due to the biological relations of the clusters. The increasing availability of large genomic data requires new statistical tools to perform model-based clustering and infer the relationship between homogeneous subgroups of units. Motivated by single-cell RNA applications we develop a novel dependent mixture model to jointly perform cluster analysis and align the clusters on a graph. Our flexible graph-aligned random partition model (GARP) exploits Gibbs-type priors as building blocks, allowing us to derive analytical results on the graph-aligned random partition's probability mass function (pmf). We derive a generalization of the Chinese restaurant process from the pmf and a related efficient and neat MCMC algorithm to perform Bayesian inference. We perform posterior inference on real single-cell RNA data from mice stem cells. We further investigate the performance of our model in capturing the underlying clustering structure as well as the underlying graph by means of simulation studies.
- [16] arXiv:2306.09555 (replaced) [pdf, ps, html, other]
-
Title: Geometric-Based Pruning Rules For Change Point Detection in Multiple Independent Time SeriesComments: 34 pages, 11 figures, 1 tableSubjects: Methodology (stat.ME); Computation (stat.CO); Machine Learning (stat.ML)
We consider the problem of detecting multiple changes in multiple independent time series. The search for the best segmentation can be expressed as a minimization problem over a given cost function. We focus on dynamic programming algorithms that solve this problem exactly. When the number of changes is proportional to data length, an inequality-based pruning rule encoded in the PELT algorithm leads to a linear time complexity. Another type of pruning, called functional pruning, gives a close-to-linear time complexity whatever the number of changes, but only for the analysis of univariate time series.
We propose a few extensions of functional pruning for multiple independent time series based on the use of simple geometric shapes (balls and hyperrectangles). We focus on the Gaussian case, but some of our rules can be easily extended to the exponential family. In a simulation study we compare the computational efficiency of different geometric-based pruning rules. We show that for small dimensions (2, 3, 4) some of them ran significantly faster than inequality-based approaches in particular when the underlying number of changes is small compared to the data length. - [17] arXiv:2306.15075 (replaced) [pdf, ps, html, other]
-
Title: Differences in academic preparedness do not fully explain Black-White enrollment disparities in advanced high school courseworkSubjects: Methodology (stat.ME); Applications (stat.AP)
Whether racial disparities in enrollment in advanced high school coursework can be attributed to differences in prior academic preparation is a central question in sociological research and education policy. However, previous investigations face methodological limitations, for they compare race-specific enrollment rates of students after adjusting for characteristics only partially related to their academic preparedness for advanced coursework. Informed by a recently-developed statistical technique, we propose and estimate a novel measure of students' academic preparedness and use administrative data from the New York City Department of Education to measure differences in AP mathematics enrollment rates among similarly prepared students of different races. We find that preexisting differences in academic preparation do not fully explain the under-representation of Black students relative to White students in AP mathematics. Our results imply that achieving equal opportunities for AP enrollment not only requires equalizing earlier academic experiences, but also addressing inequities that emerge from coursework placement processes.
- [18] arXiv:2311.03644 (replaced) [pdf, ps, html, other]
-
Title: BOB: Bayesian Optimized Bootstrap for Uncertainty Quantification in Gaussian Mixture ModelsComments: 35 pages, 8 figuresSubjects: Methodology (stat.ME); Computation (stat.CO)
A natural way to quantify uncertainties in Gaussian mixture models (GMMs) is through Bayesian methods. That said, sampling from the joint posterior distribution of GMMs via standard Markov chain Monte Carlo (MCMC) imposes several computational challenges, which have prevented a broader full Bayesian implementation of these models. A growing body of literature has introduced the Weighted Likelihood Bootstrap and the Weighted Bayesian Bootstrap as alternatives to MCMC sampling. The core idea of these methods is to repeatedly compute maximum a posteriori (MAP) estimates on many randomly weighted posterior densities. These MAP estimates then can be treated as approximate posterior draws. Nonetheless, a central question remains unanswered: How to select the random weights under arbitrary sample sizes. We, therefore, introduce the Bayesian Optimized Bootstrap (BOB), a computational method to automatically select these random weights by minimizing, through Bayesian Optimization, a black-box and noisy version of the reverse Kullback-Leibler (KL) divergence between the Bayesian posterior and an approximate posterior obtained via random weighting. Our proposed method outperforms competing approaches in recovering the Bayesian posterior, it provides a better uncertainty quantification, and it retains key asymptotic properties from existing methods. BOB's performance is demonstrated through extensive simulations, along with real-world data analyses.
- [19] arXiv:2311.05794 (replaced) [pdf, ps, html, other]
-
Title: An Experimental Design for Anytime-Valid Causal Inference on Multi-Armed BanditsSubjects: Methodology (stat.ME); Machine Learning (cs.LG)
In multi-armed bandit (MAB) experiments, it is often advantageous to continuously produce inference on the average treatment effect (ATE) between arms as new data arrive and determine a data-driven stopping time for the experiment. We develop the Mixture Adaptive Design (MAD), a new experimental design for multi-armed bandit experiments that produces powerful and anytime-valid inference on the ATE for \emph{any} bandit algorithm of the experimenter's choice, even those without probabilistic treatment assignment. Intuitively, the MAD "mixes" any bandit algorithm of the experimenter's choice with a Bernoulli design through a tuning parameter $\delta_t$, where $\delta_t$ is a deterministic sequence that decreases the priority placed on the Bernoulli design as the sample size grows. We prove that for $\delta_t = \omega\left(t^{-1/4}\right)$, the MAD generates anytime-valid asymptotic confidence sequences that are guaranteed to shrink around the true ATE. Hence, the experimenter is guaranteed to detect a true non-zero treatment effect in finite time. Additionally, we prove that the regret of the MAD approaches that of its underlying bandit algorithm over time, and hence, incurs a relatively small loss in regret in return for powerful inferential guarantees. Finally, we conduct an extensive simulation study exhibiting that the MAD achieves finite-sample anytime validity and high power without significant losses in finite-sample reward.
- [20] arXiv:2312.10563 (replaced) [pdf, ps, html, other]
-
Title: Mediation Analysis with Mendelian Randomization and Efficient Multiple GWAS IntegrationSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Mediation analysis is a powerful tool for studying causal pathways between exposure, mediator, and outcome variables of interest. While classical mediation analysis using observational data often requires strong and sometimes unrealistic assumptions, such as unconfoundedness, Mendelian Randomization (MR) avoids unmeasured confounding bias by employing genetic variations as instrumental variables. We develop a novel MR framework for mediation analysis with genome-wide associate study (GWAS) summary data, and provide solid statistical guarantees. Our framework employs carefully crafted estimating equations, allowing for different sets of genetic variations to instrument the exposure and the mediator, to efficiently integrate information stored in three independent GWAS. As part of this endeavor, we demonstrate that in mediation analysis, the challenge raised by instrument selection goes beyond the well-known winner's curse issue, and therefore, addressing it requires special treatment. We then develop bias correction techniques to address the instrument selection issue and commonly encountered measurement error bias issue. Collectively, through our theoretical investigations, we show that our framework provides valid statistical inference for both direct and mediation effects with enhanced statistical efficiency compared to existing methods. We further illustrate the finite-sample performance of our approach through simulation experiments and a case study.
- [21] arXiv:2402.14775 (replaced) [pdf, ps, html, other]
-
Title: Localised Natural Causal Learning Algorithms for Weak Consistency ConditionsComments: UAI2024Subjects: Methodology (stat.ME)
By relaxing conditions for natural structure learning algorithms, a family of constraint-based algorithms containing all exact structure learning algorithms under the faithfulness assumption, we define localised natural structure learning algorithms (LoNS). We also provide a set of necessary and sufficient assumptions for consistency of LoNS, which can be thought of as a strict relaxation of the restricted faithfulness assumption. We provide a practical LoNS algorithm that runs in exponential time, which is then compared with related existing structure learning algorithms, namely PC/SGS and the relatively recent Sparsest Permutation algorithm. Simulation studies are also provided.
- [22] arXiv:2403.02058 (replaced) [pdf, ps, html, other]
-
Title: Utility-based optimization of Fujikawa's basket trial design -- Pre-specified protocol of a comparison studyComments: 26 pages, 1 figure; updated content in reaction to anonymous review: new section "Methodology of utility functions in basket trial designs", discussion of literature and four new scenario sets in section "Outcome scenarios", two new algorithms and detailed explanation in section "Optimization algorithms", new section "Discussion", further minor changesSubjects: Methodology (stat.ME); Applications (stat.AP)
Basket trial designs are a type of master protocol in which the same therapy is tested in several strata of the patient cohort. Many basket trial designs implement borrowing mechanisms. These allow sharing information between similar strata with the goal of increasing power in responsive strata while at the same time constraining type-I error inflation to a bearable threshold. These borrowing mechanisms can be tuned using numerical tuning parameters. The optimal choice of these tuning parameters is subject to research. In a comparison study using simulations and numerical calculations, we are planning to investigate the use of utility functions for quantifying the compromise between power and type-I error inflation and the use of numerical optimization algorithms for optimizing these functions. The present document is the protocol of this comparison study, defining each step of the study in accordance with the ADEMP scheme for pre-specification of simulation studies.
- [23] arXiv:2405.10067 (replaced) [pdf, ps, html, other]
-
Title: Sparse and Orthogonal Low-rank Collective Matrix Factorization (solrCMF): Efficient data integration in flexible layoutsSubjects: Methodology (stat.ME)
Interest in unsupervised methods for joint analysis of heterogeneous data sources has risen in recent years. Low-rank latent factor models have proven to be an effective tool for data integration and have been extended to a large number of data source layouts. Of particular interest is the separation of variation present in data sources into shared and individual subspaces. In addition, interpretability of estimated latent factors is crucial to further understanding.
We present sparse and orthogonal low-rank Collective Matrix Factorization (solrCMF) to estimate low-rank latent factor models for flexible data layouts. These encompass traditional multi-view (one group, multiple data types) and multi-grid (multiple groups, multiple data types) layouts, as well as augmented layouts, which allow the inclusion of side information between data types or groups. In addition, solrCMF allows tensor-like layouts (repeated layers), estimates interpretable factors, and determines variation structure among factors and data sources.
Using a penalized optimization approach, we automatically separate variability into the globally and partially shared as well as individual components and estimate sparse representations of factors. To further increase interpretability of factors, we enforce orthogonality between them. Estimation is performed efficiently in a recent multi-block ADMM framework which we adapted to support embedded manifold constraints.
The performance of solrCMF is demonstrated in simulation studies and compares favorably to existing methods. - [24] arXiv:2003.05492 (replaced) [pdf, ps, other]
-
Title: An asymptotic Peskun ordering and its application to lifted samplersJournal-ref: Bernoulli 30(3), 2301-2325, (August 2024)Subjects: Computation (stat.CO); Methodology (stat.ME)
A Peskun ordering between two samplers, implying a dominance of one over the other, is known among the Markov chain Monte Carlo community for being a remarkably strong result. It is however also known for being a result that is notably difficult to establish. Indeed, one has to prove that the probability to reach a state $\mathbf{y}$ from a state $\mathbf{x}$, using a sampler, is greater than or equal to the probability using the other sampler, and this must hold for all pairs $(\mathbf{x}, \mathbf{y})$ such that $\mathbf{x} \neq \mathbf{y}$. We provide in this paper a weaker version that does not require an inequality between the probabilities for all these states: essentially, the dominance holds asymptotically, as a varying parameter grows without bound, as long as the states for which the probabilities are greater than or equal to belong to a mass-concentrating set. The weak ordering turns out to be useful to compare lifted samplers for partially-ordered discrete state-spaces with their Metropolis--Hastings counterparts. An analysis in great generality yields a qualitative conclusion: they asymptotically perform better in certain situations (and we are able to identify them), but not necessarily in others (and the reasons why are made clear). A quantitative study in a specific context of graphical-model simulation is also conducted.
- [25] arXiv:2307.09864 (replaced) [pdf, ps, other]
-
Title: Asymptotic equivalence of Principal Components and Quasi Maximum Likelihood estimators in Large Approximate Factor ModelsComments: arXiv admin note: text overlap with arXiv:2211.01921 which is written by the same author. The two papers do not overlap as they contain different results although they have the same assumptionsSubjects: Econometrics (econ.EM); Methodology (stat.ME)
We provide an alternative derivation of the asymptotic results for the Principal Components estimator of a large approximate factor model. Results are derived under a minimal set of assumptions and, in particular, we require only the existence of 4th order moments. A special focus is given to the time series setting, a case considered in almost all recent econometric applications of factor models. Hence, estimation is based on the classical $n\times n$ sample covariance matrix and not on a $T\times T$ covariance matrix often considered in the literature. Indeed, despite the two approaches being asymptotically equivalent, the former is more coherent with a time series setting and it immediately allows us to write more intuitive asymptotic expansions for the Principal Component estimators showing that they are equivalent to OLS as long as $\sqrt n/T\to 0$ and $\sqrt T/n\to 0$, that is the loadings are estimated in a time series regression as if the factors were known, while the factors are estimated in a cross-sectional regression as if the loadings were known. Finally, we give some alternative sets of primitive sufficient conditions for mean-squared consistency of the sample covariance matrix of the factors, of the idiosyncratic components, and of the observed time series, which is the starting point for Principal Component Analysis.
- [26] arXiv:2312.07792 (replaced) [pdf, ps, html, other]
-
Title: Differentially private projection-depth-based mediansComments: 44 pages, 1 figureSubjects: Statistics Theory (math.ST); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Methodology (stat.ME)
We develop $(\epsilon,\delta)$-differentially private projection-depth-based medians using the propose-test-release (PTR) and exponential mechanisms. Under general conditions on the input parameters and the population measure, (e.g. we do not assume any moment bounds), we quantify the probability the test in PTR fails, as well as the cost of privacy via finite sample deviation bounds. We then present a new definition of the finite sample breakdown point which applies to a mechanism, and present a lower bound on the finite sample breakdown point of the projection-depth-based median. We demonstrate our main results on the canonical projection-depth-based median, as well as on projection-depth-based medians derived from trimmed estimators. In the Gaussian setting, we show that the resulting deviation bound matches the known lower bound for private Gaussian mean estimation. In the Cauchy setting, we show that the "outlier error amplification" effect resulting from the heavy tails outweighs the cost of privacy. This result is then verified via numerical simulations. Additionally, we present results on general PTR mechanisms and a uniform concentration result on the projected spacings of order statistics, which may be of general interest.
- [27] arXiv:2405.07836 (replaced) [pdf, ps, other]
-
Title: Forecasting with Hyper-TreesComments: Forecasting, Gradient Boosting, Hyper-Networks, LightGBM, Parameter Non-Stationarity, Time Series, XGBoostSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
This paper introduces the concept of Hyper-Trees and offers a new direction in applying tree-based models to time series data. Unlike conventional applications of decision trees that forecast time series directly, Hyper-Trees are designed to learn the parameters of a target time series model. Our framework leverages the gradient-based nature of boosted trees, which allows us to extend the concept of Hyper-Networks to Hyper-Trees and to induce a time-series inductive bias to tree models. By relating the parameters of a target time series model to features, Hyper-Trees address the issue of parameter non-stationarity and enable tree-based forecasts to extend beyond their training range. With our research, we aim to explore the effectiveness of Hyper-Trees across various forecasting scenarios and to extend the application of gradient boosted decision trees outside their conventional use in time series modeling.