Knockoffs (statistics)

In statistics, the knockoff filter, or simply knockoffs, is a framework for variable selection. It was originally introduced for linear regression by Rina Barber and Emmanuel Candès,^[1] and later generalized to other regression models in the random design setting.^[2] Knockoffs has found application in many practical areas, notably in genome-wide association studies.^[2]^[3]

Fixed-X knockoffs edit

Consider a linear regression model with response vector $\mathbf {y}$ and feature matrix $\mathbf {X}$ , which is treated as deterministic. A matrix ${\tilde {\mathbf {X} }}$ is said to be knockoffs of $\mathbf {X}$ if it does not depend on $\mathbf {y}$ and satisfies $\mathbf {X} _{i}^{\top }\mathbf {X} _{j}=\mathbf {X} _{i}^{\top }{\tilde {\mathbf {X} }}_{j}={\tilde {\mathbf {X} }}_{i}^{\top }\mathbf {X} _{j}={\tilde {\mathbf {X} }}_{i}^{\top }{\tilde {\mathbf {X} }}_{j}$ for $i\neq j$ . Barber and Candès showed that, equipped with a suitable feature importance statistic, fixed-X knockoffs can be used for variable selection while controlling the false discovery rate (FDR).

Model-X knockoffs edit

Consider a general regression model with response vector $\mathbf {y}$ and random feature matrix $\mathbf {X}$ . A matrix ${\tilde {\mathbf {X} }}$ is said to be knockoffs of $\mathbf {X}$ if it is conditionally independent of $\mathbf {y}$ given $\mathbf {X}$ and satisfies a subtle pairwise exchangeable condition: for any $j$ , the joint distribution of the random matrix $[\mathbf {X} ,{\tilde {\mathbf {X} }}]$ does not change if its $j$ th and $(j+p)$ th columns are swapped, where $p$ is the number of features. While it is less clear how to create model-X knockoffs compared to their fixed-X counterpart, various algorithms have been proposed to construct knockoffs.^[2]^[3]^[4]^[5] Once constructed, model-X knockoffs can be used for variable selection following the same procedure as fixed-X knockoffs and control the FDR.

Properties edit

The knockoffs ${\tilde {\mathbf {X} }}$ can be understood as negative controls. Informally speaking, knockoffs has the property that no method can statistically distinguish the original matrix from its knockoffs without looking at $\mathbf {y}$ . Mathematically, the exchangeability conditions translate to symmetry that allows for an estimation of the type I error (e.g., if one wishes to choose the FDR as the type I error rate, the false discovery proportion is estimated), which then leads to exact type I error control.

Model-X knockoffs provides valid type I error control regardless of the unknown conditional distribution of $\mathbf {y}$ given $\mathbf {X}$ , and it can work with black-box variable importance statistics, including the ones derived from complicated machine learning methods. A most significant challenge of implementing model-X knockoffs is that it requires nontrivial knowledge on the distribution of $\mathbf {X}$ , which is usually high-dimensional. This knowledge can be gained with the help of unlabeled data.^[2]

References edit

^ Barber, Rina Foygel; Candès, Emmanuel J. (2015). "Controlling the false discovery rate via knockoffs". Annals of Statistics. 43 (5): 2055–2085.
^ ^a ^b ^c ^d Candès, Emmanuel; Fan, Yingying; Janson, Lucas; Lv, Jinchi (2018). "Panning for gold: model-X knockoffs for high dimensional controlled variable selection". Journal of the Royal Statistical Society. Series B (methodological). 80 (3). Wiley Online Library: 551–577. arXiv:1610.02351.
^ ^a ^b Sesia, Matteo; Sabatti, Chiara; Candès, Emmanuel (2019). "Gene hunting with hidden Markov model knockoffs". Biometrika. 106 (1): 1–18.
^ Bates, Stephen; Candès, Emmanuel; Janson, Lucas; Wang, Wenshuo (2020). "Metropolized knockoff sampling". Journal of the American Statistical Association.
^ Huang, Dongming; Janson, Lucas (2020). "Relaxing the assumptions of knockoffs by conditioning". Annals of Statistics.

External links edit

Official website

[Barber_2015-1] Barber, Rina Foygel; Candès, Emmanuel J. (2015). "Controlling the false discovery rate via knockoffs". Annals of Statistics. 43 (5): 2055–2085.

[:0-2] Candès, Emmanuel; Fan, Yingying; Janson, Lucas; Lv, Jinchi (2018). "Panning for gold: model-X knockoffs for high dimensional controlled variable selection". Journal of the Royal Statistical Society. Series B (methodological). 80 (3). Wiley Online Library: 551–577. arXiv:1610.02351.

[:1-3] Sesia, Matteo; Sabatti, Chiara; Candès, Emmanuel (2019). "Gene hunting with hidden Markov model knockoffs". Biometrika. 106 (1): 1–18.

[4] Bates, Stephen; Candès, Emmanuel; Janson, Lucas; Wang, Wenshuo (2020). "Metropolized knockoff sampling". Journal of the American Statistical Association.

[5] Huang, Dongming; Janson, Lucas (2020). "Relaxing the assumptions of knockoffs by conditioning". Annals of Statistics.

[1]

[2]

[3]

[4]

[5]