The DeMaSk model

For a sequence s, DeMaSk models the substitution impact as a linear combination of
  1. The Shannon entropy Hs,p of position p as computed across homologs of sequence s
  2. The variant frequency log2 fs,p,var across homologs
  3. The DMS-derived average impact for the combination of wild-type and variant residue identity, stored in the substitution matrix D.

That is, for a substitution from residue wt to residue var at position p in sequence s,

We use ordinary least squares regression to infer the coefficients using the variant fitness scores from the datasets below, along with the computed substitution matrix and the homologous sequences found for each protein using blastp.

Once fitted, the model can be applied to any variant in a query protein by finding homologs, computing the relevant position’s Shannon entropy and the frequency of the variant at that position, and combining those with the appropriate substitution matrix element.

The directional substitution matrix

as computed from the DMS datasets in the table below. Download

DMS datasets used to compute the matrix

Within each selected dataset, the fitness values for all variants are rank-normalized since fitness metrics are dependent on experimental design. For this reason, datasets were included only from DMS studies that measured all or nearly all possible amino acid substitutions in a protein so that rank-normalized scores had a consistent interpretation across proteins. The measure of fitness must also be related to the protein’s function, which excludes, for example, studies that measured the protein’s evasion of a host’s immune system. In cases where multiple datasets cover the same protein, the datasets were merged by averaging normalized fitness scores for the same variant.

Download this table Download all datasets (Or, download individually from the table below.)

Study PMID Protein Species Positions mutated
Bloom 2014 24859245 NP influenza (A/WSN/1933) 498
Brenan et al. 2016 27760319 MAPK1/ERK2 human 359
Doud and Bloom 2016 27271655 H1 HA influenza (A/WSN/1933) 564
Firnberg et al. 2014 24567513 TEM-1 E. coli 286
Giacomelli et al. 2018 30224644 TP53 human 393
Haddox et al. 2018 29590010 Env HIV (BF520) 662
Haddox et al. 2018 29590010 Env HIV (BG505) 670
Heredia et al. 2018 29678950 CCR5 human 351
Heredia et al. 2018 29678950 CXCR4 human 351
Kelsic et al. 2016 28009265 IF1 E. coli 72
Klesmith et al. 2015 26369947 LGK Lipomyces starkeyi (Oleaginous yeast) 439
Mavor et al. 2016 27111525 RL401 yeast 75
Melnikov et al. 2014 24914046 APH(3')II E. coli 264
Roscoe et al. 2013 23376099 RL401 yeast 75
Roscoe and Bolon 2014 24862281 RL401 yeast 75
Stiffler et al. 2015 25723163 TEM-1 E. coli 263
Thyagarajan and Bloom 2014 25006036 H1 HA influenza 564
Weile et al. 2017 29269382 CALM1 human 149
Weile et al. 2017 29269382 SUMO1 human 101
Weile et al. 2017 29269382 TPK1 human 243
Weile et al. 2017 29269382 UBE2I human 159
Wrenbeck et al. 2017 28585537 amiE P. aeruginosa 341