DeMaSk: protein substitution impact prediction

For a sequence s, DeMaSk models the substitution impact as a linear combination of

The Shannon entropy H_s,p of position p as computed across homologs of sequence s
The variant frequency log₂ f_s,p,var across homologs
The DMS-derived average impact for the combination of wild-type and variant residue identity, stored in the substitution matrix D.

That is, for a substitution from residue wt to residue var at position p in sequence s,

We use ordinary least squares regression to infer the coefficients using the variant fitness scores from the datasets below, along with the computed substitution matrix and the homologous sequences found for each protein using blastp.

Once fitted, the model can be applied to any variant in a query protein by finding homologs, computing the relevant position’s Shannon entropy and the frequency of the variant at that position, and combining those with the appropriate substitution matrix element.

The directional substitution matrix

as computed from the DMS datasets in the table below. Download

Within each selected dataset, the fitness values for all variants are rank-normalized since fitness metrics are dependent on experimental design. For this reason, datasets were included only from DMS studies that measured all or nearly all possible amino acid substitutions in a protein so that rank-normalized scores had a consistent interpretation across proteins. The measure of fitness must also be related to the protein’s function, which excludes, for example, studies that measured the protein’s evasion of a host’s immune system. In cases where multiple datasets cover the same protein, the datasets were merged by averaging normalized fitness scores for the same variant.

Study	PMID	Protein	Species	Positions mutated
Bloom 2014	24859245	NP	influenza (A/WSN/1933)	498
Brenan et al. 2016	27760319	MAPK1/ERK2	human	359
Doud and Bloom 2016	27271655	H1 HA	influenza (A/WSN/1933)	564
Firnberg et al. 2014	24567513	TEM-1	E. coli	286
Giacomelli et al. 2018	30224644	TP53	human	393
Haddox et al. 2018	29590010	Env	HIV (BF520)	662
Haddox et al. 2018	29590010	Env	HIV (BG505)	670
Heredia et al. 2018	29678950	CCR5	human	351
Heredia et al. 2018	29678950	CXCR4	human	351
Kelsic et al. 2016	28009265	IF1	E. coli	72
Klesmith et al. 2015	26369947	LGK	Lipomyces starkeyi (Oleaginous yeast)	439
Mavor et al. 2016	27111525	RL401	yeast	75
Melnikov et al. 2014	24914046	APH(3')II	E. coli	264
Roscoe et al. 2013	23376099	RL401	yeast	75
Roscoe and Bolon 2014	24862281	RL401	yeast	75
Stiffler et al. 2015	25723163	TEM-1	E. coli	263
Thyagarajan and Bloom 2014	25006036	H1 HA	influenza	564
Weile et al. 2017	29269382	CALM1	human	149
Weile et al. 2017	29269382	SUMO1	human	101
Weile et al. 2017	29269382	TPK1	human	243
Weile et al. 2017	29269382	UBE2I	human	159
Wrenbeck et al. 2017	28585537	amiE	P. aeruginosa	341

The DeMaSk model

The directional substitution matrix

DMS datasets used to compute the matrix