Uday Yallapragada and Iosif Vaisman

Influenza A Virus (IAV) is remarkably adept at surviving in human populations. IAV thrives even among populations with wide spread access to vaccines and anti-viral drugs, and continues to be a major cause of morbidity and mortality. Correlated mutations are an important factor in IAV’s evolution and are critical for host adaptation and pathogenicity. Large sets of publicly available sequences of IAV combined with its rapid and complex evolutionary dynamics present interesting opportunities and unique challenges to analyze correlated mutations in influenza proteomes. In this work, we performed a comprehensive analysis of correlated mutations in IAV using a network theory approach where residues in each protein act as nodes in the graph and edges in the graph are created based on inter-residue correlated mutations. Our approach used ‘maximal information coefficient’ (MIC) to compute correlations between residues and we created edges between nodes if MIC exceeds a threshold. We created a modular and robust pipeline and applied it to multiple datasets belonging to H1N1, H3N2, H5 and H7N9 subtypes. We studied structural dynamics of IAV sub-systems based on topological properties of their networks resulting in several important conclusions. We identified nodes with highest degree along with edges and triplets with strongest weight for each network. To contextualize our results, we performed entropy analysis to gain a global view of sequence variation and computed solvent accessibility profiles to identify statistical differences in correlation profiles between surface and buried residues. We computed residue cooccurrence counts to understand the internal mechanics behind MIC.

Click here if you want a quick primer on correlated mutations in proteins.

In this blog, I have attempted to provide an overview of our methodology to perform network analysis of correlated mutations in Influenza

This picture is a very high-level overview of steps in our pipeline.

pipeline

Preprocessing of data

Download strains from fludb

We searched for ‘strains’ using the Search Data –> Search Sequences –> Strain Data from fludb’s main menu.
Based on the specific dataset that we are interested, we selected specific options. For all our datasets, we have always selected strains of Influenza A with complete genome only and we limited our search results to strains from a pure sub-type (excluding all mixed sub-types). We have not limited our search results to either a geographic grouping or a country. The ‘sub type’, ‘host’ and ‘flu season’ varied for each of our datasets.
After selecting appropriate options, we submit our search and select the results in the following screen. We download the strains in ‘Protein FASTA’ format. This will result in a single FASTA file that contains sequences of all the 10 proteins that we are interested in.

Primary datasets used in this effort

name	count	comments
HUMAN_H1N1_ALL	1769	Human H1N1 strains from all years, maximum 300 strains per year
HUMAN_H3N2_ALL	1940	Human H3N2 strains from all years, maximum 300 strains per year
SWINE_H1N1_ALL	1096	All available Swine H1N1 strains
SWINE_H3N2_ALL	794	All available Swine H3N2 strains
AVIAN_H5_ALL	1463	All available Avian H5 strains
H7N9_ALL	~500	All available H7N9 strains

Since the FASTA file that we downloaded contains sequences from multiple IAV proteins, we split it into 10 separate FASTA files containing protein sequences for HA, NA, M1, M2, NP, PA, NS1, NS2, PB1 & PB2 proteins. Sequences are then aligned using the MUSCLE alignment program. We then perform de-duplication of the sequences across strains by concatenating the sequences of all 10 proteins for a given strain and looking for a 100% match. Strains with duplicate sequences are discarded. We generate 10 aligned FASTA files with sequences for each of the 10 proteins after de-duplication.

MIC Computation

In our MIC-Computation pipeline (depicted in figure below), we compute pair-wise MIC correlation coefficient between residues in IAV proteins. We have focused our efforts only on correlations between inter-protein correlations and therefore we did not compute correlations between residues within a protein. For a set of 10 proteins, we have a total of 45 combinations of these proteins. This pipeline is invoked once for each combination (examples: [HA, NA], [M1, M2], [NP, NS1], [NA, PB1]) of proteins.

mic_computation

There are two steps in this sub-flow that required elaboration.

First, we must match the strains from the two FASTA files and create matrices that can be used for pair-wise MIC computation.
Second is an important step where the sequences in the matrices have to be converted to a Boolean [0,1] format based on the following convention: For each position i of a given protein (in the two proteins), the type of amino acid s of the multiple sequence alignment (MSA) is represented by a binary variable x_i(s) where x_i(s) = 0 if the amino-acid is the most frequent amino acid at this position within the MSA and x_i(s) = 1 if it is another amino acid. x_i = 0 for all sequences in the MSA if number of gaps at position i exceeds 10%.

Before we compute MIC, we also discard all positions with zero variance to improve the speed of computation. We compute MIC between 2 residue positions using functions in minepy library. We created two sets of csv files, the first set consisted of data for all MIC scores greater than 0.1 and the second set consisted of data for all MIC scores greater than 0.5.

Creation of correlated mutation graphs

Our approach for the design of graph creation sub-flow was to create graphs that would provide the best analysis capabilities, and we decided to create these graphs using two different methods and tooling to realize that objective.

We created property graphs in a Neo4j graph database as our first method since this approach gave us a great deal of (querying) flexibility and ability to scale (for additional datasets).
For our second approach, we created ‘graphml’ files that represented our mutation graphs and this allowed us to exploit features in Gephi for visualization and analysis of global characteristics.

This picture depicts this workflow with two separate branches for our two approaches.

graph_creation

We stored all nodes and edges where MIC > 0.1 in neo4j database since we were interested in understanding the overall node and edge counts for different cutoffs. We used a MIC value of 0.5 as the notional cut-off where the correlation is significant, and hence created “graphml” files based on csv files generated using a 0.5 cut-off for MIC.

Graph Analysis

Our computational pipeline (depicted in picture shown below) for ‘graph analysis’ is comprised of two methods.
1. Creation of a web application that allows users to
a. Search connections of a specific node
b. View plots depicting number of nodes & edges for different cut-off values
c. View top nodes, edges and triplets for each dataset
d. View global properties of graphs generated with a MIC filter > 0.5 including degree distribution, clustering coefficient, edge density and average degree
e. View interactive network visualizations for correlated mutation graphs (generated with a MIC filter > 0.5)
2. Use of Gephi visualization and exploration platform to perform analysis of correlated mutation graphs. It should be noted that we have included results of any analysis that we did using Gephi (including visualizations, images, other data points) as part of our web application.

graph_analysis

Post-processing/additional analysis

In addition to conducting analysis of networks generated for different datasets to determine structural and topological properties, we performed post-processing steps to answer specific questions. First, we conducted entropy analysis to understand the amount of variation in different proteins in all datasets and thereby identify potential associations between entropy and MIC correlations. Second, we were interested to know if most of the in-network residues are on the surface of a protein with higher solvent accessibility values and we conducted solvent accessibility analysis to answer this question. Third, we calculated residue cooccurrence counts for pairs to get a better insight into the underlying mechanics of MIC algorithm for our use case. Finally, we created ‘protein networks’ to provide a macro picture and depict the extent of correlations between proteins.

Pipeline for Network Analysis of Correlated Mutations in Influenza