Protein subcellular localization

    Protein subcellular localization is defined as predicting the functioning location of a given protein in the cell. It is considered an important step towards protein function prediction and drug design. Here you are provided a benchmark dataset of 523 Gram-Negative proteins (specific kind of proteins). These proteins belong to 4 groups (4 places in the Gram-Negative Bacterial protein cells). Hence, we can formulate it as a multi class classification task (523 samples belonging to 4 classes). The proteins, their class labels, and a list of 55 physicochemical properties for amino acids (20 alphabets that build the protein sequence) are available on Canvas (in a module called: Subcellular_data). We can extract different kind of features to present protein sequence. Two examples are: amino acids occurrences and amino acids compositions which are defined as follows: Amin acid occurrence: Counting the occurrence (appearance) of each amino acid. Hence, it will be a vector of 20 number that are count for 20 amino acids for each protein. The summation of these 20 amino acids will be equal to the length of the protein. Amino acid Composition: Counting the occurrence (appearance) of each amino acid and then normalizing it by dividing them by the length of protein sequence. Hence, it will be a vector of 20 number that are count for 20 amino acids for each protein. The summation of these 20 numbers will be equal to 1 (since they are already divided by the protein length). These two feature vectors are also available on the same module. There is also an article (Dehzangi2015) n the same module that can provide you with more information about the problem. Your assignment is to work with your group and think of a feature vector to represent each protein for classification purpose. Each group will propose a feature, implement it in python or R, and extract the feature vector. One suggestion can be as follows: Choosing a physicochemical property (e.g., Volume), then for each amino acid, check its corresponding value, and sum it along the protein sequence. Then normalize it by the length of protein sequence. In this way, you will extract 1 feature for a given protein.