A novel Fibonacci hash method for protein family identification by using recurrent neural networks
Abstract
Identification and classification of protein families are one of the most significant problem in bioinformatics and protein studies. It is essential to specify the family of a protein since proteins are highly used in smart drug therapies, protein functions, and, in some cases, phylogenetic trees. Some sequencing techniques provide researchers to identify the biological similarities of protein families and functions. Yet, determining these families with sequencing applications requires huge amount of time. Thus, a computer and artificial intelligence based classification system is needed to save time and avoid complexity in protein classification process. In order to designate the protein families with computer aided systems, protein sequences need to be converted to the numerical representations. In this paper, we provide a novel protein mapping method based on Fibonacci numbers and hashing table (FIBHASH). Each amino acid code is assigned to the Fibonacci numbers based on integer representations respectively. Later, these amino acid codes are inserted a hashing table with the size of 20 to be classified with recurrent neural networks. To determine the performance of the proposed mapping method, we used accuracy, f1-score, recall, precision, and AUC evaluation criteria. In addition, the results of evaluation metrics with other protein mapping techniques including EIIP, hydrophobicity, CPNR, Atchley factors, BLOSUM62, PAM250, binary one-hot encoding, and randomly encoded representations are compared. The proposed method showed a promising result with an accuracy of 92.77%, and 0.98 AUC score.