Python Deep Learning Projects

Abstract:

Protein sequence data from computational biology and bioinformatics is ideal for NLP Language Models (LMs). Low-inference LMs seek new prediction frontiers. We trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on UniRef and BFD data containing up to 393 billion amino acids. Summit supercomputer 5616 GPUs and TPU Pod up to 1024 cores trained protein LMs (pLMs). Dimensionality reduction showed that raw pLM- embeddings from unlabeled data captured some protein sequence biophysical features. We validated the advantage of using embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without MSAs or evolutionary information, bypassing expensive database searches. The results suggested pLMs learned some of the grammar of life.

Note: Please discuss with our team before submitting this abstract to the college. This Abstract or Synopsis varies based on student project requirements.

Did you like this final year project?

To download this project Code with thesis report and project training... Click Here

You may also like: