SplitSeek-Pro Prediction

Instructions

Descriptions
- SplitSeek-Pro is a deep learning–based method designed to evaluate the feasibility of protein engineering strategies involving residue-level splitting, such as circular permutation or split–reconstitutions. The method integrates both sequence information (ESM2-650M embeddings and AAindex encodings) and structural information (pairwise atomic distances) to estimate the splitting probability for each amino acid residue.
- Currently, users can submit files in PDB format to obtain per-residue splitting probability predictions.
- On the web server, prediction results are displayed using a color scale (white → red representing scores from 0 to 1). SplitSeek-Pro outputs: A predicted PDB file, where the original B-factor column is replaced with the corresponding splitting scores (0–100). Users can download these files and visualize them locally (e.g., in PyMOL) by coloring residues based on the B-factor.
- A score of 0.5 is recommended as the threshold to distinguish between feasible and infeasible splitting sites.
- We will continue to improve the prediction accuracy as more data become available.
Tips
- The predicted score at residue n corresponds to the feasibility of splitting the protein between residues n and residue n+1.
- Continuous regions with ≥3 consecutive residues scoring above 0.8 generally indicate high splitting feasibility. Single high-scoring residues within a continuous region are less likely to be practically feasible.
- The model was trained in two stages: (1) pretraining on computational splitting probability dataset and (2) fine-tuning on circular permutation data from structurally similar proteins. Because the fine-tuning dataset is derived from circular permutation examples, predictions are biased toward identifying sites for circular permutation.
- Predictions tend to be more reliable for proteins with fewer than 400 residues.
Shortcomings
- While the model can partially recognize nonsplittable residues near active sites, the absence of explicit active-site annotations might lead to false positives due to perturbation on functionally relevant residues or ghost effects.
- The model is primarily optimized for distinguishing splittable sites at the loop-regions. It currently has limited accuracy for identifying splittable sites in rigid secondary structural elements, which is the main source of false negatives.
- Circular permutation at residues located near the terminal regions is usually well tolerated. However, their underrepresentation in pretraining data and experimental examples leads to underestimated model scores. Therefore, low-scores at terminal residues should require cautious interpretation, while high-scores remain reliable.
Download
- A predicted PDB file, where the original B-factor column is replaced with the corresponding splitting scores (0–100). Users can download these files and visualize them locally (e.g., in PyMOL) by coloring residues based on the B-factor.