SplitSeek-Pro is a deep learning–based method designed to evaluate the feasibility of protein engineering strategies involving residue-level splitting, such as circular permutation or split–reconstitutions. The method integrates both sequence information (ESM2-650M embeddings and AAindex encodings) and structural information (pairwise atomic distances) to estimate the splitting probability for each amino acid residue.
Currently, users can submit files in PDB format to obtain per-residue splitting probability predictions.
On the web server, prediction results are displayed using a color scale (white → red representing scores from 0 to 1). SplitSeek-Pro outputs: A predicted PDB file, where the original B-factor column is replaced with the corresponding splitting scores (0–100). Users can download these files and visualize them locally (e.g., in PyMOL) by coloring residues based on the B-factor.
A score of 0.5 is recommended as the threshold to distinguish between feasible and infeasible splitting sites.
We will continue to improve the prediction accuracy as more data become available.
Tips
The predicted score at residue n corresponds to the feasibility of splitting the protein between residues n and residue n+1.
Continuous regions with ≥3 consecutive residues scoring above 0.8 generally indicate high splitting feasibility. Single high-scoring residues within a continuous region are less likely to be practically feasible.
The model was trained in two stages: (1) pretraining on computational splitting probability dataset and (2) fine-tuning on circular permutation data from structurally similar proteins. Because the fine-tuning dataset is derived from circular permutation examples, predictions are biased toward identifying sites for circular permutation.
Predictions tend to be more reliable for proteins with fewer than 400 residues.
Shortcomings
While the model can partially recognize nonsplittable residues near active sites, the absence of explicit active-site annotations might lead to false positives due to perturbation on functionally relevant residues or ghost effects.
The model is primarily optimized for distinguishing splittable sites at the loop-regions. It currently has limited accuracy for identifying splittable sites in rigid secondary structural elements, which is the main source of false negatives.
Circular permutation at residues located near the terminal regions is usually well tolerated. However, their underrepresentation in pretraining data and experimental examples leads to underestimated model scores. Therefore, low-scores at terminal residues should require cautious interpretation, while high-scores remain reliable.
Download
A predicted PDB file, where the original B-factor column is replaced with the corresponding splitting scores (0–100). Users can download these files and visualize them locally (e.g., in PyMOL) by coloring residues based on the B-factor.