TEAM 1: Machine learning to distinguish T-cell receptor subregions

Cancer immunity is mediated by the recognition of cancer antigens by the host immune system. Specifically, the binding of the T cell receptor hypervariable complementarity determining region 3 (CDR3) to the surface antigen-presentation machinery on the malignant cells is crucial for T cell mediated cancer killing. Therefore, identification of the CDR3 sequences that are associated to cancer antigens is a key question in cancer immunology and immunotherapies, as it has important diagnostic and therapeutic values. Here we provide 20,000 known tumor-associated CDR3 sequences, and 10,000 sequences non-tumor CDR3s. The goal is to develop a machine learning method to computationally distinguish the two types of CDR3s. Additional test datasets are provided, to evaluate method accuracy. It is recommended that the participant to first apply cross-validation on the training data, then use the test dataset as independent validation. One may consider using other datasets in the public domain as needed. This is a challenging task, and a weak AUC of 0.7 will be acceptable.

Team Lead: Bo Li, Bioinformatics & Immunology,