Inputs
The Artificial Intelligence Predictive Oncology Research Toolkit (AIxPORT) utilities exchange datasets via
RO-Crate bundles. Each bundle is
a directory that carries a ro-crate-metadata.json file. This page
documents the expected layout for the *_train_rocrate and
*_test_rocrate folders consumed by aixportcmd.py train and
aixportcmd.py predict.
Note
To streamline downstream matching, DRE expects directory names to end with
_train_rocrate or _test_rocrate. The prediction tooling will raise
an error if those suffixes are missing.
Feature Tables
Every training and testing crate must include the genomic feature panel used by the algorithms. The files below are placed at the top-level of the crate directory.
gene2ind.txt:Tab-delimited mapping from zero-based gene indices to gene symbols.
0 ABCB1 1 ABCC3 2 ABL1
cell2ind.txt:Tab-delimited mapping from cell (genotype) indices to cell line names.
0 201T_LUNG 1 22RV1_PROSTATE 2 2313287_STOMACH
cell2mutation.txt:Comma-delimited matrix whose rows correspond to
cell2ind.txtentries and whose columns followgene2ind.txtordering. Values are1when a gene carries a non-synonymous mutation and0otherwise.0,0,1,0,0,0.. 0,0,0,0,1,0.. 0,0,0,0,0,0..
cell2cndeletion.txt:Comma-delimited matrix with the same shape as
cell2mutation.txt. A value of1denotes a copy-number deletion event;0means no deletion.0,0,0,0,0,0.. 0,1,0,0,0,0.. 0,0,0,0,1,0..
cell2cnamplification.txt:Comma-delimited matrix indicating copy-number amplification events.
0,0,0,0,0,0.. 0,0,0,1,0,0.. 0,1,0,0,0,0..
References:
Park, S., Silva, E., Singhal, A. et al. A deep learning model of tumor cell architecture elucidates response and resistance to CDK4/6 inhibitors. Nat Cancer (2024). https://doi.org/10.1038/s43018-024-00740-1
Training RO-Crates
Additionally to the feature tables, training crates include the training data:
training_data.txt:Tab-delimited table with one row per training observation. Columns are:
Cell identifier matching
cell2ind.txt(column 1)Drug SMILES string (column 2)
Observed response value (floating point, column 3)
Optional data source label (column 4)
HS633T_SOFT_TISSUE CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C 0.6695136077442607 GDSC2 KINGS1_CENTRAL_NERVOUS_SYSTEM CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C 0.6444092636032414 GDSC1
Running aixportcmd.py train writes algorithm-specific output RO-Crates
under <output>/trainedmodels. Each subdirectory includes the fitted model (for example,
model.pt or model.pkl), and a train_predictions.txt file that
captures in-sample predictions.
Testing RO-Crates
Testing crates use the same feature tables as the training crates and includes a file with test data on which predictions are to be made:
test_data.txt:Tab-delimited table with one row per inference request. Columns mirror
training_data.txt(cell identifier, SMILES string, numeric response if available, and optional source label).EW24_BONE CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C 0.98852067122827 GDSC1 OCILY7_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C 0.2728634745574858 GDSC1
aixportcmd.py predict pairs each *_test_rocrate with the corresponding
trained model directory (<dataset>_train_rocrate_<algorithm>). The command
generates per-algorithm output directories beneath
<output>/predictions that contain RO-Crate metadata, prediction scores, and
any algorithm-specific logs.