Programming_I_Leukemia_analysis

Course project bioinformatics programming I

read_data.py is for read in data;
getcox.py is to calculate the cox-score of each gene when the dataset is given; also report the histogram of the cox-scores of genes;
f.py will get the cox-regression result in the test dataset, while the PCA model is built on train dataset with the qualified genes; those genes have cox-score > or equal than the given threshold
change_col_names created a dictionary using V columns names as key and probes names as values.
sampling_the_data.py will get the balanced 5 train/test folds based on key feature: MRD day 29; Age; WBC at diagnosis.
1_gene_expression.md testing R Markdown
project_notebook.ipynb jupyter notebook showing the the read in dataframe and the cox-score histogram based on the 207 patients.
get_5_testfolds.ipynb jupyter notebook showing how to do stratified resampling to get balanced 5 train/test datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
Presentation		Presentation
Results		Results
dataset		dataset
images		images
paper		paper
1_gene_expression.md		1_gene_expression.md
README.md		README.md
change_col_names		change_col_names
clinical.xlsx		clinical.xlsx
describe_stratum.py		describe_stratum.py
f.py		f.py
genefilter.R		genefilter.R
get_5_testfolds.ipynb		get_5_testfolds.ipynb
getcox.py		getcox.py
midterm_report.ipynb		midterm_report.ipynb
presentation_final_2omb.pptx		presentation_final_2omb.pptx
presentation_vo.7.pptx		presentation_vo.7.pptx
presentation_vo.7_PDF.pdf		presentation_vo.7_PDF.pdf
project_notebook.ipynb		project_notebook.ipynb
proposal_v10.0.docx		proposal_v10.0.docx
read_data.py		read_data.py
sampling_the_data.py		sampling_the_data.py