700 patients analysis trying to identify common factors on a particular disease, risk factors and to extrapolate results on a general population. I would like to use R on AZURE environment and different algorithms to test the best fit
I am performing a research on cancer patient´s ( 700 cases ) with about 50 variables each. I already collect data and I am in the process of data cleansing. Patient data is 100% anonymous and the data was collected on EXCEL table.
The idea is to make statistics analysis and a predictive model using R and AZURE BIG DATA machine learning environment.
I have collected data and divided in three components:
a.) Risk factors (Lifestyles and previous diseases and medications)
b.) Classification and evolution (stage, symptoms, type of tumor, diagnosis tools, complications, case following and actual condition)
c.) Performed therapy (surgical, drugs, others)
And I wish to perform predictions on three areas:
d) Best Practices on Diagnosis
e.) Best Therapy approach (surgical, medical, drugs, combination, others)
f.) Prognosis. (survival)
The job has 4 phases:
1. DATA GATHERING AND CLEANSING
Perform data cleansing operations, such as identification of noisy data and removal of outliers to make the prediction more accurate. The idea is to have the basic code to apply R packages to handle missing, duplicate and impure values. After initial run, I will continue performing several test, and improving data, so the idea is to run the model several times and customize/improve the R code provided by the freelancer. In this step the freelancer will not perform data cleansing, is just to provide the code to perform the basic clean process. (R code on Azure ML environment) The code could have comments to activate /deactivate other data cleansing steps for the future.
2. DATA ANALYSIS AND TRANSFORMATION
I would like to have frequency distributions for all the variables to start looking at trends and possible relations, after that I would like to perform a crosstab analysis, and correlations on desired variables. So the freelancer will provide the code to perform this three test. Additionally I would like to remove irrelevant attributes by performing a correlation analysis, which will play a least significant role in determining the outcomes.
The freelance will provide the R code on AZURE ML environment to do all of this. The code could have "Comments" for the future crosstabs analysis and regression.
3. BIG DATA ANALYSIS (USING AZURE MACHINE LEARNING AND THE ALGORITHMS TO CREATE THE APPROPRIATE PREDICTIVE MODEL.
I want to generate a decision tree or apply linear/logistic regression techniques to build a predictive model on DIAGNOSIS, (specificity) TREATMENT (best therapy) and PROGNOSIS elements. This procedure involves choosing a classification algorithm, identifying test data and generating classification rules. Identify the confidence of the classification model by applying it against test data, accordingly to this three mentioned components a.) Risk factors, b.) Tumor type, and c.) Performed Therapies.
The freelancer will provide the R code in AZURE Machine learning environment to do all of this.
Perform a cluster analysis to segregate data groups. Use these meaningful subsets of populations to make inferences on three mentioned areas: DIAGNOSIS (specificity of available test ), TREATMENT (best therapy ) and PROGNOSIS predicted desired components. The freelancer will provide the R code in AZURE Machine learning environment to do all of this.