I was given a dataset of close to 100 human participants. This dataset includes values for 100+ measured hypothesized biomarkers of neurodegeneration, their age, sex, disease state (clinically normal, mild cognitive impairment, and Alzheimer's).
The 100 biomarkers all use the same measurement unit (NPQs). This datasheet also includes, when available, each participants' Clinical Dementia Rating Score and Mini Mental State Examination Score.
Finally, I have two other biomarkers that were measured using two different technologies and use different units of measurements (Cq and OD, respectively).
To me, this seems like a rather complicated and heterogeneous dataset so I'm having difficulty approaching this. Should I start with simple linear regression analysis?
My main interest is determining, initially, which biomarker(s) does the best job at segregating patients according to their disease and taking into account their cognitive scores. So, does Biomarker 1/100 do a good enough job of separating the Alzheimer's patients from the Clinically normal? Are there any associations that are interesting?
Here's an example of the dataset:
| Patient ID | Study ID | Biomarker 1/100 (NPQ) | Biomarker 2/100 (NPQ) | Biomarker 100/100 (NPQ) | Sex | Age | Disease | CDR | MMSE | Biomarker X (Cq) | Biomarker Z (OD) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | A | 56 | 3853 | 35 | M | 72 | Normal | 1 | 21 | 29.9 | 2.374 |
| 2 | B | 28 | 29494 | 24 | M | 78 | MCI | 0.5 | 30 | 30.95 | 1.677 |
| 3 | C | 294 | 2992 | 23 | F | 86 | MCI | NA | 17 | 28.89 | 1.789 |
| 4 | D | 29 | 223 | 14 | F | 65 | AD | 0 | 18 | 31.02 | 2.400 |
| 5 | E | 24 | 234 | 13 | M | 71 | AD | 2 | NA | 29.10 | 0.8694 |