Given a dataset of human patients which includes their age, sex, and over 100 measured disease biomarkers, what type of statistical analysis is best?
08:09 31 Mar 2026

I was given a dataset of close to 100 human participants. This dataset includes values for 100+ measured hypothesized biomarkers of neurodegeneration, their age, sex, disease state (clinically normal, mild cognitive impairment, and Alzheimer's).

The 100 biomarkers all use the same measurement unit (NPQs). This datasheet also includes, when available, each participants' Clinical Dementia Rating Score and Mini Mental State Examination Score.

Finally, I have two other biomarkers that were measured using two different technologies and use different units of measurements (Cq and OD, respectively).

To me, this seems like a rather complicated and heterogeneous dataset so I'm having difficulty approaching this. Should I start with simple linear regression analysis?

My main interest is determining, initially, which biomarker(s) does the best job at segregating patients according to their disease and taking into account their cognitive scores. So, does Biomarker 1/100 do a good enough job of separating the Alzheimer's patients from the Clinically normal? Are there any associations that are interesting?

Here's an example of the dataset:

Patient ID Study ID Biomarker 1/100 (NPQ) Biomarker 2/100 (NPQ) Biomarker 100/100 (NPQ) Sex Age Disease CDR MMSE Biomarker X (Cq) Biomarker Z (OD)
1 A 56 3853 35 M 72 Normal 1 21 29.9 2.374
2 B 28 29494 24 M 78 MCI 0.5 30 30.95 1.677
3 C 294 2992 23 F 86 MCI NA 17 28.89 1.789
4 D 29 223 14 F 65 AD 0 18 31.02 2.400
5 E 24 234 13 M 71 AD 2 NA 29.10 0.8694
r statistics