Given a dataset of human patients which includes their age, sex, and over 100 measured disease biomarkers, what type of statistical analysis is best?

08:09 31 Mar 2026

I was given a dataset of close to 100 human participants. This dataset includes values for 100+ measured hypothesized biomarkers of neurodegeneration, their age, sex, disease state (clinically normal, mild cognitive impairment, and Alzheimer's).

The 100 biomarkers all use the same measurement unit (NPQs). This datasheet also includes, when available, each participants' Clinical Dementia Rating Score and Mini Mental State Examination Score.

Finally, I have two other biomarkers that were measured using two different technologies and use different units of measurements (Cq and OD, respectively).

To me, this seems like a rather complicated and heterogeneous dataset so I'm having difficulty approaching this. Should I start with simple linear regression analysis?

My main interest is determining, initially, which biomarker(s) does the best job at segregating patients according to their disease and taking into account their cognitive scores. So, does Biomarker 1/100 do a good enough job of separating the Alzheimer's patients from the Clinically normal? Are there any associations that are interesting?

Here's an example of the dataset:

Patient ID	Study ID	Biomarker 1/100 (NPQ)	Biomarker 2/100 (NPQ)	Biomarker 100/100 (NPQ)	Sex	Age	Disease	CDR	MMSE	Biomarker X (Cq)	Biomarker Z (OD)
1	A	56	3853	35	M	72	Normal	1	21	29.9	2.374
2	B	28	29494	24	M	78	MCI	0.5	30	30.95	1.677
3	C	294	2992	23	F	86	MCI	NA	17	28.89	1.789
4	D	29	223	14	F	65	AD	0	18	31.02	2.400
5	E	24	234	13	M	71	AD	2	NA	29.10	0.8694

r statistics

Your Answer

Privacy & Cookie Consent