I would like to ask you guys some insight. I am currently working on my thesis and I have run into something I just can’t wrap my head around. So, I have an old dataset (18000 samples) and a new one (26000 samples); the new one is made up by the old plus some extra samples. On both datasets I need to run a regression model to predict the fuel power consumption of an energy system (a cogenerator). The features I am using to predict are ambient temperature, output thermal power, output electrical power. I trained a RF regression model on each dataset; the two models were trained with hyper grid search and cv = 5, and they turned out to be pretty different. I had significantly different results in terms of R2 (old: 0.850, new: 0.935). Such a difference in R2 seems odd to me, and I would like to figure something out more. I ran some futher tests, in particular:
Old model trained on new dataset, and new model on old dataset: similar R2 on old and new ds;
New model trained on increasing fractions of new dataset: no significant change in R2 (R2 always similar to final R2 on new model).
Subdatasets created as old ds + increasing fractions of the difference between new and old ds. Here we notice increasing R2 from old to new ds. Since test 2 seems to suggest that ds size is not significant, I am wondering if test 3 may mean that the new data added to the old one has a higher informative value. Are there some further tests I can run to assess this hypothesis and how can I formulate it mathematically, or are you guys aware of any other phenomena that may be going on here? I am also adding some pics. Thank you in advance! Every suggestion would be much appreciacted.