Importance of variables and correlation problems
For our last seminar we had the pleasure of receiving Pablo Brusco, who has a degree in Computer Science and is a PhD student at the Computer Science Department (UBA). Pablo works on topics related to natural language processing and in particular on the study of speech processing – where he uses automatic learning to analyze dialogues between people and to be able to understand clues that we produce unconsciously (such as small changes in tone of voice) that allow for fluid conversations with few interruptions. He oriented his talk in the Seminar to the measurement of the importance of variables in classification or regression, and also answered some questions on the subject:
1. Are attribute selection practices applicable to all problem domains equally or are there differences (text, images, timelines, etc)?
In machine learning, one performs attribute selection when one wants to obtain a subset of variables containing a large part of the information that relates the instances (X) to the value to be predicted (y). Sometimes, to be able to look for an explanation to a high dimensional phenomenon and other times, to facilitate the work of classifiers or regressors that do not work well in high dimensions. It is common to work in high dimensions in all the fields you named in the question. Therefore, yes, these practices apply to almost any type of machine learning problem.
2. Do you think that in the next few years research in artificial intelligence is going to focus on understanding how models arrive at the results they do?
Currently, the problem of explaining what a model is learning and why it makes the mistakes it does is not solved. In addition, models that are simpler to explain generally imply less predictive power. Currently, events or meetings are organized within the main conferences in the area dedicated exclusively to “model interpretability”. I believe that it is important and will continue to be important to be able to evaluate the quality of our predictors not only by observing their performance in a dataset, but through intrinsic or extrinsic explanations (models that explain models). In the meantime, I think that a lot of applications will have to wait simply because of ethical issues or lack of security in the results.
3. Random Forest is used because it’s an easy model to interpret, but today it’s dominated by neural networks. Do you think these will be equally interpretable in the coming years?
Random Forest is not that easy to interpret. It’s a middle ground in which you get a reasonable performance for many problems without the need for large searches for hyperparameters and an interpretable model through less than perfect techniques that have little studied weaknesses. Since there are many more people trying to solve the problem of interpreting results in neural networks, it is likely that in the future we will know better how these networks work. But it always depends on the problem, the amount of training data, the type of attributes used, the correlation between attributes, and so on.