We often hear people say that smoking causes lung cancer. Is this statement really true? Apart from smoking, there are other factors (confounders) such as age, gender, and environmental factors that can contribute to the development of lung cancer. This is a classic example of Causal Inference.
Causal Inference in statistics is often confused with prediction analysis in Machine Learning and computer science due to their technical similarities, ranging from PCA and SVM to Neural Networks. The fundamental difference between these two fields lies not in the methods used but in the purpose of utilizing the data. Statistics primarily focuses on inference, while Machine Learning concentrates on prediction. Data Science, a burgeoning field, emerges at the intersection of these two domains. We are well-acquainted with the applications of Machine Learning in Data Science, but what about the role of causal inference in this field?
Firstly, we need to distinguish between two concepts: Inference and Prediction. Inference is the process of understanding the factors and mechanisms behind observed data and generalizing them. On the other hand, Prediction aims to determine what phenomenon will occur for a new data point. For instance, using the Random Forest machine learning algorithm to predict the progression of diabetes based on biochemical markers yields a prediction. Conversely, identifying which markers are crucial and have the most impact on disease progression, using techniques like feature_importance, can be considered causal inference. As the number of features increases, improving the accuracy of the model, inferring the influence of each feature on the outcome becomes challenging due to issues like high dimensionality and interaction effects.
Causal inference guides us in making interventions or controlling what is happening in reality. In many cases, intervention and control are the most valuable outcomes Data Science can provide. Particularly in medical research, we not only want to predict a patient's lifespan based on their biochemical status when diagnosed with a disease, but we also need to propose interventions that can prolong that time. Even though predictions, even highly accurate ones, become nearly meaningless if we cannot infer the mechanisms influencing the output variable. Performing causal inference analysis is not easy; it requires a significant amount of time, resources, and human effort to produce meaningful results in the end.
In practice, the application of causal inference or prediction analysis depends on the job requirements and the intended use of the data. In business and technical environments, prediction is often more common, predicting user behavior, prices, customer segments, fraud detection, etc. where intervening to change the outcome is almost irrelevant compared to analyzing what the outcome is. In contrast, in research-oriented environments such as science, healthcare, and social sciences, emphasizing intervention and understanding how they affect the outcomes is crucial to eliminate and minimize negative factors. Therefore, causal inference plays a significant role not only in applied Data Science fields but also in bringing stability and progress to individuals and society.
So, if you were to work in the field of Data Science, which aspect would you prefer: causal inference or prediction?
References:
Linh Nghiem, Data Science: Inference or Prediction?: https://linhnghiem.org/2019/11/03/khoa-hoc-du-lieu-suy-luan-hay-du-doan/?fbclid=IwAR3viwLdBeq8GNwJ4-0eDvGRasZ2uy3o2j6RgHuDk8FeQwgBHyYPDsXcwnU
Causal Inference for The Brave and True: https://matheusfacure.github.io/python-causality-handbook/landing-page.html?fbclid=IwAR1mpqr0iZdXJQ-EBlHKH25zaYssB_J5lAt51RVZniwgMRApanW7cS5og4s
Detailed Information:
Hai Bang - Communication Collaborator, University of Information Technology
English version: Phan Huy Hoang