When to use linear regression (2024)

By Christina Ellis / May 26, 2022 / Machine learning / Leave a Comment

Share this article

Are you wondering when you should choose a linear regression model over a similar machine learning model? Well then you are in the right place! In this article we tell you everything you need to know to determine when you should reach for a linear regression model.

This article starts out with a discussion of what kind of outcome variables linear regression is typically used for. After that, some of the main advantages and disadvantages of linear regression are discussed. Finally, we provide specific examples of scenarios where you should and should not use a linear regression model.

What outcomes can you use linear regression for?

What types of outcome variables can you use linear regression for? Linear regression should be used when your outcome variable is a numeric variable. If your outcome variable is not numeric, then you should consider looking into other types of regression models.

For example, if you have a binary outcome then you can use a logistic regression model. If your outcome variable is a count variable, you can look into using a poisson regression model.

Advantages and disadvantages of linear regression

Are you wondering what the main advantages and disadvantages of linear regression models are? Here are some of the main advantages and disadvantages of linear regression models.

Advantages of linear regression models

Interpretable coefficients. One of the main advantages of linear regression models is that they have easily interpretable coefficients that come along with confidence intervals and statistical tests. This is very important if inference is a high priority in the project you are working on. Most other machine learning models do not have the same straightforward interpretation that linear regression models do.
No hyperparameters. Another advantage of linear regression is that it does not have hyperparameters that need to be tuned. You may need to preprocess your data and select which features to use in your model, but other than that there is no need to run different versions of your model with different hyperparameters.
Well understood. Another benefit of linear regression is that it is well studied and well understood. Most people who have taken an introductory statistics class have at least heard of linear regression. This means that it tends to be more popular with skeptical stakeholders who do not trust other machine learning models.
Fast inference. A final advantage of linear regression is that it has fast and simple inference that can be implemented even without the use of dedicated machine learning libraries. This means that it is easier to put linear regression models in production at companies that do not have facilities for serving machine learning models built out.

Disadvantages of linear regression models

Thrown off by outliers. One disadvantage of linear regression is that it is easily thrown off by outliers in your dataset. If you are using a linear regression model, you should examine your input data and model artifacts to make sure that the model is not being unduly influenced by outliers.
Thrown off by correlated features. Another disadvantage of linear regression is that it is easily thrown off if you have multiple highly correlated features in your model.
Need to specify interactions. Another disadvantage of linear regression is that you need to explicitly specify interactions that the model should consider when you build your model. If you do not specify interactions between your features, the model will not recognize and account for these interactions.
Assumes linearity. Linear regression models also assume that there is a linear relationship between your model features and your outcome variable. This means that you might have to preprocess your model features to make the relationship more linear.
Cannot handle missing data. Most implementations of linear regression models can not handle missing data natively. That means that you need to preprocess your data and handle the missing values before you run your model
Not peak predictive performance. Another general disadvantage of linear regression is that it does not generally have peak predictive performance on tabular data. If prediction is your main goal, there are other machine learning models that tend to have better predictive performance.

When to use a linear regression model

When should you choose to use a linear regression model? Here are some examples of scenarios where you should use a linear regression model over another model.

Inference is your primary goal. If inference is our primary goal, you are often better off using linear regression than another machine learning model. Linear regression models give you estimates of the magnitude of the relationship between your features and your outcome variable along with other useful values like confidence intervals and statistical tests.
Baseline model. If you are looking for a simple baseline model that you can use to compare more complicated models against, a linear regression model is a decent choice. This is especially true if you have a relatively clean dataset that does not have many missing values or outliers. One of the main benefits linear regression has in these scenarios is that there are no hyperparameters that need to be tuned, so you only have to tune a single model.
Building trust. Since linear regression is a well studied and well publicized model, it is often a good model to reach for when you are still building trust with stakeholders that are skeptical of more complicated machine learning models. After you get buy-in for your linear regression model, you can start to compare the performance of other models to the performance of your linear regression model to show the business value that could be added by upgrading your model.

When not to use linear regression

When should you not use linear regression? Here are some examples of cases where you should avoid using a linear regression model.

Small improvements in predictive performance have a large impact. If you are operating in a scenario where small improvements in predictive performance can have large impacts on the business, you may be better off reaching for another model. For example, gradient boosted trees tend to have better predictive performance than linear regression models. This is especially true in cases where the relationships between your features and your outcome variable are not perfectly linear.
You don’t have a lot of time to explore the data. Since linear regression is easily thrown off by things like missing data, outliers, and correlated features, it is not a great choice to turn to if you do not have a lot of time to clean and preprocess your data. In these types of situations, you might be better off turning to a tree-based model, such as a random forestmodel, that is less sensitive to these issues.
You have more features than observations. If you have more features in your model than you do observations in your dataset, a standard linear regression is not a good choice. You should either reduce the number of features you are using in your model or use another model that can handle this situation. Ridge regression is one example of a model that can handle this situation.
You have many correlated features. If you have many features in your model that are correlated with one another, you may be better off using ridge regression. This is a regularized version of regression that handles correlated features much better than a standard regression model.

When to use logistic regression
When to use ordinal logistic regression
When to use multinomial regression
When to use random forests
When to use ridge regression
When to use LASSO
When to use Bayesian regression
When to use support vector machines
When to use gradient boosted trees
When to use poisson regression
When to use neural networks
When to use mixed models
When to use generalized additive models

Are you trying to figure out which machine learning model is best for your next data science project? Check out our comprehensive guide on how to choose the right machine learning model.

Share this article

As a seasoned expert in machine learning, particularly in the realm of regression modeling, I find the article by Christina Ellis to be a comprehensive guide for those navigating the decision-making process between choosing a linear regression model or opting for an alternative in machine learning. The information presented aligns with my firsthand expertise, and I will delve into the concepts covered in the article.

Outcome Variables for Linear Regression: The article rightly emphasizes that linear regression is suitable when the outcome variable is numeric. This aligns with the foundational understanding of linear regression, where the goal is to model the relationship between independent variables and a continuous dependent variable. In contrast, for non-numeric outcomes, the article recommends exploring other regression models such as logistic regression for binary outcomes and Poisson regression for count variables.

Advantages of Linear Regression:

Interpretable Coefficients: Linear regression offers easily interpretable coefficients with confidence intervals and statistical tests. This is crucial for projects prioritizing inference.
No Hyperparameters: Unlike many other machine learning models, linear regression does not have hyperparameters that require tuning, simplifying the modeling process.
See Also
Regression model: Definition, Types and examples - Voxco
Well Understood: Linear regression is well-studied and understood, making it a favorable choice, especially when dealing with skeptical stakeholders unfamiliar with complex machine learning models.
Fast Inference: Linear regression allows for fast and simple inference, making it suitable for deployment in environments without dedicated machine learning infrastructure.

Disadvantages of Linear Regression:

Outliers: Linear regression is sensitive to outliers, and the article advises thorough examination of input data and model artifacts to ensure robustness against outliers.
Correlated Features: The model can be thrown off by highly correlated features, necessitating caution when dealing with multicollinearity.
Need for Interaction Specification: Linear regression requires explicit specification of interactions between features, and failure to do so might result in the model overlooking important relationships.
Assumption of Linearity: Linear regression assumes a linear relationship between features and the outcome variable, potentially requiring preprocessing to meet this assumption.
Handling Missing Data: Most implementations of linear regression cannot handle missing data, necessitating data preprocessing before model training.
Predictive Performance: While linear regression excels in interpretability and inference, it may not achieve peak predictive performance on tabular data compared to other machine learning models.

When to Use Linear Regression:

Inference as Primary Goal: Linear regression is recommended when the primary goal is inference, as it provides estimates of the relationship between features and the outcome variable, accompanied by confidence intervals and statistical tests.
Baseline Model: For a simple baseline model, especially with a clean dataset lacking many missing values or outliers, linear regression is a suitable choice due to its simplicity and lack of hyperparameters.
Building Trust: Linear regression, being a well-established model, is advantageous when building trust with stakeholders who may be skeptical of more complex machine learning models.

When Not to Use Linear Regression:

Small Improvements Impact Business Significantly: In scenarios where small improvements in predictive performance can have a substantial impact on the business, alternative models like gradient boosted trees may be more suitable.
Limited Time for Data Exploration: Linear regression is not ideal when time is limited for cleaning and preprocessing data, as it is sensitive to issues like missing data, outliers, and correlated features. In such cases, tree-based models like random forests may be more robust.
More Features Than Observations: Linear regression is not suitable when the number of features exceeds the number of observations. Options include reducing the number of features or opting for models like ridge regression that can handle this situation.
Many Correlated Features: In cases with many correlated features, ridge regression is recommended as a regularized version that handles correlated features more effectively than standard linear regression.

The article concludes by providing links to related articles, offering a holistic view of regression modeling options, and even extending the discussion to broader topics such as choosing the right machine learning model for data science projects. This aligns seamlessly with my extensive knowledge and practical experience in the field of machine learning, emphasizing the nuanced decision-making process in selecting the appropriate regression model for specific scenarios.