Are categorical variables getting lost in your random forests?

July 12, 2018 less than 1 minute read

I’ve been one-hot encoding categorical variables for as long as I have been using sci-kit learn. It turns out that you can lose a lot of predictive power this way, and that alternatives do exist.

Decision tree models can handle categorical variables without one-hot encoding them. However, popular implementations of decision trees (and random forests) differ as to whether they honor this fact. We show that one-hot encoding can seriously degrade tree-model performance. Our primary comparison is between H2O (which honors categorical variables) and scikit-learn (which requires them to be one-hot encoded).

(via Are categorical variables getting lost in your random forests? )

Share on

Twitter Facebook LinkedIn

Writing Code Was Never The Bottleneck

July 24, 2025 less than 1 minute read

With all the recent hype around large language models (LLMs) and their ability to effortlessly generate code, Pedro Tavares reminds us that it’s worth reflec...

Declare your python dependencies within your Jupyter notebook

January 16, 2025 less than 1 minute read

Reproducible workflows are simplified with tools like Nix for shell scripts and juv for Jupyter notebooks, enabling dependency declarations directly within s...

Why you should really prepare for your one-on-ones

January 16, 2025 less than 1 minute read

Maximize the impact of your 1-on-1 meetings by preparing thoroughly, not just with your direct reports but also with your managers, to boost both job perform...

Why You’re Not Getting Value from Your Data Science

May 16, 2023 1 minute read

If companies want to get value from their data, they need to focus on accelerating human understanding of data, scaling the number of modeling questions they...

Francis T. O'Donovan