Are categorical variables getting lost in your random forests?2018-07-12T15:35:00+00:002018-07-12T15:35:00+00:00I've been one-hot encoding categorical variables for as long as I have been using
sci-kit learn.
It turns out that you can lose a lot of predictive power this way,
and that alternatives do exist.</p>
<blockquote>
<p>Decision tree models can handle categorical variables without one-hot encoding them.
However, popular implementations of decision trees (and random forests) differ
as to whether they honor this fact.
We show that one-hot encoding can seriously degrade tree-model performance.
Our primary comparison is between H2O (which honors categorical variables)
and scikit-learn (which requires them to be one-hot encoded).</p>
</blockquote>
<p>(via <a href="https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/">Are categorical variables getting lost in your random forests?</a>
Using multiple worktrees with git2017-07-12T12:06:00+00:002017-07-12T12:06:00
<p>When working with multiple branches at the same time, people clone the whole
git repository again.</p>
</blockquote>
<p>I am one of these people. I have been using <code class="highlighter-rouge">git</code> for years, and I can’t believe
I’ve not known about <code class="highlighter-rouge">git worktree</code>! – this makes it easy to work on various
braches in the same repository without having to clone a new copy of the repo.</p>
Gathering weak npm credentials2017-07-06T10:41:00+00:002017-07-06T10:41:00We all know the importance of strong passwords, don't we?
<p>In case you don’t, here’s an
<a href="https://github.com/ChALkeR/notes/blob/master/Gathering-weak-npm-credentials.md">example</a>
of how a security researcher was able to
obtain direct publish access to <em>14%</em> of npm packages through some fairly
basic techniques that take advantage of poor password practices.</p>
Canonical Correlation Analysis for Analyzing Sequences of Medical Billing Codes2017-05-25T15:51:00+00:002017-05-25T15:51:00Due to the vast number of medical billing codes, it is generally infeasible to
generate machine learning features from them as one-hot vectors.
<a href="https://arxiv.org/abs/1612.00516">This paper</a> discusses the use of
<a href="https://en.wikipedia.org/wiki/Canonical_correlation">Canonical Correlation Analysis</a>
to reduce this dimensionality and capture the inherent relationships that exist
Explaining complex machine learning models with LIME2017-05-01T10:48:00+00:002017-05-01T10:48:00Another nice
<a href="https://shiring.github.io/machine_learning/2017/04/23/lime">write-up</a>
on the use of Local Interpretable Model-Agnostic Explanations (LIME) to
Combating Fake News With a Smartphone2017-03-09T10:01:00+00:002017-03-09T10:01:00
<p>Automatically [add] extra digital proof data to all photos and videos you take.</p>
</blockquote>
<p>Pretty cool!</p>
SHAttered2017-02-23T10:43:00+00:002017-02-23T10:43:00Awesome work to demonstrate how to deliberately cause a SHA-1 collision.
<blockquote>
<p>It is now practically possible to craft two colliding PDF files and obtain a
SHA-1 digital signature on the first PDF file which can also be abused as a
valid signature on the second PDF file.</p>
</blockquote>
Unlearning descriptive statistics2017-02-10T16:54:00+00:002017-02-10T16:54:00Top tips on better descriptive statistics:
<ul>
<li>Instead of the mean, use the median and/or the mode.</li>
<li>Instead of the standard deviation, use the mean absolute deviation, the median absolute deviation, or the interquartile range.</li>
<li>Instead of z-scores, use percentile ranks.</li>
<li>Instead of skewness, use a QQ-plot or a histogram.</li>
<li>Instead of x standard deviations from the mean, use x median absolute deviations from the median.</li>
<li>Instead of correlation metrics, just use a scatterplot.</li>
</ul>
The legends of mathematics that almost never were2017-02-07T14:06:00+00:002017-02-07T14:06:00It's always made me sad when people tell me they dislike mathematics.
I always wonder if it was the subject or their teachers that they disliked…</p>
<blockquote>
<p>Mathematical genius resides within every one of us. Most people just don’t know it yet. That’s because genius is fragile. If you don’t embraced genius and tend to with care, it will slip away, leaving behind just a subdued vision of the mathematicians we could have become.</p>
</blockquote>
The Truth About Bad Science2017-01-26T10:20:00+00:002017-01-26T10:20:00
on ‘bad science’ speaks to me on so many levels.
It hurts my soul to see how many published studies are not reproducible.</p>
<blockquote>
<p>“A new study shows…” are “the four most dangerous words”.</p>
"A new study shows…" are "the four most dangerous words".