Looking for data in all the right places…

Why You’re Not Getting Value from Your Data Science

2023-05-16T17:48:00-04:00

As a Data Science manager at LeanTaaS, I strive to maximize the return on investment (or ROI) of expensive time from Data Scientists into new products.

In the Harvard Business Review, Kalyan Veeramachaneni has a great discussion of principles to meet this goal:

If companies want to get value from their data, they need to focus on accelerating human understanding of data, scaling the number of modeling questions they can ask of that data in a short amount of time, and assessing their implications. In our work with companies, we ultimately decided that creating true impact via machine learning will come from a focus on four principles:

Stick with simple models: We decided that simple models, like logistic regression or those based on random forests or decision trees, are sufficient for the problems at hand. The focus should instead be on reducing the time between the data acquisition and the development of the first simple predictive model.

Explore more problems: Data scientists need the ability to rapidly define and explore multiple prediction problems, quickly and easily. Instead of exploring one business problem with an incredibly sophisticated machine learning model, companies should be exploring dozens, building a simple predictive model for each one and assessing their value proposition.

Learn from a sample of data—not all the data: Instead of focusing on how to apply distributed computing to allow any individual processing module to handle big data, invest in techniques that will enable the derivations of similar conclusions from a data subsample. By circumventing the use of massive computing resources, they will enable the exploration of more hypotheses.

Focus on automation: To achieve both reduced time to first model and increased rate of exploration, companies must automate processes that are normally done manually. Over and over across different data problems, we found ourselves applying similar data processing techniques, whether it was to transform the data into useful aggregates, or to prepare data for predictive modeling—it’s time to streamline these, and to develop algorithms and build software systems that do them automatically.

List only untracked files

2023-05-03T21:02:00-04:00

git ls-files --others --exclude-standard

Pandas: Display DataFrames side by side

2023-05-03T09:46:00-04:00

from IPython.display import display_html
from itertools import chain,cycle
def display_side_by_side(*args,titles=cycle([''])):
    # source: https://stackoverflow.com/questions/38783027/jupyter-notebook-display-two-pandas-tables-side-by-side
    html_str=''
    for df,title in zip(args, chain(titles,cycle(['
'])) ):
        html_str+=''
        html_str+="
"
        html_str+=f'{title}'
        html_str+=df.to_html().replace('table','table style="display:inline"')
        html_str+=''
    display_html(html_str,raw=True)
df1 = pd.read_csv("file.csv")
df2 = pd.read_csv("file2")
display_side_by_side(df1.head(),df2.head(), titles=['Sales','Advertising'])
### Output

^{Two DataFrames side by side. [Photo by Lucas Soares.]}

Via Lucas Soares.

Pandas: Transforming two DataFrame columns into a dictionary

2023-05-03T09:44:00-04:00

import pandas as pd

df = pd.DataFrame(dict(a=["a","b","c"],b=[1,2,3]))

df_dictionary = dict(zip(df["a"],df["b"]))
df_dictionary

# Output is {'a': 1, 'b': 2, 'c': 3}

Via Lucas Soares.

How to Tag Docker Images with Git Commit Information

2023-05-03T09:26:00-04:00

I wanted to be able to track versions of the Docker image (and the Dockerfile used to create those images), and link those versions back to specific Git commits in the source repository.

This variation of the git log command will print only the full hash of the last commit to the repository: git log -1 --format=%H

If you prefer the shortened commit hash …, then just change the %H to %h, like this: git log -1 --format=%h.

You’ll need to add lines like this to your Dockerfile:

ARG GIT_COMMIT=unspecified
LABEL org.opencontainers.image.revision=$GIT_COMMIT

Note that I’ve updated the label name in the original post to reflect an update later in the post.

The first line defines a build-time argument, and [setting this to ] =unspecified means that if the built-time argument is omitted or not supplied, it will default to the value of “unspecified”. The second line takes the information from the argument and adds it as a label on the image.

[Now] build the image with the --build-arg flag:

docker build -t flask-local-build --build-arg GIT_COMMIT=$(git log -1 --format=%h) .

Note that the --build-arg flag applies to docker-compose commands too.

When you build the image this way, you can then see the Git commit attached to the image as a label using this command:

docker inspect flask-local-build | jq '.[].ContainerConfig.Labels'

Via Scott Lowe.

Copy a file with progress and save hash to a different file

2023-05-02T21:03:00-04:00

pv file.txt | tee >(sha256sum > file.sha1) > file-copy.txt

Via commandlinefu.com.

Using a single sudo to run multiple && arguments

2023-05-02T21:00:00-04:00

Here are a couple of ways to run multiple commands using a single sudo:

# 1.
sudo -s <<< 'apt update -y && apt upgrade -y'

# 2.
sudo sh -c 'apt update -y && apt upgrade -y'

Via commandlinefu.com.

Articles from arXiv.org as responsive HTML5 web pages

2023-05-02T20:46:00-04:00

In my previous life as an astronomer, I spent a lot of time reading relevant scientific articles on arXiv.org, which in those days meant downloading (and printing! 😱) lots of pdfs.

So it was nice to see an evolution of this service – now you can:

Change the “X” in any arXiv article link to the “5” in ar5iv to get a modern HTML5 document.

📢 Welcome to https://t.co/YKX9oX7hp4

Change the "X" in any arXiv article link to the "5" in ar5iv to get a modern HTML5 document.

Thread: what is included, why now, and how we hope to merge back into arXiv. #OA #OpenScience #preprints

1/10
— Deyan Ginev (@dginev) January 31, 2022

Python: Best way to implement a simple queue

2023-05-02T20:30:00-04:00

A simple list can easily be used to implement a queue abstract data structure. A queue implies the first-in, first-out principle.

However, this approach will prove inefficient because inserts and pops from the beginning of a list are slow (all elements need shifting by one).

It’s recommended to implement queues using the collections.deque module as it was designed with fast appends and pops from both ends.

from collections import deque
queue = deque(["a", "b", "c"])
queue.append("d")
queue.append("e")
queue.popleft()
queue.popleft()
print(queue)
# output is: deque(['c', 'd', 'e'])

A reverse queue can be implemented by opting for appendleft instead of append and pop instead of popleft.

Via enki.com.

Git: Show commits in one branch but not another

2023-05-02T17:03:00-04:00

To see a list of which commits are on one branch but not another, use git log:

git log --no-merges oldbranch ^newbranch

You can list multiple branches to include and exclude, e.g.:

git log  --no-merges oldbranch1 oldbranch2 ^newbranch1 ^newbranch2

The --no-merges flag exclude commits that are merges.

[You can show] commits and commit contents from other-branch that are not in your current branch:

git show @..other-branch

Additionally you can apply the commits from other-branch directly to your current branch:

git cherry-pick @..other-branch

To show the commits in oldbranch but not in newbranch:

git log newbranch..oldbranch

To show the diff by these commits (note there are three dots):

git diff newbranch...oldbranch

[To] list all branches [that] contain the commits from “branch-to-delete”:

git branch --contains branch-to-delete

Via Stack Overflow.