Recent Posts

Spark: Date Arithmetic with Multiple Columns

less than 1 minute read

Say you have a timestamp column created_at, and an integer column number that represents a number of weeks, how do you use the date_add function to calculate...

Spark: Count number of duplicate rows

less than 1 minute read

To count the number of duplicate rows in a pyspark DataFrame, you want to groupBy() all the columns and count(), then select the sum of the counts for the ro...

Docker: Set Timezone

less than 1 minute read

To set which timezone your docker container should use, add the following to your Dockerfile: