Spark: Count number of duplicate rows
To count the number of duplicate rows in a pyspark DataFrame,
you want to groupBy()
all the columns and count()
,
then select the sum of the counts for the rows where the count is greater than 1:
import pyspark.sql.functions as funcs
df.groupBy(df.columns)\
.count()\
.where(funcs.col('count') > 1)\
.select(funcs.sum('count'))\
.show()
Via SO.
Leave a comment