etl.deduplication

Eliminating duplicated data on dataset by dataset basis or globally across multiple datasets.

etl.deduplication.common_crawl module

Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license

etl.deduplication.common_crawl.deduplication___common_crawl___exact_line(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, subset='text', *args, **kwargs) pyspark.rdd.RDD

Performs exact line by line deduplication on the given data.

Strip and lower is applied to the line text before deduplication but this will not be applied to the original text.

Examples

  • input

    text

    DuckY

    dUKCY

  • output

    text

    DuckY

Parameters:
  • spark (SparkSession) – The Spark session object.

  • data (Union[RDD, DataFrame]) – The input data to be deduplicated..

  • subset (str, optional) – A subset or column to consider. Defaults to ‘text’.

Returns:

The deduplicated data.

Return type:

rdd

Raises:

AssertionError – If the input data is not a DataFrame.

etl.deduplication.exact module

Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license

etl.deduplication.exact.deduplication___exact___column(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, subset: List[str] = ['text'], *args, **kwargs)

Exact column deduplication

Parameters:
  • spark (SparkSession) – The Spark session object.

  • data (Union[RDD, DataFrame]) – The input data to be deduplicated..

  • subset (List[str]) – Subset of columns to consider for duplication check. Default to [‘text’].

Returns:

Deduplicated DataFrame object

etl.deduplication.minhash module

Code is from ChenghaoMou/text-dedup https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py

This is a migration of the code to Dataverse.

Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license

etl.deduplication.minhash.deduplication___minhash___lsh_jaccard(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, threshold: float = 0.7, ngram_size: int = 5, min_length: int = 5, num_perm: int = 250, band_n: int = None, row_per_band: int = None, subset: str = 'text', seed: int = 42, *args, **kwargs) pyspark.rdd.RDD

Fuzzy deduplication using MinHash and Locality Sensitive Hashing (LSH).

Parameters:
  • spark (SparkSession) – The SparkSession object.

  • data (Union[RDD, DataFrame]) – Input data to be deduplicated.

  • threshold (float, optional) – Similarity threshold. Default is 0.7.

  • ngram_size (int, optional) – Size of n-grams. Default is 5.

  • min_length (int, optional) – Minimum token length of document to be considered. Default is 5.

  • num_perm (int, optional) – Number of permutations. Default is 250.

  • band_n (int, optional) – Number of bands. If not provided, it will be calculated based on the threshold and num_perm.

  • row_per_band (int, optional) – Number of rows per band. If not provided, it will be calculated based on the threshold and num_perm.

  • subset (str, optional) – Column to deduplicate on. Default is “text”.

  • seed (int, optional) – Random seed. Default is 42.

Returns:

Deduplicated data as a DataFrame.

Return type:

RDD

etl.deduplication.polyglot module

Code is from EleutherAI/dps https://github.com/EleutherAI/dps/blob/master/dps/spark/jobs/dedup_job.py

This is a migration of the deduplication job from the DPS project to the Dataverse.

Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license

etl.deduplication.polyglot.deduplication___polyglot___minhash(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, expand_size: int = 64, n_gram: int = 15, seed: int = 1, char_level: bool = False, sim_threshold: float = 0.8, *args, **kwargs)

Fuzzy deduplication using MinHash algorithm.

Parameters:
  • spark (SparkSession) – The SparkSession object.

  • data (Union[RDD, DataFrame]) – The input data to be deduplicated.

  • expand_size (int, optional) – The size of expansion for each instance. Defaults to 64.

  • n_gram (int, optional) – The size of n-gram for tokenization. Defaults to 15.

  • seed (int, optional) – The seed value for random number generation. Defaults to 1.

  • char_level (bool, optional) – Whether to use character-level tokenization. Defaults to False.

  • sim_threshold (float, optional) – The similarity threshold for deduplication. Defaults to 0.8.

  • *args – Additional positional arguments.

  • **kwargs – Additional keyword arguments.

Returns:

The deduplicated data.

Return type:

RDD or DataFrame

Raises:

None

Examples

>>> deduplication___polyglot___minhash()(spark, data, expand_size=128, sim_threshold=0.9)