etl.deduplication
Eliminating duplicated data on dataset by dataset basis or globally across multiple datasets.
etl.deduplication.common_crawl module
Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license
- etl.deduplication.common_crawl.deduplication___common_crawl___exact_line(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, subset='text', *args, **kwargs) pyspark.rdd.RDD
Performs exact line by line deduplication on the given data.
Strip and lower is applied to the line text before deduplication but this will not be applied to the original text.
Examples
input
text
DuckY
dUKCY
output
text
DuckY
- Parameters:
spark (SparkSession) – The Spark session object.
data (Union[RDD, DataFrame]) – The input data to be deduplicated..
subset (str, optional) – A subset or column to consider. Defaults to ‘text’.
- Returns:
The deduplicated data.
- Return type:
rdd
- Raises:
AssertionError – If the input data is not a DataFrame.
etl.deduplication.exact module
Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license
- etl.deduplication.exact.deduplication___exact___column(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, subset: List[str] = ['text'], *args, **kwargs)
Exact column deduplication
- Parameters:
spark (SparkSession) – The Spark session object.
data (Union[RDD, DataFrame]) – The input data to be deduplicated..
subset (List[str]) – Subset of columns to consider for duplication check. Default to [‘text’].
- Returns:
Deduplicated DataFrame object
etl.deduplication.minhash module
Code is from ChenghaoMou/text-dedup https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py
This is a migration of the code to Dataverse.
Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license
- etl.deduplication.minhash.deduplication___minhash___lsh_jaccard(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, threshold: float = 0.7, ngram_size: int = 5, min_length: int = 5, num_perm: int = 250, band_n: int = None, row_per_band: int = None, subset: str = 'text', seed: int = 42, *args, **kwargs) pyspark.rdd.RDD
Fuzzy deduplication using MinHash and Locality Sensitive Hashing (LSH).
- Parameters:
spark (SparkSession) – The SparkSession object.
data (Union[RDD, DataFrame]) – Input data to be deduplicated.
threshold (float, optional) – Similarity threshold. Default is 0.7.
ngram_size (int, optional) – Size of n-grams. Default is 5.
min_length (int, optional) – Minimum token length of document to be considered. Default is 5.
num_perm (int, optional) – Number of permutations. Default is 250.
band_n (int, optional) – Number of bands. If not provided, it will be calculated based on the threshold and num_perm.
row_per_band (int, optional) – Number of rows per band. If not provided, it will be calculated based on the threshold and num_perm.
subset (str, optional) – Column to deduplicate on. Default is “text”.
seed (int, optional) – Random seed. Default is 42.
- Returns:
Deduplicated data as a DataFrame.
- Return type:
RDD
etl.deduplication.polyglot module
Code is from EleutherAI/dps https://github.com/EleutherAI/dps/blob/master/dps/spark/jobs/dedup_job.py
This is a migration of the deduplication job from the DPS project to the Dataverse.
Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license
- etl.deduplication.polyglot.deduplication___polyglot___minhash(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, expand_size: int = 64, n_gram: int = 15, seed: int = 1, char_level: bool = False, sim_threshold: float = 0.8, *args, **kwargs)
Fuzzy deduplication using MinHash algorithm.
- Parameters:
spark (SparkSession) – The SparkSession object.
data (Union[RDD, DataFrame]) – The input data to be deduplicated.
expand_size (int, optional) – The size of expansion for each instance. Defaults to 64.
n_gram (int, optional) – The size of n-gram for tokenization. Defaults to 15.
seed (int, optional) – The seed value for random number generation. Defaults to 1.
char_level (bool, optional) – Whether to use character-level tokenization. Defaults to False.
sim_threshold (float, optional) – The similarity threshold for deduplication. Defaults to 0.8.
*args – Additional positional arguments.
**kwargs – Additional keyword arguments.
- Returns:
The deduplicated data.
- Return type:
RDD or DataFrame
- Raises:
None –
Examples
>>> deduplication___polyglot___minhash()(spark, data, expand_size=128, sim_threshold=0.9)