etl.quality
Improving the quality of data from the perspectives of accuracy, consistency, and reliability for LLMs.
etl.quality.language module
language filtering from Common Crawl
This is a migration of the common crawl code to Dataverse. some part of code is from facebookresearch/cc_net https://github.com/facebookresearch/cc_net/blob/main/cc_net/split_by_lang.py
Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license
- etl.quality.language.quality___language___fasttext_filter(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, subset: str = 'text', top_k: int = 1, score_rounding: int = 2, threshold: float = 0.0, whitelist: List[str] = None, blacklist: List[str] = None, *args, **kwargs) pyspark.rdd.RDD
Filters data based on language using fasttext. If language score is below threshold, that row will be filtered.
- Parameters:
spark (SparkSession) – The Spark session object.
data (Union[RDD, DataFrame]) – The input data to be processed.
subset (str, optional) – A subset or column to consider. Defaults to ‘text’.
top_k (int, optional) –
The number of top languages to keep after classification. Defaults to 1. - if fasttext classified 3 languages, top_k=1 will keep the top language
[en, fr, de] -> [en]
- if fasttext classified 3 languages, top_k=2 will keep the top 2 languages
[en, fr, de] -> [en, fr]
score_rounding (int, optional) – The number of decimal places to round the scores. Defaults to 2.
threshold (float, optional) – The minimum score to keep the language. Defaults to 0.0.
whitelist (List[str], optional) – The list of languages to keep. Defaults to None.
blacklist (List[str], optional) – The list of languages to remove. Defaults to None.
- Raises:
ValueError – If both whitelist and blacklist are not None.
- Returns:
The filtered data.
- Return type:
rdd
- Caveats about whitelist and blacklist:
[Default] If both whitelist and blacklist are None, all languages will be kept.
If both whitelist and blacklist are not None, an error will be raised.
If whitelist is not None, only the languages in the whitelist will be kept.
If blacklist is not None, the languages in the blacklist will be removed.