etl.quality

Improving the quality of data from the perspectives of accuracy, consistency, and reliability for LLMs.

etl.quality.language module

language filtering from Common Crawl

This is a migration of the common crawl code to Dataverse. some part of code is from facebookresearch/cc_net https://github.com/facebookresearch/cc_net/blob/main/cc_net/split_by_lang.py

Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license

etl.quality.language.quality___language___fasttext_filter(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, subset: str = 'text', top_k: int = 1, score_rounding: int = 2, threshold: float = 0.0, whitelist: List[str] = None, blacklist: List[str] = None, *args, **kwargs) pyspark.rdd.RDD

Filters data based on language using fasttext. If language score is below threshold, that row will be filtered.

Parameters:
  • spark (SparkSession) – The Spark session object.

  • data (Union[RDD, DataFrame]) – The input data to be processed.

  • subset (str, optional) – A subset or column to consider. Defaults to ‘text’.

  • top_k (int, optional) –

    The number of top languages to keep after classification. Defaults to 1. - if fasttext classified 3 languages, top_k=1 will keep the top language

    • [en, fr, de] -> [en]

    • if fasttext classified 3 languages, top_k=2 will keep the top 2 languages
      • [en, fr, de] -> [en, fr]

  • score_rounding (int, optional) – The number of decimal places to round the scores. Defaults to 2.

  • threshold (float, optional) – The minimum score to keep the language. Defaults to 0.0.

  • whitelist (List[str], optional) – The list of languages to keep. Defaults to None.

  • blacklist (List[str], optional) – The list of languages to remove. Defaults to None.

Raises:

ValueError – If both whitelist and blacklist are not None.

Returns:

The filtered data.

Return type:

rdd

Caveats about whitelist and blacklist:
  • [Default] If both whitelist and blacklist are None, all languages will be kept.

  • If both whitelist and blacklist are not None, an error will be raised.

  • If whitelist is not None, only the languages in the whitelist will be kept.

  • If blacklist is not None, the languages in the blacklist will be removed.