etl.utils
Providing essential functionalities for data processing, including sampling, logging, and statistical analysis.
etl.utils.log module
Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license
- etl.utils.log.utils___log___count(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, prev_etl_name: str = None, *args, **kwargs) pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame
Simply count the number of rows in the data
- Parameters:
spark (SparkSession) – The Spark session object.
data (Union[RDD, DataFrame]) – The input data to extract the nouns from.
prev_etl_name (str, optional) – name of the previous ETL process. Defaults to None.
- Returns:
The input data. Nothing is changed.
- Return type:
Union[RDD, DataFrame]
etl.utils.sampling module
Sampling module for data ingestion
Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license
- etl.utils.sampling.utils___sampling___random(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, replace: bool = False, sample_n_or_frac: float = 0.1, seed: int = 42, *args, **kwargs) pyspark.rdd.RDD
Randomly sample the input RDD.
- Parameters:
spark (SparkSession) – The Spark session object.
data (Union[RDD, DataFrame]) – The input data to be sampled.
replace (bool, optional) – Whether to sample with replacement. Defaults to False.
sample_n_or_frac (float, optional) – Number of samples to take or fraction of the RDD to sample. Defaults to 0.1
seed (int, optional) – Seed for the random number generator. Defaults to 42.
- Returns:
Sampled RDD
- Return type:
RDD
etl.utils.statistics module
Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license
- etl.utils.statistics.utils___statistics___korean_nouns(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, subset: str = 'text', *args, **kwargs) pyspark.rdd.RDD
Get the frequency of each noun in the given subset of the data.
- Parameters:
spark – The SparkSession object.
data – The data to extract the nouns from.
subset – The subset of the data to extract the nouns from. Defaults to ‘text’.
- Returns:
The frequency of each noun in the given subset of the data.
- Return type:
RDD[List[Tuple[str, int]]]
- Raises:
ImportError – If konlpy or Mecab is not installed.
Examples
>>> data = [ ... {'text': '오리는 꽥꽥 웁니다. 거위는'}, ... {'text': '안녕 세상!'}, ... {'text': '사람들은 꽥꽥 울지 않습니다. 오리가 웁니다'}, ... ] >>> result = utils___statistics___korean_nouns()(spark, data, subset='text') >>> result.collect() [('오리', 2), ('거위', 1), ('세상', 1), ('사람', 1)]
- Caveats:
This function works for Korean text only.
The function returns the frequency of each noun, not the unique noun list.