etl.utils

Providing essential functionalities for data processing, including sampling, logging, and statistical analysis.

etl.utils.log module

Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license

etl.utils.log.utils___log___count(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, prev_etl_name: str = None, *args, **kwargs) pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame

Simply count the number of rows in the data

Parameters:
  • spark (SparkSession) – The Spark session object.

  • data (Union[RDD, DataFrame]) – The input data to extract the nouns from.

  • prev_etl_name (str, optional) – name of the previous ETL process. Defaults to None.

Returns:

The input data. Nothing is changed.

Return type:

Union[RDD, DataFrame]

etl.utils.sampling module

Sampling module for data ingestion

Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license

etl.utils.sampling.utils___sampling___random(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, replace: bool = False, sample_n_or_frac: float = 0.1, seed: int = 42, *args, **kwargs) pyspark.rdd.RDD

Randomly sample the input RDD.

Parameters:
  • spark (SparkSession) – The Spark session object.

  • data (Union[RDD, DataFrame]) – The input data to be sampled.

  • replace (bool, optional) – Whether to sample with replacement. Defaults to False.

  • sample_n_or_frac (float, optional) – Number of samples to take or fraction of the RDD to sample. Defaults to 0.1

  • seed (int, optional) – Seed for the random number generator. Defaults to 42.

Returns:

Sampled RDD

Return type:

RDD

etl.utils.statistics module

Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license

etl.utils.statistics.utils___statistics___korean_nouns(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, subset: str = 'text', *args, **kwargs) pyspark.rdd.RDD

Get the frequency of each noun in the given subset of the data.

Parameters:
  • spark – The SparkSession object.

  • data – The data to extract the nouns from.

  • subset – The subset of the data to extract the nouns from. Defaults to ‘text’.

Returns:

The frequency of each noun in the given subset of the data.

Return type:

RDD[List[Tuple[str, int]]]

Raises:

ImportError – If konlpy or Mecab is not installed.

Examples

>>> data = [
...     {'text': '오리는 꽥꽥 웁니다. 거위는'},
...     {'text': '안녕 세상!'},
...     {'text': '사람들은 꽥꽥 울지 않습니다. 오리가 웁니다'},
... ]
>>> result = utils___statistics___korean_nouns()(spark, data, subset='text')
>>> result.collect()
[('오리', 2), ('거위', 1), ('세상', 1), ('사람', 1)]
Caveats:
  • This function works for Korean text only.

  • The function returns the frequency of each noun, not the unique noun list.