etl.pii

Ensuring the removal of sensitive information, such as personally identifiable data, from the dataset.

etl.pii.card module

Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license

etl.pii.card.pii___card___replace_card_number(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, subset: str = 'text', pattern: str = '(\\d{4}-\\d{4}-\\d{4}-\\d{4})', random_pii: bool = True, replace_pii: bool = False, replace_token: str = '[CARD_NUMBER]', start_token: str = '', end_token: str = '', *args, **kwargs) pyspark.rdd.RDD

Replace card number with a random number or a token

Parameters:
  • spark – The SparkSession object.

  • data (Union[RDD, DataFrame]) – The input data to process.

  • subset (str, optional) – The subset or columns to consider. Defaults to ‘text’.

  • pattern (str, optional) – The regex pattern to find. Defaults to r’(d{4}-d{4}-d{4}-d{4})’.

  • random_pii (bool, optional) – If True, replace the pii with a random number. Defaults to True.

  • replace_pii (bool, optional) – If True, replace the pii with the replace_token. Defaults to False.

  • replace_token (str, optional) – The token to replace the pii with. Defaults to ‘[CARD_NUMBER]’.

  • start_token (str, optional) – The start token to append where the pattern is found. Defaults to ‘’.

  • end_token (str, optional) – The end token to append where the pattern is found. Defaults to ‘’.

Returns:

The processed data.

Return type:

RDD

Caveats:
  • replace_pii takes precedence over random_pii
    • e.g when both are True, the card number will be replaced with the token

    • e.g. this is 1234-1234-1234-1234 -> this is [CARD_NUMBER]

  • start_token and end_token are used to append the token to the start and end of the card number
    • it doens’t matter with random_card_number or replace_card_number is True or False

Examples

<input>
  • text = ‘card number is 1234-1234-1234-1234.’

<output>
  • random pii
    • text = ‘card number is 2238-1534-1294-1274.’

  • replace pii
    • replace_token = ‘[CARD_NUMBER]’

    • text = ‘card number is [CARD_NUMBER].’

  • start token
    • start_token = ‘[CARD_NUMBER_START]’

    • text = ‘card number is [CARD_NUMBER_START]1234-1234-1234-1234.’

  • end token
    • end_token = ‘[CARD_NUMBER_END]’

etl.pii.nin module

NIN (National Identification Number)

A national identification number, national identity number, or national insurance number or JMBG/EMBG is used by the governments of many countries as a means of tracking their citizens, permanent residents, and temporary residents for the purposes of work, taxation, government benefits, health care, and other governmentally-related functions.

https://en.wikipedia.org/wiki/National_identification_number

Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license

etl.pii.nin.pii___nin___replace_korean_rrn(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, subset: str = 'text', pattern: str = '\\d{6}-\\d{7}', random_pii: bool = True, replace_pii: bool = False, replace_token: str = '[NIN]', start_token: str = '', end_token: str = '', *args, **kwargs) pyspark.rdd.RDD

Replace Korean RRN (Resident Registration Number) with a random number or a token

Parameters:
  • spark (SparkSession) – The Spark session object.

  • data (Union[RDD, DataFrame]) – The input data to be processed.

  • subset (str, optional) – A subset or column to consider. Defaults to ‘text’.

  • pattern (str, optional) – The regex pattern to find. Defaults to r’d{6}-d{7}’.

  • random_pii (str, optional) – If True, replace the pii with a random number. Defaults to True.

  • replace_pii (bool, optional) – If True, replace the pii with the replace_token. Defaults to False.

  • replace_token (bool, optional) – The token to replace the pii with. Defaults to ‘[NIN]’.

  • start_token (str, optional) – The start token to append where the pattern is found. Defaults to ‘’.

  • end_token (str, optional) – The end token to append where the pattern is found. Defaults to ‘’.

Returns:

The processed data with replaced Korean RRN.

Return type:

rdd

Caveats:
  • replace_pii takes precedence over random_pii

  • start_token and end_token are used to append the token to the start and end of the number
    • it doens’t matter with random_pii or replace_pii is True or False

Examples

<input>
  • text = ‘nin is 123456-1234567’

<output>
  • random pii
    • text = ‘nin is 141124-1244121’

  • replace pii
    • replace_token = ‘[NIN]’

    • text = ‘nin is [NIN].’

  • start token
    • start_token = ‘[NIN_START]’

    • text = ‘nin is [NIN_START]123456-1234567’

  • end token
    • end_token = ‘[NIN_END]’

    • text = ‘nin is 123456-1234567[NIN_END].’