etl.pii

Ensuring the removal of sensitive information, such as personally identifiable data, from the dataset.

etl.pii.card module

etl.pii.card.pii___card___replace_card_number(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, subset: str = 'text', pattern: str = '(\\d{4}-\\d{4}-\\d{4}-\\d{4})', random_pii: bool = True, replace_pii: bool = False, replace_token: str = '[CARD_NUMBER]', start_token: str = '', end_token: str = '', *args, **kwargs) → pyspark.rdd.RDD

Replace card number with a random number or a token

Parameters:

spark – The SparkSession object.
data (Union[RDD, DataFrame]) – The input data to process.
subset (str, optional) – The subset or columns to consider. Defaults to ‘text’.
pattern (str, optional) – The regex pattern to find. Defaults to r’(d{4}-d{4}-d{4}-d{4})’.
random_pii (bool, optional) – If True, replace the pii with a random number. Defaults to True.
replace_pii (bool, optional) – If True, replace the pii with the replace_token. Defaults to False.
replace_token (str, optional) – The token to replace the pii with. Defaults to ‘[CARD_NUMBER]’.
start_token (str, optional) – The start token to append where the pattern is found. Defaults to ‘’.
end_token (str, optional) – The end token to append where the pattern is found. Defaults to ‘’.

Returns:

The processed data.

Return type:

RDD

Caveats:

replace_pii takes precedence over random_pii
- e.g when both are True, the card number will be replaced with the token
- e.g. this is 1234-1234-1234-1234 -> this is [CARD_NUMBER]
start_token and end_token are used to append the token to the start and end of the card number
- it doens’t matter with random_card_number or replace_card_number is True or False

Examples

<input>

text = ‘card number is 1234-1234-1234-1234.’

<output>

random pii
- text = ‘card number is 2238-1534-1294-1274.’
replace pii
- replace_token = ‘[CARD_NUMBER]’
- text = ‘card number is [CARD_NUMBER].’
start token
- start_token = ‘[CARD_NUMBER_START]’
- text = ‘card number is [CARD_NUMBER_START]1234-1234-1234-1234.’
end token
- end_token = ‘[CARD_NUMBER_END]’

etl.pii.nin module

NIN (National Identification Number)

A national identification number, national identity number, or national insurance number or JMBG/EMBG is used by the governments of many countries as a means of tracking their citizens, permanent residents, and temporary residents for the purposes of work, taxation, government benefits, health care, and other governmentally-related functions.

https://en.wikipedia.org/wiki/National_identification_number

etl.pii.nin.pii___nin___replace_korean_rrn(self, spark, data: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, subset: str = 'text', pattern: str = '\\d{6}-\\d{7}', random_pii: bool = True, replace_pii: bool = False, replace_token: str = '[NIN]', start_token: str = '', end_token: str = '', *args, **kwargs) → pyspark.rdd.RDD

Replace Korean RRN (Resident Registration Number) with a random number or a token