etl.registry

Base class to support the registration of the ETL classes

Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license

etl.registry.auto_register(etl_categories=['data_ingestion', 'decontamination', 'deduplication', 'bias', 'toxicity', 'cleaning', 'pii', 'quality', 'data_save', 'utils'])

This will automatically register all ETLs to the registry

class etl.registry.ETLStructure

Bases: object

class etl.registry.ETLRegistry

Singleton class to register the ETL classes.

This class provides a registry for ETL classes. It ensures that only one instance of the registry is created and provides methods to register, search, and retrieve ETL classes.

_initialized

Flag to check if the class has been initialized.

Type:

bool

_registry

Dictionary to store the registered ETL classes.

Type:

dict

_status

Dictionary to store the status of the registered ETL classes.

Type:

dict

__new__()

Creates a new instance of the class if it doesn’t exist.

__init__()

Initializes the class and registers the ETL classes.

__len__()

Returns the number of registered ETL classes.

__repr__()

Returns a string representation of the registry.

__str__()

Returns a string representation of the registry.

_update_status(key)

Updates the status of the registry.

_convert_to_report_format(status, print_sub_category, print_etl_name)

Converts the status to a report format.

register(self, key: str, etl: ETLStructure)

Registers the ETL (Extract, Transform, Load) process.

Parameters:
  • key (str) – The key used to identify the ETL process. Should be in the format below:

  • etl (ETLStructure) – The ETL process to be registered. It should be a subclass of ETLStructure.

Raises:
  • ValueError – If the key is not all lowercase, not separated by ‘___’, or does not have 2 layers of category.

  • TypeError – If the ETL class is not a subclass of ETLStructure.

  • KeyError – If the key is already registered.

Note

  • The key should be in the format of:
    • all lowercase

    • separated by ___

    • it should have 2 layers of category

  • Example: <etl_type>___<file_key>___<etl_key> or <category>___<sub_category>___<etl_key>.

search(self, category: str | None = None, sub_category: str | None = None)

Search the ETL.

Parameters:
  • category (str, optional) – The category to search for. Defaults to None.

  • sub_category (str, optional) – The sub-category to search for. Defaults to None.

Returns:

A dictionary containing the filtered status information.

Return type:

dict

Raises:
  • AssertionError – If category is a list or not a string.

  • AssertionError – If sub_category is a list or not a string.

  • ValueError – If sub_category is specified without category.

Note

  • Printing all the information is fixed as default.

  • Set print_sub_category to True to print the sub-category.

  • Set print_etl_name to True to print the ETL name.

get(self, key: str) ETLStructure

Retrieves the ETLStructure associated with the given key.

Parameters:

key (str) – The key used to retrieve the ETLStructure. Should be in the format below.

Returns:

The ETLStructure associated with the given key.

Return type:

ETLStructure

Raises:
  • ValueError – If the key is not all lowercase, not separated by ‘___’, or does not have 2 layers of category.

  • KeyError – If the key is not registered in the registry.

Note

  • The key should be in the format of:
    • all lowercase

    • separated by ___

    • it should have 2 layers of category

  • Example: <etl_type>___<file_key>___<etl_key> or <category>___<sub_category>___<etl_key>.

get_all(self)

get all the etls

Returns:

list of all registered etls

Return type:

list

reset(self)

reset the registry

class etl.registry.ETLAutoRegistry(name, bases, attrs)

Bases: ABCMeta, type

class etl.registry.BaseETL

Bases: ETLStructure

Base class for spark ETL.

This class provides a base structure for implementing spark ETL processes. If you need to use self directly, inherit this class.

run(self, data, *args, **kwargs)

Run the preprocessing logic. This method should be implemented by subclasses.

__call__(self, *args, **kwargs)

Call the run method to perform the preprocessing.

etl.registry.register_etl(func)

Decorator to register a function as an ETL.

Parameters:

func (callable) – The function to be registered as an ETL.

Returns:

A dynamically created class that inherits from BaseETL and wraps the original function.

Return type:

type

Raises:

None.

About Attributes:
  • __file_path__ (str): The file path of the function where it is defined.

  • __etl_dir__ (bool): If the file is in the etl directory. If not, it means it’s a dynamically registered user-defined ETL.

Example

>>> @register_etl
>>> def my_etl_function():
>>>    pass

Note

The registered ETL function should not rely on the self parameter.

If you need to use self, directly inherit the BaseETL class.