etl.registry
Base class to support the registration of the ETL classes
Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license
- etl.registry.auto_register(etl_categories=['data_ingestion', 'decontamination', 'deduplication', 'bias', 'toxicity', 'cleaning', 'pii', 'quality', 'data_save', 'utils'])
This will automatically register all ETLs to the registry
- class etl.registry.ETLStructure
Bases:
object
- class etl.registry.ETLRegistry
Singleton class to register the ETL classes.
This class provides a registry for ETL classes. It ensures that only one instance of the registry is created and provides methods to register, search, and retrieve ETL classes.
- _initialized
Flag to check if the class has been initialized.
- Type:
bool
- _registry
Dictionary to store the registered ETL classes.
- Type:
dict
- _status
Dictionary to store the status of the registered ETL classes.
- Type:
dict
- __new__()
Creates a new instance of the class if it doesn’t exist.
- __init__()
Initializes the class and registers the ETL classes.
- __len__()
Returns the number of registered ETL classes.
- __repr__()
Returns a string representation of the registry.
- __str__()
Returns a string representation of the registry.
- _update_status(key)
Updates the status of the registry.
- _convert_to_report_format(status, print_sub_category, print_etl_name)
Converts the status to a report format.
- register(self, key: str, etl: ETLStructure)
Registers the ETL (Extract, Transform, Load) process.
- Parameters:
key (str) – The key used to identify the ETL process. Should be in the format below:
etl (ETLStructure) – The ETL process to be registered. It should be a subclass of ETLStructure.
- Raises:
ValueError – If the key is not all lowercase, not separated by ‘___’, or does not have 2 layers of category.
TypeError – If the ETL class is not a subclass of ETLStructure.
KeyError – If the key is already registered.
Note
- The key should be in the format of:
all lowercase
separated by ___
it should have 2 layers of category
Example: <etl_type>___<file_key>___<etl_key> or <category>___<sub_category>___<etl_key>.
- search(self, category: str | None = None, sub_category: str | None = None)
Search the ETL.
- Parameters:
category (str, optional) – The category to search for. Defaults to None.
sub_category (str, optional) – The sub-category to search for. Defaults to None.
- Returns:
A dictionary containing the filtered status information.
- Return type:
dict
- Raises:
AssertionError – If category is a list or not a string.
AssertionError – If sub_category is a list or not a string.
ValueError – If sub_category is specified without category.
Note
Printing all the information is fixed as default.
Set print_sub_category to True to print the sub-category.
Set print_etl_name to True to print the ETL name.
- get(self, key: str) ETLStructure
Retrieves the ETLStructure associated with the given key.
- Parameters:
key (str) – The key used to retrieve the ETLStructure. Should be in the format below.
- Returns:
The ETLStructure associated with the given key.
- Return type:
- Raises:
ValueError – If the key is not all lowercase, not separated by ‘___’, or does not have 2 layers of category.
KeyError – If the key is not registered in the registry.
Note
- The key should be in the format of:
all lowercase
separated by ___
it should have 2 layers of category
Example: <etl_type>___<file_key>___<etl_key> or <category>___<sub_category>___<etl_key>.
- get_all(self)
get all the etls
- Returns:
list of all registered etls
- Return type:
list
- reset(self)
reset the registry
- class etl.registry.ETLAutoRegistry(name, bases, attrs)
Bases:
ABCMeta,type
- class etl.registry.BaseETL
Bases:
ETLStructureBase class for spark ETL.
This class provides a base structure for implementing spark ETL processes. If you need to use self directly, inherit this class.
- run(self, data, *args, **kwargs)
Run the preprocessing logic. This method should be implemented by subclasses.
- __call__(self, *args, **kwargs)
Call the run method to perform the preprocessing.
- etl.registry.register_etl(func)
Decorator to register a function as an ETL.
- Parameters:
func (callable) – The function to be registered as an ETL.
- Returns:
A dynamically created class that inherits from BaseETL and wraps the original function.
- Return type:
type
- Raises:
None. –
- About Attributes:
__file_path__ (str): The file path of the function where it is defined.
__etl_dir__ (bool): If the file is in the etl directory. If not, it means it’s a dynamically registered user-defined ETL.
Example
>>> @register_etl >>> def my_etl_function(): >>> pass
Note
The registered ETL function should not rely on the self parameter.
If you need to use self, directly inherit the BaseETL class.