etl.data_save
Persisting the processed data into a preferred destination, such as a data lake or database.
etl.data_save.aws module
TODO: Data saving to AWS S3
This is not implemented yet.
Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license
etl.data_save.huggingface module
Data saving to Huggingface Datasets
Huggingface support spark natively! https://huggingface.co/docs/datasets/use_with_spark
Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license
- etl.data_save.huggingface.data_save___huggingface___ufl2hf_hub(self, spark, ufl, hub_path, repartition=1, *args, **kwargs)
TODO: Save data to Hugging Face dataset and upload to hub.
- etl.data_save.huggingface.data_save___huggingface___ufl2hf(self, spark, ufl: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, save_path: str, repartition: int = 1, *args, **kwargs) str
Save data to HuggingFace dataset and return the path.
- Parameters:
spark (sparkSession) – The Spark session.
ufl (Union[RDD, DataFrame]) – The input data to be saved.
save_path (str) – The path to save the HF dataset.
repartition (int, optional) – The number of partitions to repartition the data. Defaults to 1.
- Raises:
ValueError – If the save_path already exists.
AssertionError – If ufl is not an RDD or DataFrame.
- Returns:
The path where the HuggingFace dataset is saved.
- Return type:
str
- etl.data_save.huggingface.data_save___huggingface___ufl2hf_obj(self, spark, ufl: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, repartition: int = 1, *args, **kwargs) datasets.arrow_dataset.Dataset
Convert data to HuggingFace dataset object.
- Parameters:
spark (sparkSession) – The Spark session.
ufl (Union[RDD, DataFrame]) – The input data to be saved.
repartition (int, optional) – The number of partitions to repartition the data. Defaults to 1.
- Returns:
The HuggingFace dataset object.
- Return type:
Dataset
- Raises:
AssertionError – If the input data is not RDD or DataFrame.
etl.data_save.parquet module
Data saving to Parquets
Copyright (c) 2024-present Upstage Co., Ltd. Apache-2.0 license
- etl.data_save.parquet.data_save___parquet___ufl2parquet(self, spark, ufl: pyspark.rdd.RDD | pyspark.sql.dataframe.DataFrame, save_path: str, repartition: int = 1, *args, **kwargs) str
Save data to parquet and return the path.
- Parameters:
spark (sparkSession) – The Spark session.
ufl (Union[RDD, DataFrame]) – The input data to be saved.
save_path (str) – The path to save the HF dataset.
repartition (int, optional) – The number of partitions to repartition the data. Defaults to 1.
- Raises:
ValueError – If the save_path already exists.
- Returns:
The path where the parquet file is saved.
- Return type:
str