Getting Started
Documentation
config
config.interface.Config
etl
etl.bias
etl.cleaning.char
etl.cleaning.document
etl.cleaning.html
etl.cleaning.korean
etl.cleaning.length
etl.cleaning.number
etl.cleaning.table
etl.cleaning.unicode
etl.data_ingestion.arrow
etl.data_ingestion.common_crawl
etl.data_ingestion.csv
etl.data_ingestion.cultura_x
etl.data_ingestion.huggingface
etl.data_ingestion.parquet
etl.data_ingestion.red_pajama
etl.data_ingestion.slim_pajama
etl.data_ingestion.test
etl.data_save.aws
etl.data_save.huggingface
etl.data_save.parquet
etl.decontamination
etl.deduplication.common_crawl
etl.deduplication.exact
etl.deduplication.minhash
etl.deduplication.polyglot
etl.pii.card
etl.pii.nin
etl.quality.language
etl.toxicity
etl.utils.log
etl.utils.sampling
etl.utils.statistics