FASTA File System (FASTAFS) file format consists of four blocks: (1) File Header (2) Per-Sequence-Data (3) Per-Sequence-Header and (4) File Metadata, to efficiently store sequence and metadata (Fig. 1). During conversion, the metadata flag sets the archives status to incomplete. Each block of compressed sequence data is followed by the CRAM format and BAM specification compatible MD5 checksum [14, 15]. In the last phase of file conversion, file pointers are put in place and the metadata flag is updated to mark the archives conversion status to complete. The file ends with the CRC32 checksum used for whole file integrity verification.
Sequence compressor Nucleotide Archival Format (NAF) [10] compresses sequence data first with a 4-bit encoding followed by generic compressor Zstandard (zstd), but lacks random access. Given that NAF achieves high compression ratios [10], FASTAFS was designed in a similar fashion. It first compresses sequence data to a lower bit encoding, followed by the random-access implementation of zstd (zstd-seekable). When a sequence consists of AC[T|U]GN, it is stored in a 2-bit encoding, when it follows the IUPAC DNA/RNA dictionary, it is stored in a 4-bit encoding and when it is a protein sequence it is stored in a 5-bit encoding (Fig. 1).
Toolkit
The Linux based FATSTAFS toolkit is a single executable (fastafs) with different subcommands. The package comes also with an executable ‘mount.fastafs’ to mount through the /etc/fstab table.
Cache: FASTA files can be converted to a FASTFS archive using the ‘fastafs cache’ subcommand (Fig. 1), which adds a reference to the FASTAFS file into a config-file (Additional file 1: Fig. S1A).
Mount: The ‘fastafs mount’ subcommand is used to mount a FASTAFS archive to a directory (mount point) to virtualise the FASTA, fai-index, dict- and UCSC TwoBit files (Additional file 1: Fig. S1A). All files are mounted read-only which guarantees persistency with the identifiers. Mount points can be configured in /etc/fstab which requires using the binary ‘mount.fastafs’ instead of the binary ‘fastafs’. These entries can be configured to automatically mount during boot. Upon a file request, the kernel requests, through the Filesystem in Userspace (FUSE), the FASTAFS toolkit to provide either file attributes such a timestamps, size or permissions, or to copy real-time decompressed file content into a buffer.
In addition, FASTAFS provides filesystem access to query partial sequences using a subsequence identifier as filename in the ‘seq’ subdirectory. For example, the file <mountpoint>/seq/chr1:10-20 contains only the 10th up to the 20th nucleotide of chr1, without additional characters such as newlines or spaces. Subsequently, requesting the file size of <mount point>/seq/chr1 will provide its size in nucleotides. Indeed, these additional features do not solve backwards compatibility issues, but do provide virtualised random access by functioning as programming language independent API implemented at filesystem level.
List: The ‘fastafs list’ command gives an overview of the FASTAFS archives, their alias, number of sequences, format, compression ratio and all active mount points (Additional file 1: Fig. S1A).
View: Besides mounting, the FASTA contents can be decompressed to stdout using ‘fastafs view’, of which the padding can be set to a desired value and masking can be virtually disabled. If the archive contains only DNA sequences, the contents can also be exported to the UCSC TwoBit format (Additional file 1: Fig. S1B).
Info: The ‘fastafs info’ subcommand gives information about the file layout, sequence size, the per-sequence MD5 checksum and used compression type. This subcommand can also be used to query European Nucleotide Archive (ENA) [16] whether the existence of a sequence MD5 checksum can be verified (Additional file 1: Fig. S1C).
Check: The ‘fastafs check’ command checks the file integrity using a CRC32 checksum. Integrity of compressed sequence data blocks can be checked separately using their MD5 checksums with the ‘--md5’ argument (Additional file 1: Fig. S1D).
ps: A list of active FASTAFS mount points and their processes is provided by the ‘fastafs ps’ subcommand. The mount point has an extended file attribute (xattr) named ‘FASTAFS-file’ that returns the path to the mounted FASTAFS archive. When a FASTAFS file is mounted to multiple mount points, they are each listed as separate entry with the corresponding system process id (Additional file 1: Fig. S1E).