Performance and usability
We defined four requirements of VGE in “Goal and requirements” section. The first requirement (Non-privilege server program) has been described in “Algorithms” and “Results” section. Thus we focused on the other requirements in this part.
The second requirement is handling multiple tasks. Figure 3a shows a short extraction of the job-submitting script used in “Simultaneous analyses of many samples” section. Fourteen samples were named Data 0 to Data 13, respectively, and their tasks of FASTQ data division and BWA alignment were written in simple_pipeline.py.
A line starting with “simple_pipeline.py” corresponded to one sample analysis. In this example, fourteen sample jobs were submitted to VGE independently. In this way, VGE accepts multiple job submissions at once. Of course, different pipelines can also be submitted simultaneously.
Here, we focus on the results of job assignment and filling to workers. We used the same pipeline and data used in “Simultaneous analyses of many samples” section. The only difference was the number of workers. In this case, we used 5,000 workers which was much less than the number of total jobs. Thus, the workers had to perform the assigned jobs many times.
Figure 4 shows how the workers executed the jobs based on time. In the first twenty minutes, only a few workers performed jobs and the majority of the others did not work. This is because fourteen pipelines submitted fastq_splitter, which contained only two jobs. The results indicate that VGE successfully handled the dependency between tasks.
After this task, the FASTQ files were split into thousands of files that were aligned with BWA by all the workers. The number of split files was much larger than that of workers, so all workers continued to perform their jobs. From this figure, it can be concluded that the assignment was tightly arranged; thus, job management of VGE was very effective in a real case.
The third requirement is dependency control among tasks. Python is an interpreter language and performs a process per line. Using its characteristics, VGE controls task dependencies by tuning the task writing order in a script.
Figure 3 shows a short extraction of simple_pipeline.py used in “Simultaneous analyses of many samples” section. It consists of two tasks, the division of input FASTQ files (fastq_splitter) and alignment of decomposed files by BWA (bwa_align). Here, bwa_align must wait for completion of fastq_splitter task.
As described in “Algorithms” section, job submission to VGE is performed by using vge_task() function. It is clear from Fig. 3 that the simple_pipeline.py contains two tasks; the former is fastq_splitter and the latter is bwa_align.
The vge_task() written on the eighth line in Fig. 3 handles fastq_splitter task, and the process is accomplished after finishing the division of FASTQ files by VGE workers. Therefore, the vge_task() that is written later and corresponds to bwa_align does not submit to VGE before its accomplishment of the first task. The dependencies among tasks will be controlled by the order of tasks written in a script in this manner.
The final requirement is friendliness to medical and bioinformatics researchers. Pipelines using VGE consist of three parts: describing the concrete contents of tasks (hereinafter referred to as command-script), denoting the flow of the pipelines (pipeline-script), and submitting tasks to VGE (job-script) (Fig. 3).
The command-script and the pipeline-script can be written either in the same file or in independent files. The command-script can also be described in shell script. Therefore, legacy scripts can be used on VGE. However, development of scripts from scratch is also possible as researchers in this field are familiar with coding in Python.
On the other hand, pipeline-script must be written in Python. However, only vge_task() needs to be described. vge_task() requires three arguments: COMMAND, MAX_TASK, and BASENAME_FOR_OUTPUT. These arguments indicate the task name defined in command-script, the number of workers neccesary for the task (in short, the number of array jobs), and the unique ID (arbitral strings) used for log files, respectively. The value assignments are very clear, as shown in Fig. 3 (5-7 lines).
As discussed, VGE is very straightforward and friendly software for users.
Issues associated with distributed file systems
At the test described in “Simultaneous analyses of many samples” section, we encountered an unexpected severe problem. General behavior of a job using VGE is shown in Fig. 5. There are two intervals before the pipeline starts. First one is the initialization time of the system such as environmental value settings (a), and the other is the initialization time of MPI and VGE. Both intervals are from seconds to a minute in usual, but the total time of them was over 2 hours in the first trial.
This problem had observed first in this large-scale computation test, or it never appeared at smaller scale tests such as two thousand nodes. Therefore, VGE didn’t mainly cause this problem. We carefully investigated the cause of this problem with K computer operating team. According to the result of this investigation, both initialization intervals ((a) and (b)) took 1 hour respectively. In the system initialization interval (a), the operating system and JMS do various processes such as assigning of nodes, but we found that the file system became overload.
In this study, we mainly used the K computer that is one of the biggest supercomputers in the world. Of course, it equips a very large storage. Its size is over 30 PB thus traditional storage systems cannot handle such a huge storage. The K computer employs Fujitsu Exabyte File System [20] that was based on Lustre [21].
Lustre is one of distributed file systems. Lustre family file systems consist of three parts, one is physical disks, another is object storage servers (OSS), and the other is metadata servers (MDS). Thousands of physical disks and OSSs are used in Lustre family system, but the number of MDS is usually small. Therefore, MDS may become a bottleneck of Lustre family systems.
According to the investigation, we found that the job sent too much requests (e.g., make files, remove files, etc.) to the MDSs of FEFS at VGE launching. The observed value was over 20,000 per second. Applicable value of request to MDSs is 1300 per second, so that it was an extremely high value. The requests was caused by making log files of each workers. VGE workers make each log files in which the received task information and worker status are stored. The number of these files is proportional to the number of workers, thus we can’t find this problem in the previous tests. To avoid this problem, we made log files for VGE workers using only 1 process before MPI launched VGE.
Figure 6 shows FEFS structure in the K computer. The unique characteristics is that the whole file system consists of two layer: Global file system (GFS) and Local file system (LFS). Each system is complete as an independent file system. Programs and data that is required for a job send from GFS to LFS by the job management system (staging function) through the data transfer network. Thus user jobs are not affected by the others miscellaneous works on the login nodes.
The initialization time of MPI and VGE (b) was also related to MDS. In this initialization period, the system proceeds MPI startup and loads python modules that VGE imports. Each MPI process loads python module files respectively so that the requests to MDSs become very high in large scale tests. This problem is widely known in the dynamic linking library research field [22]. To avoid this problem, you may use tools that improve a library loading performance. In this study, we prepared python main system and all modules on the local disks of all calculation nodes. Since VGE master and workers didn’t access to the files of python on the FEFS, we reduced the number of access to the MDSs at this initialization intervals.
These issues were occurred in the large-scale computation of multiple sample analysis, so that you may consider that it is a very particular situation. However, it may occur in all types of bioinformatics analysis. As described before, one sample data size become quite large and it has already become over 5TB in the state-of-the-art studies. In such cases, typical protocols of sequence analysis hold potential risks for file systems.