In order to solve the issues of NGS data compatibility, we developed a new XML-based data format called NGSML, which can satisfy most of the types of NGS data.
The NGSML is based on the Extensible Markup Language (XML), which is used widely for data storage on the Internet as well as in areas such as mathematics and biology. Several well-known formats based on XML are used for specific purposes in biology. For example, SBML is widely used in the system biology, CellML is XML-based language for describing the models of cellular processes, SED-ML is XML-based language for description of simulation experiments. NGSML was designed for describing the data produced by NGS technology. We exploited the advantages of other formats such as SBML, MIAME, SAM, BCML, as well as extending the functions of NGSML. The different types of information used by NGS are integrated into NGSML such as alignment, assembly, and annotation information. Because of the high extensibility of XML, it is easy to extend NGSML with new features. Figure 1 shows the functions of NGSML. NGSMLEditor is designed for creating and editting NGSML files. It has a user friendly GUI and it can also run in command line. It will be great helpful for users to operate with NGSML files. Figure 2 shows the user interfaces of NGSMLEditor.
This format is suitable for use as a general format to represent, store, and exchange NGS data. NGSML is a new data format description language for the NGS field, which exploits the advantages of other XML-based formats and overcomes their shortcoming as far as possible. It has the following advantage features.
First, NGSML uses a component structure where the sequence and description information is divided into three parts to make the structure of the format clear and to facilitate future expansion.
Second, NGSML imports reference idea into the biological data format, which is a technique widely employed in computer science. In the NGSML format, different sequence information can reference to the same sequence or quality score if the content is the same or similar. It can avoid storing duplicate contents.
Third, NGSML exploits the merits of the most popular NGS data formats, thereby can be used to store most of the biological sequence information. In addition, NGSML inherits the flexibility and extensibility of XML. Due to the rapid development of NGS technology, new concepts and analysis tools are emerging constantly, and it is difficult to adapt the old data formats to current needs. The extensibility of NGSML overcomes the issues of the specific data formats and its flexibility can suit the needs of future development.
Finally, NGSML’s readability is good so it can be processed easily by computer programs as well as being more readable for humans. This advantage is attributable to the tree structure characteristic of XML.
The program and datasets are free to use. For any questions, please do not hesitate to contact us:
Center for Systems Biology (CSB), Soochow University; No.1 Shizi Street, Suzhou, Jiangsu, China
E-mail: bairong.shen@scu.edu.cn, yucj@siso.edu.cn.
Copyright @ 2020 CSB. All Rights Reserved.