Tuberculosis (TB) is usually a chronic, slowly progressing disease that frequently remains undiagnosed for many years. One-third of the world population is thought to be infected and in 2010 there were around 9 million new active cases of TB . It is the second highest cause of death from an infectious disease worldwide, after HIV, and the biggest killer of people infected with HIV . The rapid evolution of drug resistance strains is threatening to make TB incurable.
To control the progression of this disease, we need to define risk factors for transmission. To accomplish that, we need detailed clinical and socio-demographical information. In scenarios of intense transmission, it is essential to identify the source patient in order to prevent activation of recent infections. On the other hand, in communities where transmission is rare, the main goal would be to identify people who are latently infected, since most of the disease cases are a consequence of reactivated latent infection [3, 4].
Another question that remains unanswered is whether specific characteristics are features of individual strains or broader strain lineages. Defining the nature of diversity in M. tuberculosis offers an ideal starting point for evaluating the clinical implications of such diversity [4–6]. The properties required to address the bacterial diversity are unlikely to be met by a single marker. Since standard sequence-based genotyping, such as Multilocus sequence typing (MLST) is not applicable in these bacteria, non sequence-based tools such as Variable Number Tandem Repeat (VNTR) based techniques have become the gold standard for routine genotyping and have been successfully applied to answer a variety of epidemiological questions [2, 7–10]. While the significance of deep phylogenetic information for molecular epidemiology is yet to be established, unequivocal classification of bacterial strains is essential, in fact crucial if phenotypic associations are to be unveiled [6, 7]. One way to address this problem is to combine different typing methods in order to take full advantage of their combined results. IS6110 RFLP, MIRU-VNTR and spoligotyping are methods that can be used for epidemiological purposes but, unlike SNPs, they do not provide a robust phylogenetic picture [11, 12].
Addressing these questions requires an integrated framework, capable of linking clinical and socio-demographic data with molecular data. This framework should be able to read sequence data from bacterial isolates, identify global patterns and automatically classify strains into families [4, 13]. Currently there are a few excellent public databases and web tools focused on tuberculosis. SpolDB4  provides a clear picture of the current M. tuberculosis complex genome diversity, through Spoligotypes, with around 2000 sequences representative of several regions of the world. Nevertheless, it is not possible to correctly define the phylogenetic relationship of different strains only through Spoligotypes. MIRU-VNTRplus  and SITVIT  are broader than SpolDB4; they allow users to analyze and compare genotypes based on several methods: spoligotype, MIRU-VNTR, LSP, SNP or a combination of these markers. Although these databases contains information about sensitivity to drugs, little or no clinical data is available nor can it be uploaded, and without this information it is not possible to address the questions raised above.
Other existing approaches, not specific to tuberculosis, allow users to upload and analyze their data, such as MLST [17, 18]. MLST is used by public health laboratories and researchers to query nucleotide data against databases over the Internet, but this system lacks clinical and/or socio-demographic information and does not provide any tools to analyze the data. Other systems have been designed for local installation, such as EpiPATH  developed as a generic framework for managing clinical and molecular data from infectious diseases. However, EpiPATH lacks any analysis tools, and requires programming-dependent customization to be used for a complex disease such as tuberculosis, with multiple typing methods and complex clinical data. Finally, generic systems like Bionumerics by Applied Maths NV. are widely used as data management and analysis tools, but they are commercial and costly.
While all the databases/platforms described above have their merits, none provides a means to locally integrate and analyze the complexity of tuberculosis within the context of a research, public health or clinical unit. In this work we describe a novel integrative framework, inTB, developed to fill this gap. It is a free, locally installable, customizable data management and analysis system for Mycobacterium disease, aimed at the research laboratories, public health authorities, and potentially for the clinical setting. inTB integrates different types of molecular data with clinical and socio-demographic information, and provides pre-defined data analysis and reporting tools. Adoption of this system ensures data consistency by use of validation mechanisms, and data reusability, by use of the provided analysis tools. inTB contrasts with existing dedicated databases and tools (see above) by providing local data management and analysis. It thus addresses privacy and confidentiality concerns by providing easy-to-use packages for local installation and use, without requiring that sensitive information is transmitted over the Internet. Furthermore, inTB brings to the fore extensive clinical and socio-demographic data that can be analyzed together with genotypic information, and should the user wish to do so, it is simple to expand to include more variables. InTB was designed bearing in mind both the needs of our collaborators at the National Tuberculosis Program in Portugal (Programa Nacional de Luta Contra a Tuberculose), a national public health authority, and our own needs as research laboratories investigating the molecular epidemiology of M. tuberculosis.