Welcome to the second blog in the 10 part Cheminformatics for Beginners series. In the last blog, we saw a quick introduction to the SMILES notation. This post will introduce you to the mol2 format in cheminformatics.
One major setback from the SMILES notation was that it could not provide the positions of each atom in space, typically with X, Y, and Z cartesian coordinates. Mol2 file solves this problem. It is a plain text tabular format that represents a single or multiple chemical compounds and holds atomic coordinates, chemical bond information, and metadata of a molecule.
Anatomy of a mol2 file
A mol2 file consists of multiple sections with each section containing a part of the information about the molecule. For example, the section
@<TRIPOS>ATOM contains atomic coordinates, and the section
@<TRIPOS>BOND contains information about how atoms are connected
There are 32 sections in the file but for this tutorial, we’ll restrict ourselves to only the 3 most important sections, namely:
@<TRIPOS>BOND. Let’s see a sample mol2 file for benzene and then jump into each of the sections in detail.
NOTE: Each line in the following example is numbered starting from 1 to end at 38. These numbers are not a part of the mol2 file but rather for easy reference in the following discussion.
1 # Name: benzene 2 # Creating user name: ChemicBook 3 # Creation time: Sat Mar 20 00:18:30 2021 4 5 @<TRIPOS>MOLECULE 6 benzene 7 12 12 0 0 0 8 SMALL 9 NO_CHARGES 10 **** 11 Any comment about the molecule goes here. No need for a pound sign here 12 13 @<TRIPOS>ATOM 14 1 C -0.7600 1.1691 -0.0005 C.ar 1 BENZENE 0.000 15 2 C 0.6329 1.2447 -0.0012 C.ar 1 BENZENE 0.000 16 3 C 1.3947 0.0765 0.0004 C.ar 1 BENZENE 0.000 17 4 C 0.7641 -1.1677 0.0027 C.ar 1 BENZENE 0.000 18 5 C -0.6288 -1.2432 0.0001 C.ar 1 BENZENE 0.000 19 6 C -1.3907 -0.0751 -0.0015 C.ar 1 BENZENE 0.000 20 7 H -1.3536 2.0792 0.0005 H 1 BENZENE 0.000 21 8 H 1.1243 2.2140 -0.0028 H 1 BENZENE 0.000 22 9 H 2.4799 0.1355 -0.0000 H 1 BENZENE 0.000 23 10 H 1.3576 -2.0778 0.0063 H 1 BENZENE 0.000 24 11 H -1.1202 -2.2126 -0.0005 H 1 BENZENE 0.000 25 12 H -2.4759 -0.1340 -0.0035 H 1 BENZENE 0.000 26 @<TRIPOS>BOND 27 1 1 2 ar 28 2 2 3 ar 29 3 3 4 ar 30 4 4 5 ar 31 5 5 6 ar 32 6 1 6 ar 33 7 1 7 1 34 8 2 8 1 35 9 3 9 1 36 10 4 10 1 37 11 5 11 1 38 12 6 12 1
Each data record associated with this section consists of six data lines:
- The first data line is the name of the molecule(line 6)
- The second data line contains the number of atoms, bonds, substructures, features, and sets associated with the molecule(line 7)
- The third data line is the molecule type(line 8). The supported types are
- The fourth data line tells the type of charges associated with the molecule(line 9). The supported types are
- The fifth data line contains the internal SYBYL status bits associated with the molecule(line 10). These should never be set by the user. Valid status bits are
ref_angle. This is optional.
- The last data line contains any comment which may be associated with the molecule(line 11)
mol_name num_atoms [num_bonds [num_subst [num_feat[num_sets]]]] mol_type charge_type [status_bits [mol_comment]]
The square bracket notation tells us that the inner value is present only when the outer value is provided. For example,
mol_comment is provided only when
status_bits is also provided.
Both the following examples are valid for Methane:
@<TRIPOS>MOLECULE @<TRIPOS>MOLECULE Methane Methane 5 4 0 0 0 5 4 SMALL SMALL NO_CHARGES NO_CHARGES **** Some comment about Methane
Each data record associated with this section consists of a single data line. This data line contains all the information necessary to reconstruct one atom contained within the molecule. The atom ID numbers associated with the atoms in the molecule will be assigned sequentially when the .mol2 file is read into a mol2 parser software.
atom_id atom_name x y z atom_type [subst_id[subst_name [charge [status_bit]]]]
- atom_id (integer): The ID number of the atom at the time the file was created. This is provided for reference only and is not used when the .mol2 file is read into any mol2 parser software
- atom_name (string): The name of the atom
- x (real): The x coordinate of the atom
- y (real): The y coordinate of the atom
- z (real): The z coordinate of the atom
- atom_type (string): The SYBYL atom type for the atom
- subst_id (integer): The ID number of the substructure containing the atom
- subst_name (string): The name of the substructure containing the atom
- charge (real): The charge associated with the atom
- status_bit (string): The internal SYBYL status bits associated with the atom. These should never be set by the user. Valid status bits are
For any atom C, both the following configurations are correct:
@<TRIPOS>ATOM 1 C 1.0817 0.0395 -0.0687 C.1 1 LIG1 0.000 BACKBONE|DICT|DIRECT
@<TRIPOS>ATOM 1 C 1.0817 0.0395 -0.0687 C.1
In the first example, the atom has ID number
1 and is named
C with XYZ-coordinates as
(1.0817, 0.0395, -0.0687). Its atom type is
C.3(sp2 hybridization). It belongs to the substructure with ID
1 which is named
LIG1. The charge associated with the atom is
0.000 and the SYBYL status bits associated with the atom are
DIRECT. The second example is the minimal information necessary for the MOL2 command to create an atom.
Similar to the previous section, each data record associated here consists of a single data line that contains all the information necessary to reconstruct one bond in the molecule.
bond_id origin_atom_id target_atom_id bond_type [status_bits]
- bond_id (integer): The ID number of the bond at the time the file was created. This is provided for reference only and is not used when the .mol2 file is read into any mol2 parser
- origin_atom_id (integer): The ID number of the atom at one end of the bond
- target_atom_id (integer): The ID number of the atom at the other end of the bond
- bond_type (string): The SYBYL bond type. The supported bond types are:
- 1 = single
- 2 = double
- 3 = triple
- am = amide
- ar = aromatic
- du = dummy
- un = unknown (cannot be determined from the parameter tables)
- nc = not connected
- status_bits (string): The internal SYBYL status bits associated with the bond. These should never be set by the user. Valid status bit values are
As an example, once again, both the following configurations are correct:
@<TRIPOS>BOND 1 1 2 ar BACKBONE|DICT|INTERRES
@<TRIPOS>BOND 1 1 2 ar
The bond in the first example has ID number
1 and connects atoms
2. It is an
aromatic bond. The status bits indicate the bond is part of the backbone chain, joins two residues, and that a dictionary was used when creating the molecule. The second example is a minimal representation of the same bond.
Few important points
- Any line containing only white spaces is considered a blank line and is ignored by the MOL2 command
- A printable ASCII string beginning with a hash or pound sign (
#) in column 1 is called a comment line. These lines are again ignored by the MOL2 command
- Comments are terminated by the end of the line and may not be continued to the next line
- The string
****is used to indicate that a non-optional string field is empty