Welcome to the first article in the 10 part Cheminformatics for Beginners series. This blog will introduce you to the SMILES string notation in cheminformatics.
SMILES (Simplified molecular-input line-entry system) is a line notation method to represent molecules as well as reactions. It is one of the most common method to represent molecules because of its simplicity and readability to the human eye.
Below are a few examples just to give you an idea about the notation. Don’t worry, we will go into details next on how to write these notations.
Propane: CCC Butane: CCCC Ethene: C=C
All non-Hydrogen atoms are represented by their atomic symbols. Remember the periodic table from the school days?
In a SMILES string, any unfulfilled valency of an atom is assumed to be Hydrogen. For example, writing a simple
C means that it’s actually a CH4 (Methane) and not an elemental Carbon. Similarly,
N is NH3 (Ammonia) and
O is H2O (Water).
To represent elemental atoms, a
(Square bracket) notation is used. For example,
[S] is elemental Sulfur. In case you want to explicitly add the Hydrogens to a SMILES string, the square bracket can be used here as well. For example, Methane and Ethane can be written as
Single, double, and triple bonds are represented by the symbols
#, respectively. Same as Hydrogens, single bonds are often omitted for simplicity. Adjacent atoms are assumed to be connected by a single or aromatic bond.
NOTE: As you may have guessed by now, there can be multiple ways to represent a molecule in the SMILES string. For example, all the following notations are correct for Ethane:
Here are a few more examples:
In case of charged atoms or molecules, the square bracket notation is the way to go again. The positive charge is represented by a
+ sign and a negative charge by
- sign. Some examples to clarify:
In some of the molecules in Table 1, you might have noticed a
()(parentheses) in some of the SMILES. These parentheses indicate a branch in the molecule.
Let’s take the example of Propionic Acid. If the green highlighted atoms are taken as a base chain then yellow highlighted atoms will become a branch.
In either case, there’s a branch and like we just discussed, parentheses are used to create a branch in the SMILES string. So both the following strings are correct for Propionic Acid:
Cyclic Structures/Ring Molecules
All the examples we have seen so far were noncyclic structures. Now, we’ll see how to represent cyclic structures.
Ring structures are written by breaking each ring at an arbitrary point to make an acyclic structure and adding numerical ring closure labels to show connectivity between non-adjacent atoms.
Let’s take the example of Cyclohexane. If we break the structure at the marked bond, it’ll become a linear structure.
Then we assign a number to both the atoms between which the bond was broken. In this case, the structure is symmetric and hence it doesn’t matter which bond we break. The SMILES string will now become
NOTE: The number assigned in the SMILE string is just a marker to denote the ring structure. The above SMILE string can also be written as
C2CCCCC2. But it’s a good practice and convention to go with the lowest integer.
What if there are 2 rings? Then we break one bond each from both the rings and assign separate numbers to each involved atom. Let’s take a look at a few examples to understand it better.
The aromatic compounds can be represented by the methods we have learned so far. However, there is a preferred way to represent them: aromatic atoms are represented by lowercase letters. For example, aromatic Carbon by
c, Nitrogen by
n, Boron by
b, and so on.
Let’s start with the famous molecule, Benzene. In Table 3, we already showed that Benzene can be represented by
C1=CC=CC=C1. By this new method, now Benzene can be written as
c1ccccc1. Here, adjacent atoms are not assumed to be connected by a single bond but rather the lowercase letters tells us that this is a aromatic ring signifying alternate single and double bonds. Let’s look at few more examples below:
The ions in the ionic molecules are not connected by a covalent bond with each other. Such disconnected compounds are written as individual structures separated by a
. (period). For example, Sodium Hydroxide in its ionized form will be written as
The order in which ions or ligands are listed is arbitrary. There is no implied pairing of one charge with another, nor is it necessary to have a net-zero charge. Below are a few examples to understand it better.
Disadvantages of a SMILES string
- As we saw earlier, there can be multiple ways to write a SMILES string for a molecule. So the SMILES string are not unique
- It doesn’t provide information about positions of each atom in space, typically with X, Y, and Z cartesian coordinates
We saw a basic introduction to the SMILES notation just enough to get you started in the field but there are a lot more rules and scenarios we haven’t covered. For example, dealing with stereochemistry and chirality of molecules, representing chemical reactions in SMILES notation, etc. That’s tutorial for some other day. If you want to jump into them right away, follow the links in the next section.