SMILES strings explained for beginners (Cheminformatics Part 1)


Welcome to the first article in the Cheminformatics for Beginners series. This blog will introduce you to the SMILES string notation in cheminformatics.

Introduction

SMILES (Simplified molecular-input line-entry system) is a line notation method to represent molecules as well as reactions. It is one of the most common method to represent molecules because of its simplicity and readability to the human eye.

Below are a few examples just to give you an idea about the notation. Don’t worry, we will go into details next on how to write these notations.

Propane: CCC
Butane:  CCCC
Ethene:  C=C

Atoms

All non-Hydrogen atoms are represented by their atomic symbols. Remember the periodic table from the school days?

In a SMILES string, any unfulfilled valency of an atom is assumed to be Hydrogen. For example, writing a simple C means that it’s actually a CH4 (Methane) and not an elemental Carbon. Similarly, N is NH3 (Ammonia) and O is H2O (Water).

To represent elemental atoms, a [](Square bracket) notation is used. For example, [S] is elemental Sulfur. In case you want to explicitly add the Hydrogens to a SMILES string, the square bracket can be used here as well. For example, Methane and Ethane can be written as [CH4] and [CH3][CH3] respectively.

Bonds

Single, double, and triple bonds are represented by the symbols -, =, and #, respectively. Same as Hydrogens, single bonds are often omitted for simplicity. Adjacent atoms are assumed to be connected by a single or aromatic bond.

NOTE: As you may have guessed by now, there can be multiple ways to represent a molecule in the SMILES string. For example, all the following notations are correct for Ethane: CC, C-C, [CH3]-[CH3], [CH3]-C

Here are a few more examples:

Table 1
Molecule SMILES Alternate SMILES Structure
Ethene C=C [CH2][CH2]
Formaldehyde C=O [CH2]=O
Carbon Dioxide O=C=O C(=O)=O
Hydrogen Cyanide C#N [CH]#N
Ethanol CCO CC[OH]
Propionic Acid CCC(=O)O [CH3][CH2]C(O)=O
Methyl Isocyanate CN=C=O [CH3]N=C=O

Charged Molecules

In case of charged atoms or molecules, the square bracket notation is the way to go again. The positive charge is represented by a + sign and a negative charge by - sign. Some examples to clarify:

Table 2
Molecule SMILES Structure
Sodium Ion [Na+]
Ammonium Cation [NH4+]
Hydroxyl Anion [OH-]

Branched Molecules

In some of the molecules in Table 1, you might have noticed a ()(parentheses) in some of the SMILES. These parentheses indicate a branch in the molecule.

Let’s take the example of Propionic Acid. If the green highlighted atoms are taken as a base chain then yellow highlighted atoms will become a branch.

In either case, there’s a branch and like we just discussed, parentheses are used to create a branch in the SMILES string. So both the following strings are correct for Propionic Acid: CCC(=O)O, CCC(O)=O

Cyclic Structures/Ring Molecules

All the examples we have seen so far were noncyclic structures. Now, we’ll see how to represent cyclic structures.

Ring structures are written by breaking each ring at an arbitrary point to make an acyclic structure and adding numerical ring closure labels to show connectivity between non-adjacent atoms.

Let’s take the example of Cyclohexane. If we break the structure at the marked bond, it’ll become a linear structure.

Then we assign a number to both the atoms between which the bond was broken. In this case, the structure is symmetric and hence it doesn’t matter which bond we break. The SMILES string will now become C1CCCCC1.

NOTE: The number assigned in the SMILE string is just a marker to denote the ring structure. The above SMILE string can also be written as C2CCCCC2. But it’s a good practice and convention to go with the lowest integer.

What if there are 2 rings? Then we break one bond each from both the rings and assign separate numbers to each involved atom. Let’s take a look at a few examples to understand it better.

Table 3
Molecule SMILES Structure
1,4-Cyclohexadiene C1=CCC=CC1
Benzene C1=CC=CC=C1
Toluene C1=C(C)C=CC=C1
Decalin C1CCC2CCCCC2C1
Bicyclohexyl C1CCC(C2CCCCC2)CC1

Aromatic Compounds

The aromatic compounds can be represented by the methods we have learned so far. However, there is a preferred way to represent them: aromatic atoms are represented by lowercase letters. For example, aromatic Carbon by c, Nitrogen by n, Boron by b, and so on.

Let’s start with the famous molecule, Benzene. In Table 3, we already showed that Benzene can be represented by C1=CC=CC=C1. By this new method, now Benzene can be written as c1ccccc1. Here, adjacent atoms are not assumed to be connected by a single bond but rather the lowercase letters tells us that this is a aromatic ring signifying alternate single and double bonds. Let’s look at few more examples below:

Table 4
Molecule SMILES Structure
Phenol c1cc(O)ccc1
Pyridine c1ccncc1
Benzoic Acid c1cc(C(O)=O)ccc1
Isoquinoline c1ccc2cnccc2c1
Biphenyl c1ccc(c2ccccc2)cc1

Disconnected Structures

The ions in the ionic molecules are not connected by a covalent bond with each other. Such disconnected compounds are written as individual structures separated by a . (period). For example, Sodium Hydroxide in its ionized form will be written as [Na+].[OH-].

The order in which ions or ligands are listed is arbitrary. There is no implied pairing of one charge with another, nor is it necessary to have a net-zero charge. Below are a few examples to understand it better.

Table 5
Molecule SMILES Structure
Sodium Phenoxide [Na+].[O-]c1ccccc1
Sodium Chloride [Na+].[Cl-]

Disadvantages of a SMILES string

  • As we saw earlier, there can be multiple ways to write a SMILES string for a molecule. So the SMILES string are not unique
  • It doesn’t provide information about positions of each atom in space, typically with X, Y, and Z cartesian coordinates

Summary

We saw a basic introduction to the SMILES notation just enough to get you started in the field but there are a lot more rules and scenarios we haven’t covered. For example, dealing with stereochemistry and chirality of molecules, representing chemical reactions in SMILES notation, etc. That’s tutorial for some other day. If you want to jump into them right away, follow the links in the next section.

References and Further Reading

If you found this article helpful, don't forget to subscribe and share it with your colleagues and friends. If you have any questions, suggestions, and/or criticisms, feel free to reach out to me on any of the social media platforms mentioned in this blog. Until next time. Stay safe.

Share it on
* indicates required field