Molinspiration Batch Molecule Processing

mib batch molecule processing v2024.01

Molinspiration mib engine allows batch processing of molecules encoded as SMILES or SDfiles. The engine supports broad range of cheminformatics functionalities, for example file conversion, structure normalization, generation of tautomers, calculation of molecular physicochemical properties and many others. The mib program is able to process large molecule files. For example, as a part of our research activities 33 million molecules from a PubChem database was converted from SDfile format to SMILES, normalised and properties for the molecules were calculated. mib is also used to standardize molecules and calculate properties by the ZINC virtual screening database.
The program is written in Java, therefore is platform independent and runs on any Windows PC, Mac or LINUX server where Java runtime (version 11 or better) is available.

Running the program

The program is started by the following code

java -jar mib.jar input_option [processing_options]

Input options

Molecules stored in the Daylight SMILES format or MDL SDfile format may be processed. Molecules from a file may be read by using the option

-f file_name

where the file_name is a file containing set of SMILESes, or SDfile. SDfile will be automatically recognized by an extension ".sdf", ".sd" or ".mol". All other files are assumed to contain SMILES. In this case SMILES must be a first item in a line. The line may contain also other items (molecule name, other data), tab separated.
Compressed files with extension .gz, gzip, or zip may be processed without necessity to unpack them.

Data from the SDfile may be retrieved by using the -keep parameter. For example:

java -jar mib.jar -f mdpi.smi -keep "MolName,Amount"

retrieves also parameters named MolName and Amount from the SDfile.
If you want to retrieve all parameters use the parameter -keepall

Processing of a single SMILES is possible by using a -smi input parameter (in this case SMILES string must be in quotes).

java -jar mib.jar -smi 'molecule_smiles'

Output options

When no output options are provided, a file with canonised SMILES is generated and sent to the standard output.
When a SDfile should be generated, the parameters -out sdf is required.

Processing options

When no processing options are given, canonical SMILESes of inputted molecules are generated and sent to the standard output. This may be redirected to a file by the > redirection operator

Molecule normalization

Various parameters allow modification and processing of molecules.

-nostereo stereo information in molecules will not be considerer
-normalizeCharges atomic charges will be normalized when possible
-singlePart only the main (largest) part of a multipart molecule will be processed
-standardize is a shortcut for all three previous options together
-normalizeIsotopes removes all isotope labels from atoms
-isostandardize is a shortcut for -standardize plus -normaliseIsotopes

The mib package performs strict valence checking and discards molecules violating organic valence rules (when skipping molecules with errors, respective error messages are issued). They keyword:

-kmwve (mnemonic for "keep molecules with valence errors") allows processing also of molecules with non-standard valencies

Calculation of molecular properties

Molecular properties for molecules on input are calculated when using the keyword -properties

The following properties are available on the output (in this order); items are tab separated:

logP octanol-water partition coefficient
PSA polar surface area
number of nonhydrogen atoms
molecular weight
number of hydrogen-bond acceptors (O and N atoms)
number of hydrogen-bond donors (OH and NH groups)
number of Rule of 5 violations
number of rotatable bonds
molecular volume

When using the keyword -header in the SMILES output mode, the first line of output is a header with property names.

The mib property calculation engine is used in numerous instances by our industry customers and powers also our free online property calculation tool.

Molecule formula

When using the keyword -formula the molecule formula is part of the output. This keyword may be used together with the -properties keyword.

Molecule fragmentation

mib allows fragmentation of molecules into various types of fragments. Below examples of various fragmentation options are given. The parent molecule used for the fragmentation is the structure shown above.

-r1 - substituents (Rgroups); all "breakable" nonring single bonds are broken to generate substituent

-r2 - spacers (groups with 2 attachment points)

-ringSystems - ring systems is a collection of fused or spiro rings

-simpleRings - simple rings which this molecule contains. Simple ring does not need to be a valid molecule (in example below, in the sulfur ring only 2 atoms are aromatic) therefore results are provided as fragment SMILES (note that aromatic bonds in fragments are displayed as dashed lines on images below)

C1Sc:nN1
c:1:n:n:c:n:1
c:1:c:c:c:c:c:1
o:1:c:c:c:c:1

-scaffold - is ring part of a molecule (rings systems and their connections) without aliphatic substituents

-hose - generates so called HOSE fragments (atoms with environment). A HOSE fragment consists of a central atom (first atom in HOSE SMILES) and several levels of surrounding atoms.

HOSE fragments may be used as structural descriptors by QSAR studies or fragment-based property prediction applications.

Fragmentation parameters

By default, the size of the r1 and r2 fragments is limited to 15 atoms. This may be changed by a parameter -maxsize n (this parameter does not affect other types of fragments).

Generated fragments are written in the output line after input SMILES (in a canonized form) and any other parameters from the input. All items are tab separated.

Parameter -list allows to perform fragmentation statistics for large collections of molecules. On the output a list of fragments is provided, together with the number of molecules containing these fragments.
The following command provides list of substituents up to 8 atoms, which are most common in GPCR ligands.

java -jar mib.jar -f gpcr.smi -standardize -r1 -max 8 -list > gpcr.r1

The first lines of the output file are

[R]C	539
[R]c1ccccc1	213
[R]O	204
[R]OC	193
[R]N	191
[R]Cl	172
[R]F	133
[R]CC	83
[R]CCC	71
[R]C(O)=O	67  
. . .

When using parameter -count also the number of fragments of particular type will be provided, in the form fragment1 count1 fragment2 count2 (tab separated)

For example command

java -jar mib.jar -smi 'c1ncccc1' -hose -maxSize 1

provides output

c1ccncc1 [n] [cH]

while when using also the parameters -count the output includes also the number of respective fragments in the molecule

c1ccncc1 [n] 1 [cH] 5

When using the parameter -count together with the parameter -list, the number on output provides the total number of fragments of this type in the molecule set (and not just the number of molecules with this fragments as by -list alone).

-maxsize sets the maximum size of generated fragments. This is ignored when generating rings and scaffolds, and makes sense only by -r1 and -r2 fragments and HOSE fragments (in this case -maxsize is the number of surrounding levels, 1 - just central atom, 2 - single level of neighbors, 3 - 2 levels of neighbors, etc)

Fragmentation is applied only to the main part of multipart molecules (as the keyword -singlepart would be used). [This may be modified in later release of the toolkit].

Generation of tautomers

Tautomers of processed molecules may be generated by using the option -tautomer. In the output line canonized SMILES of the original molecule, followed by eventual data contained in the input line are provided, followed by a number giving the number of generated tautomers and SMILES codes of these tautomers. All data in line are tab separated.

The command java mib -smi 'n1c(O)cccc1' -tautomers

provides the following output

Oc1ccccn1 2 O=c1cccc[nH]1 Oc1ccccn1

To get tautomers of a single molecule one tautomers SMILES per line (this output may be displayed for example by the Molinspiration molecule viewer) use the -list option.

The command

java -jar mib.jar -smi 'Nc2nc1nc[nH]c1c(=S)[nH]2' -tautomers -list > tautomers.smi

saves the following output to the tautomers.smi file (the source SMILES in this example is thioguanine)

Nc2nc(=S)c1nc[nH]c1[nH]2	1
Sc1nc(=N)[nH]c2nc[nH]c12	1
Sc1nc(=N)[nH]c2[nH]cnc12	1
N=c2[nH]c(=S)c1nc[nH]c1[nH]2	1
Sc1[nH]c(=N)nc2nc[nH]c12	1
Nc2nc(S)c1nc[nH]c1n2	1
Nc2nc1ncnc1c(S)[nH]2	1
Sc1[nH]c(=N)nc2[nH]cnc12	1
Nc2nc(S)c1ncnc1[nH]2	1
Nc2nc(=S)c1[nH]cnc1[nH]2	1
Nc2nc(S)c1[nH]cnc1n2	1
Sc1[nH]c(=N)[nH]c2ncnc12	1
Nc2nc1[nH]cnc1c(=S)[nH]2	1
N=c2[nH]c(=S)c1[nH]cnc1[nH]2	1
Nc2nc1nc[nH]c1c(=S)[nH]2	1

The 15 tautomers generated are shown below

Some molecules may have very large number of tautomers (several hundreds), to keep computational time reasonable, the default number of generated tautomers is limited to 50. This limit may be increased by parameter -maxtautomers n . Tautomers are listed on the output in alphabetic order.
EZ stereochemistry on tautomeric bonds is not preserved during tautomer enumeration.