The misearch engine is written completely in Java, and therefore may be used on any platform where the Java (version 11 or later) is installed. Java is currently supported practically on all platforms (Windows, LINUX, Unix). The latest version of Java runtime environment may be downloaded for free from www.java.sun.com. (You may find out which version of Java is installed on your machine by command java -version). No other software is required to run misearch.
Misearch toolkit has no limit on the size of molecular database used. Some of our customers are using databases with several million molecules. The typical structure or similarity search in a database of 100'000 molecules requires 3-4 seconds on an average, single-processor PC.
The misearch engine does not use any special database software (like ORACLE, or mySQL), data are stored in a simple ASCII file. The big advantage of this set-up is, that it makes creation and use of misearch databases and molecule searches very simple and straightforward. On the other side, the misearch cannot provide sophisticated database functionality, such as storage of all types of additional data, or modification of data in the existing database. Each time the data change, the misearch database file must be created from scratch.
Misearch functions are available from DOS or UNIX command line.
As a first step, molecular database must be created from SMILES data. Molinspiration can provide a free conversion software to transform MDL SDfiles to SMILES.
The database is created by a command
java -jar misearch.jar -f source_file -create > database
source_file
is a file with information about molecules to be stored in the database, one molecule per line, encoded as a SMILES string, tab separated from molecule identifier (molecule name) and optionally also additional data.
This command will generate a database file, which is sent to the standard output and may be redirected into a file (you may use, of course, the name you wish) by using the > character.
Created database is a simple text file, one line per molecules, which contains standardized SMILES, molecule structure code (a string of characters which encodes molecular structure in a compact way), molecule identifier, and any additional data contained in the source file.
Progress of database creation will be shown on the screen. Creation of a database with 10,000 molecules will require about 10 minutes.
New molecules may be added to the end of existing database file by a command:
java -jar misearch.jar -f source_file >> database
(note the >> redirection command)
Once a database is created, searches are possible, by a command:
java -jar misearch.jar -db database -search_type -smi 'target' [-options] > resultFile
where
database_name
is name of the database
-search_type
is type of the search; currently 4 search types are available:
-simisearch
similarity search
-sssearch
substructure search
-exactsearch
exact molecule search
-namesearch
text search in the name field
target
is a target molecule in SMILES or SMARTS format, or target text for text searches
possible options are:
-jme
hits will be provided not as SMILES, but as JME strings (molecule encoded as JME string may be displayed by the JSME JavScript editor.
-slimit value
may be used to set minimum required similarity in the similarity search (default value is 0.65).
-nhits n
limits on the number of hits. By substructure and name searches when hitlist size reaches this limit the search is terminated. By similarity searches the n most similar molecules are found and sent to the output order according to their similarity value.
-skip n
skip n molecules at the beginning of the database file. May be used to continue search after reaching the hit limit in previous search.
A search command may look for example like this:
java -jar misearch.jar -db nci -sssearch -smi 'c1cccccn1=O' > out
Output of the search is sent to the standard output, which may be redirected to any file. Hits are written to the output file, one molecule per line, starting with molecule SMILES (or JME code if the parameter -jme is used), and molecule identifier / name. In case of similarity searches the similarity to the target molecule is also provided. If the original entry contained any additional data, all these data are added to the end of line. The entries in the line are separated by tabs.
When the search is finished, the number of hits and time required are sent to the standard error.
It is also possible to perform searches directly from a file containing molecules encoded as SMILES (one molecule per line, tab separated from other data items) not from a database file as described previously. The serch is not as fast, but this option is recommended when only few searches are needed and the creation of the database would be an overkill.
For this type of search use the -f parameter insted of the -db parameter, like:
java -jar misearch.jar -f smiles_file -sssearch -smi 'c1cccccn1=O' > out
Substructure search will find molecules containing given substructure. Substructure queries may be submitted as SMILES or SMARTS (in this case use the keyword -smarts instead of -smi). SMARTS syntax allows specification of complex substructure queries. Complete SMARTS specification is implemented, including the recursive SMARTS, only stereo SMARTS queries are not supported. Be aware, however, that SMARTS searches are much slower than simple SMILES substructure searches.
Similarity search will identify molecules most similar to the target structure. Similarity is expressed as a number between 1.0 (identical, or very similar molecules) to 0.0 (no similarity at all). Molecules with similarity greater then ca 0.7 may be considered to be reasonably similar. By default all molecules with similarity to the target greater than 0.65 are identified by the search. This limit may be changed by a modifier -slimit value
. By default all molecules with similarity to the target greater than slimit are identified and sent to the output not ordered. When using the parameter -nhits n, n most similar molecules are found and sent to the output ordered according to their similarity value. The -slimit parameter cannot be used together with the -nhits parameter.
Exact search will find exactly the same molecules as the target SMILES. Stereochemistry is not considered, so all stereoisomers of the same basic connectivity will be found.
Name search performs text search in the name field. When using the name search, you have to use the -text searchText
parameter to define the text query (instead of a -smi parameter). All names containing the text query are identified, so for example search with 'thia' identifies 'thiazole', as well as 'benzothiazole' as hits.
For more information about the misearch database engine, or to arrange a free evaluation, contact info(at)molinspiration.com.