Source Code

The researchers, engineers, and other members of MMLI create a wide variety of source code, which is available for all to use, modify, and extend.

molli: Molecular Toolbox Library

Molli is a cross-platform toolbox written in modern Python (3.10+) that provides a convenient API for molecule manipulations, combinatorial library generation with stereochemical fidelity from plain CDXML files, as well as parallel computing interface. The main feature of molli is the full representation of molecular graphs, geometries and geometry ensembles with no implicit atoms. Additionally, a compact and extensible format for molecular library storage make it a useful tool for in silico library generation. molli is cross-platform code that runs on a wide range of hardware from laptops and workstations to distributed memory clusters.


Reaction Miner is a system to extract precise, extensive information about chemical reactions from scientific literature provided as PDF files. A text segmentation module ensures that the refined text encapsulates complete chemical reactions, augmenting the accuracy of extraction. Reaction Miner also broadens the scope of existing pre-defined reaction roles, including vital attributes previously neglected, thereby offering a more comprehensive depiction of chemical reactions.

Closed Loop Transfer

Closed-Loop Transfer (CLT) is an AI-guided method for optimization campaigns. It consists of three phases: (1) Bayesian optimization in a closed-loop process with automated modular small molecule synthesis and multidimensional characterization; (2) comprehensive whole molecule DFT calculations and feature selection with physics-informed supervised learning models to extract physical insights; and (3) statistical validation of the physical insights and understanding across the entire chemical search space.

Additional Repositories

  • Anisotropic_CG_ConjPoly_Model: LAMMPS implementation of the anisotropic coarse-grained model for conjugated polymers.
  • Atropselective-Iodination: Generation of an in silico library of DSI catalysts and validation of the catalysts selection by committee.
  • Bio-Domain-Transfer: A model to transfer the knowledge from the source domain to the target domain, but, at the same time, to project the source entities and target entities into separate regions of the feature space, thus diminishing the risk of mislabeling the source entities as the target entities.
  • BOX_VMA: Programs for BOX library construction and ligand selection.
  • Chem-FINESE: Validating fine-grained few-shot entity extraction through text reconstruction.
  • Chem Reasoner: Monte Carlo Thought Search: LLM querying for complex scientific reasoning in catalyst design.
  • chemscraper-frontend: AlphaSynthesis web tool for ChemScraper.
  • chemscraper-helm-chart: Helm chart for running ChemScraper backend services in Kubernetes.
  • clean-frontend: AlphaSynthesis web tool for CLEAN.
  • digital-molecule-maker: The Digital Molecule Maker web frontend.
  • ECG_ActiveLearning: Dataset and code for “Coarse-Grained Density Functional Theory Predictions via Deep Kernel Learning.”
  • gECG_thiophene: Generalized RCG for thiophene polymers.
  • Graphics Extraction Pipeline: A tool for extracting mathematical and chemical formulas from PDF files. This is the core of the ChemScraper method.
  • ILCiteR: Engine to recommend papers to cite for a query grounded on similar evidence spans extracted from the existing research literature.
  • irproc: IR spectra batch processing and plotting.
  • KDBNet: Deep learning algorithm that incorporates 3D protein and molecule structure data to predict binding affinities.
  • L+M-24 Dataset: Resources to create, evaluate, and benchmark models for the L+M-24 Dataset, which consists molecule-language pairs.
  • Lucid_Somnambulist: Active learning applied to Pd-catalyzed C-N couplings.
  • MAE-LM: Masked Language Modeling pretraining allocates some model dimensions exclusively for representing [MASK] tokens, resulting in a representation deficiency for real tokens and limiting the pretrained model’s expressiveness when it is adapted to downstream data without [MASK] tokens.
  • mmli-backend: Web backend for AlphaSynthesis job execution and related functions.
  • molli-frontend: AlphaSynthesis web tool for Molli.
  • novoStoic2.0: Integrated pathway design tool with thermodynamic considerations and enzyme selection.
  • novostoic-frontend: AlphaSynthesis web tool for NovoStoic 2.0.
  • On-Demand IE: Extract user-requested information from text and return as structured data.
  • PIEClass: Weakly-supervised text classification with prompting and noise-robust iterative ensemble training.
  • Pre-Trained Language Models Tutorial: Outline, slides, references, and software links for three-hour tutorial.
  • SEType: Seed-guided fine-grained entity typing in science and engineering domains, which takes the name and a few seed entities for each entity type as the only supervision and aims to classify new entity mentions into both seen and unseen types.
  • ShareGPT Investigation: Comparison of real-world user queries and established NLP benchmarks.
  • somn-frontend: AlphaSynthesis web tool for Lucid Somnambulist.
  • ToTER: Topical Taxonomy Enhanced Retrieval framework, which identifies the central topics of queries and documents with the guidance of the taxonomy, and exploits their topical relatedness to supplement missing contexts.

GitHub and GitLab Organizations