ChemDFM: Dialogue Foundation Model for Chemistry

Zihan Zhao^*, Da Ma^*, Lu Chen^†, Liangtai Sun, Zihao Li, Hongshen Xu, Zichen Zhu,
Su Zhu, Shuai Fan, Guodong Shen, Xin Chen, Kai Yu^†

X-LANCE Lab, Department of Computer Science and Engineering
MoE Key Lab of Artificial Intelligence, SJTU AI Institute
Shanghai Jiao Tong University, Shanghai, China
Suzhou Laboratory, Suzhou, China
Shanghai Key Laboratory for Molecular Engineering of Chiral Drugs
School of Chemistry and Chemical Engineering
Shanghai Jiao Tong University, Shanghai, China
AI Speech Co, .Ltd., Suzhou, China
*Equal contribution
zhao_mengxin@sjtu.edu.cn, chenlusz@sjtu.edu.cn, kai.yu@sjtu.edu.cn

arXiv Code

🤗

Model

Large language models (LLMs) have established great success in the general domain of natural language processing. Their emerging task generalization and free-form dialogue capabilities can greatly help to design Chemical General Intelligence (CGI) to assist real-world research in chemistry. However, the existence of specialized language and knowledge in the field of chemistry, such as the highly informative SMILES notation, hinders the performance of general-domain LLMs in chemistry. To this end, we develop ChemDFM, the pioneering LLM towards CGI. ChemDFM-13B is trained on 34B tokens from chemical literature, textbooks, and instructions as well as various data from the general domain. Therefore, it can store, understand, and reason over chemical knowledge and languages while still possessing advanced free-form language comprehension capabilities. Extensive quantitative evaluation shows that ChemDFM can significantly outperform the representative open-sourced LLMs. Moreover, ChemDFM can also surpass GPT-4 on a great portion of chemical tasks, despite the significant size difference. Further qualitative evaluations demonstrate the efficiency and effectiveness of ChemDFM in real-world research scenarios.

Objective Evaluation

►ChemDFM-13B greatly outperforms the representative open-sourced LLMs on typical chemical tasks.
►ChemDFM-13B achieves comparable or even better performances than GPT-4 on a great portion of chemical tasks, despite the significant size difference.

Human Evaluation: Question Answering

We mark the correct and relevant information in the replies in green, the correct but irrelevant information in yellow, and the wrong information in red. In addition, the key points of the answer are marked in bold if they appear in the reply.
The results show that while open-sourced LLMs perform well when asked about existing knowledge (Q1), only ChemDFM can provide correct and comprehensive answers when questions involve new molecules and reactions (Q2 [Yin et al., 2023] & Q3 [Dargo et al., 2023])

Human Evaluation: Dialogue

The above conversation is also inspired by Yin et al.[2023]. In the dialogue, ChemDFM shows promising capabilities regarding error correction (Round 2) and detailing (Round 3) when handling real-world research scenarios.

BibTeX

@misc{zhao2024chemdfm,
    title={ChemDFM: Dialogue Foundation Model for Chemistry},
    author={Zihan Zhao and Da Ma and Lu Chen and Liangtai Sun and Zihao Li and Hongshen Xu and Zichen Zhu and Su Zhu and Shuai Fan and Guodong Shen and Xin Chen and Kai Yu},
    year={2024},
    eprint={2401.14818},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}