Text mining facilitates materials discovery

Computer algorithms can be used to analyse text to find semantic relationships between words without human input. This method has now been adopted to identify unreported properties of materials in scientific papers.


The total number of materials that can potentially be made — sometimes referred to as materials space — is vast, because there are countless combinations of components and structures from which materials can be fabricated. The accumulation of experimental data that represent pockets of this space has created a foundation for the emerging field of materials informatics, which integrates high-throughput experiments, computations and data-driven methods into a tight feedback loop that enables rational materials design. Writing in Nature, Tshitoyan et al.1 report that knowledge of materials science ‘hidden’ in the text of published papers can be mined effectively by computers without any guidance from humans.


The discovery of materials that have a particular set of properties has always been a serendipitous process requiring extensive experimentation — a combination of craft and science practised by knowledgeable artisans. However, this trial-and-error approach is expensive and inefficient. There is therefore great interest in using machine learning to make materials discovery more efficient.

Currently, most machine-learning applications aim to find an empirical function that maps input data (for example, parameters that define a material’s composition) to a known output (such as measured physical or electronic properties). The empirical function can then be used to predict the property of interest for new input data. This approach is said to be supervised, because the process of learning from the training data is akin to a teacher supervising students by selecting the subjects and facts needed for a particular lesson. A contrasting approach involves using only input data, which have no obvious connection to a specific output. In this case, the goal is to identify intrinsic patterns in the data, which are then used to classify those data. Such an approach is called unsupervised learning, because there are no a priori correct answers and there is no teacher.

Tshitoyan and colleagues collected 3.3 million abstracts from papers published in the fields of materials science, physics and chemistry between 1922 and 2018. These abstracts were processed and curated, for example to remove text that wasn’t in English and to exclude abstracts that had unsuitable metadata types, such as ‘Erratum’ or ‘Memorial’. This left 1.5 million abstracts, which were written using a vocabulary of about 500,000 words.

The authors then analysed the curated text using an unsupervised machine-learning algorithm known as Word2vec2, which was developed to enable computers to process text and natural language. Word2vec takes a large body of text and passes it through an artificial neural network (a type of machine-learning algorithm) to map each word in the vocabulary to a numeric vector, each of which typically has several hundred dimensions. The resulting word vectors are called embeddings, and are used to position each word, represented as a data point, in a multidimensional space that represents the vocabulary. Words that share common meanings form clusters within that space. Word2vec can therefore make accurate estimates about the meaning of words, or about the functional relationships between them, on the basis of the patterns of usage of the words in the original text. Importantly, these meanings and relationships are not explicitly encoded by humans, but are learnt in an unsupervised way from the analysed text.


The researchers found that the obtained word embeddings for materials-science terms produced word associations that reflect rules of chemistry, even though the algorithm did not use any specific labels to identify or interpret chemical concepts. When combined using various mathematical operations, the embeddings identified word associations that corresponded to concepts such as ‘chemical elements’, ‘oxides’, ‘crystal structures’, and so on. The embeddings also identified clusters of known materials (Fig. 1) corresponding to categorizations that could be used to classify new materials made in the future.


 For any more information, please log on https://www.nature.com/articles/d41586-019-01978-x