Research Article

Feature importance-based interpretation of UMAP-visualized polymer space

Corresponding Author

Takuya Ehiro

ehirot@orist.jp

orcid.org/0000-0001-9512-3238

Research Division of Polymer Functional Materials, Osaka Research Institute of Industrial Science and Technology, 2-7-1 Ayumino, Izumi, Osaka, 594-1157 Japan

Search for more papers by this author

Takuya Ehiro,

Corresponding Author

Takuya Ehiro

ehirot@orist.jp

orcid.org/0000-0001-9512-3238

Research Division of Polymer Functional Materials, Osaka Research Institute of Industrial Science and Technology, 2-7-1 Ayumino, Izumi, Osaka, 594-1157 Japan

Search for more papers by this author

First published: 22 May 2023

https://doi.org/10.1002/minf.202300061

Share a link

Graphical Abstract

Abstract

Dimensionality reduction (DR) techniques are used for various purposes such as exploratory data analysis. A commonly employed linear DR technique is principal component analysis (PCA), which is one of the most popular methods for DR. Owing to its linear nature, PCA enables the determination of axes in a low-dimensional space and the calculation of corresponding loading vectors. However, PCA cannot necessarily extract important features of non-linearly distributed data. This study presents a technique aimed at aiding the interpretation of data reduced through non-linear DR methods. In the proposed method, non-linear dimensionally reduced data was clustered via a density-based clustering method. Thereafter, the obtained cluster labels were classified by random forest (RF) classifiers. Further, feature importance (FI) of RF classifiers and Spearman's rank correlation coefficients between predictive probabilities to obtained clusters and original feature values were utilized for characterizing the visualized dimensionally reduced data. The results revealed that the proposed method can provide the interpretable FI-based images of the handwritten digits dataset. Moreover, the proposed method was also applied to the polymer dataset. The study found that incorporating signed FI was advantageous in achieving a meaningful interpretation. Furthermore, Gaussian process regression was utilized to produce intuitive FI-based heatmaps on a 2-dimensional space for greater ease of understanding. Additionally, to enhance the interpretability of the obtained clusters, a feature selection technique called Boruta was applied. The Boruta feature selection method worked effectively to interpret the obtained clusters with limited and commonly important features. Additionally, the study suggested that computing FI solely from substructure-based descriptors could further enhance the interpretability of the results. Finally, the automation of the proposed method was investigated, and through maximizing the target score based on the quality of both the DR and clustering, indicative results were automatically obtained for both the handwritten digits and polymer datasets.

Conflict of interest

None declared.

Open Research

References

Volume42, Issue8-9

August 2023

2300061

Metrics

Full text views:865

Details

Check for updates

Keywords

Publication History

Issue Online: 13 September 2023
Version of Record online: 16 June 2023
Accepted manuscript online: 22 May 2023
Manuscript accepted: 19 May 2023
Manuscript revised: 11 May 2023
Manuscript received: 12 March 2023

Feature importance-based interpretation of UMAP-visualized polymer space

References

Information

Graphical Abstract

Abstract

Conflict of interest

Open Research

Data Availability Statement

References

References

Information

Metrics

Details

Keywords

Publication History

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Feature importance-based interpretation of UMAP-visualized polymer space

References

Related

Information

Graphical Abstract

Abstract

Conflict of interest

Open Research

Data Availability Statement

References

References

Related

Information

Recommended