Machine learning for
metadata standardization

Craven Group meeting

April 12, 2021

Yuriy Sverchkov

Metadata standardization

Often we encounter databases of items where these items are described by unstructured or semi-structured metadata.

Example: Sequence Read Archive


"age": "24",
"cell_type": "dermal fibroblast",
"isolate": "not applicable",
"organism": "Homo sapiens",
"sampling site": "shoulder",
"sex": "female",
"stimulus": "UV exposed"
            

"gender": "female",
"individual": "patient2",
"source_name": "kidney",
"tissue": "tumor"
            

Standardization task

Key | Value ---|--- gender | female individual | patient2 source_name | kidney tissue | tumor

Variable-length text input

Variable-length ontology+relationship output

Approach

Cast as classification

  • Domain specific
  • Rule based

MetaSRA text reasoning graph

We identify candidate terms using a text reasoning graph with a domain-agnostic set of rules

Feature groups extracted

Classifiers

Evaluation metrics for ontology terms

Classifier performance

Multilabel classification - all terms

Multilabel classification - most specific terms