Clinical Genomics

Informative Ballot 1
HL7 V3 CG GEXPR, R1
HL7 Version 3 Standard: Clinical Genomics; Gene Expression, Release 1
Informative Ballot 1 - May 2012
Not Balloting This Cycle
HL7 V3DAM GEXPR, R1
HL7 Version 3 Domain Analysis Model: Clinical Genomics; Gene Expression, Release 1
Last Ballot: Informative Ballot 2 - May 2010
ANSI
ANSI/HL7 V3 CGPED, R1-2007
HL7 Version 3 Standard: Clinical Genomics; Pedigree, Release 1
7/5/2007
Responsible Group Clinical Genomics Work Group
HL7
Co-Chair, Facilitator & Primary Contributor Amnon Shabo (Shvo), Ph.D.
shabo@il.ibm.com
IBM Research Lab in Haifa
Co-Chair Kevin S. Hughes, M.D., FACS
KSHUGHES@PARTNERS.ORG
Massachusetts General Hospital, Partners HealthCare
Co-Chair Mollie Ullman-Cullere
MULLMANCULLERE@PARTNERS.ORG
Harvard-Partners Center for Genetics and Genomics
Co-Chair and RCRIM liaison Phil Pochon
Phil.Pochon@covance.com
Covance
Past Co-Chair Scott Whyte
scott.whyte@chw.edu
Catholic Healthcare West
Vocabulary Facilitator Joyce Hernandez
joyce_hernandez@merck.com
Merck Research Laboratories
Publishing Facilitator Grant Wood
grant.wood@imail.org
Intermountain Healthcare Clinical Genetics Institute
Meeting Minutes Marta Jaremek
Marta.Jaremek@siemens.com
Siemens AG
Project Management Facilitator Rick Haddorff
Haddorff.Richard@Mayo.edu
Mayo Clinic
Vocabulary contributor Yan Heras
Yan.Heras@imail.org
Intermountain Healthcare
Contributor Scott Bolte
Scott.Bolte@med.ge.com
GE Healthcare
Implementation guide editor Joann Larsen
Joann.Larsen@kp.org
Kaiser Permanente
Modeling and Vocabulary Contributor Perry Mar
PMAR@PARTNERS.ORG
Partners HealthCare - Clinical Informatics Research and Development
Modeling Contributor Allen Hobbs
allen.hobbs@kp.org
Kaiser Permanente
Modeling Contributor Michael Miller
Rosetta Biosoftware
Contributor Mark Sires
msires@gte.net
Sires Consulting
Modeling & Vocabulary Contributor Shosh Israel, Ph.D.
israels@hadassah.org.il
Hadassah University Hospital
Modeling & Vocabulary Contributor Chuanbo Xu
Chuanbo_Xu@perlegen.com
Perlegen
MnM Facilitator (retired) Charlie Mead, MD MSc
Booz Allen Hamilton
Storyboard Contributor Pravat Das
Mayo Clinic
Storyboard Contributor Jennifer Fostel, Ph.D.
NIH - National Center for Toxicogenomics
Storyboard Contributor Brent Gendleman
5AM Solutions
Storyboard Contributor Jim Holeman
HP Nonstop Enterprise Division
Storyboard Contributor Anajane Smith
Fred Hutchinson Cancer Research Center
Storyboard Contributor Derek Walker
Fred Hutchinson Cancer Research Center
Storyboard Contributor Sue-Jane Wang, Ph.D.
US. Food and Drug Administration
Storyboard Contributor Mathieu Wiepert
Mayo Clinic
Storyboard Contributor Katie Wittrup
Mayo Clinic

Content Last Edited: 2009-12-02T14:33:13


View Revision MarksHide Revision Marks

Table of Contents


Preface
    i January 2010 Ballot Cycle
    ii General Domain Overview
    iii Message Design Element Navigation
1  Overview
    1.1  Introduction & Scope
    1.2  Domain Information Models
2  Pedigree Topic
3  Genetic Variation Topic
    3.1  Introduction
    3.2  Storyboards
    3.3  Application Roles
    3.4  Trigger Events
    3.5  Refined Message Information Models
    3.6  Hierarchical Message Descriptions
    3.7  Interactions
4  Gene Expression DAM Topic
5  Quality Analysis Report Topic
6  CMETs Defined by this Domain
7  CMETs Used by this Domain
8  Interactions Annex
    8.1 By Application Role
    8.2 By Trigger Event
    8.3 By Message Type
9  Glossary

As has been done with topics in a number of other Normative domains, the Pedigree topic in the Clinical Genomics domain has been removed from the V3 Ballot Web site beginning with the January 2010 ballot. This material has been removed because a Normative version of this material is available in the HL7 V3 Normative Edition, beginning with the 2008 release, and there is currently no ongoing ballots for this topic. This statement has been inserted as a placeholder to direct readers to approved versions of this domain content.

The HL7 Version 3 Normative Edition is available to HL7 members as a free download on the V3 Messaging Standard page of the HL7 web site. Non-members can purchase a copy of the Normative Edition online at the HL7 Store. For those who are not implementing and merely want to review the ballot version of this content, draft content remains available in previous ballot cycle versions of the V3 Ballot Web Site. The Version 3 Ballot Site Archive provides links to previous ballot web sites. Readers wishing to review the domain material are directed to the September 2009 ballot web site or earlier.

The Clinical Genomics SIG have developed thus far three topics - (1) Pedigree (Family History), (2) Genotype and (3) GeneticVariation. The Pedigree Topic has been approved as normative in May 2007 (after being part of the Clinical Genomics DSTU package). The Genotype Topic was approved as DSTU in May 2005 and two updates have been approved since then. The current goal of the Clinical Genomics Working Group is to bring the Genotype Topic to normative as well. However, due to the broad scope of the Genotype DSTU, the decision is to progress to Normative in a step-wise approach so that each focal area of the DSTU will be balloted as a Normative Topic, containing a constrained R-MIM of the DSTU models.

The stepwise approach is based on identifying those areas in the Genotype Topic that are (1) most relevant to our stakeholders and (2) have been actually experimented since the DSTU passed ballot. As a result of these considerations, the group chose the genetic variation area to be the first area to progress to normative. Indeed, the Clinical Genomics domain includes a new topic called "Genetic Variation" whose model is a constraining of the Genotype models. Therefore, readers interested in standard specifications for genetic variation should use the GeneticVariation Topic and refer to the Genotype DSTU only as an overarching model describing how the various types of core genomic data could be associated with phenotypic data.

 Genetic Variation Topic ()
 
pointer GeneticVariation (POCG_RM000011UV

Since its formation, the Clinical Genomics Working Group has been developing HL7 V3 standards to enable the exchange of interrelated clinical and personalized genomic data between interested parties. In many cases the exchange of genomic data is done between disparate organizations (healthcare providers, genetic labs, research facilities, etc.) and acceptable standards are crucial for the usefulness of genomic data in healthcare practice. It is envisioned that the use of genomic data in healthcare practice will become ubiquitous.

The Clinical Genomics domain addresses requirements for the interrelation of clinical and genomic data at the individual level. Much of the genomic data is still generic, for example the human genome is in fact the DNA sequences believed to be the common sequences in every human being. The vision of 'personalized medicine' is based on those correlations that make use of personal genomic data such as the SNPs (Single Nucleotide Polymorphisms) that differentiate any two persons and occur about every thousand bases. Beside normal differences, health conditions such as drug sensitivities, allergies and others could be attributed to the individual SNPs or to differences in gene expression and proteomics.

The emphases of the Clinical Genomics domain are the personalization of the genomic data and the 'intelligent' linking to relevant clinical information. These links are probably the main source from which geneticists (genomicists?) and clinicians could benefit. The cases where genomic data are used in healthcare practice vary in complexity and extent of the data used, since the current testing methods are still very expensive and not widely used. We can see simple testing like identifying genes and mutations as well as full sequencing of alleles and the use of micro-arrays to identify the expression of vast number of genes in each individual. Naturally, the Clinical Genomics Working Group has been focusing on tests that are routinely done in healthcare, while preparing the information infrastructure standard for more futuristic cases.

At a first sight it seems that genomic data sets are yet another type of observations. While this is true of course, there are a few characteristics that might distinguish it from typical observations such as blood pressure or potassium level:

  • The amount of data: potentially it could be the entire human genome

  • The personalization of the data is evolving as new discoveries are constantly made

  • The complexity of the data: not only the DNA sequences (...AGCT...) need to be represented, but also SNPs, annotations (automatic and manual), gene expression, protein translation, and more

  • The emerging standard formats being used by bioinformatics communities, for example: BSML (Bioinformatic Sequence Markup Language), MAGE-ML ( Microarray and GeneExpression Markup Language)

  • Various standard organizations and many stakeholders are involved

  • The clinical-genomic correlations are represented in variety of different ways depending on the point of view (clinical research, pharmaceutical or healthcare)

The core Genotype model is the GeneticLocus model. It consists of various types of genomic data relating to a specific DNA locus including sequencing, expression and proteomic data. Within the GeneticLocus model we have utilized existing bioinformatics markups to represent raw data received from genomic facilities. Examining and constraining these markups is a work in progress and thus this part of the GeneticLocus model is considered informative as well.

The FamilyHistory model is aimed at describing a patient's pedigree with genomic data and thus utilizes the GeneticLocus model (as a CMET) to carry the genomic data for the patient's relatives.

Go To Top

Diagram

T-POCG_DM000020UV.png
Description

Preface:

NOTE: THIS DIM IS NOT UNDER BALLOT AND CONSISTS OF PORTIONS OF THE DEPRECATED DSTU AS WELL AS NEW STRUCTURES TO ACCOMODATE NON-LOCUS SPECIFIC DATA, ALL WITHIN AN UMBRELLA OF "GENOME" AS THE ULTIMATE ORGANIZER. THIS MODEL IS PRESENTED FOR THE PURPOSE OF INTERNAL DISCUSSION ONLY.

In previous ballot cycles, the Clinical Genomics DIM used to be an aggregation of two core models within the DSTU Genotype Topic (GeneticLocus and GeneticLoci), along with the Pedigree Topic model (Family History). Since then, the Pedigree Topic has been already approved as normative and the DSTU Genotype Topic has been frozen and is going to be deprecated. As part of our effort to move the DSTU to Normative, the Clinical Genomics Domain has a new Topic called "Genetic Variation" which has a model that constrains the GeneticLoci and GeneticLocus models and is balloted in the normative track.

The roadmap of the Clinical Genomics Domain is to ballot more areas of the domain as normative and eventually aggregate all normative models into a revised Domain Information Model. The DIM available here contains the Genetic Variation model from Septemeber 2009 along with other portions of the core DSTU models in order to have a 'continuity of standards' in the Clinical Genomics Domain. This domain model will be continuously updated as the normative ballots of the various areas of genomics get finalized.

DIM Walk-Through

General Notes

* The Use of the 'id', 'code' and 'value' Attributes:

The use of these attributes in the various classes depends on the extent to which the data has being personalized and how different are the results from the known genome. It is also different in those classes that encapsulate raw genomic data. For example, in the IndividualAllele class, in the case that the patient's allele was fully sequenced and found to be slightly different than the one registered in GenBank or other reference databases, there is no external code to place in one of those attributes, rather the IndividualAllele class is associated with the Sequence class where the individual sequence could be placed in the value attribute. If it is a new allele indeed, temporary identifiers could be placed in the id attribute until it is registered externally. If, however, that is a known allele, then the 'value' attribute can be populated with the appropriate code from GenBank for example. In this case there isn't much point in populating the Sequence class as it can be retrieved from GenBank, but for self-containment purposes in a specific implementation, it could be that the GenBank sequence will be copied and placed in the Sequence class. As fot the 'id' attribute, it should be used to uniquely identifying that specific instance (possibly using the LSID format). The 'code' attribute should identify the kind of data stored in the 'value' attribute, and the 'value' attribute should hold actual data, for example, a characteristic (e.g., heterozygous) or an external gene code from GenBank, dbSNP. In the 'encapsulating' classes (e.g., Sequence, Expression, etc.) the 'value' attribute should hold the bioinformatics markup itself. In the latter case, the code should hold an indication of the exact bioinformatics format used to populate the 'value' attribute.

* Vocabularies

All vocabularies presented in the model walk-through below should be considered informative part of the this ballot document and in general were imported from the deprecated DSTU and are included here for illustration purposes. The ultimate goal is to have codes drawn from internationally-recoginzed controlled vocabularies such as SNOMED and LOINC. For example, it is possible to use newly-created LOINC codes in the area of genetic testing results for healthcare environments, developed within the HL7 v2 Implementation Guide effort. A few of those LOINC codes are presented below where appropriate, e.g., for DNA variation type, Overall interpretation and others. The use of these value sets is further constrained in the various Topics of the Clinical Genomics Domain as well as in implementation guides for specifc realms or use cases.

Genome:

The Genome class is the highest entry point in terms of the genomic data collected about an individual. Presumably, one can represent the entire genome of a patient for example in this area of the model.

  • The value attribute could hold the entire genome in a common markup like BSML or just contain a reference to the genome persisted in some other place. Its data type is ANY and should be dynamically typed in the instance according to the use of the attribute. In addition, the value attribute is optional and may not be assigned whatsoever. In this case the Genome class is used more as an entry point the expectation is to hold data in nesting classes like Sequence (of some Genetic Locus) or in non-locus specific place holders (see below).
  • The code attribute could optionally identify the type of genome held in this class but no binding vocabulary has been developed thus far for this attribute.
  • The interpretation code could optionally hold a high level interpretation of all data held in the instance, whether it's held in this class or other nesting class of any model derived from this DIM. No binding vocabulary has been developed thus far for this attribute. It's more likely to see interpretation codes being populated at a lower level like the Genetic Loci or Genetic Variation classes.
  • The method code attribute could hold the method by which the genome held in the value attribute was obtained.
  • Other attributes in the Genome class should be used as implied from the general description of these attributes in the RIM.

Associations:

The Genome class has two major associations that allow for representation of locus-specific data and other types of data that cannot be tied to specific loci. A locus in our domain is referred to a location on the genome which has the size of a gene.
  • Locus-specific data: Locus-specific data is represented by traversing to the Genetic Loci class (as many times as needed). In each traversal, a set of loci (e.g., a gene panel) could be described. The Genetic Loci portion of the model is described further on in the DIM walk through.
  • NonLocusData Choice: It is possible to traverse (as many times as needed) from the Genome class to any of the classes in the NonLocusData choice box in order to represent data that is not related to specific loci.
    • LargeDeletion - this class could hold a large deletion identified by a common representation of such deletion in bioinformatics. The value attribute will hold the actual deletion while other attributes provide metadata as defined in this walk through consistent with the RIM.
    • LargeDuplication - this class could hold a large duplication identified by a common representation of such duplication in bioinformatics. The value attribute will hold the actual duplication while other attributes provide metadata as defined in this walk through consistent with the RIM.
    • Cytogenetics - this class holds results of some cytogenetics observations. Format still needs to be defined, but as in any RIM observation, value attributes holds the actual observation while other attributes provide the metadata on that value.
    • OtherNonLocusData - this class is a catcher of any type of non-locus data that could not be represented by any of the other classes in this choice box. Semantics of the data held by this class is controlled by the code attributes and its binding vocabulary is under development.

Genetic Loci:

The 'Genetic Loci' portion of the DIM allows for representation of data relating to a set of loci along a genome. The set of loci could be of differnet types, for example, a haplotype (allele or SNP), a genetic profile, a biological pathway, a set of genetic test results which contains results of multiple genes, etc.

Genetic Loci Walk-Through:

  • Entry Point:

    • GeneticLoci

      The entry point is a GeneticLoci class allowing the representation of the type of this loci group (e.g., allele haplotype like in tissue typing of HLA antigens) and an optional code for identifying the loci set (if available).

      GeneticLoci Attributes:

      GeneticLociAttr.gif
      • LociChoice

        A GeneticLoci instance consists of zero to many GeneticLocus CMETs or other GeneticLoci classes. This recursive structure allows the representation of a complex set of genetic loci as comprised of 'sub' genetic loci sets. The actual genetic/genomic data is represented through GeneticLocus instances at any level of nesting required.

      • AssociatedObservation

        The AssociatedObservation class associated with GeneticLoci is a place holder for various observations related to the set of loci that have been observed independently of the parent observation. This is a generic catcher for data that can be placed in any other class in this model. Population of the code & value attributes of this class is controlled by a vocabulary which is still under development.

      • Interpretation

        In cases where there is an interpretation to the entire set of loci data, it is possible to use the interpretationCode attribute of the GeneticLoci class or populate the associated Phenotype model (CMET). Examples of overall interpretation codes can be demonstrated through the follwoing LOINC answer lists. Note that since the interpretation code is a single attribute, the context of each answer list should be represnted by the code attribute. For example, in the case of the first answer list below, "Genetic disease analysis" is the type of Genetic Loci observation while any of the codes on the answer list could be a valid interpretaion, i.e., the overall result of the the analysis.

        LOINC "Genetic disease analysis overall interpretation" - 51968-6

        • LA6576-8: Positive
        • LA6577-6: Negative
        • LA9663-1: Inconclusive
        • LA9664-9: Failure

        LOINC "Genetic disease analysis overall carrier interpretation" - 53039-4

        • LA10314-5: Carrier
        • LA6577-6: Negative
        • LA9663-1: Inconclusive
        • LA9664-9: Failure

        LOINC "Drug efficacy analysis overall interpretation" - 51964-5

        • LA6677-4: Responsive
        • LA6676-6: Resistant
        • LA6577-6: Negative
        • LA9663-1: Inconclusive
        • LA9664-9: Failure

        LOINC "Drug metabolism analysis overall interpretation" - 51971-0

        • LA10315-2: Ultrarapid metabolizer
        • LA10316-0: Extensive metabolizer
        • LA10317-8: Intermediate metabolizer
        • LA9657-3: Poor metabolizer
      • Raw genomic data and the value attribute:

        While the value attribute typically holds a common code of this set of loci, it can also hold raw genomic data for the entire set of loci. For example, if the set of loci is a gene expression assay, the value attribute can hold the part of the MAGE-ML xml that holds data on all loci.

  • Other Classes

    • GeneticReportDocument

      A report document that summarizes the data of a genetic loci set could be associated with the GeneticLoci class and is of type DOCCLIN which represents in HL7 a CDA type (Clinical Document Architecture). It is possible to actually embed an entire CDA document in the text attribute of the GeneticReportDocument class or just point to this document through the id attribute. Other related reports (e.g., follow-up's, addenda) could be associated as well. For more details about the use of clinical documents, please refer to the CDA and the Medical Records specification in the HL7 V3 Ballot Package.

      • sequelTo

        This is a recursive association to the GeneticReportDocument class, which allows the representation of several documents that relate to one another and to the Genetic loci of course. For example, an addendum or follow-up document to the first summary.

    • Participants

      In general, a group of optional participants are associated with both the GeneticLoci and the GeneticTestOrder classes allowing the recording of participants of the order act as well as the results fulfilling that order:

      • recordTarget

        The record target indicates whose medical record holds the documentation of this act (i.e., the order or the results). This is especially important when the subject of a service is not the patient himself. Note that the subject can be overridden in certain GeneticLocus instances. This could be useful when describing for example genomic data relating to various specimens (healthy and tumor tissues) or relating to virus, as part of a genetic testing of a patient who carries this virus.

      • author

        Author of the data.

      • performer

        Performer of the observations.

      • verifier

        Verifier of the observations.

      • informationRecipient

        To whom the data should be sent.

Genetic Locus:

The GeneticLocus portion of the DIM describes data relating to a genetic locus, which we propose to be the basic unit of genomic information exchange in healthcare. This model is not meant to be a biological model; rather it is aimed at the needs of healthcare with the vision of personalized medicine in mind. Also it could facilitate the needs of clinical research conducted within the healthcare enterprises as well as the needs of clinical trials. This model is the result of the group effort to look for the commonalities in each genomic-oriented storyboard that we've initially explored (i.e., Tissue Typing, Cystic Fibrosis, BRCA and Pharmacogenomics). The entry class GeneticLocus might be further constrained by its main subject (e.g., human, animal, and viral) or by type of genomic data (e.g., DNA, Expression and Proteomics). Those types of constraining are presented in the various Topics of this HL7 Domain, e.g., the Genetic Variation Topic where both the Genetic Loci and Genetic Locus portions of this DIM were constrained to represent variation data.

The Genetic Locus area of the DIM evolved from the work on several use cases that involve genomic data: Tissue Typing, Cystic Fibrosis, BRCA, and Pharmacogenomics. For example, in the tissue typing use case for bone-marrow transplantation (BMT) we have identified the follwoing: messages and documents being exchanged; tissue-typing observations, i.e., the individual tissue-typing observation and the matching observation which indicates the level of matching between two individual tissue-typing observations (e.g., patient and donor). and finally the individual genotype that describes a pair of HLA alleles. In this use case, the latter observations could be described by the Genetic Loci and Locus areas of this DIM.

Genetic Locus Main Characteristics:
  • The entry point to this area is a GeneticLocus observation which could be associated with a pair of alleles on paternal and maternal homologous chromosomes.

  • Core observations associated directly with each allele are Sequence, Sequence Variation and Expression. These core classes are also the ones which encapsulate raw genomic data.
  • In addition, sequence, sequence variation and expression data could be associated directly with the GeneticLocus observation if data is not available at the allelic level granularity.
  • Other classes hold extracted and derived data such certain types of variations, expression main results, references to a set of loci (e.g., a haplotype that this gene belongs to), and proteomics data (e.g., determinant peptides).
  • Both the GeneticLocus and IndividualAllele observations can recurse to represent relationships to genes/alleles from other loci (e.g., in a biological pathway).
  • The Sequence class could recurse as well to allow the representation of the translational path, i.e., DNA, RNA and Protein sequences, derived from each other.
  • Entry Point and Locus/Gene/Allele Classes:

    • GeneticLocus

      Important note: The name 'GeneticLocus' refers to ALL genomic data and aspects of a specific locus along a chromosomal or mitochondrial DNA.

      The GeneticLocus class is the entry point of describing locus-level data, e.g., gene, genetic marker, small variation, etc. A genotype commonly stands for an allele pair - from paternal and maternal homologous chromosomes. However, in this model it could be that (1) only one allele is associated with the locus (in cases of insufficient data or interest in one allele only); (2) no alleles are associated with the locus in cases where the locus' alleles have not been determined but there is a need to represent data related to the locus such as expression data, variations and even sequences; and (3) multiple allele are available (in case of tumor tissues where several acquired (somatic) variants are encountered).

      GeneticLocus Attributes:

      GenotypeLocusAttr.gif

      Table 1: GeneticLocus.value

      GenotypeLocus.value.gif
      • refrence

        The reference class represents a significant relationship with another locus. Note that it is possible to either expand the associations of the referred allele, or just indicate its id, assuming that it is detailed elsewhere (and accessible using its id). The association class called reference has a typeCode attribute currently set to REFR but we are developing a new vocabulary with codes like FUNCTIONAL, PHYSICAL, SIGNALING, and METABOLIC_PATHWAY that will be described in subsequent ballot cycles.

      • AssociatedObservation

        The AssociatedObservation class associated with GeneticLocus is a place holder for various observations related to a locus, for example, a Copy Number value that represents the number of copies of this gene or allele. The class has a shadow associated with all other classes thus provding a generic mechanism to hold asscoaited observations controlled by vocabularies. Both GeneticLocus and IndividualAllele share the same vocabulary at this point but might be separated in future versions. The code attribute holds the type of Observation, e.g., COPY_NUMBER and the value holds the actual result. Another example would be code=ZYGOSITY and value could then be either HOMOZYGOTE or HETEROZYGOTE. See tables 3&4 for more possible codes.

        Table 3: AssociatedObservation.code

        LocusAssociatedObservation.code.gif

        Table 4 lists the vocabulary from which codes are drawn to populate the value attribute. The abstract codes in this vocabulary are the codes from table 3 and thus these two vocabularies will be maintained in synch.

        Table 4: AssociatedObservation.value

        LocusAssociatedObservation.value.gif

        A new LOINC value set is dedicated to Allelic state and can be used here:

        LOINC "Allelic state" - 53034-5

        • LA6703-8: Heteroplasmic
        • LA6704-6: Homoplasmic
        • LA6705-3: Homozygous
        • LA6706-1: Heterozygous
        • LA6707-9: Hemizygous

        Place the code 53034-5 ("Allelic state") in the code attribute and assign one of the codes in the answer list into the value attribute of the AssociatedObservation class.

    • IndividualAllele

      Important note: The term 'Individual Allele' doesn't refer necessarily to a known variant of a gene (or any locus), rather it refers to the patient data regarding the locus that might contain personal variations (e.g., rare SNPs with unknown-significance). In addtion, the individual allele could also be a wild type of that allele, i.e., no variations were found that could indetify this allele as one of the known alleles.

      The GeneticLocus class is associated with 0 to many alleles represented by the IndividualAllele class. The IndividualAllele class identifies the specific allele instance (using the id attribute) and optionally specifies its external code (if known) and the method by which it was identified.

      IndividualAllele Attributes:

      IndividualAlleleAttr.gif

      Table 5: IndividualAllele.value:

      IndividualAllele.value.gif

      Table 6: IndividualAllele.methodCode:

      IndividualAllele.methodCode.gif
      • AssociatedObservation (SHADOW )

        This is a shadow of the AssociatedObservation class. Please refer to the description that class in the context of GeneticLocus for more details.

      • reference

        The reference class represents a related allele of a different locus, and still has significant interrelation with the source allele (this is a recursive association of IndividualAllele). See the equivalent class in association with GeneticLocus for more details.

  • Encapsulating Classes

    • Sequence

      The Sequence class is a generalization of all types of genetic-releated sequences (i.e., DNA, RNA, Protein) preferably encapsulating the raw sequencing results of the DNA, and the derived sequences of the resultant RNA and protein molecules. The Sequence class has a recursive relation that makes it possible to nest an RNA sequence within a DNA sequence, and a protein sequence within an RNA sequence. The relationship type is DRIV (derivation) as the nested Sequence classes are meant to be placeholders for sequences that were computed from the first Sequence class which is the only "encapsulating" class in this path (by 'first' we mean the one that is associated directly with the IndividualAllele class or with the GeneticLocus class in case of non-allelic data sets).

      Sequence Attributes:

      SequenceAttr.gif

      Table 7: Sequence.methodCode

      Sequence.methodCode.gif
      • AssociatedProperty

        The AssociatedProperty class is a placeholder for various properties that relate to the parent class (e.g., the Sequence class), which are supposed to be extracted (bubbled-up) from the raw data encapsulated in the parent class. This class is basically a code-value pair allowing the association of multiple properties with the core observation class which sets the context of these property observations in terms of identification and time for example. See discussion about the differences between associated observations versus properties further on. Table 8 lists the vocabulary from which codes are drawn to populate the code attribute while table 9 lists the vocabulary from which codes are drawn to populate the value attribute. The two vocabularies are synchronized in the sense that table 8 codes are the abstract codes in table 9 and each of them defines the vocabulary (nested within the abstract code) used when that abstract code was selected to populate the code attribute.

        Table 8: AssociatedProperty.code (associated with Sequence)

        SequenceProperty.code.gif

        Table 9: AssociatedProperty.value (associated with Sequence)

        SequenceProperty.value.gif
    • SequenceVariation

      The class SequenceVariation is a generalization of all variation types, i.e., in all molecules (DNA, RNA, Protein) and of all types within each molecule (e.g., in DNA: SNP, Mutation, large deletion, etc.).

      SequenceVariation Attributes:

      SequenceVariationAttr.gif

      Table 10: SequenceVariation.interpretationCode

      SequenceVariation.interpretationCode.gif

      A more advanced value set for sequence variation interpretation is the following LOINC answer list: (Note that since the interpretation code is a single attribute, the context of each answer list should be represnted by the SequenceVariation code attribute. See an example in the genetic Loci interpretation code)

      LOINC "Genetic disease sequence variation interpretation" - 53037-8

      • LA6668-3: Pathogenic
      • LA6669-1: Presumed pathogenic
      • LA6682-4: Unknown significance
      • LA6675-8: Benign
      • LA6674-1: Presumed benign

      Sequence variation interpretation realted to drug efficacy is represented by the following LOINC answer list:

      LOINC "Drug efficacy sequence variation interpretation" - 51961-1

      • LA6676-6:Resistant
      • LA6677-4:Responsive
      • LA9660-7: Presumed resistant
      • LA9661-5: Presumed responsive
      • LA6682-4: Unknown Significance
      • LA6675-8: Benign
      • LA6674-1: Presumed Benign
      • LA9662-3: Presumed non-responsive

      Associations:

      • AssociatedProperty

        The class AssociatedProperty is a place holder for various properties that relate to a sequence variation, for example, position, length, region, reference and more. It replaces the distinct observations we had in previous versions of the DSTU Genotype model. This class is basically a code-value pair allowing the association of multiple properties with the core variation class which sets the context of these property observations in terms of identification and time for example. See discussion about the differences between associated observations versus properties further on. Table 11 list the vocabulary from which codes are drawn to populate the code attribute while table 12 lists the vocabulary from which codes are drawn to populate the value attribute. The two vocabularies are synchronized in the sense that table 11 codes are the abstract codes in table 12 and each of them defines the vocabulary (nested within the abstract code) used when that abstract code was selected to populate the code attribute.

        Table 11: AssociatedProperty.code (asscoated with SequenceVariation)

        SequenceVariationProperty.code.gif

        Table 12: AssociatedProperty.value (asscoated with SequenceVariation)

        SequenceVariationProperty.value.gif

        A major challenge is to accurately identify the type of sequence variation represented here. A set of DNA sequence variation types is presented here by their LOINC codes. The LOINC code that identifies the following value set is 48019-4 representing the LOINC component "DNA sequence variation type". Thus, it is sufficient to place this LOINC code in the code attribute of the AssociatedProperty class (associated with the SequenceVariation class) and one of the codes from the list below in its value attribute.

        LOINC "DNA sequence variation type" - 48019-4

        • LA9658-1: Wild type
        • LA6692-3: Deletion
        • LA6686-5: Duplication
        • LA6687-3: Insertion
        • LA6688-1: Insertion/Deletion
        • LA6689-9: Inversion

        Another example is value set is the value set "Amino acid change type" - LOINC code 48006-1. It is possible to place this LOINC code in the code attribute of the AssociatedProperty class (associated with the SequenceVariation class) and one of the codes from the list below in its value attribute.

        LOINC "Amino acid change type" - 48006

        • LA9658-1: Wild type
        • LA6692-3: Deletion
        • LA6686-5: Duplication
        • LA6694-9: Frameshift
        • LA6695-6: Initiating Methionine
        • LA9659-9: Insertion and Deletion
        • LA6698-0: Missense
        • LA6699-8: Nonsense
        • LA6700-4: Silent
        • LA6701-2: Stop Codon Mutation
    • Expression

      The class Expression is a generalization of all types of expression data (typically DNA-->RNA but also protein). Its code attribute identifies the type of expression data it carries. This class is one of the encapsulating classes, that is, it holds in its value attribute portions of relevant bioinformatics markup (e.g., MAGE-ML for gene expression data), complying with constrained schemas of the full-fledged markups. In such a case, the code attribute holds the exact reference to the contained bioinformatics schema which the value's content should comply with. Note that the association cardinality between this class and its source class IndividualAllele is zero to many. The idea here is to be able to represent gene expression over multiple experiments for the same allele under possibly various clinical environments and expression testing methods. If this association is traversed several times then it's mandatory to populate the id & effectiveTime attributes so that each object of this class will be distinguished clearly and identified uniquely.

      Expression Attributes:

      ExpressionAttr.gif
      • AssociatedProperty

        The AssociatedProperty class is a place holder for various properties that relate to expression data, for example, normalized intensity, qualitative indication, p-value and more, which are supposed to be extracted (bubbled-up) from the raw expression data encapsulated in the Expression class. This class is basically a code-value pair allowing the association of multiple properties with the core expression class which sets the context of these property observations in terms of identification and time for example. See discussion about the differences between associated observations versus properties further on. Table 13 list the vocabulary from which codes are drawn to populate the code attribute while table 14 list the vocabulary from which codes are drawn to populate the value attribute. The two vocabularies are synchronized in the sense that table 13 codes are the abstract codes in table 14 and each of them defines the vocabulary (nested within the abstract code) used when that abstract code was selected to populate the code attribute.

        Table 13: AssociatedProperty.code (associated with Expression)

        ExpressionProperty.code.gif

        Table 14: AssociatedProperty.value (associated with Expression)

        ExpressionProperty.value.gif

        Table 15: AssociatedProperty.methodCode (associated with Expression)

        ExpressionProperty.methodCode.gif
  • Other Classes and Proteomics

    • Polypeptide and DeterminantPeptide

      The Sequence class could be associated with its resultant or corresponding polypeptides (represented by the Polypeptide class) as well as with determinant peptides if applicable (represented by the DeterminantPeptide class). Note that the Sequence class has a recursive relation and it is possible to nest an RNA sequence within a DNA sequence, and a protein sequence within an RNA sequence. The Polypeptide could then be associated with the protein sequences or directly with any of the above levels. Also, it is possible to associate the DeterminantPeptide with the Polypeptide class or directly with the Sequence class. Both classes (Polypeptide and DeterminantPeptide) could be associated with several instances of the Phenotype model.

      The proteomics classes in this model represent protein data derived from the sequences (by means of computational biology) and is not intended to be a direct observation of some protein. The latter could be represented as regular lab results (using the HL7 Lab specs), which could be referenced in the GeneticLocus instance as if they were phenotype observations.

      A common case for the use of proteomics in this model could be as follows: Checking whether an amino acid change would result from the variant; if so - whether the new amino acid change is to an amino acid of a different size or charge state that would likely change the shape of the active region of the protein; how far the change is from the active site; whether the change is in a regulator region, and so forth. These observations could then be associated to phenotypic data.

      For example, consider the following case described in OMIM: "Despite the dramatic responses to EGFR inhibitors in patients with non-small cell lung cancer, most patients ultimately have a relapse. Kobayashi et al. (2005) reported a patient with EGFR-mutant, gefitinib-responsive, advanced non-small cell lung cancer who had a relapse after 2 years of complete remission during treatment with gefitinib. The DNA sequence of the EGFR gene in his tumor biopsy specimen at relapse revealed the presence of a second mutation {131550.0006}. Structural modeling and biochemical studies showed that this second mutation led to the gefitinib resistance." (OMIM *131550)

      Polypeptide and DeterminantPeptide Attributes:

      PolypeptideAttr.gif
  • To Phenotype and Beyond...

    • Phenotype

      The Phenotype CMET is meant to complement or replace the use of the interpretation codes presnet in all core classes of thsi model. While the interpretatioon code attribute could hold a single code, the Phenotype model has the full expressiveness of the Clinical Statement mode. Phenotype is a separate model for modularity reasons - it is possible to make changes in this model with changing any of the Topic model derived from the DIM.

      The entry point to this model is a choice box that has two 'stub' observations targeted at distinguishing between tow basic types of phenotypes: observed phenotypes and interpretive phenotypes. While the former represents observations made in the subject, the latter is an interpretation based on some evidence. To actually represent a phenotype (whether observed or interpretive), the choice box is associated with the Clinical Statement CMET so that the full expressiveness of that model is available to represent phenotypic data.

    • pertinentInformation

      This ActRelationship class (named pertinentInformation) represents the association of a genomic observation with a number of phenotypic observations. Its mandatory attribute typeCode holds the semantics of what is the type of this association. It is defined as <=PERT which means that any code in the PERT sub-hierarchy of the HL7 ActRelationshipType Vocabulary is permitted here. There is a work under progress to select appropriate codes from the HL7 ActRelationshipType Vocabulary as well as add unique codes to genomics.

Miscellaneous Issues

  • Association types:

    Association types (ActRelationship typeCode) are consistent with the following principles:

    • Encapsulating classes are components (COMP typeCode) of the IndividualAllele and GeneticLocus classes, while...
    • Bubbled-up classes are derivations (DRIV typeCode) of encapsulating classes;
    • Clinical phenotypes are pertinent to (PERT typeCode) genomic classes (Note: in previous versions this was fixed to "caused by" but it was too restricted. The code CAUS is part of the PERT sub-vocabulary and thus could be still used, but other codes are available as well)
    • Non patient-specific data items are defined (INST typeCode) by classes with mood code = DEF (definitional), that is, defined and described outside of the patient medical records, in some kind of master file, dictionary , ontology, etc.;
    • Alleles in one GeneticLocus refer (REFR typeCode) to alleles in another locus. Nevertheless, we develop a vocabulary for such types of relationship, which will be proposed for RIM harmonization as a new domain in ActRelationshipType Vocabulary (see table 16 for a first draft).

    Table 16: reference.typeCode

    reference.typeCode.gif
  • Bioinformatics formats:

    In general, we use bioinformatics formats in the model to feature the encapsulation of raw genomic data such as sequencing, expression and proteomic data. To enable the embedding of such data accepted from labs that work with bioinformatics formats, it is possible to assign specific XML portions into the Sequence and Expression value attributes (as well as into SequenceVariation). This encapsulation of 'foreign' markup is made possible due to the use of the HL7 ED (Encapsulating Data) Data Type which is defined as follows: "ED holds data that is primarily intended for human interpretation or for further machine processing which is outside the scope of HL7. ED includes unformatted or formatted written language, multimedia data, or structured information as defined by a different standard (e.g., XML-signatures.)"

    The use of the XML bioinformatics markups is restricted, that is, not all tags are allowed, rather only a subset which relates to a specific patient and include the information pertinent to healthcare. The restrictions on those external XML standards are specified elsewhere but a draft of a constrained BSML schema for sequencing data is presented in Appendix C. For more details about the rationale behind this mixture of HL7 and bioinformatics markup, see the section "Coexistence of HL7 Classes and Bioinformatics Markup".

  • Validation:

    The use of external markup in HL7 messages requires that a receiver of an HL7 instance that contains a Genotype instance, will carry out a 'double-validation' process: first step is to validate the instance against the HL7 message specification (of which the Genotype schema is part of) and the second phase is to validate the content of those value attributes against their respective content models. The valid content models of the Sequence and Expression value attributes will be an integral part of the entire Genotype specification, but at this point it is still considered informative.

  • Associated Properties / Observations and the Harmonization Proposals:

    In early versions of the DSTU models we have coped with the reluctance of both the HL7 RIM Harmonization process and the HL7 Clinical Genomics group to nail down common attributes of genomic observations through the addition of new classes and attributes to the RIM, by elaborating on the SequenceVariation and Expression properties and creating two new Observation classes (SequenceVariationProperty and ExpressionProperty) to be placeholders for each of theses properties. For example, the proposed 'length' property of a possible SequenceVariation new RIM class could be represented by an object of the SequenceVariationProperty class which only has code and value attributes. The code will indicate that this observation describes the position of the variation and the value attribute holds the position itself. The assumption is that this observation is an integral part of the parent observation with the same effective time. It could be identified only by going through the source variation object. In contrast, we also had the LocusAssociatedObservation class which is a place holder for associated observations such as copy number, zygosity, dominancy and gene family. These observations are independent observations that do have an id, effective time and method code.

    In later vesrions of the DSTU, the associated properties/observation classes were consolidated to two classes: AssociatedProperty and AssociatedObservation. Instead of having specific class names (e.g., SequenceVariationProperty), all core classes now have these two generic classes coming off them. It makes the model simpler (but put the burden on the parsing application to understand the context of each associated property/observation). The basic difference between associated properties and associated observations is that an associated property should have been (and eventually may be) part of the parent class attributes. It's an inherent part of the parent observation and thus doesn't have id, time stamp, method, performer, etc. They 'inherent' all these attributes from their parent. An associated observation, in contrast, is an independent observation and a component of its parent class.

Coexistence of HL7 Objects and Bioinformatics Markup:

When exploring this model one could identify the use of bioinformatics markup such as MAGE for gene expression and BSML for DNA sequencing. Also, a few of the HL7 Classes such as the property classes are overlapping the elements of the bioinformatics markup in a way that it is possible to find a SNP represented in both the Sequence class as well as in the AssociatedProperty class. The question then arises, what are the relationships of the two and how do they coexist? The following are a few points to note about that issue:

  • HL7 mission is to develop message/document specs that will be used in healthcare practice. The mission of the HL7 Clinical Genomics SIG is to develop message/document specs where genomic data are involved (e.g., Tissue Typing, Genetic Testing, and Clinical Trials).
  • Bioinformatics communities develop models/markups and are usually not ANSI-Accredited SDOs and thus cannot sanction and maintain these formats. Naturally, their orientation is more towards research and the needs of information exchange between research facilities, data mining and statistical analysis tools.
  • The HL7 Clinical Genomics SIG attempts to constrain existing bioinformatics markups and embed them in the HL7 model. This effort brings up the issue of overlapping between the HL7 core Reference Information Model (RIM) classes & data types from which we derive the Genotype model and embedded bioinformatics markups. One approach is not allowing overlaps. For example, if we model a SNP in an HL7 class, and also allow BSML extractions to be embedded in the Sequence class, then we should constrain the BSML extractions so that it will not hold SNP data (in the Isoform tag), rather those parts of the BSML (if exist) should be transformed and moved to HL7 classes.
  • Another approach is seeing the overlap as beneficial: The bioinformatics markups are kind of raw data that might not always get into the HL7 Clinical Genomics actual instances, rather might be only referenced from the HL7 instance as supporting evidences. When it does get into an HL7 instance then we have a blend of HL7 classes and embedded genomic markup. This situation is analogous to the raw imaging data. The average referring physician doesn't dig into the dozens of CT images taken in a study, rather look at the already designated ROIs or just read the radiologist report.
  • A similar situation might exist in the Clinical Genomics specification: many clinicians might not be able to grasp the raw genomic data represented in the bioinformatics markup but it could be helpful to have parts of it in the HL7 instances for purposes like evidence availably and machine-processabilty. The HL7 classes should be seen as representing the digest of the raw genomic data which is most pertinent to the healthcare practice itself. There is a space here for applications that might parse the bioinformatics markup, intelligently populate the HL7 classes and then associate them with the appropriate clinical data.
  • The HL7 classes in the this model have the advantage of being better tied with the other HL7 classes in the patient record (e.g., in a problem/allergy list) and thus serves better the ability to link individual genomic data to the clinical data of that individual. Note that bioinformatics models also include clinical data so this poses an overlapping problem in the other direction.
  • The integration applications that recognize bioinformatics markups could integrate various genomic sources into healthcare standards used for patient care, so there will not be one winning format necessarily. The issue should not be which bioinformatics model is the "best fit with HL7 Reference Information Model", rather, how could we develop mechanisms to digest data in various representations and link them to the HL7 RIM for the benefit of personalized medicine.
  • For example, the BSML markup is fairly simple and the HL7 SIG has done some work on BSML in constraining it to be embedded in HL7 classes. XML samples are available to show how the BSML is embedded in an HL7 message of the tissue typing use case. The constraining process of BSML made sure that data items related to presentation elements are not included and that the instance includes one and only one patient's data and this patient is uniquely identified. In contrast, PharmaGKB has a repository orientation like GenBank with actual data available for retrieval. This could be a data source of generic knowledge that can be accessed by healthcare information systems. In HL7 these repositories are usually treated as controlled coding schemes. They need to be recognized by the HL7 Vocabulary Committee and then any HL7 instance can reference terms, identifiers and data from that taxonomy.

Figure 3 shows a conceptual workflow where the above co-existence takes place and executed step-wise. Figure 4 shows an example taken form the sequencing type of data: the most clinically-significant SNPs are bubbled-up from the raw sequencing data and being associated with clinical phenotypes. For illustration of the latter scenario, see sample on the EGFR (described in the samples appendix B), courtesy of the NHANES project carried out in the USA CDC.Figure 3: Bubbling up the most clinically-significant portions of the raw genomic data embedded in the encapsulating objects.

Figure 4: Bubbling up the most clinically-significant SNPs of the raw sequencing data embedded in the encapsulating objects.

View Revision MarksHide Revision Marks Return to top of page