Skip to content
rodche edited this page Mar 16, 2015 · 1 revision

Query and Export Services

Contents

Introduction

Phase 2 focuses on the development of the web service API and batch download for cpath2. The web service will provide a facility for data query and access. The batch download will provide the entire repository of pathway data in a variety of file formats.

There are two groups of users, direct and indirect. Direct users are software developers that will use this web api to help their own users (indirect users) answer biological questions. Although we are primarily concerned with the direct users for immediate specification and design purposes, we have to constantly keep indirect users and biological questions in our perspective.

cpath already provides a web API that covers an important portion of use cases. This new web api will build on this to provide additional services and expose more BioPAX features.

Goal

Provide generic, straightforward, reliable and non-volatile access to pathway data in Pathway Commons, in a variety of formats, for software tools.

Requirements

User Requirements

  • Search by text or id: Users can obtain id's of the biopax elements by searching by text or by external id such as Uniprot.

  • Fine-grained BioPAX access: Current cpath web API allows accessing BioPAX in a coarse grained manner -- BioPAX is served in chunks of Pathway, Physical Entity and/or Interaction. Users that require a specific individual that is not one of these "parents", first need to obtain the "parent" individual, construct a BioPAX model and traverse that model on the client side. This is a suboptimal method and might not be possible under some scenarios. cpath2 should expose all BioPAX classes/properties in API, allowing software to obtain exactly the portion they need.

  • Common graph queries: Although fine-grained access allows users of the web api to traverse the BioPAX graph in any way they wish, this might not perform well for queries that require traversing multiple properties. We expect several types of such large graph searches, such as neighborhood, graph of interest or shortest path(s) to be common among use cases. cpath2 should provide these common searches as ready-made components that will run on the server side.

  • Provide auto-completion: A BioPAX model returned by a query is not necessarily internally consistent. For example a query that returns interactions by their ID does not necessarily returns the participants of those interactions. The internal consistency, however, is well defined and any BioPAX subgraph can be programmatically expanded to its minimal internally consistent subgraph. cpath2 web api should provide its users with the option to auto-complete results.

  • Provide results in multiple formats/views: Current cpath api allows multiple output formats. These will be adopted and expanded by the cpath2 web api.

  • Provide batch download of data.

System Requirements

  • Performance: We expect regular users to run queries that return several hundred nodes at most. The response time for these operations should be within several seconds.

  • Security: With the new api the user will be able to perform arbitrarily complex queries. If unmonitored such queries can easily crash the server even without malicious intent. We need to monitor load for each user and kill queries that take too long or use too much memory. We also need to provide an alternative to users who need to run long queries.

  • Backwards Compatibility: The web service will support all of the cpath webservices for backwards compatibility. BioPAX output will change from level 2 to level 3 breaking backwards compatibility for the users of this output format. This is a direct consequence of backwards incompatibility between BioPAX 2 and 3 and is a general problem that we need to tackle. Users that use other formats, such as sif will not be effected.

Unsupported/Future Requirements

  • Search for similar graphs: This is an important requirement for semantic integration, cross validation, "diff'ing" and identifying overlapping/similar portions of pathways. We have ongoing research in this direction and have preliminary results from Arman's work. Unfortunately there are still multiple issues and uncertainties preventing us from including this feature in the current plans. Schedule/Solution : We need a research person to work on this full time.

  • Get output in SBGN/SBML: Similarly we would like to provide SBGN export in our api but there are conversion issues we need to tackle first. Schedule/Solution: Pathway visualization problem is being tackled in parallel with the i-Vis group. This is still in very early stages. SBML export is currently possible but requires a person to work on it.

  • Provide full HQL/SQL access: This would provide a powerful facility to query the cpath2, but has security issues associated with it.

  • Allow chained queries: An example of this would be to allow using the results of a text search as an input set to neighborhood search. In this version this will require two hits to to the webservice api, once for getting the results of the text search and once for the neighborhood query. A chained query, on the other hand, would hit the webservice once potentially increasing performance. Schedule/Solution: This is scheduled as low priority because of the uncertainty about its near term usage and the relative complexity of the implementation.

  • Allow Logical Operators: This is related to the item above. Currently all logical operators should be performed on the client side at the cost of performance. This can be avoided by providing logical operators in the query interface. Schedule/Solution: This is scheduled as low priority because of the uncertainty about its near term usage and the relative complexity of the implementation.

Use Cases / Biological Questions

Design

Query

See Also:

Query Input

Overall pattern for the new webapi urls will follow

http://api.pathwaycommons.org/{queryname}/{parameter}/

First, we deside the list of commands and their arguments. Next, WS URLs always begin with a command name (queryname), i.e., api.pathwaycommons.org/search. However, not everything is technuically possible or fits nicely into this RESTful style. We can make any method that requires multiple arguments accept either GET requests, e.g., /search?q=aaa&type=Xref, or POST (/search), or both; and for other methods (without arguments, or with trivial arguments only) - use "pure" RESTful pattern. Imagine the following: api.pathwaycommons.org/help/types would lists all BioPAX classes, /help/types/Xref/properties - properties of Xref, /help/commands/, or /summary/{dataSource}, etc.

Query Results

Query Types

  • Entry points: Entry points into PathwayCommons query system is the first queries that users will employ. Similar to cpath, cpath2 will support two entry points, text search and external ids.
    • Text search is provided by Lucene and HibernateSearch framework. Fields that will be searchable are specified by the hibernate-search mappings. TODO(igor, ben) : ranking.
      • Input: A search term.
      • Output:
      • URL example: api.pathwaycommons.org/search?q=TERM&type=Protein
    • For external ids we will use normalized miriam forms and make a simple lookup to the database.
  • BioPAX traversal: a low-level api that exposes all the BioPAX properties and classes.

Batch Download

See Also:

cpath2 will provide pathway data available as batch downloads. For each data format (described below), we will provide two directories:

  • by_source: one file per data source, e.g. one file for Reactome, one for NCI-Nature, etc.
  • by_organism: one file per organism, e.g. one file for human, one for mouse, etc.

Data Formats

  • BioPAX: BioPAX is a data exchange format for biological pathway data. All pathways and interactions within Pathway Commons are available in BioPAX Level 2. For more information on the BioPAX exchange format, visit: http://www.biopax.org.

  • GSEA (MSigDB GMT): Gene Matrix Transposed file format (.gmt) Format: This is the main tab-delimited file format specified by the Broad Molecular Signature Database (http://www.broad.mit.edu/gsea/msigdb/). We provide two versions of this file format. In the first, all participants in the pathway are specified as official gene symbols (if an official gene symbol is not available, the participant will not be exported). In the second, all participants are specified as Entrez Gene IDs (if an Entrez Gene ID is not available, the participant will not be exported). All participants for a pathway must come from the same species as the pathway. Therefore some participants from cross-species pathways are removed. Exporting to the MSigDB format will enable computational biologists to use pathway commons data within gene set enrichment algorithms, such as GSEA. Available for all pathways within Pathway Commons (only from pathway database sources, not interaction database sources). Full data format details are available at: Broad GSEA Wiki, http://www.broad.mit.edu/cancer/software/gsea/wiki/index.php/Data_formats.

  • Pathway Commons Gene Set Format: Similar to the MSigDB format (see above), except that all participants are micro-encoded with multiple identifiers. Each participant is specified as: CPATH_ID:RECORD_TYPE:NAME:UNIPROT_ACCESION:GENE_SYMBOL:ENTREZ_GENE_ID. Also available for all explicit pathways within Pathway Commons (only from pathway database sources, not interaction database sources). All participants for a pathway must come from the same species as the pathway. Therefore some participants from cross-species pathways are removed.

  • SIF: Simple Interaction Format - SIF, used by Cytoscape: http://cytoscape.org/cgi-bin/moin.cgi/Cytoscape_User_Manual/Network_Formats. All participants will be specified as GENE_SYMBOL. If an official gene symbol is not available for all members of the interaction, the interaction will not be exported. This is why fewer species may be listed here as compared to other file formats. This format is available for all pathways and interactions within Pathway Commons.

  • Tab Delimited Network: Tab Delimited Network: Similar to the basic SIF export, except that each export is specified with two files. Each file is tab-delimited, multi-column. The first file is SIF (using CPATH_ID instead of GENE_SYMBOL) plus edge attributes. Current edge attributes are the Participant-A GENE_SYMBOL, Participant-B GENE_SYMBOL, interaction data source and PubMed ID. The second file contains participant CPATH_ID followed by node attributes. Current node attributes are GENE_SYMBOL, UNIPROT_ACCESSION, ENTREZ_GENE_ID, CHEBI_ID, NODE_TYPE, and Organism (NCBI taxonomy id). If an attribute cannot be determined, "NOT SPECIFIED" will be used. This format is suitable for Cytoscape - Attribute Table import and loading into Excel. To prevent an unsuccessful import into Cytoscape due to missing attribute values, users should specify during import that all columns are strings. This format is available for all pathways and interactions within Pathway Commons.

Clone this wiki locally