9 Crucial Mistakes to Avoid in Biological Sequence Search

30 Mar 2015

Each year, pharmaceuticals, biotech organizations, academic institutions and law firms commit costly errors that happen due to poorly informed IP portfolio decisions. Relating to biological sequence search, here are nine serious mistakes we hate to see life science companies make.

1. Overlooking Patent Sequence Data

Serious sequence information search require specify and organized efforts, and searching Genbank is not enough. Genbank had 180 million sequences as of its December 2014 build, only 32 million of which are identified in their patent division. As a contrast, Aptean GenomeQuest's GQ-Pat had over 280 million sequences, all found in patents, almost nine times larger.

2. Under-Utilizing Annotation Information

Ascertaining the legal or biological importance of the similarity between any two sequences requires a clean, curated database with organized annotation fields and content. Additional fields, such as bibliographic references, date of earliest publication and date of sequence disclosure add analytical speed and precision when used with a rapid search result filtering function.

3. Forgetting the Dark Genome

Public BLAST portals search only the most readily-accessible elements of the entire universe of genome data. The remaining information is sometimes referred to as the “dark genome.” Poorly annotated data in a readily accessible database may be considered part of the dark genome, in that is “hiding in plain sight.” Additional data with low search accessibility includes the information held in proprietary databases, desktop hard drives, graphic images and illustrations and print document collections. Searching the “dark genome” requires access to proprietary data and full-time, multiple-media genome information searching and database curation procedures.

4. Taking Too Much Time

Taking too much time to do a patent-related search is a root cause of research project and intellectual property decision delays. Researchers might spend weeks scouring the internet for new data related to a query sequence, or developing lists of databases holding separate or overlapping sets of genomic information. Unintuitive search software user interfaces cause “learning curve” delays, and sequence search outsourcing can cause vendor transaction and project scheduling delays of weeks or months

5. Hoping for the Best

Moving forward with a research project without properly searching and evaluating sequences can prove to be costly in the long run. An incomplete evaluation of the data early in the research cycle can be costly once a completed project is found to have yielded unusable results.

6. Making Decisions Based on Yesterday’s Results

Genome sequence information is extremely dynamic. In addition to the steady addition of recorded primary sequence data, scientific and patent information about both new and previously existing sequences also grows and changes on a daily basis. A sequence data query affecting important scientific research and business decisions might not yield the same answer one week from now. The more sequences involved in the decision, the greater the risk. Research groups and businesses without access to an automated and continuous search-and-report system are particularly vulnerable.

7. Using the Wrong Algorithm

Even the most experienced analyzers can make a mistake choosing the right algorithm for sequence search. For example, using BLAST for short sequences will miss many approximate hits. GenePAST is a better algorithm to use in many sequence search cases.

8. Too Many Gatekeepers

Restricted access rights to proprietary databases, cumbersome search software user interfaces, and outdated business practices often prohibit direct utilization of sequence data search systems by the person asking the question, who must instead work through one or more gatekeepers. A well-defined project submission process can prevent intended queries from getting “lost in translation,” but when sequence searches are outsourced, queries are often composed broadly in order to prevent potentially relevant results from being excluded from the search report returned by the service. This results in an oversized report and a long manual search process for the sequence records of real interest. Gatekeeper delays also inhibit creative sequence data exploration, where hunches and hypotheses can be quickly formed and investigated using fast, iterative database queries.

9. Ignoring Workflow Issues

Commercially licensed or in-house bioinformatics solutions often become very popular within organizations as researchers learn to use them to great advantage. But an effort to provide genome search capability to the user base that does not consider workflow issues can result in the installation of an isolated, standalone information “silo” with an unfamiliar interface. The standalone solution is itself likely to be underutilized, and also fails to take advantage of organizational knowledge built up around previously existing bioinformatics applications.

With its extensive data coverage (over 500 million sequences), powerful search tools and user-friendly functionality, Aptean GenomeQuest is the obvious choice for searching the entire sequence domain, both patent and non-patent.

Avoid the pitfalls of using free solutions for IP sequence searching. Download our RFP template or start a free trial today!