The Protein Data Bank (PDB) bioinformatics database is the world’s largest repository of experimentally-determined structures of proteins, nucleic acids, and complex assemblies. All data is gathered using experimental methods such as X-ray, spectroscopy, crystallography, NMR, etc.
The PDB has a lot of repeating structures with different resolutions, methods, mutations, etc. Doing an experiment with the same or similar proteins can produce bias in any group analysis, so we will need to choose the correct structure from among any set of duplicates. For that purpose, we need to use a non-redundant (NR) set of proteins.
For the purpose of normalization, I recommend downloading the chemical compound dictionary for importing atom names into a database that uses 3NF or uses a star schema and dimensional modeling. (I’ve also used DSSP to help eliminate problematic structures. I won’t cover that in this article, but note that I didn’t use any other DSSP features.)
Data used in this research contains single-unit proteins who contain at least one disulfide bond taken from different species. To perform an analysis, all disulfide bonds are first classified as consecutive or nonconsecutive, by domain (archaea, prokaryote, viral, eukaryote, or other), and also by length.
Primary and tertiary protein structures, before and after protein folding.
Output
To be ready for input into R, SPSS, or some other tool, an analyst will need the data to be in a database table with this structure:
Column | Type | Description |
code | character(4) | Experiment ID (alphanumeric, case-insensitive, and cannot start with a zero) |
title | character varying(1000) | Title of the experiment, for reference (field can be also text format) |
ss_bonds | integer | Number of disulfide bonds in the chosen chain |
ssbonds_overlap | integer | Number of overlapping disulfide bonds |
intra_count | integer | Number of bonds made within the same chain |
sci_name_src | character varying(5000) | Scientific name of organism from which the sequence is taken |
tax_path | character varying | Path in Linnaean classification tree |
src_class | character varying(20) | Top-level class of organism (eukaryote, prokaryote, virus, archaea, other) |
has_reactives7 | boolean | True if and only if the sequence contains reactive centers |
len_class7 | integer | Length of sequence in set 7 (set with p-value 10e-7 calculated by blast) |