Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
If you would like to install SEEK/NExtSEEK: Documentation is located here: https://igb.mit.edu/data-management/seek-and-nextseek
If you would just like to install SEEK: https://seek4science.org/get_seek.html
An introduction of NExtSEEK and overview of our ongoing data management efforts
NExtSEEK is an active data management platform based out of the : the Integrated Genomics/Bioinformatics Core Facility of the Koch Institute for Integrative Cancer Research.
is a modified wrapper, built on the Infrastructure, to allow for active data management of ongoing research projects. Our focus is to ensure that the data associated with these research projects is : Findable, Accessible, Interopable, and Reusable. This is achieved by structuring metadata in a relational model, allowing for the creation and maintenance of a rich provenance for data files.
Currently, we support data management for 5 projects:
NExtSEEK is maintained as a private instance that hosts pre-publication research data and is only accessible to those who have federated access. Users from one project cannot see or access data files / metadata of other projects. If you are a member of one the projects above and would like to register for a NExtSEEK account, head to
If you are looking to set up your own NExtSEEK instance, head to the page.
Published data of these projects exist in - a metadata repository maintained by the SEEK team. Links to the public data are below:
Summary Statistics as of August 2024:
Project | Sample Count |
---|
SEEK: or
SEEK Documentation:
NExtSEEK:
SEEK/NExtSEEK Installation:
FAIRDOMHub:
Repositories:
Sequence Read Archive:
Gene Expression Omnibus:
Zenodo:
Immport:
MIT.OMERO:
PRIDE:
If you would like to contact the MIT team, please reach out to
Lead Data Specialist
Data Specialist
Northeastern Bioinformatics CO-OP, Fall 2024
System Administrator and Maintainer
Huiming Ding: System Administrator and Developer
Project Lead / Principal Investigator
Northeastern Bioinformatics CO-OP, Spring 2024
Northeastern Bioinformatics CO-OP, Fall 2023
Northeastern Bioinformatics CO-OP, Spring 2023
Northeastern Bioinformatics CO-OP, Fall 2022
: Northeastern Bioinformatics CO-OP, Spring 2022
Data Specialist and Developer
An explanation of the major concepts and major pages of the SEEK/NExtSEEK data management platform
NExtSEEK is a modified wrapped, built on top of the SEEK infrastructure. The fundamental differences that differentiate the SEEK and NExtSEEK platforms are outlined in the Although there are differences, SEEK is required for NExtSEEK's functionality, as we leverage and use many features from the core SEEK. All data/metadata in NExtSEEK are compatible with SEEK. This compatibility is shown by using , an instance of the core SEEK infrastructure, as the metadata repository we use to publish our research data. An example of published research data exists
SEEK and NExtSEEK are utilized together, each serving different purposes. Below is a very brief explanation of how the two platforms interact.
Core SEEK: - Functionality: Register for accounts (same account used for SEEK and NExtSEEK), create Projects/SampleTypes/Assays, administer account-project associations
NExtSEEK: - Functionality: Upload/Search/Download Samples, Protocols, and Data Files
We use SEEK as the administrative site (creating assets and administering roles), while NExtSEEK is used for all things data (uploading, downloading, searching).
SEEK/NExtSEEK uses the ISA metadata tracking framework as described . ISA = Investigation, Study, Assay. In our case: Investigation = Grant/Research Project, Study = Publication, and Assay = Experiment. This is a nested structure -> There are multiple Assays in a Study, and multiple Studies in an Investigation.
This is how the data is modeled in the public domain (on FAIRDOMHub), but in the scope of NExtSEEK, we treat the investigation and study as a singular node. During the research process, it's often unknown which data will be part of a particular publication, therefore, all data of ongoing research efforts (on NExtSEEK), lives underneath a single study. When data is published from NExtSEEK to FAIRDOMHub, it is then associated with a publication, and can then be in ISA format.
SEEK/NExtSEEK has a few different flavors/types of assets: Samples, Assays, Protocols, and Data Files.
A sample is any unit of biological, chemical, or data material that is subject to analysis or experimentation. It can range from a tangible entity, such as a patient or tissue specimen, to digital data outputs like raw or analyzed sequencing data files.
Samples are stored as tabular metadata (excel), and grouped into different Sample Types; each describing a specific type of data or metadata. Each sample type is unique and will contain a different subset of attributes. Some attributes are shared, such as UUID (unique identifier/primary key), Name (also needs to be unique), Protocol (A field that links to the protocol associated with the sample), Parent (unique identifier of Parent sample), and more. Sample Type Nomenclature: Samples without a prefix = Metadata samples. D.XXX = Data File, A.XXX = Analyzed Data File. Examples: PAT: Human Patient, TIS: Tissue, DNA: DNA Library, D.SEQ: Sequencing File, A.GEX: Gene Expression Analysis File.
An Assay is a type of experiment/procedure done on a Sample, to generate another sample. These can be broad terms, or more specific. Assays always have two samples associated with them: the Parent sample that feeds into the assay, and the Child sample that is generated from the assay. Examples: PAT -> Tissue Collection -> TIS -> DNA Extraction -> DNA -> Short Read Sequencing -> D.SEQ -> Gene Expression Analysis -> A.GEX
In the above example, the PAT sample feeds into the Tissue Collection Assay and generates a TIS sample.
An actual data file. Not frequently used. We are not looking to house/manage terabytes of research data, nor be responsible for serving/housing that data to the public (in perpetuity). Instead, we push for data to live in their respective repositories, and until then, in their original home (generating lab). We can store data files on SEEK/NExtSEEK, and those data files can be downloaded by users who have access, but the majority of our use cases point to systems that are much better at managing data transfers (repositories, cloud computing environments, Globus, etc).
There are three pages associated with Data Entry:
There are four pages associated with Data Query:
The sample page has two sections: An interactive Sample Tree and a table of Metadata.
The interactive sample tree shows all connected Parent/Child samples. By clicking on a sample, you then load the sample page of that sample.
The table of Metadata is straightforward - it is the metadata associated with that sample.
Sample pages can take some time to load (as they are not all stored in the database, and are auto-generated on load)- depending on the number of nodes (child/parent) associated with the sample.
Once you have an account, you will need to be approved and added to a project (by an administrator) to access SEEK/NExtSEEK. This is what federates access to different Projects, and therefore access to the different assets of those projects. You must be a member of a project to access the assets, therefore allowing multiple projects to exist in the same database.
To create a new asset type, the SEEK website is used. Whether that is creating a new Sample Type, a new Assay, or creating a new Project.
Documentation surrounding creating these assets can be found directly on the SEEK Documentation, linked below:
Assay: There is no documentation on the SEEK website for this.
Statistic | Sample Count |
---|
Assays are Study specific. To view the full list of assays that are visible to you, head To view the list of assays associated with a study, head to the for your specific project.
A description of the assay/experiment performed on the sample. Can be in any format (PDF, DOCX, XLSX, TXT, IMG, etc). Ideally, this is primary materials from a lab (primary protocols used in-house), but materials and methods sections usually suffice. Examples: Protocols describing , DNA Library Creation, , and . Again, these can be Word documents, PDFs, text files, etc.
: Where a user uploads samples
: Where a user uploads data files/protocols
: Housing sample sheet templates for users to use (to prep and upload files)
More information on how to use these pages exists on the page.
: A text search of the entire database (all samples). Allows complex searching (AND/OR/NOT). partial/exact matches, and sample type specificity.
: Search a single Sample Type, by a single Attribute, by a single Value. Example: All D.SEQ whose Type contains 'RNA-Seq'
: Search through what data files exist in a filterable table. Files are downloadable as well (single + batch).
: Search through what Protocols exist in a filterable table. Files are downloadable as well (single + batch).
More information on how to search / download samples exists on the page.
Each sample has its own page on NExtSEEK located at: XXX (where XXX = the UUID of that sample).
This allows users to add/remove/edit attributes of Sample Types. This feature is only available on NExtSEEK, as SEEK does not allow for attribute editing. This is very useful when we are working with a new group, and they collect (and want us to include) a new field of a Sample Type that already exists.
Accounts are registered on the SEEK website (and then used on both the SEEK and NExtSEEK websites). You can register for an account here: .
To administer project associations: .
You can also request to join a project: .
A link to the full SEEK Documentation exists here: (head to user guides).
Samples | 100843 |
Raw Data File Samples | 23993 |
Analyzed Data File Samples | 1141 |
FAIRDOMHub Studies | 30 |
Public FAIRDOMHub Studies | 19 |
IMPAcTB | 58711 |
SRP | 32294 |
MetNet | 4885 |
CSBC | 302 |
BTC | 892 |
An in depth overview of how to upload Samples, Protocols, and Data Files to NExtSEEK
Samples are tabular metadata and are uploaded to the database as Excel sheets. These Excel sheets must be structured in a specific format to be successfully uploaded.
Properly formatted Upload Sheets have four sub-sheets: Instructions, Samples, Ontology, and Assay.
Instructions: This sheet contains all of the information required to add the sample to the Database. There are four required columns in this sheet: Field, Database Field, Field Type, and Ontology. Field = An identical match to the headers of the Samples Page. The headers/column names do NOT need to be the database name of the attribute. Database Field = Formatted as SAMPLETYPE::Attribute Name -> Ex: TIS::Type or D.SEQ::Name. The Attribute name here MUST exactly match the exact DB Field Name. This maps the value in the Samples sheet to the correct Attribute. Field Type = Text, Number, Date, Controlled Ontology Ontology = If Field Type == Controlled Ontology, the name of the Ontology (in the Ontology) sheet.
Samples: This is the table of metadata, where each row is a sample, and each column is an attribute for that sample. The column headers (Row 1) are the attribute names, and must identically match the Field column (transposed) of the Instructions page.
Ontology: This sheet contains ontologies (sets of controlled vocabulary terms) that can be used to control the values of an attribute. For an ontology to be enforced, the "Field Type" on the Instructions page for that attribute must be set to "Controlled Ontology", and the name of the Ontology (header in the Ontology sheet) must be set as the "Ontology" of that attribute on the Instructions Page. See the image below for more clarification.
Assay: This sheet determines which Assay(s) the uploaded samples should be associated with. The required columns are: SampleType, AssayType, Assay, Direction.
Attached is an example sample sheet, with notes/annotations as described above.
There are three different types of upload sheets: Assay Sheets, Sample Sheets, and Update Sheets.
Assay Sheets / Sample Sheets follow the Excel Sheet Format as shown above.
Assay/Sample Sheets must be used to upload a sample for the first time, to generate the UID.
The first time you upload an Assay/Sample Sheet, the UID column should be blank and will be automatically generated. Following upload, paste the UIDs into your upload sheet from the auto-generated feedback sheet.
When using an Assay / Sample Sheet to update samples-> All attributes for that sample must be included. If an attribute is not included at a later update, that metadata will be removed from the sample.
An Assay Sheet contains multiple sample types, while a Sample Sheet contains a single sample type.
Update Sheets are used to update a subset of attributes for a sample that has already been uploaded
Below are examples of an Assay Sheet / Update Sheet. An Example of a Sample Sheet is linked above (SampleSheetFormatting_Template_240824.xlsx)
Once you have formatted your Assay/Sample Sheet for Upload, there exists a Sample Validation Script on the Uploading page to check that the sheet is in the correct format.
Choose your prepared Assay/Sample Sheet (not applicable for update sheets) and click validate.
The validation script checks:
That the Excel sheet is formatted correctly (Instructions, Samples, Ontology, Assay)
That the Instructions Page is formatted correctly (Field, Database Field, Field Type, Ontology)
In the above example, the Ontology column is missing
That the entries of Database Field match attributes in the database
In the above example, sample type CEL does not have the attribute Protocols (it should be Protocol)
That the Header row in the Samples page == Field column in the Instructions page
Disregard the 'Field' error, but in the above example, it's finding that there exists an entry in the Instructions page for Source, that does not exist in the Samples page.
That the Assay Sheet is formatted correctly (SampleType, AssayType, Assay, Direction)
In the above example, the column AssayType is missing, and there is an extra column named "1"
Not all of these errors would cause the upload to error out. Any error with the overall structure/format of the sheet would cause an upload to fail. A mislabeled attribute is not going to cause an error, but will instead upload that sample WITHOUT that attribute.
It is good practice to test your sample sheet on sample validation before uploading.
Once you have created your assay or sample sheet, head to the Uploading Page
Submit your sheet through Sample Validation.
Following validation success, place your sheet in the upload box. If you are an admin, select which lab/user you are uploading for. If not, leave as default and it will upload as yourself.
Click Upload. Should take around 1 second per sample
To track your upload, head to either Search Page (INSERT LINK). Search today's date in YYMMDD format (so 8/23/24 = 240823).
Through running that search a few times, you should see the number of samples increasing, therefore tracking that your upload is running successfully.
Following upload, paste your generated UIDs from the feedback file back into your upload sheet
IMPORTANT: Quality checks
Check a few samples
Ensure that the correct # of samples got uploaded
Ensure that all attributes for your samples are uploaded.
To upload Protocols and Data Files, head to the Protocol/Data File Uploading Page.
Select whether the file(s) you are uploading are Data Files or Protocols
If you are an admin, select which Lab/User you are uploading for. If you are not, leave it as the default
Place the files into the "File Dropzone" and click submit. Wait for data files/protocols to be uploaded
The resulting UID generated in the bottom table will be the UID used to reference that Data File or Protocol. Data File UIDs are SampleTypeUID_FileName. Protocol UIDs are P.LAB-YYMMDD_Version_FileName
For Protocols, following the procedure above is sufficient to upload.
Data Files require a Sample with a File_PrimaryData that == the name of the file you are trying to upload (to automatically match the Data File UID / Link_PrimaryData to the corresponding samples). If there is not a D. Sample that matches your data file name, you can make a D.FILE sample to trick the system into uploading it- this is particularly useful when the file you are trying to associate is not a primary file, but a supplementary data file, such as a FASTQC.html. Below is a D.FILE_Template.
An in depth overview of how to search, view, download, and delete samples, protocols, and data files.
This page allows you to search a single sample type, by a single attribute:value. In the example below, I am searching for all NHP where the Species attribute contains the value 'Macaca'.
The resulting table returns all samples that match the query. It displays the Assays, Contributor, and Attribute:Value that it found. All of the empty boxes underneath the headers are text filterable. To view the sample page of a specific sample, click on its hyperlinked UID.
The advanced search page allows for complex querying (AND/OR/NOT) across the entire database. Additionally, you can select if you want it to be partial/exact matches, or limit it to a specific sample type.
The resulting table is identical to the Simple Search page, with one additional feature: Send to Sample Retrieval. Following a search, you can select a subset of samples, and send them to sample retrieval.
On the top bar of every page, there exists a Search by UID box.
By entering a valid UID and pressing search, you automatically redirect to that sample page (assuming you have access and it is a valid UID). Remember, sample pages with lots of associated samples take longer to load.
On the SEEK website, you can either search all assets in the top search bar (Search here...) or head to Browse and select a specific asset type that you would like to search.
The first step of downloading via Search pages is to search for the samples you want to download.
When downloading samples, you must choose whether you want to download just the samples you are looking up or include all samples associated (parent/children).
In the pop-up window, it asks if you would like to download with Parents or not. By selecting NO, you will download the samples that have been selected. By selecting yes, you will download all associated samples (parent and child).
This is only an option on the Simple Search page. on the Advanced search page, it automatically will include parents. Be patient when downloading a large subset of samples and their parents.
Attached below is an example of a downloaded file from NExtSEEK - with parents. This data is published already and associated with: https://fairdomhub.org/studies/1134.
Sample Retrieval is a feature that allows a user to download all of the associated samples (parents and children) of the sample(s) searched.
How to Use Sample Retrieval:
Input samples into Sample Retrieval: by pasting in UIDs (delimited by newlines), or sending samples over from advanced search
Run sample retrieval
Select sample types and attributes that you want to download
Download samples
This sample retrieval shows that there are 2 other sample types (Cells and Tissues) associated with the Mice I queried. I then can filter and uncheck any of the attributes that exist for those sample types, and then download all of the metadata.
These two pages are identical (images below). They are filterable tables that allow you to search what protocol/data files are visible to you, along with links to download individual files, and an option to download files in batch.
To download a specific file, click the File URL. To batch download files, select the checkbox, and select Batch download files selected. The original file name redirects you to the SEEK page for that specific data file/protocol.
Globus is a cloud-based file transfer and storage service that allows users to move and share large amounts of data between different resources.
NExtSEEK houses metadata that describes/annotates actual data files. NExtSEEK does not have a good solution for storing and sharing data files, while Globus does. Below is an overview of how Globus and NExtSEEK are used together.
Register for a Globus account with your Institution email
Email fairdata@mit.edu with your Globus Username and Project Association
The fairdata team will reply once you have been added to the correct Globus Collections
Head to Collections > Shared with You to see the collections shared with you
There exist two collections that you will have access to: {Project_Name}-Staging and {Project_Name}-Public
Data is Uploaded to {Project_Name}-Staging and Downloaded from {Project_Name}-Public
The fairdata team will curate (move) data from Staging to Public when the relevant metadata has been uploaded to NExtSEEK. Following "curation", the Link_PrimaryData attribute of the D.SampleType on NExtSEEK, will be the corresponding Globus link.
The full Globus documentation exists here: https://docs.globus.org/
Deleting samples happens on the search pages. Similar to downloading, first, you need to search and select the samples you want to delete. You also need to ensure that no samples are children of the samples you are trying to delete.
For example, in the image below, If I am trying to delete those 5 NHPs, no samples in the database can have those 5 NHPs as their parent.
Once you've selected the samples, click delete, and assuming you are admin, type 'DELETE'. Let it run - takes around 6 seconds per sample - and again, will error out if there are downstream samples associated.
To delete a Protocol or Data File - head to the SEEK website:
Find the Protocol / Data File you want to delete, click actions, and delete.