1 of 7

MIT Data Management Analysis Core

Overview

An introduction of NExtSEEK and overview of our ongoing data management efforts

NExtSEEK is an active data management platform based out of the MIT BioMicro Center: the Integrated Genomics/Bioinformatics Core Facility of the Koch Institute for Integrative Cancer Research.

NExtSEEK is a modified wrapper, built on the SEEK Infrastructure, to allow for active data management of ongoing research projects. Our focus is to ensure that the data associated with these research projects is FAIR: Findable, Accessible, Interopable, and Reusable. This is achieved by structuring metadata in a relational model, allowing for the creation and maintenance of a rich provenance for data files.

Currently, we support data management for 5 projects:

NExtSEEK is maintained as a private instance that hosts pre-publication research data and is only accessible to those who have federated access. Users from one project cannot see or access data files / metadata of other projects. If you are a member of one the projects above and would like to register for a NExtSEEK account, head to #account-registration-project-association

If you are looking to set up your own NExtSEEK instance, head to the installation page.

Published data of these projects exist in FAIRDOMHub - a metadata repository maintained by the SEEK team. Links to the public data are below:

Summary Statistics as of August 2024:

Project

Sample Count

IMPAcTB

58711

SRP

32294

MetNet

4885

CSBC

302

BTC

892

Statistic

Sample Count

Samples

100843

Raw Data File Samples

23993

Analyzed Data File Samples

1141

FAIRDOMHub Studies

Public FAIRDOMHub Studies

Using SEEK and NExtSEEK

An explanation of the major concepts and major pages of the SEEK/NExtSEEK data management platform

Concepts:

SEEK vs NExtSEEK

NExtSEEK is a modified wrapped, built on top of the SEEK infrastructure. The fundamental differences that differentiate the SEEK and NExtSEEK platforms are outlined in the NExtSEEK publication. Although there are differences, SEEK is required for NExtSEEK's functionality, as we leverage and use many features from the core SEEK. All data/metadata in NExtSEEK are compatible with SEEK. This compatibility is shown by using FAIRDOMHub, an instance of the core SEEK infrastructure, as the metadata repository we use to publish our research data. An example of published research data exists here.

SEEK and NExtSEEK are utilized together, each serving different purposes. Below is a very brief explanation of how the two platforms interact.

Core SEEK: fairdata.mit.edu - Functionality: Register for accounts (same account used for SEEK and NExtSEEK), create Projects/SampleTypes/Assays, administer account-project associations

NExtSEEK: nextseek.mit.edu - Functionality: Upload/Search/Download Samples, Protocols, and Data Files

We use SEEK as the administrative site (creating assets and administering roles), while NExtSEEK is used for all things data (uploading, downloading, searching).

ISA Structure

SEEK/NExtSEEK uses the ISA metadata tracking framework as described here. ISA = Investigation, Study, Assay. In our case: Investigation = Grant/Research Project, Study = Publication, and Assay = Experiment. This is a nested structure -> There are multiple Assays in a Study, and multiple Studies in an Investigation.

This is how the data is modeled in the public domain (on FAIRDOMHub), but in the scope of NExtSEEK, we treat the investigation and study as a singular node. During the research process, it's often unknown which data will be part of a particular publication, therefore, all data of ongoing research efforts (on NExtSEEK), lives underneath a single study. When data is published from NExtSEEK to FAIRDOMHub, it is then associated with a publication, and can then be in ISA format.

Types of Assets

SEEK/NExtSEEK has a few different flavors/types of assets: Samples, Assays, Protocols, and Data Files.

Samples:

A sample is any unit of biological, chemical, or data material that is subject to analysis or experimentation. It can range from a tangible entity, such as a patient or tissue specimen, to digital data outputs like raw or analyzed sequencing data files.

Samples are stored as tabular metadata (excel), and grouped into different Sample Types; each describing a specific type of data or metadata. Each sample type is unique and will contain a different subset of attributes. Some attributes are shared, such as UUID (unique identifier/primary key), Name (also needs to be unique), Protocol (A field that links to the protocol associated with the sample), Parent (unique identifier of Parent sample), and more. Sample Type Nomenclature: Samples without a prefix = Metadata samples. D.XXX = Data File, A.XXX = Analyzed Data File. Examples: PAT: Human Patient, TIS: Tissue, DNA: DNA Library, D.SEQ: Sequencing File, A.GEX: Gene Expression Analysis File.

Assays:

An Assay is a type of experiment/procedure done on a Sample, to generate another sample. These can be broad terms, or more specific. Assays always have two samples associated with them: the Parent sample that feeds into the assay, and the Child sample that is generated from the assay. Examples: PAT -> Tissue Collection -> TIS -> DNA Extraction -> DNA -> Short Read Sequencing -> D.SEQ -> Gene Expression Analysis -> A.GEX

In the above example, the PAT sample feeds into the Tissue Collection Assay and generates a TIS sample.

Assays are Study specific (see ISA format). To view the full list of assays that are visible to you, head here. To view the list of assays associated with a study, head to the study page for your specific project.

Protocols:

A description of the assay/experiment performed on the sample. Can be in any format (PDF, DOCX, XLSX, TXT, IMG, etc). Ideally, this is primary materials from a lab (primary protocols used in-house), but materials and methods sections usually suffice. Examples: Protocols describing Tissue Extraction, DNA Library Creation, Sequencing, and Gene Expression Analysis. Again, these can be Word documents, PDFs, text files, etc.

Data Files:

An actual data file. Not frequently used. We are not looking to house/manage terabytes of research data, nor be responsible for serving/housing that data to the public (in perpetuity). Instead, we push for data to live in their respective repositories, and until then, in their original home (generating lab). We can store data files on SEEK/NExtSEEK, and those data files can be downloaded by users who have access, but the majority of our use cases point to systems that are much better at managing data transfers (repositories, cloud computing environments, Globus, etc).

Pages on NExtSEEK:

Data Entry

There are three pages associated with Data Entry:

Assay Sheet Uploading: Where a user uploads samples
Data File/Protocol Uploading: Where a user uploads data files/protocols
Templates: Housing sample sheet templates for users to use (to prep and upload files)

More information on how to use these pages exists on the Uploading page.

Data Query

There are four pages associated with Data Query:

Advanced Search: A text search of the entire database (all samples). Allows complex searching (AND/OR/NOT). partial/exact matches, and sample type specificity.
Simple Search: Search a single Sample Type, by a single Attribute, by a single Value. Example: All D.SEQ whose Type contains 'RNA-Seq'
Data File Query: Search through what data files exist in a filterable table. Files are downloadable as well (single + batch).
Protocol Query: Search through what Protocols exist in a filterable table. Files are downloadable as well (single + batch).

More information on how to search / download samples exists on the Searching / Downloading page.

Sample Pages

Each sample has its own page on NExtSEEK located at: https://nextseek.mit.edu/seek/sampletree/uid=XXX (where XXX = the UUID of that sample).

The sample page has two sections: An interactive Sample Tree and a table of Metadata.

The interactive sample tree shows all connected Parent/Child samples. By clicking on a sample, you then load the sample page of that sample.

The table of Metadata is straightforward - it is the metadata associated with that sample.

Sample pages can take some time to load (as they are not all stored in the database, and are auto-generated on load)- depending on the number of nodes (child/parent) associated with the sample.

Attribute Editor

This page allows users to add/remove/edit attributes of Sample Types. This feature is only available on NExtSEEK, as SEEK does not allow for attribute editing. This is very useful when we are working with a new group, and they collect (and want us to include) a new field of a Sample Type that already exists.

Pages on SEEK:

Account Registration / Project Association:

Accounts are registered on the SEEK website (and then used on both the SEEK and NExtSEEK websites). You can register for an account here: https://fairdata.mit.edu/signup.

Once you have an account, you will need to be approved and added to a project (by an administrator) to access SEEK/NExtSEEK. This is what federates access to different Projects, and therefore access to the different assets of those projects. You must be a member of a project to access the assets, therefore allowing multiple projects to exist in the same database.

To administer project associations: https://docs.seek4science.org/help/user-guide/administer-project-members.html#add-and-remove-people-from-a-project.

You can also request to join a project: https://docs.seek4science.org/help/user-guide/join-a-project.html.

Creating Assets (Sample Types, Assays, Projects)

To create a new asset type, the SEEK website is used. Whether that is creating a new Sample Type, a new Assay, or creating a new Project.

Documentation surrounding creating these assets can be found directly on the SEEK Documentation, linked below:

Sample Type
Assay: There is no documentation on the SEEK website for this.
Project
Investigation/Study
Institution

SEEK Documentation Link

A link to the full SEEK Documentation exists here: https://docs.seek4science.org/ (head to user guides).

Uploading

An in depth overview of how to upload Samples, Protocols, and Data Files to NExtSEEK

Uploading Samples

Excel Sheet Structure

Samples are tabular metadata and are uploaded to the database as Excel sheets. These Excel sheets must be structured in a specific format to be successfully uploaded.

Properly formatted Upload Sheets have four sub-sheets: Instructions, Samples, Ontology, and Assay.

Instructions: This sheet contains all of the information required to add the sample to the Database. There are four required columns in this sheet: Field, Database Field, Field Type, and Ontology. Field = An identical match to the headers of the Samples Page. The headers/column names do NOT need to be the database name of the attribute. Database Field = Formatted as SAMPLETYPE::Attribute Name -> Ex: TIS::Type or D.SEQ::Name. The Attribute name here MUST exactly match the exact DB Field Name. This maps the value in the Samples sheet to the correct Attribute. Field Type = Text, Number, Date, Controlled Ontology Ontology = If Field Type == Controlled Ontology, the name of the Ontology (in the Ontology) sheet.

Samples: This is the table of metadata, where each row is a sample, and each column is an attribute for that sample. The column headers (Row 1) are the attribute names, and must identically match the Field column (transposed) of the Instructions page.

Ontology: This sheet contains ontologies (sets of controlled vocabulary terms) that can be used to control the values of an attribute. For an ontology to be enforced, the "Field Type" on the Instructions page for that attribute must be set to "Controlled Ontology", and the name of the Ontology (header in the Ontology sheet) must be set as the "Ontology" of that attribute on the Instructions Page. See the image below for more clarification.

Assay: This sheet determines which Assay(s) the uploaded samples should be associated with. The required columns are: SampleType, AssayType, Assay, Direction.

Attached is an example sample sheet, with notes/annotations as described above.

Assay Sheet, Sample Sheet, and Update Sheets

There are three different types of upload sheets: Assay Sheets, Sample Sheets, and Update Sheets.

Assay Sheets / Sample Sheets follow the Excel Sheet Format as shown above.
- Assay/Sample Sheets must be used to upload a sample for the first time, to generate the UID.
  - The first time you upload an Assay/Sample Sheet, the UID column should be blank and will be automatically generated. Following upload, paste the UIDs into your upload sheet from the auto-generated feedback sheet.
- When using an Assay / Sample Sheet to update samples-> All attributes for that sample must be included. If an attribute is not included at a later update, that metadata will be removed from the sample.
An Assay Sheet contains multiple sample types, while a Sample Sheet contains a single sample type.
Update Sheets are used to update a subset of attributes for a sample that has already been uploaded

Below are examples of an Assay Sheet / Update Sheet. An Example of a Sample Sheet is linked above (SampleSheetFormatting_Template_240824.xlsx)

Sample Validation Script

Once you have formatted your Assay/Sample Sheet for Upload, there exists a Sample Validation Script on the Uploading page to check that the sheet is in the correct format.

Choose your prepared Assay/Sample Sheet (not applicable for update sheets) and click validate.

The validation script checks:

That the Excel sheet is formatted correctly (Instructions, Samples, Ontology, Assay)
That the Instructions Page is formatted correctly (Field, Database Field, Field Type, Ontology)
- In the above example, the Ontology column is missing
That the entries of Database Field match attributes in the database
- In the above example, sample type CEL does not have the attribute Protocols (it should be Protocol)
That the Header row in the Samples page == Field column in the Instructions page
- Disregard the 'Field' error, but in the above example, it's finding that there exists an entry in the Instructions page for Source, that does not exist in the Samples page.
That the Assay Sheet is formatted correctly (SampleType, AssayType, Assay, Direction)
- In the above example, the column AssayType is missing, and there is an extra column named "1"

Not all of these errors would cause the upload to error out. Any error with the overall structure/format of the sheet would cause an upload to fail. A mislabeled attribute is not going to cause an error, but will instead upload that sample WITHOUT that attribute.

It is good practice to test your sample sheet on sample validation before uploading.

How to Upload

Once you have created your assay or sample sheet, head to the Uploading Page
Submit your sheet through Sample Validation.
Following validation success, place your sheet in the upload box. If you are an admin, select which lab/user you are uploading for. If not, leave as default and it will upload as yourself.

Click Upload. Should take around 1 second per sample
1. To track your upload, head to either Search Page (INSERT LINK). Search today's date in YYMMDD format (so 8/23/24 = 240823).
2. Through running that search a few times, you should see the number of samples increasing, therefore tracking that your upload is running successfully.
Following upload, paste your generated UIDs from the feedback file back into your upload sheet
IMPORTANT: Quality checks
1. Check a few samples
2. Ensure that the correct # of samples got uploaded
3. Ensure that all attributes for your samples are uploaded.

Uploading Protocols / Data Files

To upload Protocols and Data Files, head to the Protocol/Data File Uploading Page.

Select whether the file(s) you are uploading are Data Files or Protocols
If you are an admin, select which Lab/User you are uploading for. If you are not, leave it as the default
Place the files into the "File Dropzone" and click submit. Wait for data files/protocols to be uploaded
The resulting UID generated in the bottom table will be the UID used to reference that Data File or Protocol. Data File UIDs are SampleTypeUID_FileName. Protocol UIDs are P.LAB-YYMMDD_Version_FileName

For Protocols, following the procedure above is sufficient to upload.

Data Files require a Sample with a File_PrimaryData that == the name of the file you are trying to upload (to automatically match the Data File UID / Link_PrimaryData to the corresponding samples). If there is not a D. Sample that matches your data file name, you can make a D.FILE sample to trick the system into uploading it- this is particularly useful when the file you are trying to associate is not a primary file, but a supplementary data file, such as a FASTQC.html. Below is a D.FILE_Template.

Documentation surrounding Globus (Uploading and Downloading) exists here:

Searching / Downloading

An in depth overview of how to search, view, download, and delete samples, protocols, and data files.

Searching

Simple Search Page

This page allows you to search a single sample type, by a single attribute:value. In the example below, I am searching for all NHP where the Species attribute contains the value 'Macaca'.

The resulting table returns all samples that match the query. It displays the Assays, Contributor, and Attribute:Value that it found. All of the empty boxes underneath the headers are text filterable. To view the sample page of a specific sample, click on its hyperlinked UID.

Advanced Search Page

The advanced search page allows for complex querying (AND/OR/NOT) across the entire database. Additionally, you can select if you want it to be partial/exact matches, or limit it to a specific sample type.

The resulting table is identical to the Simple Search page, with one additional feature: Send to Sample Retrieval. Following a search, you can select a subset of samples, and send them to sample retrieval.

Search by UID

On the top bar of every page, there exists a Search by UID box.

By entering a valid UID and pressing search, you automatically redirect to that sample page (assuming you have access and it is a valid UID). Remember, sample pages with lots of associated samples take longer to load.

Searching on the SEEK Website

On the SEEK website, you can either search all assets in the top search bar (Search here...) or head to Browse and select a specific asset type that you would like to search.

Downloading

Downloading via Search Pages

The first step of downloading via Search pages is to search for the samples you want to download.

When downloading samples, you must choose whether you want to download just the samples you are looking up or include all samples associated (parent/children).

In the pop-up window, it asks if you would like to download with Parents or not. By selecting NO, you will download the samples that have been selected. By selecting yes, you will download all associated samples (parent and child).

This is only an option on the Simple Search page. on the Advanced search page, it automatically will include parents. Be patient when downloading a large subset of samples and their parents.

Attached below is an example of a downloaded file from NExtSEEK - with parents. This data is published already and associated with: https://fairdomhub.org/studies/1134.

Sample Retrieval

Sample Retrieval is a feature that allows a user to download all of the associated samples (parents and children) of the sample(s) searched.

How to Use Sample Retrieval:

Input samples into Sample Retrieval: by pasting in UIDs (delimited by newlines), or sending samples over from advanced search
Run sample retrieval
Select sample types and attributes that you want to download
Download samples

This sample retrieval shows that there are 2 other sample types (Cells and Tissues) associated with the Mice I queried. I then can filter and uncheck any of the attributes that exist for those sample types, and then download all of the metadata.

Protocol / Data File Query and Download

These two pages are identical (images below). They are filterable tables that allow you to search what protocol/data files are visible to you, along with links to download individual files, and an option to download files in batch.

To download a specific file, click the File URL. To batch download files, select the checkbox, and select Batch download files selected. The original file name redirects you to the SEEK page for that specific data file/protocol.

Globus

Globus is a cloud-based file transfer and storage service that allows users to move and share large amounts of data between different resources.

NExtSEEK houses metadata that describes/annotates actual data files. NExtSEEK does not have a good solution for storing and sharing data files, while Globus does. Below is an overview of how Globus and NExtSEEK are used together.

Register for a Globus account with your Institution email
Email fairdata@mit.edu with your Globus Username and Project Association
The fairdata team will reply once you have been added to the correct Globus Collections
Head to Collections > Shared with You to see the collections shared with you
There exist two collections that you will have access to: {Project_Name}-Staging and {Project_Name}-Public

Data is Uploaded to {Project_Name}-Staging and Downloaded from {Project_Name}-Public

The fairdata team will curate (move) data from Staging to Public when the relevant metadata has been uploaded to NExtSEEK. Following "curation", the Link_PrimaryData attribute of the D.SampleType on NExtSEEK, will be the corresponding Globus link.

The full Globus documentation exists here: https://docs.globus.org/

Deleting

Deleting Samples

Deleting samples happens on the search pages. Similar to downloading, first, you need to search and select the samples you want to delete. You also need to ensure that no samples are children of the samples you are trying to delete.

For example, in the image below, If I am trying to delete those 5 NHPs, no samples in the database can have those 5 NHPs as their parent.

Once you've selected the samples, click delete, and assuming you are admin, type 'DELETE'. Let it run - takes around 6 seconds per sample - and again, will error out if there are downstream samples associated.

Deleting Protocols and Data Files

To delete a Protocol or Data File - head to the SEEK website:

Find the Protocol / Data File you want to delete, click actions, and delete.

Useful Links

SEEK: https://seek4science.org/ or https://fairdata.mit.edu/

SEEK Documentation: https://docs.seek4science.org/

NExtSEEK: https://nextseek.mit.edu/

SEEK/NExtSEEK Installation: https://igb.mit.edu/data-management/seek-and-nextseek

FAIRDOMHub: https://fairdomhub.org/

Repositories:

Sequence Read Archive: https://www.ncbi.nlm.nih.gov/sra
Gene Expression Omnibus: https://www.ncbi.nlm.nih.gov/geo/
Zenodo: https://zenodo.org/
Immport: https://www.dev.immport.org/home
MIT.OMERO: https://omero.mit.edu/webclient/
PRIDE: https://www.ebi.ac.uk/pride/

Installation

If you would like to install SEEK/NExtSEEK: Documentation is located here: https://igb.mit.edu/data-management/seek-and-nextseek

If you would just like to install SEEK: https://seek4science.org/get_seek.html

Contact / Staff

Contact Email:

If you would like to contact the MIT team, please reach out to

Lead Data Specialist

Data Specialist

Northeastern Bioinformatics CO-OP, Fall 2024

System Administrator and Maintainer

Huiming Ding: System Administrator and Developer

Project Lead / Principal Investigator

Previous Team Members:

Northeastern Bioinformatics CO-OP, Spring 2024
Northeastern Bioinformatics CO-OP, Fall 2023
Northeastern Bioinformatics CO-OP, Spring 2023
Northeastern Bioinformatics CO-OP, Fall 2022
: Northeastern Bioinformatics CO-OP, Spring 2022
Data Specialist and Developer